AI agent that thinks and creates like a creative director, seamlessly weaving together text, images, audio, and video in a single, fluid output stream.
Focus: Multimodal Storytelling with Interleaved Output
Build an agent that thinks and creates like a creative director, seamlessly weaving together text, images, audio, and video in a single, fluid output stream. Leverage Gemini's native interleaved output to generate rich, mixed-media responses that combine narration with visuals, explanations with generated imagery, or storyboards with voiceover, all in one cohesive flow. Examples include Interactive storybooks (text + generated illustrations inline), marketing asset generator (copy + visuals + video in one go), educational explainers (narration woven with diagrams), and social content creator (caption + image + hashtags together).
Mandatory Tech: Must use Gemini's interleaved/mixed output capabilities. The agents are hosted on Google Cloud.
TaleSpark/ ├── app.py # FastAPI backend ├── requirements.txt # Python dependencies ├── README.md # This file ├── frontend/ # Vue 3 + TypeScript + Vite │ ├── package.json │ ├── tsconfig.json │ ├── vite.config.ts │ ├── index.html │ ├── public/ │ │ └── favicon.svg │ └── src/ │ ├── main.ts # Entry point │ ├── App.vue # Root component │ ├── types.ts # TypeScript definitions │ ├── styles/ │ │ └── main.css # Global styles + CSS variables │ ├── composables/ │ │ ├── useAppState.ts # Global state management │ │ ├── useSSE.ts # Server-Sent Events streaming │ │ └── useThreeScene.ts # Three.js particle system │ └── components/ │ ├── WelcomeScreen.vue # Hero with animated logo + particles │ ├── StorySetup.vue # Genre selection + prompt input │ ├── StoryViewer.vue # Streaming story display │ ├── StoryComplete.vue # Celebration + stats │ ├── LoadingScreen.vue # Animated quill writing │ ├── GenreCard.vue # 3D tilt genre cards │ ├── SceneCard.vue # Image + text + typewriter │ └── AudioPlayer.vue # Custom audio player ├── dist/ # Built frontend (production) ├── static/ # Generated images/audio └── plans/ └── frontend-architecture.md flowchart LR %% Styles classDef frontend fill:#d4edda,stroke:#28a745,stroke-width:2px; classDef backend fill:#cce5ff,stroke:#007bff,stroke-width:2px; classDef ai fill:#f8d7da,stroke:#dc3545,stroke-width:2px; classDef cloud fill:#fff3cd,stroke:#ffc107,stroke-width:2px; %% Nodes subgraph Frontend UI[Web Browser]:::frontend end subgraph Backend API[FastAPI Endpoint]:::backend EQ[(Event Queue)]:::backend TQ[(TTS Text Queue)]:::backend LLM_W[Task 1: LLM Producer]:::backend TTS_W[Task 2: TTS Worker]:::backend FS[(Local Static Files)]:::backend end subgraph The_Brain LLM[Gemini 2.5 Pro]:::ai IMG[Imagen 3]:::ai TTS[GCP TTS API]:::cloud end %% Flow 1: Initialization UI -->|1. POST Prompt| API API -->|Starts| LLM_W API -->|Starts| TTS_W %% Flow 2: Task 1 (Text & Image Interleaved) LLM_W -->|2. Stream Chat| LLM LLM -.->|Text Chunks| LLM_W LLM_W -->|3. Tool Pause| IMG IMG -.->|Image Data| LLM_W %% Flow 3: Queues Routing LLM_W -->|Push Text/Img Event| EQ LLM_W -->|Push Sentences| TQ %% Flow 4: Task 2 (Parallel Audio) TQ -->|Pop Sentences| TTS_W TTS_W -->|4. Synthesize| TTS TTS -.->|MP3 Data| TTS_W TTS_W -->|Save File| FS TTS_W -->|Push Audio Event| EQ %% Flow 5: Output to Frontend EQ -->|5. SSE Stream| UI UI -.->|6. Fetch MP3/JPG| FS - Python 3.10+
- Node.js 18+
- Google Cloud project with Gemini API enabled
# 1. Install Python dependencies pip install -r requirements.txt # 2. Install frontend dependencies Invoke-WebRequest https://get.pnpm.io/install.ps1 -UseBasicParsing | Invoke-Expression cd frontend pnpm install # 3. Configure Google Cloud (set PROJECT_ID in app.py) # Required: Google Cloud project with Vertex AI enabled# Terminal 1: Start the FastAPI backend python app.py # Backend runs at http://localhost:8000 # Terminal 2: Start Vue dev server (hot reload) cd frontend pnpm run dev # Frontend runs at http://localhost:5173The frontend proxies API requests to the backend:
/api/*→http://localhost:8000/api/*/static/*→http://localhost:8000/static/*
# Build frontend cd frontend pnpm run build # This creates the dist/ folder with static files # Run production server python app.py # Serves the built frontend from dist/- Three.js Particle Background — Ambient golden particles floating upward, react to mouse movement, change color per genre
- Genre Theming — 5 distinct themes (Fantasy, Sci-Fi, Mystery, Fairy Tale, Adventure) via CSS custom properties
- GSAP Animations — Smooth page transitions, logo entrance, button glows, card 3D tilts
- Real-time Streaming — Server-Sent Events deliver story content as it's generated
- Typewriter Effect — Text streams in character-by-character with cursor
- Custom Audio Player — Styled player with progress bar and auto-play
- Responsive Design — Works on desktop, tablet, and mobile
- Gemini 2.5 Pro — Generates story text with interleaved tool calls
- Imagen 3.0 — Generates scene images
- google text-to-speech API — Converts text to speech narration
- Server-Sent Events — Streams content in real-time
- Create a Google Cloud project
- Enable Vertex AI API
- Set
PROJECT_IDinapp.py:
PROJECT_ID = "your-project-id"For production, you might want to use environment variables:
export PROJECT_ID="your-project-id" export LOCATION="us-central1"| Layer | Technology |
|---|---|
| Frontend Framework | Vue 3 + TypeScript |
| Build Tool | Vite |
| Animations | GSAP |
| 3D Effects | Three.js |
| Styling | CSS Custom Properties |
| Backend | FastAPI (Python) |
| AI | Google Gemini + Imagen |
| Audio | google text to speech |
| Method | Endpoint | Description |
|---|---|---|
| GET | / | Serve frontend |
| POST | /api/generate | Generate story (SSE stream) |
Request:
{ "prompt": "A young dragon discovers it can speak human languages..." }Response: Server-Sent Events stream
{"type": "image", "src": "/static/img_abc123.jpg"} {"type": "text", "chunk": "Once upon a "} {"type": "text", "chunk": "time, in a land..."} {"type": "audio", "src": "/static/aud_def456.mp3"}MIT
Built for the Gemini Live Agent Challenge.