Inspiration
Every student has been stuck on homework at 11pm with no one to help. Google gives answers. ChatGPT gives answers. But real learning happens when someone guides you to the answer yourself. We wanted to build a tutor that acts like a brilliant friend — one who sees your homework, hears your struggle, and never just hands you the answer.
What it does
Nova is a real-time AI tutor that sees and hears you through your browser. Show her your homework through your camera, ask her a question, and she guides you Socratically — asking the right questions until you figure it out yourself. She adapts to your level automatically, whether you're in high school or university.
How we built it
- Frontend: Vanilla HTML/JS with Web Audio API for real-time PCM audio playback and WebSocket client for streaming
- Backend: Python FastAPI with asyncio for concurrent audio/video/response streams
- AI: Google Gemini Live API (
gemini-2.5-flash-native-audio-preview-12-2025) for real-time multimodal understanding - Deployment: Docker container on Railway, with Google Cloud Run deployment script included
- Architecture: Browser captures mic (PCM 16kHz) and camera (JPEG 1fps) → WebSocket → FastAPI → Gemini Live API → audio response (PCM 24kHz) → browser speaker
Challenges we ran into
- WebSocket + HTTPS requires WSS — browser blocks insecure WebSocket connections from HTTPS pages
- Gemini Live API requires continuous audio input to trigger proactive vision responses
- PCM audio scheduling — multiple chunks arriving rapidly must be chained seamlessly or they overlap
- AudioWorklet vs ScriptProcessorNode — moved to AudioWorklet for dedicated audio thread to avoid main thread blocking
Accomplishments that we're proud of
- Built a fully working real-time multimodal tutor in under 24 hours
- Nova genuinely teaches — she never gives direct answers, always guides with questions
- Clean architecture with auto-reconnect — session survives Gemini API drops transparently
- Single HTML file frontend with no framework, no build step
What we learned
- Gemini Live API is remarkably capable at real-time multimodal understanding
- Real-time audio streaming in browsers requires careful PCM scheduling
- The hardest part of building an AI tutor isn't the AI — it's the persona and prompt engineering
- asyncio queues are a clean pattern for bridging WebSocket streams with external AI APIs
What's next for Nova
- Google Cloud Run deployment when regional billing supports it
- Session memory — Nova remembers what was covered and gives end-of-session summaries
- Multiple subject modes with specialized tutoring approaches
- Mobile app for easier homework camera setup
- Parent dashboard showing learning progress
Log in or sign up for Devpost to join the conversation.