MediScan Live — Bringing Real-Time AI Triage to Rural India

Inspiration

India has roughly one doctor for every 1,456 people — and in rural areas, that ratio is far worse. Community health workers (ASHA workers) are often the only medical presence in villages across Tamil Nadu, Rajasthan, and Andhra Pradesh. They are trained, dedicated, and deeply trusted by their communities — but they work alone, without specialist support, making triage decisions under pressure with limited information.

We asked a simple question: what if every health worker could have a specialist AI looking over their shoulder, in real time, in their own language?

That question became MediScan Live.

The inspiration wasn't abstract. In Coimbatore, we've seen firsthand how a 45-minute drive to the nearest district hospital can be the difference between a minor wound and a life-threatening infection — simply because no one on the ground knew whether to refer the patient immediately or treat and monitor. MediScan Live is built to close that gap.

What It Does

MediScan Live is a real-time multimodal AI agent that a health worker can talk to naturally while pointing their phone camera at a patient. The agent:

Sees wounds, rashes, skin conditions, and medication labels through the live camera feed
Hears the health worker's voice in Tamil, Hindi, Telugu, or English
Responds in the same language with calm, structured triage guidance — in under 2 seconds
Handles interruptions gracefully — if the worker starts speaking mid-response, the agent immediately stops and listens
Escalates emergencies automatically — alerting the nearest doctor via SMS when it detects signs of seizure, severe burns, or cardiac events

The system routes each query to one of four specialized sub-agents:

Sub-Agent	Handles
Wound Triage Agent	Cuts, burns, bleeding, injuries
Skin Condition Agent	Rashes, swelling, discoloration
Medication Agent	Drug labels, dosage, contraindications
Emergency Agent	Seizures, cardiac signs, severe trauma

How We Built It

Architecture

The system is built on five layers, each chosen deliberately:

User Device (React PWA) ↕ WebRTC / WebSocket Cloud Run Gateway (Node.js) ↕ Gemini Live API streaming Gemini 2.0 Flash Live ↕ Tool dispatch Google ADK Multi-Agent Orchestration ↕ Backend services Vertex AI Search · Firestore · Pub/Sub · Cloud Functions

The Real-Time Streaming Pipeline

The core technical challenge was building a truly bidirectional real-time pipeline — not the typical request-response pattern of most AI apps. We use the Gemini 2.0 Flash Live API, which maintains a persistent WebSocket connection where audio streams in and audio streams back out simultaneously, with the model processing both video frames and speech in a single unified pass.

The latency budget works out as follows. If $L_{total}$ is the total perceived latency:

$$L_{total} = L_{network} + L_{gemini} + L_{tts} + L_{rag}$$

Where:

$L_{network} \approx 80\text{ms}$ — WebSocket round trip on 4G
$L_{gemini} \approx 600\text{ms}$ — Gemini Live API first token latency
$L_{tts} \approx 0\text{ms}$ — Gemini Live outputs audio natively (no separate TTS step)
$L_{rag} \approx 200\text{ms}$ — Vertex AI Search retrieval, runs in parallel

$$L_{total} \approx 880\text{ms} \approx \textbf{under 1 second}$$

This sub-second response time is what makes the conversation feel natural rather than robotic.

Interruption Handling

The Live API's native interruption model works by detecting voice activity on the input stream while audio is still being output. When a new voice segment begins, the model sends an implicit turn_complete signal, halting its own output and switching to listening mode. We surface this to the UI as an immediate visual state change — the pulsing "speaking" animation snaps back to the "listening" state within one animation frame (~16ms).

Multi-Agent Routing

The ADK root agent classifies incoming intent using a combination of keyword matching and semantic similarity. For a query $q$, the routing score $S$ for each sub-agent $a_i$ is:

$$S(q, a_i) = \alpha \cdot \text{keyword}(q, a_i) + (1 - \alpha) \cdot \text{semantic}(q, a_i)$$

Where $\alpha = 0.6$ (keyword matching weighted higher for speed in low-connectivity scenarios). The agent with the highest $S$ is selected. If $\max_i S(q, a_i) < \theta$ (threshold = 0.4), the router defaults to the Wound Triage Agent as the safest fallback.

RAG Pipeline

The medical knowledge base is indexed in Vertex AI Search and retrieved using dense vector search. For each query, the top $k = 3$ most relevant protocol snippets are injected into the model's context window before it generates a response. This grounds every answer in verified clinical guidelines rather than model knowledge alone — critical for a medical application where hallucination has real consequences.

Tech Stack Summary

Layer	Technology	Why
Frontend	React PWA + WebRTC	Works on low-end Android, no app install needed
Gateway	Node.js + Cloud Run	Non-blocking I/O, scales to zero between sessions
AI Core	Gemini 2.0 Flash Live	Only model with native bidirectional audio streaming
Orchestration	Google ADK	Multi-agent routing without custom scaffolding
Knowledge Base	Vertex AI Search	Managed RAG with medical protocol documents
Session Logging	Firestore	Real-time doctor dashboard, natural document schema
Emergency Alerts	Pub/Sub + Cloud Functions + Twilio	Decoupled alerting that never delays the agent
Translation	Cloud Translation API	Normalizes logs to English for doctor review

Challenges We Faced

1. The Streaming Protocol Learning Curve

The Gemini Live API is fundamentally different from a standard API call — it's a stateful, persistent connection that requires careful management of audio chunks, turn boundaries, and session lifecycle. Getting the first clean end-to-end audio-in, audio-out loop working took significant debugging. The key insight was that audio must be sent in small chunks (≤100ms) rather than large buffers, and the session must be explicitly kept alive with periodic keep-alive signals.

2. Language Detection in Noisy Environments

Rural health workers often speak in code-mixed language — Tamil sentences with English medical terms, or Hindi with local dialect words. Pure language detection APIs struggled with this. Our solution was to let Gemini Live handle language detection natively (it's remarkably good at code-mixed Indian languages) and use the Translation API only as a normalization layer for logs, not for the live conversation.

3. Balancing Medical Safety with Helpfulness

Every response the agent gives has real stakes. Too cautious, and it becomes useless ("please see a doctor" for every query). Too confident, and it could cause harm. We spent significant time crafting the system prompts for each sub-agent to thread this needle — always providing actionable first-aid steps while consistently signposting when professional referral is needed. The emergency agent in particular required careful calibration of its trigger thresholds.

4. Offline Resilience

Network connectivity in rural India is unreliable. The PWA is built with a Workbox service worker that caches the full app shell, so the UI loads even with no connection. When the WebSocket drops, the app shows a clear "Reconnecting..." state rather than silently failing, and automatically re-establishes the session when connectivity returns.

5. Latency on Low-End Devices

The AudioWorklet-based audio capture pipeline had to be carefully tuned. On low-end Android devices, the default buffer sizes caused audio artifacts. We settled on a 128-sample buffer at 16kHz, which gives 8ms chunks — small enough for real-time feel, large enough to avoid dropouts on a ₹6,000 Android phone.

What We Learned

Building MediScan Live taught us that the hardest part of a real-time AI agent isn't the AI — it's the plumbing. Getting audio to flow cleanly in both directions, handling dropped connections gracefully, managing session state across a distributed system — these engineering challenges consumed far more time than the model integration itself.

We also learned that language is culture. The Tamil prompts we tested with actual ASHA workers in Coimbatore revealed that formal Tamil sounds clinical and cold, while the conversational register the workers actually use feels warm and trustworthy. The agent's personality matters as much as its accuracy.

Most importantly, we learned that constraint is a feature. Designing for a health worker with a low-end phone, poor connectivity, and no time to read a manual forced us to strip everything non-essential. The result is a UI with exactly one button — and that simplicity is what makes it actually usable in the field.

What's Next

Clinical validation with ASHA workers across three districts in Tamil Nadu
Offline inference mode using a quantized on-device model for zero-connectivity scenarios
Doctor dashboard — a real-time web interface where supervising physicians can monitor active sessions and intervene when flagged
Expansion to veterinary triage — the same architecture serves livestock health workers, a significant need in rural agricultural communities
Integration with India's Ayushman Bharat digital health records for seamless patient history context

Built With

Gemini 2.0 Flash Live API · Google ADK · Cloud Run · Vertex AI Search · Firestore · Pub/Sub · Cloud Functions · Firebase Hosting · React · Node.js · Python · WebRTC · Twilio · Workbox

MediScan Live was built for the Google Live Agent Challenge Hackathon. Every design decision was made with one person in mind — the ASHA worker standing in a village, alone, with a patient in front of her and no doctor nearby.

Built With

2.0
ai
api
audio
css
firebase
firestore
firestore-session-logger
flash
gateway
gemini
google
google-adk
google-cloud-discoveryengine
google-cloud-firestore
google-cloud-pubsub
google-genai
javascript
live
mediadevices
node.js
python-3.11-?-adk-agents
rag-pipeline
react
run
tailwind
translation
twilio
vertex
vite
web
websocket
workbox

Updates

yogitha Sivakumar started this project — Mar 16, 2026 03:51 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.