TalkingSpaces Project Overview
1) Project Vision
TalkingSpaces is a Gemini + ElevenLabs museum companion that lets a visitor point a phone at an artifact and hold a live, in-character voice conversation with one or more historical personas.
The product goal is simple: replace passive museum consumption (plaques/audio guide scripts) with active, memorable dialogue that is educational, emotionally engaging, and multilingual-ready.
2) Product Outcome (What the User Should Have)
By the time an end user finishes one interaction, they should have:
- A clearly identified artifact and characters shown in UI
- A short, understandable artifact context summary
- A natural voice conversation with historically grounded personas
- Distinct character voices for multi-subject works
- Suggested questions to continue learning without friction
- Low-latency interaction that feels real-time and “alive”
3) Core Experience Flow
- User opens mobile web app and captures a photo
- Gemini identifies artifact + candidate characters
- User confirms/corrects identity
- Gemini research runs in selected mode (
standardordeep) - ElevenLabs Voice Designer creates/reuses voices per character
- ElevenLabs conversational agent starts WebRTC session
- User speaks naturally; characters reply in distinct voices and can reference each other
4) Tooling and Services
Frontend
- Next.js (mobile-first web app)
- Camera capture + upload UX
- Real-time conversation UI (
Connecting,Live,Listening,Speaking)
AI + Voice
- Gemini API (single LLM provider for this build)
- ElevenLabs Voice Designer (
eleven_multilingual_ttv_v2) - ElevenLabs Conversational AI (
eleven_v3)
Realtime + Media
- WebRTC for live audio streaming
- Optional waveform and conversation state indicators
Backend / Platform
- Node.js API routes for orchestration
- Session token management for ElevenLabs conversation start
- Artifact/character/voice cache storage
Data / Storage
- Artifact entities
- Character entities
- Cached
voice_idandagent_id - Conversation/session logs (for quality + safety monitoring)
Quality + Operations
- Structured logging and tracing per pipeline stage
- Prompt/version registry for reproducibility
- Cost dashboards (LLM + voice usage)
5) Key Challenges and Risks
A) Accuracy and Trust
- Vision misidentification risk for ambiguous or damaged artifacts
- Hallucinated historical details from model outputs
- Need clear confidence + correction UX before live conversation
B) Multi-Character Orchestration
- Prompting for believable turn-taking between personas
- Preventing all characters from over-talking in each reply
- Keeping each persona consistent over long sessions
C) Latency and UX
- Compounded delay across STT/LLM/TTS/network
- Need sub-second feedback cues even when generation is slow
- Audio quality variation in noisy museum environments
D) Safety and Sensitive Content
- Historical topics include slavery, colonialism, repatriation, conflict
- Responses must remain nuanced, factual, and non-harmful
- Need policy + fallback behavior for unsafe or uncertain outputs
E) Cost and Scale
- Voice generation can be expensive for cold-start scans
- Deep research increases token cost and latency
- Requires aggressive caching and session lifecycle controls
F) Legal and Museum Constraints
- Potential restrictions on photo capture in specific exhibits
- IP/reproduction concerns across institutions
- Need terms, usage boundaries, and partner policy path
6) Execution Roadmap
Phase 0: Foundation (Week 1)
- Set up mono-repo/app structure
- Implement Gemini client wrapper + retries + typed schemas
- Implement ElevenLabs client wrapper + token endpoint
- Define data models for Artifact/Character/Voice/Session
Deliverable: callable backend primitives with tests.
Phase 1: Vertical Slice (Weeks 2-3)
- Mobile capture UI
- Artifact identification + confirm screen
- Standard research mode only
- Single-character voice conversation loop
Deliverable: end-to-end “photo to first conversation” for one character.
Phase 2: Multi-Persona Core (Weeks 4-5)
- Multi-character extraction and confirmation UX
- Voice generation/caching per character
- Multi-voice agent setup and switching
- Persona consistency constraints in prompting
Deliverable: stable 2-4 character live conversation.
Phase 3: Reliability and Safety (Weeks 6-7)
- Guardrails for sensitive topics
- Fact uncertainty signaling in responses
- Observability dashboards (latency, failures, cost)
- Retry/backoff and degraded-mode handling
Deliverable: production-quality behavior under real network conditions.
Phase 4: Beta Readiness (Weeks 8-9)
- Deep research mode
- Suggested questions and onboarding polish
- Session analytics + retention instrumentation
- Terms/privacy/copyright handling
Deliverable: closed beta launch package.
7) Definition of Done (Launch Gate)
A release is ready when all items below are true:
- P95 time from photo to first response is within target
- Identification correction flow exists and is used
- Multi-character responses are coherent and distinct
- Unsafe/uncertain content handling is validated
- Caching significantly reduces repeat-session cost
- Error handling + fallback paths are visible in UI
- Logs/metrics/alerts are operational
- Legal/privacy baseline is approved
8) Build-Readiness Checklist
Use this checklist before active development starts.
Product
- [ ] Problem statement and success metrics approved
- [ ] Evidence labels (
Verified,Likely,Hypothesis) applied to requirements - [ ] MVP scope frozen (what is explicitly out-of-scope)
Technical Architecture
- [ ] Gemini model selection finalized for
identifyandresearch - [ ] ElevenLabs API keys and rate limits validated
- [ ] Data schema finalized for Artifact/Character/Voice/Session
- [ ] Caching strategy defined (
voice_id,agent_id, research cache)
UX
- [ ] Mobile capture and confirmation flows designed
- [ ] Conversation states designed (
Connecting,Live,Listening,Speaking) - [ ] Failure-state UX designed (network/model/audio)
Safety and Quality
- [ ] Sensitive-topic policy prompts defined and reviewed
- [ ] Human review workflow for problematic sessions defined
- [ ] Automated test plan approved (unit, integration, E2E)
Ops
- [ ] Logging/tracing schema defined
- [ ] Cost monitoring configured
- [ ] Staging and production environments separated
- [ ] Incident ownership and on-call path assigned
Go-to-Beta
- [ ] Internal dogfood completed with issue triage
- [ ] Pilot artifact set curated and validated
- [ ] Beta success criteria documented
9) Suggested First Sprint Backlog (Practical)
- [ ] Implement
POST /identify(Gemini vision + schema validation) - [ ] Implement
POST /research(standardmode only) - [ ] Implement
POST /voices/create(ElevenLabs voice generation + cache) - [ ] Implement
POST /conversation/token(ElevenLabs token handoff) - [ ] Build mobile flow: capture -> confirm -> start conversation
- [ ] Add telemetry: step latency, error codes, token usage
- [ ] Add smoke tests for end-to-end happy path
10) Decision Log (Initial)
- Gemini is the only LLM provider for this build.
- OpenRouter is removed from this implementation path.
- Tweet-derived analysis is input context, not final truth.
- Production truth comes from tested behavior and official API contracts.