TalkingSpaces Project Overview

1) Project Vision

TalkingSpaces is a Gemini + ElevenLabs museum companion that lets a visitor point a phone at an artifact and hold a live, in-character voice conversation with one or more historical personas.

The product goal is simple: replace passive museum consumption (plaques/audio guide scripts) with active, memorable dialogue that is educational, emotionally engaging, and multilingual-ready.

2) Product Outcome (What the User Should Have)

By the time an end user finishes one interaction, they should have:

A clearly identified artifact and characters shown in UI
A short, understandable artifact context summary
A natural voice conversation with historically grounded personas
Distinct character voices for multi-subject works
Suggested questions to continue learning without friction
Low-latency interaction that feels real-time and “alive”

3) Core Experience Flow

User opens mobile web app and captures a photo
Gemini identifies artifact + candidate characters
User confirms/corrects identity
Gemini research runs in selected mode (standard or deep)
ElevenLabs Voice Designer creates/reuses voices per character
ElevenLabs conversational agent starts WebRTC session
User speaks naturally; characters reply in distinct voices and can reference each other

4) Tooling and Services

Frontend

Next.js (mobile-first web app)
Camera capture + upload UX
Real-time conversation UI (Connecting, Live, Listening, Speaking)

AI + Voice

Gemini API (single LLM provider for this build)
ElevenLabs Voice Designer (eleven_multilingual_ttv_v2)
ElevenLabs Conversational AI (eleven_v3)

Realtime + Media

WebRTC for live audio streaming
Optional waveform and conversation state indicators

Backend / Platform

Node.js API routes for orchestration
Session token management for ElevenLabs conversation start
Artifact/character/voice cache storage

Data / Storage

Artifact entities
Character entities
Cached voice_id and agent_id
Conversation/session logs (for quality + safety monitoring)

Quality + Operations

Structured logging and tracing per pipeline stage
Prompt/version registry for reproducibility
Cost dashboards (LLM + voice usage)

5) Key Challenges and Risks

A) Accuracy and Trust

Vision misidentification risk for ambiguous or damaged artifacts
Hallucinated historical details from model outputs
Need clear confidence + correction UX before live conversation

B) Multi-Character Orchestration

Prompting for believable turn-taking between personas
Preventing all characters from over-talking in each reply
Keeping each persona consistent over long sessions

C) Latency and UX

Compounded delay across STT/LLM/TTS/network
Need sub-second feedback cues even when generation is slow
Audio quality variation in noisy museum environments

D) Safety and Sensitive Content

Historical topics include slavery, colonialism, repatriation, conflict
Responses must remain nuanced, factual, and non-harmful
Need policy + fallback behavior for unsafe or uncertain outputs

E) Cost and Scale

Voice generation can be expensive for cold-start scans
Deep research increases token cost and latency
Requires aggressive caching and session lifecycle controls

F) Legal and Museum Constraints

Potential restrictions on photo capture in specific exhibits
IP/reproduction concerns across institutions
Need terms, usage boundaries, and partner policy path

6) Execution Roadmap

Phase 0: Foundation (Week 1)

Set up mono-repo/app structure
Implement Gemini client wrapper + retries + typed schemas
Implement ElevenLabs client wrapper + token endpoint
Define data models for Artifact/Character/Voice/Session

Deliverable: callable backend primitives with tests.

Phase 1: Vertical Slice (Weeks 2-3)

Mobile capture UI
Artifact identification + confirm screen
Standard research mode only
Single-character voice conversation loop

Deliverable: end-to-end “photo to first conversation” for one character.

Phase 2: Multi-Persona Core (Weeks 4-5)

Multi-character extraction and confirmation UX
Voice generation/caching per character
Multi-voice agent setup and switching
Persona consistency constraints in prompting

Deliverable: stable 2-4 character live conversation.

Phase 3: Reliability and Safety (Weeks 6-7)

Guardrails for sensitive topics
Fact uncertainty signaling in responses
Observability dashboards (latency, failures, cost)
Retry/backoff and degraded-mode handling

Deliverable: production-quality behavior under real network conditions.

Phase 4: Beta Readiness (Weeks 8-9)

Deep research mode
Suggested questions and onboarding polish
Session analytics + retention instrumentation
Terms/privacy/copyright handling

Deliverable: closed beta launch package.

7) Definition of Done (Launch Gate)

A release is ready when all items below are true:

P95 time from photo to first response is within target
Identification correction flow exists and is used
Multi-character responses are coherent and distinct
Unsafe/uncertain content handling is validated
Caching significantly reduces repeat-session cost
Error handling + fallback paths are visible in UI
Logs/metrics/alerts are operational
Legal/privacy baseline is approved

8) Build-Readiness Checklist

Use this checklist before active development starts.

Product

[ ] Problem statement and success metrics approved
[ ] Evidence labels (Verified, Likely, Hypothesis) applied to requirements
[ ] MVP scope frozen (what is explicitly out-of-scope)

Technical Architecture

[ ] Gemini model selection finalized for identify and research
[ ] ElevenLabs API keys and rate limits validated
[ ] Data schema finalized for Artifact/Character/Voice/Session
[ ] Caching strategy defined (voice_id, agent_id, research cache)

UX

[ ] Mobile capture and confirmation flows designed
[ ] Conversation states designed (Connecting, Live, Listening, Speaking)
[ ] Failure-state UX designed (network/model/audio)

Safety and Quality

[ ] Sensitive-topic policy prompts defined and reviewed
[ ] Human review workflow for problematic sessions defined
[ ] Automated test plan approved (unit, integration, E2E)

Ops

[ ] Logging/tracing schema defined
[ ] Cost monitoring configured
[ ] Staging and production environments separated
[ ] Incident ownership and on-call path assigned

Go-to-Beta

[ ] Internal dogfood completed with issue triage
[ ] Pilot artifact set curated and validated
[ ] Beta success criteria documented

9) Suggested First Sprint Backlog (Practical)

[ ] Implement POST /identify (Gemini vision + schema validation)
[ ] Implement POST /research (standard mode only)
[ ] Implement POST /voices/create (ElevenLabs voice generation + cache)
[ ] Implement POST /conversation/token (ElevenLabs token handoff)
[ ] Build mobile flow: capture -> confirm -> start conversation
[ ] Add telemetry: step latency, error codes, token usage
[ ] Add smoke tests for end-to-end happy path

10) Decision Log (Initial)

Gemini is the only LLM provider for this build.
OpenRouter is removed from this implementation path.
Tweet-derived analysis is input context, not final truth.
Production truth comes from tested behavior and official API contracts.

Generated on 2026-02-21 12:56:03.