Docs Home

TalkingSpaces Project Overview

1) Project Vision

TalkingSpaces is a Gemini + ElevenLabs museum companion that lets a visitor point a phone at an artifact and hold a live, in-character voice conversation with one or more historical personas.

The product goal is simple: replace passive museum consumption (plaques/audio guide scripts) with active, memorable dialogue that is educational, emotionally engaging, and multilingual-ready.

2) Product Outcome (What the User Should Have)

By the time an end user finishes one interaction, they should have:

  • A clearly identified artifact and characters shown in UI
  • A short, understandable artifact context summary
  • A natural voice conversation with historically grounded personas
  • Distinct character voices for multi-subject works
  • Suggested questions to continue learning without friction
  • Low-latency interaction that feels real-time and “alive”

3) Core Experience Flow

  1. User opens mobile web app and captures a photo
  2. Gemini identifies artifact + candidate characters
  3. User confirms/corrects identity
  4. Gemini research runs in selected mode (standard or deep)
  5. ElevenLabs Voice Designer creates/reuses voices per character
  6. ElevenLabs conversational agent starts WebRTC session
  7. User speaks naturally; characters reply in distinct voices and can reference each other

4) Tooling and Services

Frontend

  • Next.js (mobile-first web app)
  • Camera capture + upload UX
  • Real-time conversation UI (Connecting, Live, Listening, Speaking)

AI + Voice

  • Gemini API (single LLM provider for this build)
  • ElevenLabs Voice Designer (eleven_multilingual_ttv_v2)
  • ElevenLabs Conversational AI (eleven_v3)

Realtime + Media

  • WebRTC for live audio streaming
  • Optional waveform and conversation state indicators

Backend / Platform

  • Node.js API routes for orchestration
  • Session token management for ElevenLabs conversation start
  • Artifact/character/voice cache storage

Data / Storage

  • Artifact entities
  • Character entities
  • Cached voice_id and agent_id
  • Conversation/session logs (for quality + safety monitoring)

Quality + Operations

  • Structured logging and tracing per pipeline stage
  • Prompt/version registry for reproducibility
  • Cost dashboards (LLM + voice usage)

5) Key Challenges and Risks

A) Accuracy and Trust

  • Vision misidentification risk for ambiguous or damaged artifacts
  • Hallucinated historical details from model outputs
  • Need clear confidence + correction UX before live conversation

B) Multi-Character Orchestration

  • Prompting for believable turn-taking between personas
  • Preventing all characters from over-talking in each reply
  • Keeping each persona consistent over long sessions

C) Latency and UX

  • Compounded delay across STT/LLM/TTS/network
  • Need sub-second feedback cues even when generation is slow
  • Audio quality variation in noisy museum environments

D) Safety and Sensitive Content

  • Historical topics include slavery, colonialism, repatriation, conflict
  • Responses must remain nuanced, factual, and non-harmful
  • Need policy + fallback behavior for unsafe or uncertain outputs

E) Cost and Scale

  • Voice generation can be expensive for cold-start scans
  • Deep research increases token cost and latency
  • Requires aggressive caching and session lifecycle controls
  • Potential restrictions on photo capture in specific exhibits
  • IP/reproduction concerns across institutions
  • Need terms, usage boundaries, and partner policy path

6) Execution Roadmap

Phase 0: Foundation (Week 1)

  • Set up mono-repo/app structure
  • Implement Gemini client wrapper + retries + typed schemas
  • Implement ElevenLabs client wrapper + token endpoint
  • Define data models for Artifact/Character/Voice/Session

Deliverable: callable backend primitives with tests.

Phase 1: Vertical Slice (Weeks 2-3)

  • Mobile capture UI
  • Artifact identification + confirm screen
  • Standard research mode only
  • Single-character voice conversation loop

Deliverable: end-to-end “photo to first conversation” for one character.

Phase 2: Multi-Persona Core (Weeks 4-5)

  • Multi-character extraction and confirmation UX
  • Voice generation/caching per character
  • Multi-voice agent setup and switching
  • Persona consistency constraints in prompting

Deliverable: stable 2-4 character live conversation.

Phase 3: Reliability and Safety (Weeks 6-7)

  • Guardrails for sensitive topics
  • Fact uncertainty signaling in responses
  • Observability dashboards (latency, failures, cost)
  • Retry/backoff and degraded-mode handling

Deliverable: production-quality behavior under real network conditions.

Phase 4: Beta Readiness (Weeks 8-9)

  • Deep research mode
  • Suggested questions and onboarding polish
  • Session analytics + retention instrumentation
  • Terms/privacy/copyright handling

Deliverable: closed beta launch package.

7) Definition of Done (Launch Gate)

A release is ready when all items below are true:

  • P95 time from photo to first response is within target
  • Identification correction flow exists and is used
  • Multi-character responses are coherent and distinct
  • Unsafe/uncertain content handling is validated
  • Caching significantly reduces repeat-session cost
  • Error handling + fallback paths are visible in UI
  • Logs/metrics/alerts are operational
  • Legal/privacy baseline is approved

8) Build-Readiness Checklist

Use this checklist before active development starts.

Product

  • [ ] Problem statement and success metrics approved
  • [ ] Evidence labels (Verified, Likely, Hypothesis) applied to requirements
  • [ ] MVP scope frozen (what is explicitly out-of-scope)

Technical Architecture

  • [ ] Gemini model selection finalized for identify and research
  • [ ] ElevenLabs API keys and rate limits validated
  • [ ] Data schema finalized for Artifact/Character/Voice/Session
  • [ ] Caching strategy defined (voice_id, agent_id, research cache)

UX

  • [ ] Mobile capture and confirmation flows designed
  • [ ] Conversation states designed (Connecting, Live, Listening, Speaking)
  • [ ] Failure-state UX designed (network/model/audio)

Safety and Quality

  • [ ] Sensitive-topic policy prompts defined and reviewed
  • [ ] Human review workflow for problematic sessions defined
  • [ ] Automated test plan approved (unit, integration, E2E)

Ops

  • [ ] Logging/tracing schema defined
  • [ ] Cost monitoring configured
  • [ ] Staging and production environments separated
  • [ ] Incident ownership and on-call path assigned

Go-to-Beta

  • [ ] Internal dogfood completed with issue triage
  • [ ] Pilot artifact set curated and validated
  • [ ] Beta success criteria documented

9) Suggested First Sprint Backlog (Practical)

  • [ ] Implement POST /identify (Gemini vision + schema validation)
  • [ ] Implement POST /research (standard mode only)
  • [ ] Implement POST /voices/create (ElevenLabs voice generation + cache)
  • [ ] Implement POST /conversation/token (ElevenLabs token handoff)
  • [ ] Build mobile flow: capture -> confirm -> start conversation
  • [ ] Add telemetry: step latency, error codes, token usage
  • [ ] Add smoke tests for end-to-end happy path

10) Decision Log (Initial)

  • Gemini is the only LLM provider for this build.
  • OpenRouter is removed from this implementation path.
  • Tweet-derived analysis is input context, not final truth.
  • Production truth comes from tested behavior and official API contracts.

Generated on 2026-02-21 12:56:03.