Product Requirements Document

TalkingSpaces — Museum AI Companion

Source tweet: https://x.com/isnit0/status/2024104717039685915
Official build guide: https://elevenlabs.io/blog/talk-to-a-statue-building-a-multi-modal-elevenagents-powered-app
Creator: Joe Reeve (@isnit0)
Status: Alpha — DM-gated access (as of Feb 2026)
Analysis method: Whisper + Gemini (3 videos) + X API thread + ElevenLabs build guide
Last updated: 2026-02-20

Evidence policy (for build decisions):
- Verified: directly observed in video UI/audio or official vendor docs
- Likely: strong inference from multiple artifacts, not yet implementation-proof
- Hypothesis: plausible product behavior requiring prototype validation

Model policy (updated):
- Gemini is the only LLM provider for this build (vision + research)
- OpenRouter is removed from local analysis tooling in this repo
- Model selection is fixed to Gemini unless a future architecture decision changes it

1. Product Overview

Vision

A mobile web app that transforms any museum visit into a live voice conversation. Point your phone at a statue or artifact — the app identifies it, researches its history, generates a period-appropriate AI voice per character, and lets you hold a real-time spoken conversation with the exhibit — even with multiple characters simultaneously.

Problem Statement

Museum experiences are passive. Plaques are static, audio guides are scripted, and human guides are expensive and unavailable at scale. Visitors — especially children and international tourists — leave without deeply connecting with what they've seen.

Proposed Solution

A mobile-optimised Next.js web app that:
1. Takes a photo of any statue, monument, or artwork
2. Identifies all depicted characters via Gemini vision
3. Researches historical context via Gemini (deep or standard mode, user selects)
4. Generates a unique, period-appropriate voice per character via ElevenLabs Voice Designer (eleven_multilingual_ttv_v2)
5. Enables real-time spoken conversation via ElevenLabs Agent over WebRTC (eleven_v3)

Creator's Own Words

"I made an app that lets you talk to any statue you want using AI. The app will use OpenAI research to find all the historical context about the statue. Then, using that information and the ElevenLabs Voice Designer API, it will design the perfect voice completely using AI that matches that statue. And then we simply use that voice in an ElevenLabs agent."

How It Was Built

The entire app was built from a single one-shot prompt in Cursor with Claude Opus 4.5 (high) from an empty Next.js project. The prompt:

"We need to make an app that: is optimised for mobile, allows the user to take a picture of a statue/monument that includes one or more people, uses an OpenAI LLM API call to identify the statue/characters/location, allows the user to check it's correct then do either a deep research or standard search, then creates an ElevenLabs agent (allowing multiple voices) that the user can talk to as though they're talking to the characters. Each character should use Voice Designer API to create a matching voice. The purpose is to be fun and educational."

Evidence Confidence Snapshot

Item	Confidence	Notes
Multi-character live conversation behavior	`Verified`	Observed in extracted frames/transcripts
Pipeline stages (`Identified` -> `Research` -> `Voices` -> `Ready`)	`Verified`	Visible in UI captures
Exact internal model IDs used by the original production app	`Hypothesis`	Must be validated independently; your build is Gemini-only
Domain/app availability claims from tweet thread	`Likely`	Needs independent runtime verification

2. Target Users

User Type	Need	Pain Point Solved
Casual museum visitor	Engaging, effortless learning	Plaques are dry and hard to read
Families with children	Interactive, entertaining experience	Kids disengage quickly from traditional guides
International tourists	Native language support	Audio guides rarely support their language
Educators / school groups	Deep contextual engagement	Hard to make ancient artifacts relatable
Museums / galleries	Scalable modern visitor experience	Human guides are expensive and limited

3. Core User Flow

1. OPEN APP (mobile browser) → tap camera button
2. PHOTOGRAPH statue/monument/artwork
3. CONFIRM → app shows identified name/characters, user verifies
4. RESEARCH MODE → choose deep research or standard search (Gemini)
5. VOICES → ElevenLabs Voice Designer creates a voice per character
6. READY → enter live WebRTC conversation
7. TALK → speak; each character responds in their distinct AI voice
   (characters can also respond to each other)

Pipeline status tabs in UI: Identified → Research → Voices → Ready

4. Confirmed Demo Scenarios (from thread)

Tweet	Artifact	Characters	Notable
Main	Colossal bust of Ramesses II	1	First demo, established product
Tweet 1	The Ancestor (British Museum)	1	Second demo
Tweet 2	Two friends playing knucklebones	2	Multi-persona, friends talking to each other
Tweet 3	Sir Hans Sloane	1	Founder of the British Museum — meta
Tweet 4	Parthenon Metope Relief (Elgin Marbles)	4 simultaneously	Most technically complex demo

Tweet 4 (Parthenon) is the headline demo — 4 characters speaking to the user and to each other simultaneously, all with distinct voices. This is the key technical differentiator.

5. Features

5.1 Core Features (MVP — confidence-labeled)

F1 — Artifact Identification (Gemini vision)

User photographs statue/monument via mobile camera
Image sent to Gemini vision with structured output
Returns JSON: statueName, location, artist, year, description, characters[]
Each character entry includes: name, description, era, voiceDescription
User confirms identification is correct before proceeding
Supports 1 to N characters per artifact

JSON schema returned:

{
  "statueName": "string",
  "location": "string (city, country)",
  "artist": "string",
  "year": "string",
  "description": "string",
  "characters": [{
    "name": "string",
    "description": "string",
    "era": "string",
    "voiceDescription": "string — audio quality + age + gender + vocal qualities + accent + pacing + personality"
  }]
}

F2 — Historical Research (two modes)

Standard search: fast, surface-level context
Deep research: thorough, richer historical detail (slower)
User selects mode after identification
Research stored per artifact — not repeated on revisit

F3 — AI Voice Generation per Character

Uses ElevenLabs Voice Designer API (eleven_multilingual_ttv_v2)
Generates 3 voice previews per character, selects first
Voice description formula (critical for quality):

"Perfect audio quality. [age/gender] with [vocal tone], [precise accent]. [pacing]. [personality trait]."
Example: "Perfect audio quality. A powerful woman in her 30s with a deep resonant voice and a thick Celtic British accent. Her tone is commanding and fierce. She speaks at a measured, deliberate pace with passionate intensity."
Longer, character-appropriate sample text (50+ words) produces more stable voices
Voice saved as voice_id, cached per character — never regenerated on revisit

F4 — Multi-Voice ElevenLabs Agent (WebRTC)

Creates one ElevenLabs Agent per artifact scan (eleven_v3)
Primary character assigned as default voiceId
Additional characters registered in supportedVoices[] with label + voiceDescription
Agent switches between character voices in real time during conversation
System prompt instructs agent to play ALL characters simultaneously
Config: turnTimeout: 10s, maxDurationSeconds: 600 (10 min max session)
Real-time audio via WebRTC (server-side token → client-side session)

Agent config (confirmed):

{
  conversationConfig: {
    agent: {
      language: "en",
      prompt: { prompt: systemPrompt, temperature: 0.7 }
    },
    tts: {
      voiceId: primaryCharacter.voiceId,
      modelId: "eleven_v3",
      supportedVoices: otherCharacters.map(c => ({
        voiceId: c.voiceId,
        label: c.name,
        description: c.voiceDescription
      }))
    },
    turn: { turnTimeout: 10 },
    conversation: { maxDurationSeconds: 600 }
  }
}

F5 — Multi-Persona Simultaneous Conversation

Agent prompt: "You are playing ALL [N] characters simultaneously"
Characters respond to user AND to each other in the same conversation
Demonstrated with 2 characters (knucklebones), and 4 characters (Parthenon Metope)
This is the core technical differentiator — no other museum app does this

5.2 Supporting Features

F6 — Exhibit Info Card

Displays: artifact name, artist, year/period, location, description
Shown during and after pipeline, before conversation starts

F7 — Suggested Question Prompts

Pre-seeded question bubbles: "How did you end up here?", "Tell me who you are"
Lowers barrier to starting conversations

F8 — Listening Indicator

"Listening..." UI state during voice input

F9 — Research Mode Selector

User explicitly chooses: deep research vs standard search
Gives control over speed vs depth trade-off

6. Technical Architecture

6.1 Current Build Target Tech Stack (recommended)

Layer	Technology	Model/Version	Role
Frontend	Next.js (mobile-optimised)	—	Mobile web app, camera access
Vision / Identification	Gemini API	Gemini model set via env	Identify artifact + characters
Research	Gemini API	Gemini model set via env	Historical context (deep or standard)
Voice Design	ElevenLabs Voice Designer	`eleven_multilingual_ttv_v2`	Generate character voices
Conversation	ElevenLabs Conversational AI	`eleven_v3`	Real-time multi-voice agent
Audio transport	WebRTC	—	Low-latency real-time audio
STT	ElevenLabs (built-in)	—	Voice input → text

6.2 Pipeline Flow

Mobile camera capture (base64 JPEG)
    ↓
Gemini vision call (structured output)
→ { statueName, characters[{ name, voiceDescription }] }
    ↓
User confirms identification
    ↓
Gemini research call (deep or standard)
→ historical context per character
    ↓
ElevenLabs Voice Designer (per character)
→ voice previews → save → voice_id[]
    ↓
ElevenLabs Agent create
→ { agentId, primaryVoice, supportedVoices[] }
    ↓
WebRTC session start (server token → client)
    ↓
Real-time conversation loop
(STT → multi-persona LLM → TTS voice switching)

6.3 Data Model (from build guide)

Character {
  name: string
  description: string
  era: string
  voiceDescription: string   // detailed prompt for Voice Designer
  voiceId: string            // ElevenLabs voice_id (cached)
}

Artifact {
  statueName: string
  location: string
  artist: string
  year: string
  description: string
  characters: Character[]
  agentId: string            // ElevenLabs agent_id (cached per scan)
}

6.4 WebRTC Integration

// Server-side: get conversation token
const { conversationToken } = await elevenlabs.conversationalAi
  .agents.getConversationToken(agentId)

// Client-side: start session
await conversation.startSession({ token: conversationToken })

7. User Stories

Priority	Story
P0	As a visitor, I can photograph a statue and have the app identify it within seconds
P0	As a visitor, I can speak to the statue in real time and it responds in character
P0	As a visitor, each character has a distinct voice that matches their era and personality
P1	As a visitor, I can choose how deeply the app researches before talking
P1	As a visitor, I can interact with all characters in a group sculpture simultaneously
P1	As a visitor, characters respond to each other — not just to me
P1	As a visitor, the app suggests questions to help me start
P2	As a repeat visitor, cached voices load instantly for artifacts I've seen before
P2	As an international visitor, I can interact in my native language
P2	As a visitor, I can hold a conversation for up to 10 minutes per artifact

8. Open Questions (Updated)

#	Question	Status
1	Identification model	✅ Resolved (for this build): Gemini-only; tune model choice for accuracy/cost/latency
2	ElevenLabs cost model	⚠️ Open: Voice Designer + Agent per unique scan — caching mitigates but cold scans are expensive
3	Museum partnerships	⚠️ Open: Works on any public artifact without museum opt-in (raises IP/access questions at scale)
4	Content accuracy	⚠️ Open: No validation layer mentioned — Gemini research outputs used directly
5	Publishing intent	✅ Resolved: Alpha via DM. Build guide published on ElevenLabs blog for community to self-build
6	Voice description quality	✅ Resolved: Formula confirmed — "Perfect audio quality. [age/gender]. [vocal tone]. [precise accent]. [pacing]. [personality]"
7	Max session length	✅ Resolved: 600 seconds (10 minutes)

9. Competitive Landscape

Product	Approach	Gap vs This App
Traditional audio guides	Pre-recorded, scripted	Not interactive, not personalized
Bloomberg Connects	Curated museum app	No AI conversation, no dynamic voice
Google Lens	Identifies objects, links Wikipedia	No voice, no conversation, no character
Museum chatbots (text)	Text Q&A about exhibits	No voice, single character only
This app	Real-time multi-voice conversation with artifact	First-mover: in-character voice + multi-persona

10. Success Metrics (Suggested)

Metric	Target
Time to first conversation	< 20s from photo capture
Identification accuracy	> 90% for major museum collections
Questions per artifact per session	> 3
Artifacts per visit	3+
Session duration	> 2 minutes average
Alpha → beta conversion	> 40% of DM requesters become repeat users

11. Build Complexity Assessment

Surprisingly low. The creator built this with a single prompt in Cursor (Claude Opus 4.5, high). Core complexity is in:

Voice description quality — the prompt formula for Voice Designer is the key craft skill
Multi-persona agent prompting — instructing the agent to play N characters simultaneously
WebRTC integration — server token → client session (well-documented by ElevenLabs)
Caching strategy — ensuring voice_id and agent_id are stored to avoid regeneration costs

A developer with Next.js experience and ElevenLabs API access could replicate this in a weekend.

12. App Identity

App URL observed in video UI: talkingpaintings.ai
Treat as Likely until independently verified (e.g., live request + matching functionality).

13. Demo Analysis (Video Evidence)

Main tweet video (~80s) — Ramesses II

Full pipeline walkthrough: PHOTO → Identified → Research → Voices → Ready
Single character: Ramesses II (ancient Egyptian pharaoh)
Demonstrates text + voice chat, "Listening..." indicator
ElevenLabs Voice Designer + Agent overlays visible

Tweet 3 — Sir Hans Sloane (~3.5 min, 220s)

Artifact: Terracotta bust of Sir Hans Sloane, British Museum
Character: Sir Hans Sloane (founder of the British Museum)

Conversation highlights (Whisper transcript):
- User asks how Sloane built his collection → Sloane describes Jamaica expedition in 1680s
- User asks directly about slavery: "I've heard that slaves were involved in collecting some of these exhibits"
- Sloane responds in character, acknowledging his wealth was tied to Jamaican plantation wealth and enslaved labour
- User asks: "What did you make of it at the time?"
- Sloane defends himself as a man of his era, focused on scientific knowledge
- User closes warmly: "I'm grateful that you put some of that wealth to good use"

Why this matters for PRD: The AI handles morally complex historical topics in character — it doesn't deflect or refuse. This is intentional product design. The museum experience is enriched precisely because visitors can ask uncomfortable questions that plaques can't address.

UI confirmed: Connecting... → Live → Speaking... / Listening... states all visible

Tweet 4 — Parthenon Metope / Elgin Marbles (~4.5 min, 256s)

Artifact: Parthenon metope relief: Lapiths vs Centaurs, British Museum
Characters (4 simultaneous):
- The Lapith Rider
- The Lapith Defender
- The War Horse
- The Charging Horse

Artifact details shown in UI:
- Artist: Attributed to sculptors of the Parthenon workshop under Phidias
- Year: c. 447–432 BCE (Classical Greek period)
- Research summary visible: "Centauromachy as a concentrated flash of..."

Multi-character conversation (Whisper transcript highlights):
- Lapith Rider introduces all 4 characters simultaneously in first response
- User: "Tell me your story" → Each character narrates their role in the battle with distinct voice:
- Lapith Defender [stern]: describes the Centaurs' attack at the wedding feast
- War Horse [deep, resonant]: describes sensing fear, the rider poised for battle
- Charging Horse [energetic]: describes the charge into combat

Elgin Marbles political conversation:
- User pivots: "You're part of the Elgin Marbles — there's some complicated politics. Can you tell me?"
- Characters respond in character with nuance:
- Lapith Defender [somber]: "We once stood proudly upon the Parthenon..."
- War Horse [sighs]: "The light here is different, the air cooler than the sun-drenched Acropolis..."
- Charging Horse [thoughtful]: "There is much discussion, a great debate, about where we truly belong..."
- User: "What do you think about being here at the British Museum?" → characters give varied personal opinions

Key PRD finding from this demo:
- Emotional tone descriptors visible in chat UI: [stern], [deep, resonant], [energetic], [somber], [sigh], [thoughtful], [firm], [proud], [eager] — these are embedded in the agent's response text and rendered in the chat log
- 4-character voice switching in a single conversation is seamless
- The app surfaces genuinely controversial cultural heritage debates (Elgin Marbles repatriation) through character voices — far more engaging than a Wikipedia summary

14. Content Strategy Insight

Both tweet 3 and tweet 4 deliberately choose artifacts with moral and political complexity:
- Sir Hans Sloane → slavery and colonialism
- Elgin Marbles → cultural heritage and repatriation

This isn't accidental. The product's value is highest precisely where static information fails — where visitors want to challenge the artifact, ask uncomfortable questions, and hear a perspective. No plaque addresses these topics. No audio guide takes a position. This app does.

PRD implication: Content strategy and prompt engineering for morally complex figures is a core product competency, not just a feature.

Sources: X thread (@isnit0), ElevenLabs build guide (elevenlabs.io/blog), video analysis (Whisper + Gemini ×3), X API thread fetch
Engagement: 1,783 likes | 926,210 impressions | 193 reposts | 769 bookmarks (as of Feb 2026)
App live at: talkingpaintings.ai

Generated on 2026-02-21 12:56:03.