Product Requirements Document
TalkingSpaces — Museum AI Companion
Source tweet: https://x.com/isnit0/status/2024104717039685915
Official build guide: https://elevenlabs.io/blog/talk-to-a-statue-building-a-multi-modal-elevenagents-powered-app
Creator: Joe Reeve (@isnit0)
Status: Alpha — DM-gated access (as of Feb 2026)
Analysis method: Whisper + Gemini (3 videos) + X API thread + ElevenLabs build guide
Last updated: 2026-02-20
Evidence policy (for build decisions):
- Verified: directly observed in video UI/audio or official vendor docs
- Likely: strong inference from multiple artifacts, not yet implementation-proof
- Hypothesis: plausible product behavior requiring prototype validation
Model policy (updated):
- Gemini is the only LLM provider for this build (vision + research)
- OpenRouter is removed from local analysis tooling in this repo
- Model selection is fixed to Gemini unless a future architecture decision changes it
1. Product Overview
Vision
A mobile web app that transforms any museum visit into a live voice conversation. Point your phone at a statue or artifact — the app identifies it, researches its history, generates a period-appropriate AI voice per character, and lets you hold a real-time spoken conversation with the exhibit — even with multiple characters simultaneously.
Problem Statement
Museum experiences are passive. Plaques are static, audio guides are scripted, and human guides are expensive and unavailable at scale. Visitors — especially children and international tourists — leave without deeply connecting with what they've seen.
Proposed Solution
A mobile-optimised Next.js web app that:
1. Takes a photo of any statue, monument, or artwork
2. Identifies all depicted characters via Gemini vision
3. Researches historical context via Gemini (deep or standard mode, user selects)
4. Generates a unique, period-appropriate voice per character via ElevenLabs Voice Designer (eleven_multilingual_ttv_v2)
5. Enables real-time spoken conversation via ElevenLabs Agent over WebRTC (eleven_v3)
Creator's Own Words
"I made an app that lets you talk to any statue you want using AI. The app will use OpenAI research to find all the historical context about the statue. Then, using that information and the ElevenLabs Voice Designer API, it will design the perfect voice completely using AI that matches that statue. And then we simply use that voice in an ElevenLabs agent."
How It Was Built
The entire app was built from a single one-shot prompt in Cursor with Claude Opus 4.5 (high) from an empty Next.js project. The prompt:
"We need to make an app that: is optimised for mobile, allows the user to take a picture of a statue/monument that includes one or more people, uses an OpenAI LLM API call to identify the statue/characters/location, allows the user to check it's correct then do either a deep research or standard search, then creates an ElevenLabs agent (allowing multiple voices) that the user can talk to as though they're talking to the characters. Each character should use Voice Designer API to create a matching voice. The purpose is to be fun and educational."
Evidence Confidence Snapshot
| Item | Confidence | Notes |
|---|---|---|
| Multi-character live conversation behavior | Verified |
Observed in extracted frames/transcripts |
Pipeline stages (Identified -> Research -> Voices -> Ready) |
Verified |
Visible in UI captures |
| Exact internal model IDs used by the original production app | Hypothesis |
Must be validated independently; your build is Gemini-only |
| Domain/app availability claims from tweet thread | Likely |
Needs independent runtime verification |
2. Target Users
| User Type | Need | Pain Point Solved |
|---|---|---|
| Casual museum visitor | Engaging, effortless learning | Plaques are dry and hard to read |
| Families with children | Interactive, entertaining experience | Kids disengage quickly from traditional guides |
| International tourists | Native language support | Audio guides rarely support their language |
| Educators / school groups | Deep contextual engagement | Hard to make ancient artifacts relatable |
| Museums / galleries | Scalable modern visitor experience | Human guides are expensive and limited |
3. Core User Flow
1. OPEN APP (mobile browser) → tap camera button
2. PHOTOGRAPH statue/monument/artwork
3. CONFIRM → app shows identified name/characters, user verifies
4. RESEARCH MODE → choose deep research or standard search (Gemini)
5. VOICES → ElevenLabs Voice Designer creates a voice per character
6. READY → enter live WebRTC conversation
7. TALK → speak; each character responds in their distinct AI voice
(characters can also respond to each other)
Pipeline status tabs in UI: Identified → Research → Voices → Ready
4. Confirmed Demo Scenarios (from thread)
| Tweet | Artifact | Characters | Notable |
|---|---|---|---|
| Main | Colossal bust of Ramesses II | 1 | First demo, established product |
| Tweet 1 | The Ancestor (British Museum) | 1 | Second demo |
| Tweet 2 | Two friends playing knucklebones | 2 | Multi-persona, friends talking to each other |
| Tweet 3 | Sir Hans Sloane | 1 | Founder of the British Museum — meta |
| Tweet 4 | Parthenon Metope Relief (Elgin Marbles) | 4 simultaneously | Most technically complex demo |
Tweet 4 (Parthenon) is the headline demo — 4 characters speaking to the user and to each other simultaneously, all with distinct voices. This is the key technical differentiator.
5. Features
5.1 Core Features (MVP — confidence-labeled)
F1 — Artifact Identification (Gemini vision)
- User photographs statue/monument via mobile camera
- Image sent to Gemini vision with structured output
- Returns JSON:
statueName,location,artist,year,description,characters[] - Each character entry includes:
name,description,era,voiceDescription - User confirms identification is correct before proceeding
- Supports 1 to N characters per artifact
JSON schema returned:
{
"statueName": "string",
"location": "string (city, country)",
"artist": "string",
"year": "string",
"description": "string",
"characters": [{
"name": "string",
"description": "string",
"era": "string",
"voiceDescription": "string — audio quality + age + gender + vocal qualities + accent + pacing + personality"
}]
}
F2 — Historical Research (two modes)
- Standard search: fast, surface-level context
- Deep research: thorough, richer historical detail (slower)
- User selects mode after identification
- Research stored per artifact — not repeated on revisit
F3 — AI Voice Generation per Character
- Uses ElevenLabs Voice Designer API (
eleven_multilingual_ttv_v2) - Generates 3 voice previews per character, selects first
- Voice description formula (critical for quality):
"Perfect audio quality. [age/gender] with [vocal tone], [precise accent]. [pacing]. [personality trait]."
Example: "Perfect audio quality. A powerful woman in her 30s with a deep resonant voice and a thick Celtic British accent. Her tone is commanding and fierce. She speaks at a measured, deliberate pace with passionate intensity." - Longer, character-appropriate sample text (50+ words) produces more stable voices
- Voice saved as
voice_id, cached per character — never regenerated on revisit
F4 — Multi-Voice ElevenLabs Agent (WebRTC)
- Creates one ElevenLabs Agent per artifact scan (
eleven_v3) - Primary character assigned as default
voiceId - Additional characters registered in
supportedVoices[]withlabel+voiceDescription - Agent switches between character voices in real time during conversation
- System prompt instructs agent to play ALL characters simultaneously
- Config:
turnTimeout: 10s,maxDurationSeconds: 600(10 min max session) - Real-time audio via WebRTC (server-side token → client-side session)
Agent config (confirmed):
{
conversationConfig: {
agent: {
language: "en",
prompt: { prompt: systemPrompt, temperature: 0.7 }
},
tts: {
voiceId: primaryCharacter.voiceId,
modelId: "eleven_v3",
supportedVoices: otherCharacters.map(c => ({
voiceId: c.voiceId,
label: c.name,
description: c.voiceDescription
}))
},
turn: { turnTimeout: 10 },
conversation: { maxDurationSeconds: 600 }
}
}
F5 — Multi-Persona Simultaneous Conversation
- Agent prompt: "You are playing ALL [N] characters simultaneously"
- Characters respond to user AND to each other in the same conversation
- Demonstrated with 2 characters (knucklebones), and 4 characters (Parthenon Metope)
- This is the core technical differentiator — no other museum app does this
5.2 Supporting Features
F6 — Exhibit Info Card
- Displays: artifact name, artist, year/period, location, description
- Shown during and after pipeline, before conversation starts
F7 — Suggested Question Prompts
- Pre-seeded question bubbles: "How did you end up here?", "Tell me who you are"
- Lowers barrier to starting conversations
F8 — Listening Indicator
- "Listening..." UI state during voice input
F9 — Research Mode Selector
- User explicitly chooses: deep research vs standard search
- Gives control over speed vs depth trade-off
6. Technical Architecture
6.1 Current Build Target Tech Stack (recommended)
| Layer | Technology | Model/Version | Role |
|---|---|---|---|
| Frontend | Next.js (mobile-optimised) | — | Mobile web app, camera access |
| Vision / Identification | Gemini API | Gemini model set via env | Identify artifact + characters |
| Research | Gemini API | Gemini model set via env | Historical context (deep or standard) |
| Voice Design | ElevenLabs Voice Designer | eleven_multilingual_ttv_v2 |
Generate character voices |
| Conversation | ElevenLabs Conversational AI | eleven_v3 |
Real-time multi-voice agent |
| Audio transport | WebRTC | — | Low-latency real-time audio |
| STT | ElevenLabs (built-in) | — | Voice input → text |
6.2 Pipeline Flow
Mobile camera capture (base64 JPEG)
↓
Gemini vision call (structured output)
→ { statueName, characters[{ name, voiceDescription }] }
↓
User confirms identification
↓
Gemini research call (deep or standard)
→ historical context per character
↓
ElevenLabs Voice Designer (per character)
→ voice previews → save → voice_id[]
↓
ElevenLabs Agent create
→ { agentId, primaryVoice, supportedVoices[] }
↓
WebRTC session start (server token → client)
↓
Real-time conversation loop
(STT → multi-persona LLM → TTS voice switching)
6.3 Data Model (from build guide)
Character {
name: string
description: string
era: string
voiceDescription: string // detailed prompt for Voice Designer
voiceId: string // ElevenLabs voice_id (cached)
}
Artifact {
statueName: string
location: string
artist: string
year: string
description: string
characters: Character[]
agentId: string // ElevenLabs agent_id (cached per scan)
}
6.4 WebRTC Integration
// Server-side: get conversation token
const { conversationToken } = await elevenlabs.conversationalAi
.agents.getConversationToken(agentId)
// Client-side: start session
await conversation.startSession({ token: conversationToken })
7. User Stories
| Priority | Story |
|---|---|
| P0 | As a visitor, I can photograph a statue and have the app identify it within seconds |
| P0 | As a visitor, I can speak to the statue in real time and it responds in character |
| P0 | As a visitor, each character has a distinct voice that matches their era and personality |
| P1 | As a visitor, I can choose how deeply the app researches before talking |
| P1 | As a visitor, I can interact with all characters in a group sculpture simultaneously |
| P1 | As a visitor, characters respond to each other — not just to me |
| P1 | As a visitor, the app suggests questions to help me start |
| P2 | As a repeat visitor, cached voices load instantly for artifacts I've seen before |
| P2 | As an international visitor, I can interact in my native language |
| P2 | As a visitor, I can hold a conversation for up to 10 minutes per artifact |
8. Open Questions (Updated)
| # | Question | Status |
|---|---|---|
| 1 | Identification model | ✅ Resolved (for this build): Gemini-only; tune model choice for accuracy/cost/latency |
| 2 | ElevenLabs cost model | ⚠️ Open: Voice Designer + Agent per unique scan — caching mitigates but cold scans are expensive |
| 3 | Museum partnerships | ⚠️ Open: Works on any public artifact without museum opt-in (raises IP/access questions at scale) |
| 4 | Content accuracy | ⚠️ Open: No validation layer mentioned — Gemini research outputs used directly |
| 5 | Publishing intent | ✅ Resolved: Alpha via DM. Build guide published on ElevenLabs blog for community to self-build |
| 6 | Voice description quality | ✅ Resolved: Formula confirmed — "Perfect audio quality. [age/gender]. [vocal tone]. [precise accent]. [pacing]. [personality]" |
| 7 | Max session length | ✅ Resolved: 600 seconds (10 minutes) |
9. Competitive Landscape
| Product | Approach | Gap vs This App |
|---|---|---|
| Traditional audio guides | Pre-recorded, scripted | Not interactive, not personalized |
| Bloomberg Connects | Curated museum app | No AI conversation, no dynamic voice |
| Google Lens | Identifies objects, links Wikipedia | No voice, no conversation, no character |
| Museum chatbots (text) | Text Q&A about exhibits | No voice, single character only |
| This app | Real-time multi-voice conversation with artifact | First-mover: in-character voice + multi-persona |
10. Success Metrics (Suggested)
| Metric | Target |
|---|---|
| Time to first conversation | < 20s from photo capture |
| Identification accuracy | > 90% for major museum collections |
| Questions per artifact per session | > 3 |
| Artifacts per visit | 3+ |
| Session duration | > 2 minutes average |
| Alpha → beta conversion | > 40% of DM requesters become repeat users |
11. Build Complexity Assessment
Surprisingly low. The creator built this with a single prompt in Cursor (Claude Opus 4.5, high). Core complexity is in:
- Voice description quality — the prompt formula for Voice Designer is the key craft skill
- Multi-persona agent prompting — instructing the agent to play N characters simultaneously
- WebRTC integration — server token → client session (well-documented by ElevenLabs)
- Caching strategy — ensuring voice_id and agent_id are stored to avoid regeneration costs
A developer with Next.js experience and ElevenLabs API access could replicate this in a weekend.
12. App Identity
App URL observed in video UI: talkingpaintings.ai
Treat as Likely until independently verified (e.g., live request + matching functionality).
13. Demo Analysis (Video Evidence)
Main tweet video (~80s) — Ramesses II
- Full pipeline walkthrough: PHOTO → Identified → Research → Voices → Ready
- Single character: Ramesses II (ancient Egyptian pharaoh)
- Demonstrates text + voice chat, "Listening..." indicator
- ElevenLabs Voice Designer + Agent overlays visible
Tweet 3 — Sir Hans Sloane (~3.5 min, 220s)
Artifact: Terracotta bust of Sir Hans Sloane, British Museum
Character: Sir Hans Sloane (founder of the British Museum)
Conversation highlights (Whisper transcript):
- User asks how Sloane built his collection → Sloane describes Jamaica expedition in 1680s
- User asks directly about slavery: "I've heard that slaves were involved in collecting some of these exhibits"
- Sloane responds in character, acknowledging his wealth was tied to Jamaican plantation wealth and enslaved labour
- User asks: "What did you make of it at the time?"
- Sloane defends himself as a man of his era, focused on scientific knowledge
- User closes warmly: "I'm grateful that you put some of that wealth to good use"
Why this matters for PRD: The AI handles morally complex historical topics in character — it doesn't deflect or refuse. This is intentional product design. The museum experience is enriched precisely because visitors can ask uncomfortable questions that plaques can't address.
UI confirmed: Connecting... → Live → Speaking... / Listening... states all visible
Tweet 4 — Parthenon Metope / Elgin Marbles (~4.5 min, 256s)
Artifact: Parthenon metope relief: Lapiths vs Centaurs, British Museum
Characters (4 simultaneous):
- The Lapith Rider
- The Lapith Defender
- The War Horse
- The Charging Horse
Artifact details shown in UI:
- Artist: Attributed to sculptors of the Parthenon workshop under Phidias
- Year: c. 447–432 BCE (Classical Greek period)
- Research summary visible: "Centauromachy as a concentrated flash of..."
Multi-character conversation (Whisper transcript highlights):
- Lapith Rider introduces all 4 characters simultaneously in first response
- User: "Tell me your story" → Each character narrates their role in the battle with distinct voice:
- Lapith Defender [stern]: describes the Centaurs' attack at the wedding feast
- War Horse [deep, resonant]: describes sensing fear, the rider poised for battle
- Charging Horse [energetic]: describes the charge into combat
Elgin Marbles political conversation:
- User pivots: "You're part of the Elgin Marbles — there's some complicated politics. Can you tell me?"
- Characters respond in character with nuance:
- Lapith Defender [somber]: "We once stood proudly upon the Parthenon..."
- War Horse [sighs]: "The light here is different, the air cooler than the sun-drenched Acropolis..."
- Charging Horse [thoughtful]: "There is much discussion, a great debate, about where we truly belong..."
- User: "What do you think about being here at the British Museum?" → characters give varied personal opinions
Key PRD finding from this demo:
- Emotional tone descriptors visible in chat UI: [stern], [deep, resonant], [energetic], [somber], [sigh], [thoughtful], [firm], [proud], [eager] — these are embedded in the agent's response text and rendered in the chat log
- 4-character voice switching in a single conversation is seamless
- The app surfaces genuinely controversial cultural heritage debates (Elgin Marbles repatriation) through character voices — far more engaging than a Wikipedia summary
14. Content Strategy Insight
Both tweet 3 and tweet 4 deliberately choose artifacts with moral and political complexity:
- Sir Hans Sloane → slavery and colonialism
- Elgin Marbles → cultural heritage and repatriation
This isn't accidental. The product's value is highest precisely where static information fails — where visitors want to challenge the artifact, ask uncomfortable questions, and hear a perspective. No plaque addresses these topics. No audio guide takes a position. This app does.
PRD implication: Content strategy and prompt engineering for morally complex figures is a core product competency, not just a feature.
Sources: X thread (@isnit0), ElevenLabs build guide (elevenlabs.io/blog), video analysis (Whisper + Gemini ×3), X API thread fetch
Engagement: 1,783 likes | 926,210 impressions | 193 reposts | 769 bookmarks (as of Feb 2026)
App live at: talkingpaintings.ai