Voice AI for roleplay training: latency, silence, and the illusion of a real conversation
What it takes to make a voice roleplay partner realistic enough that trainees engage with it as practice rather than as a test. The problem appears in sales, HR, clinical communication, and anywhere else a conversation can make or break the relationship.
Some conversations are too important to practise on the people they are actually for. Sales reps cannot ramp on real customer calls without losing deals. HR managers cannot rehearse a termination meeting on an actual employee. Junior clinicians cannot practise breaking bad news on real patients. A voice-first AI practice partner can fill that gap, but only if the experience feels real enough that the trainee engages with it as a conversation rather than as a test. This paper describes how we built a voice AI roleplay training platform, first deployed for wholesale sales at a UK consumer-goods brand and designed to generalise to HR, clinical communication, and other domains. It covers the three problems at the heart of making the system work wherever it is applied: latency in the voice pipeline, silence detection for turn-taking, and persona stability across multi-turn conversation.
1. The problem: high-stakes conversations you cannot rehearse
Every organisation has conversations that matter disproportionately. A sales rep’s first call with a reluctant buyer. An HR manager’s first termination meeting. A junior doctor’s first conversation with a family about a poor prognosis. A customer service lead on the phone with someone who has threatened to cancel. The common pattern: high stakes, interpersonal, real-time, and hard to train for because the obvious practice surface is a real person with real consequences attached.
The traditional alternatives all have gaps. Role-play with a peer or manager is effective but inconsistent, expensive, and difficult to schedule at scale. Shadowing is passive; the trainee hears the conversation but does not have to produce it under pressure. And the default option, learning on the job, works eventually, but every mishandled objection or fumbled first conversation is a real commercial or human cost.
The client we first built this for is a UK consumer-goods brand with a national wholesale sales team. Their reps run conversations with retail buyers for a living: pitching new ranges, negotiating prices, managing launch cycles, handling complaints. A rep who knows the catalogue backwards can still lose an account by mishandling an objection. The same shape of problem shows up in many other domains. HR managers rehearsing difficult feedback. Customer service leads practising de-escalation. Clinicians learning how to deliver unwelcome news. Interview panels preparing to evaluate consistently. The architecture we describe here was built with sales as the first deployment and is deliberately designed to generalise.
A practice partner that is always available, always in character, and realistic enough that the trainee feels actual pressure would fill the gap between the existing options. The question is whether voice AI can produce an experience someone actually engages with, rather than one they dismiss as obvious after two minutes.
2. What a good practice partner has to do
For a trainee to engage with the simulator as practice rather than as a toy, four things have to be true at once.
It has to sound like a person. Synthetic voice breaks the frame immediately. This is what modern TTS is for. Services like ElevenLabs produce voices that pass casual listening and carry enough expression to read as genuinely human.
It has to respond in conversational time. If there is a two-second gap after the trainee finishes talking, the illusion is gone. Human conversational turn-taking sits at around 200ms on average. The latency budget for the whole pipeline is small, and every component has to earn its share.
It has to stay in character. Whatever persona the scenario calls for (sceptical buyer, upset employee, anxious patient, hostile customer) cannot drift into helpful-chatbot mode when the trainee asks an awkward question. LLMs have strong gravitational pull toward their default register; keeping a counterparty persona stable across multiple turns is a specific engineering problem.
It has to score meaningfully. After the conversation, the trainee wants to know how they did in terms that match how they would be judged in reality. Did they address the objection. Did they handle the escalation. Did they read the room. That is a different problem from scoring syntactically correct output.
None of these four is individually novel. The interesting part is making all four work together in a single product where breaking any one of them collapses the experience.
3. The voice pipeline, and why latency is the whole game
Voice AI at this layer is a three-step pipeline: speech-to-text on the trainee’s input (STT), large-language-model generation of the counterparty’s response (LLM), and text-to-speech synthesis of that response (TTS). Each stage contributes latency, and the total gap from “trainee stops talking” to “counterparty starts talking” has to feel conversational.
Human conversational turn-taking sits around 200ms on average. Anything over about two seconds feels broken. The practical target we aimed for was under 1.5 seconds for most turns, which is achievable but requires discipline at every stage.
Making that budget work means streaming everything that can be streamed. STT returns partial transcripts as the trainee speaks, so the moment they finish we already have the text. The LLM streams tokens as they generate, and we start sending them to TTS the moment we have enough to render a coherent phrase. TTS streams audio chunks back to the client for immediate playback. The trainee hears the first words of the response before the LLM has finished generating the last words.
This architecture is not specific to any domain. It is the standard pattern for any real-time voice AI product. But the latency discipline is easy to lose. A system that starts playing the response at 2.5 seconds instead of 1.2 seconds is experienced as fundamentally worse, even though only a second separates them.
4. Silence detection: the hardest UX problem
The hardest single problem in this product is deciding when the trainee has finished their turn.
The obvious approach is voice-activity detection: wait for a short silence (say 500ms) and treat the turn as over. This works badly. Trainees pause mid-sentence to think, to breathe, or to reach for a figure. If the simulator interrupts them, the illusion breaks immediately. They then adapt by speaking continuously without natural pauses, which makes the training unhelpful because real counterparts do not require continuous speech.
The opposite failure is also bad. If the simulator waits too long (say 1.5 seconds of silence) the conversation feels laggy even when the trainee has definitely finished talking.
Our current approach combines three signals. The first is voice-activity detection for raw silence length. The second is a running confidence from the STT system about whether the last utterance is syntactically complete; a sentence ending with a trailing “and...” is almost certainly not done. The third is contextual: during certain moments in the conversation (for example, after the counterparty has asked a direct question) we lower the silence threshold because the trainee is expected to be forming an answer rather than thinking.
None of these signals is novel individually. The combination gets us closer to reliable turn-taking than any one of them does alone. It still gets things wrong sometimes. That is why the retake affordance exists, and why we treat turn-taking as a UX problem rather than a fully-solved algorithmic one.
5. Persona stability across multi-turn conversation
LLMs have a strong statistical pull toward their default register, which for an assistant model is helpful, diplomatic, and explanatory. A “sceptical retail buyer worried about shelf space”, an “anxious patient receiving unwelcome news”, or a “frustrated employee on the edge of resigning” are all personas that tend to drift toward helpful-chatbot mode after a few turns, especially when the trainee asks a question the model recognises as a common support query.
We hold the persona in place with three tools.
Detailed system prompt. The character is specified in detail: who they are, what their commercial or personal incentives are, what their communication style is, and explicitly what they will not do. For example, do not offer to help the trainee improve their approach; stay in character as someone who is judging them, not coaching them. That second clause catches a surprising amount of drift.
Turn-level reinforcement. On every LLM call we pass not just the conversation history but a short persona reminder immediately before the latest turn. Without it, personas drift noticeably by turn eight or nine.
Response-level filtering. A small classifier watches generated responses for chatbot tells (offering to help, apologising excessively, providing unsolicited explanations) and regenerates when it sees them. This catches most of the remaining drift before it reaches the trainee.
None of this is elegant, but together it produces personas that hold across 15-minute conversations without breaking character. The same three mechanisms work across domains; what changes is the persona definitions and the domain-specific chatbot tells the classifier watches for.
6. Scoring a conversation
Scoring a sales conversation, a difficult-feedback conversation, or a bad-news delivery is not like scoring a multiple-choice test. There is no single correct answer. What you want to assess is judgment. Did the trainee identify the main objection or concern. Did they address it without becoming defensive. Did they move toward whatever the right resolution looks like in that domain: an agreement, a plan, a safe handover.
We use a post-call LLM evaluation with a structured rubric tuned to each scenario. The rubric has concrete observable criteria (“the rep acknowledged the pricing objection before pivoting to value”; “the manager named the specific behaviour rather than generalising”; “the clinician paused after delivering the news before continuing”) rather than abstract quality measures (“the rep was persuasive”; “the manager was empathetic”). Observable criteria can be evaluated consistently; abstract ones cannot.
The rubric is tied to each scenario separately. A scenario about handling a sales complaint uses different criteria from a scenario about giving performance feedback, which uses different criteria again from a scenario about a patient conversation. The scoring model has access to the transcript, the scenario brief, and the rubric, and it produces both a numerical score per criterion and a short qualitative note naming the specific moment it is referring to.
The qualitative notes matter more than the scores. A manager or training lead running this as part of their coaching loop uses the notes as starting points for one-on-ones. The numerical score mostly exists to track progression across attempts.
7. Where the same architecture applies
A short map of domains where we see the same shape of problem, and where the same platform (with new scenarios and new rubrics) works with minimal adaptation.
Sales. The first deployment. Reps practise pitches, objection handling, and close-rate conversations against buyer personas tuned to the specific account types they will encounter.
HR and people management. Difficult conversations: performance feedback, termination, mediation between colleagues, handling a grievance, returning from long-term sick leave. These are rarer than sales calls in any given career, which makes rehearsing them more valuable, not less. A manager may have to deliver bad news to a team member once a year and has no good way to practise for it.
Clinical communication. Junior clinicians learning to break bad news, manage anxious patients, or handle families during difficult decisions. Traditional training uses simulated patients (actors), which is effective but expensive and rarely available on demand. Voice AI standardises the practice surface and makes it available whenever the trainee has time for it.
Customer service and de-escalation. Support teams handling angry or distressed customers. The pattern is similar to sales but the counterparty profiles are different and the successful-outcome criteria change accordingly.
Interview training. Either side of the table. Candidates rehearsing for high-stakes interviews, or interviewers preparing to evaluate consistently. Persona stability matters especially here: the interviewer persona must not slip into coaching mode.
What changes across these domains is scenario design, persona library, and rubric. The platform underneath stays the same. That is where the engineering investment pays back: build the latency, silence, and persona stack once, then ship it into new domains by configuring scenarios.
8. What works and what does not
A few observations from shipping this.
The illusion is surprisingly robust once the latency is right. Trainees engage with well-designed scenarios as if they are real conversations, including getting frustrated, getting nervous, and celebrating when they succeed. The “this is just a chatbot” dismissal evaporates within the first minute if the system feels fast enough.
Scenarios need deep design, not just a brief paragraph. Shallow scenarios get figured out in two attempts and stop being useful. Deep ones, with multiple branching objection patterns, several counterparty moods, and believable edge cases, hold up to many attempts and become a real training surface. This is true in every domain we have looked at.
The hardest users are not new trainees but experienced ones. New users engage enthusiastically; they welcome the practice. Experienced users try to break the simulator by asking questions only a real person could answer. When they succeed, the illusion breaks. When they fail, the simulator has earned their respect.
Scoring feedback is mostly used by coaches, not by individual trainees. That changed how we thought about the scoring surface. It is a coaching tool, not a gamification tool. Managers, training leads, and educators are the actual users of the feedback output; trainees use the conversation itself.
9. What we would do differently
Two honest reflections.
First, we built voice input and voice output to parity without enough thought about which mattered more. In practice the quality of voice output (the counterparty sounding real) is far more important to the illusion than the quality of voice input (perfect transcription). Perfect STT with synthetic-sounding TTS feels worse than imperfect STT with convincing TTS. If we were starting over we would spend more effort on voice output polish and less on STT robustness.
Second, we underestimated how much scenario design would matter relative to platform engineering. We built a sophisticated platform that any scenario can plug into, and then spent months discovering that most scenarios, written as a short brief, produce thin training experiences. The platform was ready for scenario designers we did not have. Next time we would build fewer platform abstractions and more scenarios ourselves, earlier, and in each domain we plan to enter.
Closing note
Voice AI for training sits in an unusual product category. It is not really an AI product; it is a training product that uses AI as a generation engine. The commercial decisions it has to make are about pedagogy, learning design, and pairing with existing training processes. The engineering decisions are about latency, realism, and persona stability. Those two conversations have to happen at the same table, and most teams organise themselves to prevent that.
The value of this kind of product is not in replacing any existing training tool. It is in giving people a practice surface that was not there before: always available, always in character, consistent across the team, and responsive enough that engaging with it feels like practice rather than busywork. The domain of the first deployment does not define the architecture. The architecture is about conversation, not about commerce.
If you are building something in a similar space, or you have opinions on silence detection and persona stability that differ from ours, we are happy to compare notes.
