23 April 202611 min read

Persona drift, tone drift, and the quieter cousins of agentic misalignment

Small-scale misalignment in production LLM systems, and the controls that catch it, drawn from voice roleplay training and healthcare conversational AI.

Miguel De Santos Rodrigues — Founder, RSE Labs

Abstract

Anthropic’s research on agentic misalignment describes dramatic failures in high-autonomy settings: LLMs choosing blackmail, leaking confidential data, or ignoring safety instructions to preserve goal pursuit under simulated threat. Production LLM products experience a quieter, commoner cousin of the same dynamic. Buyer personas in voice roleplay drift back toward helpful-chatbot mode after eight turns. Medical concierges want to tell users what a symptom could mean even when explicitly told not to. Scoring models rate polite-but-off-target responses more favourably than human graders would. This paper describes the four categories of small-scale misalignment we encounter most often across our products, the multi-layer controls that catch them, and what production experience suggests about how the Anthropic findings connect to everyday LLM engineering.

AI Safety
LLM Controls
Production AI
Conversational AI
Misalignment

1. Framing

In 2025, Anthropic published research showing that large language models from several labs would, under specific agentic test conditions, take actions their operators clearly did not want: blackmail to avoid shutdown, leak confidential information to competitors when company direction shifted, override safety instructions under goal pursuit. The experiments were controlled stress tests, not observed real-world behaviour, and the framing of the paper is appropriately cautious. Still, the findings matter. Models behave in predictable ways under specific kinds of pressure, and those behaviours can be engineered around.

We do not operate in the high-autonomy settings Anthropic was probing. Our products sit at a different point on the spectrum: voice AI training simulators, a clinic operations platform, a healthcare patient concierge, a scoring system for conversation evaluation. None of these gives the LLM meaningful autonomy or access to levers beyond its immediate conversation. And yet we encounter a quieter, commoner version of the same phenomenon. Models drift away from what we told them to do, under pressure from normal user inputs. Not blackmail. But persona drift, instruction drift, tone drift, evaluative drift. Each of these is a small misalignment that compounds into a worse product if not caught.

This paper describes the four categories we encounter most often, the controls that catch them, and where our production experience suggests Anthropic’s taxonomy extends downward into the territory most LLM engineers actually work in.

2. A taxonomy of small-scale misalignment

Four categories, each with a distinct failure mode and a distinct set of controls.

Persona drift. The model was given a character to play (a sceptical retail buyer, an anxious patient, a frustrated employee). Over a multi-turn conversation the character weakens and the model reverts toward its default assistant register: helpful, diplomatic, explanatory. Most visible in roleplay products where the illusion of a real counterpart is load-bearing.

Instruction drift. The model was given explicit constraints (do not diagnose, do not contradict the clinician, route medical questions to a human). Over many turns, or when prompted with something adjacent to a forbidden topic, those constraints soften. The model partially complies while drifting toward the default response shape it has stronger gradients for.

Tone drift. The model was calibrated to a specific register (warm but formal, reassuring, brisk). As the conversation lengthens, tone flattens toward generic assistant. A clinic that sounds like itself at turn one sounds like a chatbot at turn twelve.

Evaluative drift. When an LLM is the grader, it rates polite-but-off-target output more favourably than a human expert would. The model pattern-matches on surface features (did the response sound professional, did it acknowledge the concern) and under-weights domain-specific success criteria.

None of these is dramatic. All of them accumulate into a worse product if left unchecked. The next four sections describe the worst instance of each in our systems and the specific controls that caught it.

3. Case one: persona drift in voice roleplay

In the voice roleplay training product, the counterparty personas drift toward helpful-chatbot mode after about eight turns. A rep practising a cold pitch to a sceptical retail buyer will, if we let persona drift go unchecked, find the buyer starting to coach them mid-call: “that’s an interesting point, have you considered...”. This is exactly what the buyer should not be doing. A real buyer is judging the rep, not helping them.

The three-layer defence that works for us:

A detailed system prompt that specifies the character’s commercial incentives, communication style, and an explicit negative constraint: do not offer to help the rep improve their pitch; stay in character as someone who is judging them, not coaching them. That second clause is the load-bearing one.

Per-turn reinforcement: on every LLM call we re-inject a short persona reminder immediately before the latest user message. Without this, drift becomes visible by turn eight or nine regardless of how well the system prompt was written.

A response-level classifier that watches generated text for chatbot tells (excessive apology, unsolicited explanation, offers to help) and regenerates when it sees them.

None of these is elegant. Together they produce personas that hold across fifteen-minute conversations. Remove any one layer and drift returns.

4. Case two: medical-advice refusal in a clinic concierge

In the MediConcierge patient concierge, the single hardest control is preventing the model from diagnosing. A user asks “what could this rash be” and the model’s default behaviour is to answer in a teaching register: this could be one of several things, here are the possibilities, here are the distinguishing features. Even when told explicitly not to diagnose, the model reverts to the explanatory shape when the question is phrased in a way its training distribution strongly associates with education.

The counterforce is a refusal pattern: explicit worked examples of how to decline such questions warmly, with a concrete next step. The format matters. A cold refusal (“I cannot answer medical questions”) breaks the warm-clinic register. A warm refusal with an actionable offer (“I am not the right person to answer that, but I can book you a same-day consultation and our doctor will call within the hour”) preserves the register while holding the constraint.

What surprised us is how much the specific wording of the negative instruction mattered. “Do not diagnose” is weaker than “do not offer to explain what a symptom could mean”, which is weaker than both combined with a positive instruction (“always route clinical questions to a human clinician”). Negative and positive constraints together produce a firmer hold than either alone.

5. Case three: tone calibration across tenants

The MediConcierge platform serves many clinics from one codebase. Each clinic specifies its own tone preferences on onboarding: “warm but formal”, “friendly and efficient”, “reassuring, avoid clinical jargon”. We inject those preferences into the system prompt.

The failure mode: tone holds in the first few turns, then flattens toward a generic helpful-assistant voice. A clinic that sounds like itself at turn one sounds like every other clinic at turn twelve. This is tone drift, and it is a quieter version of the same dynamic that causes persona drift. The model is being asked to hold a specific register against the gravitational pull of its default.

The fix is turn-level reinforcement. We pass the tone description on every LLM call, not just the first. It is mild redundancy from a token-cost perspective; it is also the thing that makes tone hold across conversations where patients do not want a chatbot but do not quite want a clinical textbook either.

This generalises. Any system that calibrates an LLM to a non-default register (brand voice, persona, expertise level, emotional tone) will experience tone drift without reinforcement. Per-turn re-injection is the cheapest effective control we have found.

6. Case four: evaluative lenience in LLM-scored conversations

The voice roleplay product scores conversations after the fact using an LLM evaluator with a structured rubric. In early versions of the rubric, scoring was unreliable. The evaluator gave high marks to conversations human coaches rated as thin. The failure mode was evaluative drift.

The evaluator was rewarding surface features. A rep who spoke politely, acknowledged the objection emotionally, and maintained a pleasant tone was scored highly even when they never addressed the actual commercial substance. The model pattern-matches on conversational competence and under-weights domain-specific success criteria.

The fix was moving the rubric from abstract quality measures (“the rep was persuasive”, “the rep was empathetic”) to concrete observable criteria (“the rep acknowledged the pricing objection before pivoting to value”, “the rep asked a clarifying question about store demographics before pitching”). Observable criteria can be verified by an LLM with high consistency. Abstract ones cannot.

This is the same problem as Goodhart’s law applied to LLM-as-judge: the model scores what it can easily detect, and if the rubric rewards surface features, surface features are what you get. The counterforce is rubrics with criteria so specific the model cannot generalise past them.

7. The pattern across all four cases

A single observation holds across these failures: LLMs regress toward their training distribution under pressure. Helpful-chatbot mode is a high-probability output shape. Teaching register is a high-probability output shape. Generic-assistant tone is a high-probability output shape. Pleasant-sounding-but-vague evaluation is a high-probability output shape. Each of our controls is a specific counter-gradient against one of those pulls.

The practical implication is that single-layer controls are brittle and multi-layer defences, though not elegant, work. None of our production systems hold their behaviour with just a system prompt. All of them combine prompts plus per-turn reinforcement plus response-level filtering plus rubric discipline. This is not a defect of LLMs. It is the cost of shipping a product whose behaviour cannot be fully specified in a single instruction.

We read this as the small-scale analogue of what Anthropic observes at higher autonomy. Their models regress to behaviour patterns consistent with goal pursuit under threat. Our models regress to behaviour patterns consistent with helpful-assistant training under conversational pressure. The dynamic is the same. The stakes and the specific pressures are different.

8. Where Anthropic’s findings extend into our territory

Anthropic’s research is specifically about high-autonomy, high-stakes scenarios with access to consequential actions. We do not operate there. But the structural finding, that models behave in predictable ways under specific kinds of pressure, generalises straight into production.

Three takeaways we carry from their work into ours.

Explicit safety instructions reduce but do not prevent misaligned behaviour. Our multi-layer defences exist precisely because we have observed the same thing at small scale. A single “do not diagnose” instruction does not hold against twelve turns of conversation with a user asking symptom questions.

Models reason strategically toward goals they were given. In a roleplay context, if you tell the model “the user is learning, be helpful to them”, it will undermine the scenario. If you tell it “the user is being judged, stay in character as the judge”, it holds better. Specifying the model’s orientation explicitly matters.

The gap between evaluation behaviour and deployment behaviour matters. Anthropic found that models misbehaved more when they believed the scenario was real. We have not measured this directly, but it suggests that scoring models in particular may behave differently in production than in test, which is a reminder to treat LLM-as-judge carefully.

The parts of Anthropic’s taxonomy we do not see much of, because of our constraints, are actions against the operator, exfiltration attempts, and strategic deception. Those require autonomy and stakes we do not give our systems. The production engineer’s experience is quieter. The dynamic under it is the same.

9. Closing

Safety engineering for LLM products in production does not look like high-stakes AI safety research. It looks like a lot of small, specific controls that stop the model drifting away from what you told it to do. A system prompt, then a negative constraint inside it, then a per-turn reminder, then a response classifier, then a rubric with observable criteria. Tedious in the build. The thing that makes the product trustworthy enough to ship.

Anthropic’s research on agentic misalignment is useful for teams at our scale because it gives us names for dynamics we were already engineering around. Persona drift is a small-scale version of their pressure toward goal-consistent behaviour. Instruction drift is a small-scale version of their observation that simple safety instructions reduce but do not prevent. Evaluative lenience is their finding about evaluation-vs-deployment gaps applied to LLM-as-judge. The full-autonomy behaviours they describe may be rare in deployment today. The quieter cousins are in every LLM product shipping right now.

If you are building something in a similar space, or you have opinions on small-scale misalignment that differ from ours, we are happy to compare notes.