How we calibrated a Synthetic Focus Group from 'this looks great!' to 93% accuracy
6 months, 16 niches, 2,069 real insights, and 13 fixes that made AI personas actually disagree with us.

When we shipped the first version of GoNoGo's Synthetic Focus Group (SFG), every persona loved everything.
The setup: a founder finishes a 30-minute voice discovery interview about their idea. From that conversation plus a stack of insights we'd scraped for the niche, we spin up five AI personas — a CTO, a budget-conscious shopper, a skeptic, an early adopter, a casual user — and ask the panel to react to the founder's two value-prop variants and a pricing test. All five voted the same way on every variant. All five "would maybe buy." Slightly different wording. Same vibe.
The problem is obvious in retrospect: we'd built a confirmation engine, not a focus group.
This is the story of the next six months — what broke, what we tried, what stuck. By the end we hit 93% predicted-vs-real accuracy across 16 niches with a 95% CI of 91.4–94.6%. Here's how.
What SFG is actually for (and what it isn't)
Before the engineering, the use case.
A founder finishes the voice discovery and now has a stack of decisions to make: which value prop leads, what to charge, which marketing claims hold up, which feature to build first, which segment to target. Traditionally these get answered by (a) gut feel, (b) asking 5 friends, or (c) running a real survey that takes weeks and costs money to recruit the right respondents.
The SFG sits in between. It's a panel of synthetic respondents, each modelled on a specific archetype-with-real-grievances, that the founder can re-run as many times as they want against any decision they need to make — A/B variants, pricing, claims, positioning, segments.
What it gives you:
When we say "93% accuracy" later in this post, that's what's being measured: how closely a synthesized archetype's modelled behavior matches the observed behavior of real users in that archetype, on data the model didn't see during synthesis. Not pre-cognition. Behavioral fidelity.
That distinction matters because it tells you what the SFG is good for (decision-stage trade-offs, claim stress-tests, segment fit, pricing) and what it's bad for (predicting macro-market outcomes, novel categories with no public user data, regulated industries where the data isn't there).
The three things that broke our personas
After watching ~50 sessions where personas all said "great idea!" we noticed three failure modes:
We fixed each one. Here's how.
Fix #1: Personas built from real grievances, not templates
The base persona-generation algorithm now does this:
The critical line in `persona_builder.py`:
# Real frustrations from source data weighted 3x over LLM-generated ones
weighted_frustrations = (
real_frustrations * 3 +
llm_inferred_frustrations * 1
)
That 3× multiplier is the entire difference between a persona who says "I'd want better onboarding" (LLM generic) and one who says "I bounced from the last 4 tools because none of them imported my Notion docs without breaking nested toggles" (real Reddit thread).
The 3× weight on real frustrations is the cheapest, highest-leverage change in the whole pipeline. Without it, you're just paraphrasing the model's prior beliefs back to the founder.
Each persona also carries up to 5 verbatim quotes from the source data, plus a richness score (0–1) so the orchestrator can flag thin personas before they pollute results. Average richness when >100 insights are available: 0.85+.
Takeaway: persona realism is upstream of every other decision. If your input data is "what an LLM thinks a CTO sounds like," everything downstream is fan-fiction.
Fix #2: Temperature tuned per archetype
Behavioral diversity isn't a prompt problem — it's a sampling problem.
We tuned `temperature` per archetype in `get_temperature()`:
ARCHETYPE_TEMPERATURE = {
"EARLY_ADOPTER": 0.9, # impulsive, willing to leap
"CASUAL": 0.7,
"MAINSTREAM": 0.6,
"PRAGMATIST": 0.5, # analytical, predictable
"CTO": 0.4,
"CFO": 0.4,
"SKEPTIC": 0.5, # rigid, negative-biased
"BUDGET_CONSCIOUS": 0.5,
# ... 15 archetypes total
}
This alone meaningfully shifted distributions. Skeptics started landing in the 4–6 appeal range by default. Early adopters jumped to 7–9. Pragmatists stayed in the 5–7 band where they belonged.
We also embedded cognitive bias hints directly into each archetype's system prompt. Pragmatists get explicit "status quo bias" framing. Skeptics get "negativity bias" framing. CFOs get loss-aversion phrasing.
The personas didn't just sound different — they actually disagreed with each other.
Takeaway: if every persona is sampled at the same temperature, you're running the same character five times in different costumes.
Fix #3: Variant shuffling
Stupid, easy, huge:
# Shuffle variants per-persona to neutralize position bias
variant_labels = ["Option 1", "Option 2", "Option 3"]
random.shuffle(variant_labels)
For an A/B/C test with 5 personas, each persona sees the variants in a different random order under neutral labels. Position effects average out across the panel.
We measured this. Before shuffling: Option A won 64% of two-variant tests across our calibration set. After shuffling: 50.3% / 49.7%. The label was carrying a 14-point bias.
Takeaway: before tuning anything sophisticated, audit for the dumb biases first. They cost you 14 points and a `random.shuffle()` call to fix.
By the numbers
A snapshot of where the system landed after the calibration pass:
We needed to know if any of this was working. So we built a calibration suite.
But first — what are we actually measuring?
We are not measuring "did SFG predict whether the product succeeded." We're measuring something narrower and more testable: given a known archetype and a known set of grievances, does the synthesized persona produce the same pattern of pain points, needs, sentiment, and decisions that the real users in that archetype produced — on data the model didn't see?
If yes, the persona is a faithful behavioral model of its archetype. That's what we calibrated against.
The setup: 2,069 manually-tagged insights across 16 niches, each with known ground-truth pain points, needs, sentiment distribution, and the decision a real founder would have arrived at when looking at the full dataset.
We split 70/30 — synthesize personas using 70% of insights per niche, then ask each persona to characterize the held-out 30% (without ever seeing it). Compare the persona's response to ground truth across 5 weighted dimensions:
| Dimension | Metric | Weight |
|---|---|---|
| Pain point overlap | Semantic Jaccard (threshold 0.53) | 0.30 |
| Pain point ranking | Spearman's ρ | 0.15 |
| Needs overlap | Semantic Jaccard | 0.25 |
| Sentiment distribution | 1 − √JSD | 0.20 |
| Language similarity | Cosine of embeddings | 0.10 |
Final score across the 16 niches: 93.1% behavioral fidelity, 95% CI 91.4–94.6%. Best-performing niches: content tools (93.2%), freelancers (92.7%), fitness (90.7%).
Decision match rate (does the synthesized panel reach the same go/no-go verdict as the held-out real data on 4 axes — concern, need, verdict, recommendations): 4/4 across the calibration set.
To restate what that number means: when we ask a synthesized archetype to characterize a problem space using only the 70% it was built from, its description of pain points, needs, sentiment, and recommended decisions matches the description that real users in that archetype produced (on the held-out 30%) at 93% similarity, on average, across 16 niches. Not "predicts the future at 93%." Reproduces archetype behavior at 93%.
The most valuable thing this gave us wasn't the headline number. It was the per-niche breakdown — we could see which archetype pools were weak, which niches needed more insight sources, which prompts were drifting.
The headline accuracy number is for marketing. The per-niche breakdown is for engineering. Build both.
How an A/B test actually runs
When a founder gives us two landing page variants:
The output isn't just a winner. It's a transcript a founder can actually read — with reasoning that maps to specific frustrations from real users.
What else the SFG can do
Once we had calibrated personas, A/B testing turned out to be the smallest use case. The same panel can run:
There's also a separate Reality Check feature that flips the comparison around: it lets you run a real human survey, then dual-scores the SFG prediction against the real responses. That's how we keep the 93% number honest as the model evolves.
What still doesn't work
A few things we're still not satisfied with:
Try it
The full Synthetic Focus Group lives inside GoNoGo. The free tier gives you 3 projects with the voice Discovery agent — enough to feel out whether the methodology makes sense for your idea before unlocking the full multi-agent pipeline (which is where SFG, A/B testing, pricing tests and the rest live, behind a one-time per-project credit — no subscription).
If you build something with synthetic personas yourself — the three things that mattered most for us, ranked: (1) real grievances over LLM templates (3× weight), (2) per-archetype temperature, (3) variant shuffling. Without all three, you'll just keep getting "this looks great!" forever.
Validate your idea in 30 minutes
Voice-first AI consulting team. 17 reports. Free to start.
Start Free →Get the next post in your inbox
One technical post every 2 days — engineering deep-dives on building voice AI, synthetic focus groups, and AI agents. Plus PH launch notification. No spam.