Skip to main content
Engineering10 min read

How we calibrated a Synthetic Focus Group from 'this looks great!' to 93% accuracy

6 months, 16 niches, 2,069 real insights, and 13 fixes that made AI personas actually disagree with us.

How we calibrated a Synthetic Focus Group from 'this looks great!' to 93% accuracy — illustration for GoNoGo blog article
GT
GoNoGo Team
April 20, 2026

When we shipped the first version of GoNoGo's Synthetic Focus Group (SFG), every persona loved everything.

The setup: a founder finishes a 30-minute voice discovery interview about their idea. From that conversation plus a stack of insights we'd scraped for the niche, we spin up five AI personas — a CTO, a budget-conscious shopper, a skeptic, an early adopter, a casual user — and ask the panel to react to the founder's two value-prop variants and a pricing test. All five voted the same way on every variant. All five "would maybe buy." Slightly different wording. Same vibe.

The problem is obvious in retrospect: we'd built a confirmation engine, not a focus group.

This is the story of the next six months — what broke, what we tried, what stuck. By the end we hit 93% predicted-vs-real accuracy across 16 niches with a 95% CI of 91.4–94.6%. Here's how.

**TL;DR** — If you're building anything with synthetic personas, three things matter more than the rest: (1) generate persona grievances from real user data, weighted 3× over LLM-imagined ones, (2) tune sampling temperature *per archetype* (skeptics ≠ early adopters), (3) shuffle variant labels per persona to kill position bias. We measured the third one alone shaving 14 percentage points off label bias.

What SFG is actually for (and what it isn't)

Before the engineering, the use case.

A founder finishes the voice discovery and now has a stack of decisions to make: which value prop leads, what to charge, which marketing claims hold up, which feature to build first, which segment to target. Traditionally these get answered by (a) gut feel, (b) asking 5 friends, or (c) running a real survey that takes weeks and costs money to recruit the right respondents.

The SFG sits in between. It's a panel of synthetic respondents, each modelled on a specific archetype-with-real-grievances, that the founder can re-run as many times as they want against any decision they need to make — A/B variants, pricing, claims, positioning, segments.

What it gives you:

  • A vote with reasoning, not a number — "the Skeptic rejected variant B because pricing felt anchored to a tier she doesn't need"
  • Disagreement between archetypes — surfacing the trade-off you were about to ignore
  • Reproducible runs — same insights → same panel → same decision logic. You can re-test the same idea after changing one word in the headline
  • **What SFG is NOT.** It is not a fortune teller. It does not predict whether your startup will succeed. It does not predict what *the market* will do. It models the **decision behavior of a specific archetype**, given the frustrations and goals we've sourced for that archetype from real public data. That's a narrower claim than "predict the future" — and a much more honest one.

    When we say "93% accuracy" later in this post, that's what's being measured: how closely a synthesized archetype's modelled behavior matches the observed behavior of real users in that archetype, on data the model didn't see during synthesis. Not pre-cognition. Behavioral fidelity.

    That distinction matters because it tells you what the SFG is good for (decision-stage trade-offs, claim stress-tests, segment fit, pricing) and what it's bad for (predicting macro-market outcomes, novel categories with no public user data, regulated industries where the data isn't there).

    The three things that broke our personas

    After watching ~50 sessions where personas all said "great idea!" we noticed three failure modes:

  • Personas had no real grievances. They were generated from the LLM's vague prior of "what a CTO might say." So a CTO persona evaluating a B2B SaaS would just... vibe. No specific scar tissue, no real pain.
  • Sampling temperature was uniform. Skeptics rolled the same temperature (0.7) as early adopters. Skeptics weren't actually skeptical — they were just slightly less enthusiastic.
  • Variant labels biased everything. "Option A" reliably won over "Option B" — classic position bias. Personas were anchoring on label, not content.
  • We fixed each one. Here's how.

    Fix #1: Personas built from real grievances, not templates

    The base persona-generation algorithm now does this:

  • Niche detection. A small LLM classifier maps the project to one of 16 niches (B2B SaaS, marketplace, dev tools, ecommerce, hardware, fitness, content, freelancers, ...). Each niche has a different archetype pool.
  • Insight collection. We pull real posts from Reddit, HackerNews, ProductHunt, G2, app store reviews — anywhere the niche's actual users complain. Typical project gets 100–300 raw insights.
  • Per-persona synthesis. For each archetype slot (3–5 per project), we sample 3–8 frustrations and 3–8 goals directly from the real insights for that persona's likely demographic.
  • The critical line in `persona_builder.py`:

    # Real frustrations from source data weighted 3x over LLM-generated ones
    weighted_frustrations = (
        real_frustrations * 3 +
        llm_inferred_frustrations * 1
    )
    

    That 3× multiplier is the entire difference between a persona who says "I'd want better onboarding" (LLM generic) and one who says "I bounced from the last 4 tools because none of them imported my Notion docs without breaking nested toggles" (real Reddit thread).

    The 3× weight on real frustrations is the cheapest, highest-leverage change in the whole pipeline. Without it, you're just paraphrasing the model's prior beliefs back to the founder.

    Each persona also carries up to 5 verbatim quotes from the source data, plus a richness score (0–1) so the orchestrator can flag thin personas before they pollute results. Average richness when >100 insights are available: 0.85+.

    Takeaway: persona realism is upstream of every other decision. If your input data is "what an LLM thinks a CTO sounds like," everything downstream is fan-fiction.

    Fix #2: Temperature tuned per archetype

    Behavioral diversity isn't a prompt problem — it's a sampling problem.

    We tuned `temperature` per archetype in `get_temperature()`:

    ARCHETYPE_TEMPERATURE = {
        "EARLY_ADOPTER":     0.9,   # impulsive, willing to leap
        "CASUAL":            0.7,
        "MAINSTREAM":        0.6,
        "PRAGMATIST":        0.5,   # analytical, predictable
        "CTO":               0.4,
        "CFO":               0.4,
        "SKEPTIC":           0.5,   # rigid, negative-biased
        "BUDGET_CONSCIOUS":  0.5,
        # ... 15 archetypes total
    }
    

    This alone meaningfully shifted distributions. Skeptics started landing in the 4–6 appeal range by default. Early adopters jumped to 7–9. Pragmatists stayed in the 5–7 band where they belonged.

    We also embedded cognitive bias hints directly into each archetype's system prompt. Pragmatists get explicit "status quo bias" framing. Skeptics get "negativity bias" framing. CFOs get loss-aversion phrasing.

    The personas didn't just sound different — they actually disagreed with each other.

    Takeaway: if every persona is sampled at the same temperature, you're running the same character five times in different costumes.

    Fix #3: Variant shuffling

    Stupid, easy, huge:

    # Shuffle variants per-persona to neutralize position bias
    variant_labels = ["Option 1", "Option 2", "Option 3"]
    random.shuffle(variant_labels)
    

    For an A/B/C test with 5 personas, each persona sees the variants in a different random order under neutral labels. Position effects average out across the panel.

    We measured this. Before shuffling: Option A won 64% of two-variant tests across our calibration set. After shuffling: 50.3% / 49.7%. The label was carrying a 14-point bias.

    Position bias in LLM panels is real and large. If you're not shuffling labels, your A/B "winners" are partially a measurement of which slot you put them in.

    Takeaway: before tuning anything sophisticated, audit for the dumb biases first. They cost you 14 points and a `random.shuffle()` call to fix.

    By the numbers

    93.1%
    Behavioral fidelity
    16
    Niches calibrated
    2,069
    Tagged insights
    70/30
    Train/test split

    A snapshot of where the system landed after the calibration pass:

    100 insights)", "0.85+"], ["Position-bias reduction from shuffling", "14 percentage points"], ["Real-grievance weighting over LLM", "3×"], ["Archetypes available", "15 (across B2B, consumer, marketplace)"] ]} highlightCol={2} />

    We needed to know if any of this was working. So we built a calibration suite.

    But first — what are we actually measuring?

    We are not measuring "did SFG predict whether the product succeeded." We're measuring something narrower and more testable: given a known archetype and a known set of grievances, does the synthesized persona produce the same pattern of pain points, needs, sentiment, and decisions that the real users in that archetype produced — on data the model didn't see?

    If yes, the persona is a faithful behavioral model of its archetype. That's what we calibrated against.

    The setup: 2,069 manually-tagged insights across 16 niches, each with known ground-truth pain points, needs, sentiment distribution, and the decision a real founder would have arrived at when looking at the full dataset.

    We split 70/30 — synthesize personas using 70% of insights per niche, then ask each persona to characterize the held-out 30% (without ever seeing it). Compare the persona's response to ground truth across 5 weighted dimensions:

    DimensionMetricWeight
    Pain point overlapSemantic Jaccard (threshold 0.53)0.30
    Pain point rankingSpearman's ρ0.15
    Needs overlapSemantic Jaccard0.25
    Sentiment distribution1 − √JSD0.20
    Language similarityCosine of embeddings0.10

    Final score across the 16 niches: 93.1% behavioral fidelity, 95% CI 91.4–94.6%. Best-performing niches: content tools (93.2%), freelancers (92.7%), fitness (90.7%).

    Decision match rate (does the synthesized panel reach the same go/no-go verdict as the held-out real data on 4 axes — concern, need, verdict, recommendations): 4/4 across the calibration set.

    To restate what that number means: when we ask a synthesized archetype to characterize a problem space using only the 70% it was built from, its description of pain points, needs, sentiment, and recommended decisions matches the description that real users in that archetype produced (on the held-out 30%) at 93% similarity, on average, across 16 niches. Not "predicts the future at 93%." Reproduces archetype behavior at 93%.

    The most valuable thing this gave us wasn't the headline number. It was the per-niche breakdown — we could see which archetype pools were weak, which niches needed more insight sources, which prompts were drifting.

    The headline accuracy number is for marketing. The per-niche breakdown is for engineering. Build both.

    How an A/B test actually runs

    When a founder gives us two landing page variants:

  • Generate panel — 3–5 personas synthesized from the project's collected insights (already done at discovery time).
  • Per-persona evaluation in parallel — each persona sees all variants in one prompt, with shuffled labels.
  • Structured response — for each variant: appeal score (1–10), willingness (`would_buy` / `might_buy` / `would_not_buy`), pros, cons, 2–6 sentence reasoning.
  • Round 2 — panel discussion — personas react to each other's reasoning. This is where the interesting stuff happens. The skeptic challenges the early adopter. Scores shift. Sometimes the panel realigns entirely.
  • Aggregate — winner by win count first, average appeal as tiebreak.
  • The output isn't just a winner. It's a transcript a founder can actually read — with reasoning that maps to specific frustrations from real users.

    What else the SFG can do

    Once we had calibrated personas, A/B testing turned out to be the smallest use case. The same panel can run:

  • Claim validation — paste 1–10 marketing claims, each persona votes agree / disagree with reasoning. Surfaces which claims a real audience would call BS on.
  • Pricing tests — test 3+ price points, get per-persona perceived value and conversion likelihood.
  • Adaptive hypothesis generation — auto-generates 5–6 testable hypotheses covering problem fit, segment fit, behavior change, switching costs, pricing.
  • Early adopter lead extraction — pulls 20 real handles from the source insights — actual people who described the exact problem you're solving. Not synthetic. Outreach list.
  • There's also a separate Reality Check feature that flips the comparison around: it lets you run a real human survey, then dual-scores the SFG prediction against the real responses. That's how we keep the 93% number honest as the model evolves.

    What still doesn't work

    A few things we're still not satisfied with:

  • Single-model persona reasoning. Persona inference currently runs on one frontier-class LLM. We cross-verify factual claims across multiple providers in a separate feature, but the persona reasoning itself is single-backbone. That's a known shared-blind-spot risk we want to address — multi-model panels are on the roadmap.
  • No benchmarking against traditional focus groups. Only against holdout real-user data. Comparing AI personas to a real moderated focus group with 8 humans is the obvious next benchmark, and it's expensive enough that we keep deferring it.
  • Niches we don't have insight sources for (regulated industries mostly) drop to ~75% accuracy. The whole approach falls apart when you can't pull real user grievances from somewhere public.
  • We're not done. The 93% number is a calibration milestone, not a verdict. Anyone who tells you their AI focus group has solved the problem is selling.

    Try it

    The full Synthetic Focus Group lives inside GoNoGo. The free tier gives you 3 projects with the voice Discovery agent — enough to feel out whether the methodology makes sense for your idea before unlocking the full multi-agent pipeline (which is where SFG, A/B testing, pricing tests and the rest live, behind a one-time per-project credit — no subscription).

    If you build something with synthetic personas yourself — the three things that mattered most for us, ranked: (1) real grievances over LLM templates (3× weight), (2) per-archetype temperature, (3) variant shuffling. Without all three, you'll just keep getting "this looks great!" forever.

    Validate your idea in 30 minutes

    Voice-first AI consulting team. 17 reports. Free to start.

    Start Free →

    Get the next post in your inbox

    One technical post every 2 days — engineering deep-dives on building voice AI, synthetic focus groups, and AI agents. Plus PH launch notification. No spam.