May 12, 2026·4 min read

Training the coach: using pro demos to teach Claude what good play looks like

The coaching report is the moat. Here is the approach we are taking to make it actually read like a coach — including how pro demos fit in as a reference corpus, not training data.

The moat is the digest and the prompt

CSReplays does not fine-tune a model. We do not train a custom transformer on a folder of demos. The moat is two things: the shape of the digest we feed the model, and the system prompt that turns the digest into coaching.

Both are iterated against real demos, by hand, until the output passes a single bar: would a real CS player call this useful?

The kill-shot test

Before we wrote a line of webapp code, we ran one test. Parse one real demo. Hand-craft a digest. Prompt Claude with a coaching system prompt. Read the output and ask, honestly: does this read like a coach, or like generic LLM filler about "improving your aim?"

If the answer was "filler" after a day of iteration, the project was dead. The bar for shipping was that several real CS players, given a stranger's report, would point at specific round references and say "yeah, that is what was actually happening."

We passed that bar in early May. Everything since has been about making the same loop survive at scale.

How pro demos fit in

The instinct, when you have demos, is to fine-tune on them. We deliberately do not. A fine-tuned model on a few hundred pro demos would learn to write text that looks like commentary, not text that coaches a 15k Premier player.

Instead, pro demos are used in three ways:

As a baseline distribution for utility lineups. Every map has a catalog of standard smokes and mollies — execute smokes, retake mollies, lurk pops. The catalog is hand-curated, but populated by mining pro demos: any smoke trajectory that appears in N+ pro matches is a "known lineup." The digest can then say "you used the mid-to-A executive smoke" instead of "you used a smoke somewhere on A."
As a reference for what good defaults look like. Pro T-side defaults on Mirage, pro CT rotates on Inferno, pro post-plant positions on Anubis. These show up in the digest as comparison points: "you played a 2-A/3-B default on T-side; the pro distribution is 3-A/2-B 70% of the time on this map." Useful when grounded; useless when stated as a flat rule.
As a hallucination ground-truth set. When the report says "you peeked A-long after losing the same duel in round 11," we need to be certain round 11 actually contained that duel. The verification pass runs against the parsed demo, not against the model. Pro demos are used to seed the test corpus for that verifier — they give us a deeper sample of correct round-reference behavior than only consumer demos would.

How the prompt is iterated

The system prompt lives in a single file in the repo. Changes to it are deployed with the worker. The loop is:

1. Pick a demo from a small, hand-curated test set (mix of skill levels, mix of maps). 2. Run the current prompt against its digest. Read the output. 3. Find one specific failure: a generic line, a missed pattern, a round reference that is technically correct but uninteresting. 4. Edit the prompt to address that one failure. Re-run against the full test set to make sure the fix did not regress anything else. 5. Commit. Deploy. Move on.

This is unsexy. There is no clever evaluation harness, no benchmark, no leaderboard. The benchmark is "do real CS players read this and feel seen." That cannot be automated, so we do not try.

What this means for users

You will see the report quality change over time, often noticeably between weeks. That is the prompt and the digest improving in the repo, not a different model behind the scenes. The model is Claude — currently Sonnet 4.6, with Opus 4.7 ready behind a feature flag if the quality gap becomes worth the cost.

You will not see your demos turned into training data. They are scoped to your account, encrypted at rest, never indexed, and never used to train models — ours or anyone else's. The pro-demo corpus we mine is a separate, public-source set.

The boring summary: the model is fixed. The thing that gets better is the layer between the demo and the model. That layer is the actual product.

← Back to all posts