Jun 1, 2026·15 min read

Reading strategy out of a demo: heuristics and a trained role classifier

Starts from first principles — what a Counter-Strike demo actually is and what data it exposes — then shows two ways we read strategy out of it: timing heuristics for team tactics, and a calibrated classifier for player roles. Real metrics, confusion matrix and all.

What this is about

Every serious Counter-Strike match leaves behind a perfect recording — and that recording is *data*, not video. This post is about two ways we read strategy out of those recordings: simple timing rules for what a *team* is doing, and a trained model for what *role* a player plays. The techniques are general; the interesting part is how much falls out of one very rich data source. But first, for anyone who hasn't stared at a CS2 server at 3am: what is the thing we're parsing?

First — what's a demo?

Counter-Strike 2 is a 5-versus-5 tactical shooter. A match runs ~24 rounds; each round the two teams (called T and CT) spend money on weapons and utility — grenades: smokes, flashbangs, molotovs — and try to win the round by eliminating the other team or by the bomb objective. Tens of millions of people play it, and almost every competitive match gets recorded as a demo.

A demo (a .dem file) is not a video. It's a complete, replayable log of the match at the network-protocol level. The game server advances the world in discrete steps called ticks — CS2 runs at 64 ticks per second — and the demo records the full game state plus every event on each tick. "Playing" a demo means the engine *re-simulates* the entire match from that log, which is why you can watch it from any player's eyes, any camera angle, scrub backwards, freeze a single moment. It's the difference between a screen recording of a program and the program's actual execution trace.

Concretely: one pro match is 100–600 MB and tens of thousands of ticks. Competitive games produce GOTV demos (Valve's spectator-broadcast format) — that's what we collect. Under the hood it's the Source 2 engine's protobuf-encoded stream of entity updates and events: a firehose of "this entity moved here," "this event fired" — not a tidy spreadsheet.

What the demo exposes (and what we parse out of it)

Nobody hands you "here are the kills." The demo is a low-level binary stream of entity *deltas* and game events, and a parser has to reconstruct the high-level picture. We use awpy (a Python library) built on the demoparser2 CS2 parser, which turns that stream into structured tables. The ones that matter here:

ticks    — per player, per tick: X/Y/Z position, where they're aiming, health,
           armor, active weapon, equipment value, money, side (T/CT), alive?, and
           the named map area ("callout") they're standing in (e.g. "Jungle").
           ~10 players x 64/sec x ~40 min  ≈  hundreds of thousands of rows.
kills    — attacker + victim (id + position), weapon, headshot?, tick, round
damages  — every hit: attacker, victim, weapon, HP/armor dealt, hit group, tick
grenades — per-tick projectile flight paths → where a nade was thrown from + where it landed
smokes   — smoke-cloud events: position, who threw it, which side
infernos — molotov / fire events: position, thrower, side
rounds   — freeze-end / end ticks, winner, win reason (elimination / bomb / defuse / time)
bomb     — plant + defuse events and positions
header   — map name, tick rate, demo protocol version

Every player is keyed by their steamid64, which we map to a name. Three things worth stressing for a non-CS reader:

It's positional and temporal. We know where all ten players are, what they're holding, and how much money they have — 64 times a second. So almost any question about Counter-Strike ("did they take map control?", "was that a coordinated push?", "who took the first fight?") becomes a *query over positions and ticks*, not a judgment call.
The demo is the source of truth. Everything above is reconstructed from the file itself — no Valve API needed to know what happened in the match. (Valve's API shows up elsewhere — turning a player's "sharecode" into a downloadable demo — but the gameplay data lives in the .dem.)
Granularity is the whole point. A box score says a player got 20 kills. The demo says *which tick*, *from where*, *with what weapon*, *against whom*, *with which teammates alive nearby* — which is exactly the raw material the heuristics and classifier below run on.

Two ways to read strategy out of a demo

So: given all that, how do you read *strategy* out of it? Two questions, two very different techniques.

"What is this team doing?" — executes, nade combos, pistol-round tendencies. These are heuristics: hand-written rules over grenade timing. No model, no training. They are transparent by construction — you can read the rule and know exactly why a round was flagged.
"What role does this player play?" — entry, AWPer, support, lurker. This is a trained classifier: ~30 numeric features per player, fed to a calibrated logistic-regression model trained on labeled pro demos, with a hand-rule scorer as a fallback.

We will show both, including the parts where the model is honestly mediocre. The whole point of writing this down is that you can check our work.

Part 1 — Heuristics: timing is the signal

Every grenade we extracted has a throw_tick. Almost everything about team strategy falls out of looking at *which nades were thrown close together in time, by how many players*. Three patterns, three thresholds.

A nade combo is one player throwing two or more grenades in quick succession — a flash-into-molly, a double-smoke. We define "quick" as within 3 seconds (192 ticks at 64-tick). We group every round's grenades by thrower, then walk each thrower's sequence looking for runs:

// webapp/lib/pro.ts
const COMBO_GAP_TICKS = 3 * 64;   // 3s
const EXEC_WINDOW_TICKS = 6 * 64; // 6s
const PISTOL_ROUNDS = new Set([1, 13]); // MR12 pistols

// COMBOS — per thrower, runs of >=2 nades within COMBO_GAP_TICKS
const byThr = new Map<string, StratNade[]>();
for (const g of gl) (byThr.get(g.thrower_steamid) ?? byThr.set(g.thrower_steamid, []).get(g.thrower_steamid)!).push(g);
for (const [sid, seq] of byThr) {
  for (let i = 0; i < seq.length - 1; ) {
    const run = [seq[i]];
    while (i + 1 < seq.length && seq[i + 1].throw_tick! - run[run.length - 1].throw_tick! <= COMBO_GAP_TICKS) {
      run.push(seq[++i]);
    }
    if (run.length >= 2) {
      nCombos++;
      const sequence = run.map((g) => G_ABBR[g.grenade_type] ?? g.grenade_type).join(' → ');
      const c = comboAgg.get(sequence) ?? { count: 0, players: {} };
      c.count++; c.players[sid] = (c.players[sid] ?? 0) + 1;
      comboAgg.set(sequence, c);
    }
    i++;
  }
}

The result is a count per *sequence* — "flash → molly", "smoke → smoke" — with the player who does it most. That is how a team page can say "apEX: flash → flash, 11 times."

An execute is the team-level version: two or more *distinct* throwers smoking within a 6-second window. The distinct-thrower requirement is what separates a coordinated execute from one player throwing two smokes:

// EXECUTES — >=2 distinct throwers smoke within EXEC_WINDOW_TICKS
const smokes = gl.filter((g) => g.grenade_type === 'smoke');
for (let a = 0; a < smokes.length; a++) {
  const win = [smokes[a]];
  const thr = new Set([smokes[a].thrower_steamid]);
  for (let b = a + 1; b < smokes.length && smokes[b].throw_tick! - smokes[a].throw_tick! <= EXEC_WINDOW_TICKS; b++) {
    win.push(smokes[b]); thr.add(smokes[b].thrower_steamid);
  }
  if (thr.size >= 2 && win.length >= 2) {        // >=2 *different* players → execute
    nExecutes++;
    const cx = win.reduce((s, g) => s + g.land_x, 0) / win.length;
    const cy = win.reduce((s, g) => s + g.land_y, 0) / win.length;
    const r = worldToRadar(map, cx, cy);
    const area = r ? (nearestCallout(await loadCallouts(map), r.x, r.y) ?? 'site') : 'site';
    const ek = `${map}|${(win[0].side ?? '?').toLowerCase()}|${area}`;
    execAgg.set(ek, (execAgg.get(ek) ?? 0) + 1);
    break;                                        // one execute per round-window start
  }
}

Pistol rounds are just rounds 1 and 13 under MR12. On those rounds we tally what utility each side threw and where the smokes landed, so a team page can show "CT pistol: usually smokes Connector."

That is the per-team view. There is also a map-level version (gate-2/detect_executes.py) that answers a richer question: not "does this team execute A" but "what *combination of areas* do teams smoke together." The example that motivated it: on Mirage, an A-execute is usually CT plus Jungle plus Stairs smoked at once, into pop-flashes. So instead of aggregating single areas, we capture the set of areas smoked together and group executes by that set across all teams:

python

# one execute = the SET of distinct areas smoked together
areapos = {}
for g in win:
    r = w2r(mp, g["land_x"], g["land_y"])
    if not r: continue
    ar = nearest(cs, r[0], r[1])            # nearest callout to this smoke
    p = areapos.setdefault(ar, [0.0, 0.0, 0])
    p[0] += r[0]; p[1] += r[1]; p[2] += 1
if len(areapos) < 2:
    continue                                # all smokes hit one callout — not a combo
combo = tuple(sorted(areapos))              # e.g. ("CT", "Jungle", "Stairs")

# pop-flash follow-up: a flash within 3s after the last smoke
last = win[-1]["throw_tick"]
flashed = any(0 <= f["throw_tick"] - last <= FLASH_FOLLOW for f in flashes)

Each combo accumulates a count, an average number of smokes, and a pop-flash rate — the fraction of times the smokes were followed by a flash within 3 seconds. The output is a catalog of the most common multi-smoke executes per map, currently 107 combos across seven maps, each rendered as linked smoke circles on a mini-radar in the Lineups view.

Why heuristics here and not a model? Because the rules *are* the explanation. "Two players, two smokes, six seconds" is a definition a coach would accept, and when we flag a round you can replay it and see exactly that. A classifier would add opacity for no accuracy gain on a question this crisp. The honest limitation: we have no per-round win/loss in the corpus yet, so we report frequency and location — *what* and *where* — not win rate.

Part 2 — A trained classifier for player roles

Roles are not crisp. "Is this player a lurker" is not a six-second timing rule; it is a fuzzy pattern across a whole match. So here we do build a model — but a small, legible, calibrated one, and we are explicit about how well it works.

The features. Per player per match, we compute 31 numeric features grouped by what they capture: opening duels, spatial positioning, weapon usage, utility usage, trades and survival, economy, and a set of lurker-specific behavioral signals. The feature list is a frozen, ordered contract — reorder it and you must bump the schema version so old models are retired:

python

# gate-2/extract_role_features.py  (excerpt of the 31-name contract)
FEATURE_NAMES = [
    # Group A — Opening duels
    "opening_attempt_rate_t", "first_kill_rate", "first_kill_to_death_ratio",
    "avg_time_to_first_engagement_t", ...
    # Group C — Weapon usage (the strongest AWPer signal)
    "awp_kill_fraction", "awp_kill_fraction_t", "awp_kill_fraction_ct",
    "awp_round_pickup_rate", "scout_kill_fraction", "awp_buy_rate",
    # Group D — Utility usage (the primary Support signal)
    "flashes_thrown_per_round", "smokes_thrown_per_round",
    "mollies_thrown_per_round", "flash_assist_rate", "util_damage_per_round",
    # Group G — Lurker behavioral signature (v2)
    "teammate_proximity_at_kill_t", "late_round_kill_rate_t", "time_alone_rate_t",
]
SCHEMA_VERSION = 2

Most features are simple ratios. A few are not. The spatial features are computed by a vectorized self-join in polars: for each sampled tick of each round, join the tick frame to itself on (tick, side) to get every same-side pair, take each player's nearest teammate, and average. The flash-assist feature is a genuine causal check — a teammate's kill of a victim who took *my* flashbang damage within the prior two seconds:

python

flash_assists = defaultdict(int)
for k in kills:
    victim, t, rn = k.get("victim_steamid"), k.get("tick"), k.get("round_num")
    attacker_side = side_by_round_player.get((rn, int(k["attacker_steamid"])))
    # any teammate whose flash hit `victim` shortly before this kill?
    for candidate in all_steamids:
        if candidate == attacker_sid or side_by_round_player.get((rn, candidate)) != attacker_side:
            continue
        for ft in flash_dmg.get((candidate, int(victim)), ()):
            if 0 <= t - ft <= FLASH_ASSIST_TICKS:   # 2s window
                flash_assists[candidate] += 1
                break

The lurker story is worth telling, because it is a case where the first version of the features was wrong. v1 measured *average* distance from teammates — which over-tagged any rifler on a big map as a lurker, because even non-lurkers sit 400–800 units away during normal play. Known lurkers (ropz, zont1x, NiKo) were being mis-classified as support. v2 replaced the average with three features that measure *the moments that actually prove the thesis*: how far from teammates the player was when they got a kill, how many of their kills came late in the round, and what fraction of round-time they spent genuinely alone. The comments in the code record exactly this calibration.

The model. A scikit-learn pipeline: impute missing features to zero, standard-scale, then a calibrated multinomial logistic regression. The calibration matters because the role radar in the UI shows a probability simplex — it needs to *mean* something, not just argmax:

python

def build_pipeline() -> Pipeline:
    base = LogisticRegression(penalty="l2", C=0.5, max_iter=2000,
                              class_weight="balanced", solver="lbfgs")
    calibrated = CalibratedClassifierCV(base, method="isotonic", cv=2)
    return Pipeline([
        ("impute", SimpleImputer(strategy="constant", fill_value=0.0)),
        ("scale", StandardScaler()),
        ("clf", calibrated),
    ])

The evaluation is the honest part. We have ~15 labeled pro players, ~4 matches each — about 63 samples. With data that small, the only trustworthy cross-validation is Leave-One-Group-Out *by player*, so a player never appears in both train and validation. Here are the real numbers from the current training report:

Samples: 63   Features: 31   CV: Leave-One-Group-Out by steamid64 (14 folds)

              precision  recall   f1   support
   entry         0.62     0.68   0.65    19
   awper         0.93     1.00   0.97    14
   support       0.62     0.50   0.55    16
   lurker        0.50     0.50   0.50    14

   f1-macro (pooled): 0.667

Read that honestly. AWPer is nearly perfect — awp_kill_fraction is an almost-clean signal, so the model basically can't miss it. Support and lurker are coin-flips at the margins — they share utility and positioning behavior, and the confusion matrix shows exactly where it leaks:

 true \ pred   entry  awper  support  lurker
   entry         13      0       2       4
   awper          0     14       0       0
   support        5      0       8       3
   lurker         3      1       3       7

Entries get mistaken for lurkers and vice-versa (off-diagonal 4 and 3), and support bleeds into entry. That is the model telling you the truth: at the boundary between a passive entry and an aggressive lurker, 30 features and 63 samples are not enough to be sure.

One subtlety we want to flag so the number is not misread: the *pooled* f1-macro is 0.667, but the *per-fold* mean is far lower. That is an artifact of LOGO, not a second opinion — each held-out fold is a single player with a single true role, so three of the four classes have zero support in that fold and a four-class macro-f1 is capped near 0.25 even on a perfect prediction. The pooled figure is the one that means something.

Graceful degradation. The classifier ships with the trained model *and* a deterministic hand-rule scorer. If the joblib model is absent — initial deploy, training in flight, or a load failure on a Modal image build — we fall back to the heuristic rather than returning nothing. The hand rules are baseline-subtracted weighted sums, calibrated against a labeled pro demo, so a player only earns role-signal once they rise above the "everyone does this a bit" floor:

python

"lurker": (
    5.0 * pos(f("time_alone_rate_t") - 0.20)
    + 4.0 * pos((f("teammate_proximity_at_kill_t") - 600) * 0.001)
    + 3.0 * pos(f("late_round_kill_rate_t") - 0.20)
    + 2.0 * pos(f("clutch_attempt_rate") - 0.05)
    - 2.0 * pos(f("opening_attempt_rate_t") - 0.20)   # penalize entry behavior
    - 0.8 * pos(f("first_kill_rate") - 0.15)
),

The selection between the two is invisible to the caller — both return the same {role_scores, primary_role, primary_role_confidence} shape — and a CONFIDENCE_FLOOR of 0.35 lets the UI show "mixed role" rather than forcing a hard label when the model is unsure.

Why one of each

The split is not arbitrary. Use a heuristic when the definition *is* the explanation and you want a coach to be able to nod at the rule — executes and combos are exactly that. Reach for a trained model only when the pattern is genuinely fuzzy and no rule captures it cleanly — roles are that. And when you do train, on a corpus this small, the most useful thing you can publish is not the headline f1 but the confusion matrix and the place where the model admits it is guessing.

Both of these run over the same corpus_grenades and parsed-demo data the previous post described. The pipeline builds the corpus; these two layers read meaning out of it. As the demo set grows — and the free HLTV downloader means it keeps growing — the heuristics stay exactly as legible, and the classifier gets the one thing it is actually short on: more labeled players.

← Back to all posts