Uday Singh

Tells · part 1 · linear algebra for poker players

The residual stream is the table

What a language model computes when it reads a hand of poker, and where that computation lives. One hand of no-limit hold'em builds the geometric picture under every measurement in this series.

Uday · Tells, part 1 · June 2026

Every poker player knows about tells. Shaky hands usually mean a monster. The player who glances at his chips the instant the flop comes has connected with it and is planning a bet. A tell is information leaking out of a hidden state: something true inside a player (the cards, the intention) that you cannot observe directly, and a measurable behavior on the outside that correlates with it.

A language model has hidden state too. When it reads a hand history and decides whether the hero should call, something inside it registers the third spade, computes something like pot odds, weighs something like a range. All of that is invisible by default; what comes out is a single word. Mechanistic interpretability is the field that reads the network’s tells. Its simplest instrument, the linear probe, is a trained tell-reader: a rule that looks at the machine’s internal numbers and says “it thinks the flush got there.”

This post builds the picture those instruments stand on: what the machine does when it reads the hand, how text becomes geometry, where the model keeps its working state, and how information moves through it. It assumes a reader who knows poker and can code, and nothing about neural networks. One hand carries the whole post:

Villain raises the button, Hero defends the big blind. Flop K♠ 9♠ 2: check, bet, call. Turn 7: check, check. River A♠: Hero checks, Villain jams. Hero ?

1 · The machine has one job

Strip away the chat window and a language model is one function: given some text, output a probability for every possible next chunk of text. That is the entire job. There is no module for facts, no module for grammar, no module for poker. It writes essays and code the same way it would finish our hand, by predicting one chunk, appending it, and predicting again, in a loop.

The job sounds too small. But predicting the next word well requires modeling whatever process generated the words. If the training data contains millions of hand histories, then putting the right probability on Hero’s next action requires registering what the river card did to the board, because hands where the flush arrives end differently from hands where it misses. Try it. Same hand, two rivers:

fold
48%
call
41%
raise
6%
tank
5%
Figure 1. The model's actual output: a probability for every continuation (illustrative numbers). Switch the river and watch the distribution move. Whatever moved it is the thing this series is trying to find.

Nobody programmed a flush detector. The model was trained by reading trillions of words with the next one hidden, guessing, and getting every internal dial nudged a hair in whichever direction would have made the guess slightly better. Poker knowledge ended up encoded in the dials for one reason: it made the guesses better. So instead of asking whether the model “really understands” flushes, ask a mechanical question: where, in the numbers, does the third spade live? That question has an answer, and by the end of this series we’ll measure it.

2 · The hand, chopped into chips

The machine can’t read letters. It has a fixed vocabulary of roughly 150,000 text fragments (whole words, word pieces, punctuation), and the first thing that happens to our hand history is that it gets chopped into those fragments, each identified by a number:

River#3204  A#317 #9081 .#13  Villain#28116  jam#7521 s#82 .#13  Hero#15490 ?#30
Figure 2. Tokenization (IDs invented, structure real). Note the practical annoyance: "A♠" is two tokens and "jams" splits in the middle. Card notation tokenizes badly, which matters later when we decide exactly which position to measure.

Tokenization is a lookup, and so is the next step. Nothing mathematical has happened yet. It’s about to.

3 · Meaning is position

The model’s second move is to swap each token ID for a vector: a list of numbers, say 4,096 of them for a mid-sized model, fetched from a giant table with one row per vocabulary entry (the swap is called embedding). That list is the model’s working description of the token.

Here is the first idea to internalize: stop reading “a list of 4,096 numbers” and start reading “coordinates.” Each token is a point in a space with 4,096 axes. Nobody can picture that space, and you don’t need to. It behaves exactly like the 2D version, just with more axes, and in 2D you can see what training does to it:

showing 2 of 4,096 axes A♠ K♥ Q♣ 7♦ check call bet raise jam pizza nearby points = similar meaning · parallel arrows = the same relationship, one step more aggressive
Figure 3. A 2D shadow of embedding space. Cards cluster with cards, actions with actions; pizza sits far from both. The two orange arrows are the same arrow: check→call and raise→jam differ by one notch of aggression, and that shared relationship is a shared direction.

Training pushes tokens that behave similarly toward each other, so position comes to encode meaning. Directions pick up something better: relationships. The famous party trick is king − man + woman ≈ queen; the table version is that the step from check to call points the same way as the step from raise to jam. Hold onto “meaning lives along directions.” Every measurement in this series rests on it.

Here is the analogy that will carry you through everything else: a token’s vector is a HUD readout. Online players summarize an opponent as a stat line (VPIP, PFR, three-bet percentage, fold to c-bet), and no single stat means much, but combinations do. No single stat means “this player is a nit,” either; nit-ness is a combination: low VPIP, low PFR, high fold-to-three-bet, each weighted some amount. A weighted blend of coordinates is exactly what a direction is. The model’s vectors are HUDs with two differences. There are 4,096 columns instead of fifteen. And none of the columns are labeled, because the network invented its own stat definitions during training and never told anyone what they are. Finding out is the job.

4 · A matrix is a machine

Here’s where I got lost at first. Everything so far was lookups. The next step is the first matrix multiplication, and if “matrix multiply” is a half-remembered ritual of rows and columns, everything after this point stays blurry. The fix is to stop seeing arithmetic and start seeing a machine:

token vector 4,096 numbers W: a matrix 64 fixed recipes of weighted sums distilled 64 numbers
Figure 4. Multiplying a vector by a matrix means running it through a bank of fixed recipes. Each output number is one recipe: take 0.3 of input slot 17, subtract 0.8 of slot 1,204, add a pinch of slot 3,001… The recipes were learned; applying them is mechanical.

Each output number is its own learned weighted sum over the whole input: a 4,096-to-64 machine is 64 recipes applied to the same vector. The recipes were learned during training; applying them is mechanical.

Three properties of these machines do real work later. They apply the same rule to every point, which is what “linear” means and why the whole space transforms coherently instead of tearing. They can change dimensionality on purpose: the machine above distills a 4,096-stat HUD into a 64-number specialist summary, deliberately throwing away everything it doesn’t care about. And chaining two machines yields one machine, which is why a 36-layer network is analyzable at all. For the full visual treatment, watch chapter 3 of 3Blue1Brown’s Essence of Linear Algebra.1

5 · The residual stream is the table

Now the architecture. The model is a stack of about 36 identical blocks (the design is called a transformer). Every token’s vector passes through all of them, and each block adds its output onto the vector. Adds, never overwrites. So the vector for a given token gets revised 36 times on its way up. At layer 0 the vector for “Hero” encodes the word and its position,2 and nothing else. By the middle layers, after each token has been mixing in information from the tokens around it (the next section explains how), it encodes who raised, how big, what street, and what that implies. That running, accumulating vector is called the residual stream. The metaphor that made it click for me gives this post its title.

The residual stream is the table. At a live game, the state of the hand isn’t stored in any one player’s head; it’s out on the felt: the board cards, the pot, the bet in front of villain, the posture of everyone still in. Each street, players read the table and add to it. In the model, each token’s stream is its patch of felt: every block reads what’s there and lays down another chip of information. By the late layers, the patch in front of our final token holds everything the network has figured out about the hand:

  • layer 0the word "Hero", sitting at position 47 in the sequence, and nothing else
  • ~layer 3grammar resolved: this token is a decision point awaiting an action verb
  • ~layer 9board synthesis: A♠ completed the K♠ 9♠ flop; the flush arrived
  • ~layer 15arithmetic: jam is 1.4× pot, Hero is getting about 1.7-to-1
  • ~layer 22range logic: check-back turn then jam river reads polarized (nuts or air)
  • ~layer 29the verdict forming: probability mass sliding toward "calls"
Figure 5. The final token's patch of felt, accumulating annotations on its way up the stack. This is a cartoon; real features are messier, distributed, and overlapping, and pinning actual claims like these to actual layers is precisely what probing experiments do. But the shape of the picture is right: information accrues by addition, layer by layer.

Two things follow from “add, never overwrite.” First, because every block writes by addition, the natural format for information is a direction: to record “flush completed,” nudge the vector along the flush-completed direction. That is why meaning-as-direction keeps showing up.

Second, anyone can read the table at any point. Including us. An activation, the raw material of every interpretability experiment, is a snapshot of one token’s residual stream at one layer: 4,096 numbers, copied out mid-computation, like photographing the felt between streets. The model is a program, a long sequence of matrix multiplications and a few simple nonlinear steps, and you can pause it at a chosen line and copy out a variable, the same way you’d print a variable from the middle of a function.3

One distinction to keep, because the two get conflated constantly: the stream and the weights. Training wrote the weights (the embedding table, every block’s matrices) slowly, then froze them. They are the player, carrying everything the years of practice taught him. The stream is the hand he’s playing: rebuilt from nothing every time the model reads a prompt, used, and thrown away. Everything this series measures lives in the stream. The question the series ends on, whether tell-readers go stale as the player keeps learning, lives in the gap between the two.

6 · Attention is alignment

One problem remains. After embedding, each token’s patch of felt knows only about itself. The vector for “jams” knows it is the word “jams.” But the jam means something different on an A♠ river after a spade flop than on a brick, so information has to move from earlier positions into later ones. The mechanism that moves it needs two properties. It has to be learned, because the network must figure out for itself what is relevant to what. And it has to be soft, because training can only tune things that change smoothly, so “look at token 3, ignore token 5” must be expressed as percentages, never as hard switches. Attention is that mechanism, the piece the transformer was built around: a learned, soft lookup. Mechanically, it’s a job market.

Attention runs in parallel units called heads. Inside one head, three of the matrix-machines from Figure 4 distill every token’s big vector into three small ones, and the three roles are worth keeping separate in your head. The query is a want-ad: what this token is looking for. The key is a résumé: what this token advertises containing. The value is the actual employee: the package of information handed over if hired. What a token seeks, what it advertises, and what it delivers are different things, and the network learns each independently.

Scoring is a dot product, the simplest similarity measure there is. To compare the current token’s want-ad against an earlier token’s résumé, line up the two small vectors, multiply them entry by entry, and add everything up. When the vectors point in similar directions, the products reinforce and the sum is large; when they’re unrelated, it hovers near zero. Alignment, as a number. The network has total freedom, through the query and key recipes, to learn what kind of matching matters; it invents coordinate systems in which “relevant to me” becomes “geometrically aligned with me.”

Then softmax turns the scores into a pot split. Exponentiate each score, divide by the sum, and the results are positive and total exactly 1. Every token has exactly 100% of attention to distribute over itself and everything before it (a mask blocks the future: no peeking at cards that haven’t come), and softmax decides the split while exaggerating the leaders; a score that’s moderately ahead gets a dominant share. Drag the want-ad around and watch the split:

flop K♠9♠2♦ (key) bets 75% (key) river A♠ (key) turn 7♥ (key) query (drag me)
flop K♠9♠2♦q·k 2.0
9%
bets 75%q·k 3.3
29%
river A♠q·k 3.9
57%
turn 7♥q·k 1.6
5%

Drag the orange arrowhead, or run the three experiments left to right.

Figure 6. One attention head, live. Teal arrows are keys (what earlier tokens advertise); orange is the query (what the decision token seeks); each q·k is a dot product; the bars are softmax splitting 100% of attention. Three things to make happen before moving on: aligned beats everything; perpendicular dies even when the arrows are close, because alignment is what's measured rather than proximity; and a longer query in the same direction keeps the same winner while sharpening the split; softmax exaggerates big scores, which is exactly why real attention divides by √64 to keep the pot from saturating.

Once the split is decided, each earlier token hands over its value package in proportion to its share, and one last learned matrix translates the blend back into the residual stream’s 4,096-dimensional language. The last step is the one to remember: the result is added onto the token’s existing vector. Nothing is overwritten. The token’s running HUD just gained a new annotation: the third spade arrived.

A real layer runs about 64 of these heads in parallel, each with its own learned matching rule, each reading the same residual stream and matching on different criteria. The head in Figure 6 behaves like a flush-tracker, hiring the river A♠ hardest because that’s the card that completed the draw. Other heads might match on bet sizes, position, whose turn it is. Nobody assigns the specialties. They emerge, because redundant heads waste capacity and the training process punishes waste.

The whole mechanism fits in eight lines, and seeing how little there is demystifies it:

q = x @ W_Q                   # x: [seq, 4096] -> q: [seq, 64]
k = x @ W_K
v = x @ W_V
scores = q @ k.T / 8          # every query dotted with every key, divided by sqrt(64)
scores = mask_future(scores)  # causal: no looking ahead
split = softmax(scores)       # shares of the pot; each row sums to 1
out = split @ v               # blend of value packages
x = x + out @ W_O             # translate back, add to the stream

Two distinctions in those lines pay off later. The four matrices per head are weights, fixed after training, while the attention pattern itself is recomputed from scratch for every input; the rules for looking are permanent, the looking is per-hand.4 And the lines factor into two separate questions: the query and key matrices decide where to look, the value and output matrices decide what to move, and a head can have an interesting looking rule, an interesting moving rule, or both.5

Attention gathers; it does very little thinking of its own. The other machine in each block, the MLP, is the opposite: no communication between tokens, just heavy computation on each patch of felt separately. Mechanically it is two of the matrix-machines from Figure 4 with a simple gate between them that snaps negative numbers to zero; without the gate, the two would collapse into one machine and compute nothing new. Attention hauls the pot size and the jam size onto the final token’s felt; the MLP is where something like 1.7-to-1 gets computed and written down. Fetch, then think, thirty-six times.

7 · The last machine

One step remains between the final block and the bars in Figure 1. After thirty-six rounds of fetching and thinking, the patch of felt in front of the last token holds the model’s full read on the hand. The unembedding turns that read into a prediction, and it is one more matrix machine, the widest of all: every token in the vocabulary owns a direction, a row of the matrix, and the final vector gets dotted against all 150,000 of them. Alignment becomes score, softmax splits the pot, and out come the bars we started with: “calls” 62%, “folds” 31%. Sample one, append it, run the whole function again. That loop is text generation, and a chat conversation is the loop running on a transcript.

calls folds raises tanks the final vector (drag me)
callsv·w 2.2
62%
foldsv·w 1.5
31%
raisesv·w -0.6
4%
tanksv·w -0.8
3%

Drag the orange arrowhead. The bars are softmax over four dot products.

Figure 7. The unembedding with a four-word vocabulary. Each word owns a direction, a row of the matrix in ink; the amber arrow is the final token’s stream after the last block. Dot products become scores, softmax splits the pot, and the bars from Figure 1 fall out. The real matrix has 150,000 rows.

Read the output step again in the language of section 3. Each vocabulary direction is a weighted sum over the 4,096 coordinates of the stream. The model’s own output layer is a bank of 150,000 linear readouts, trained by the model for its own use. Reading the residual stream with a weighted sum is the native way information leaves this machine; it happens 150,000 times for every token it produces.

8 · Where this is going

Back to tells. A linear probe, the instrument this series will spend months with, is one weighted sum and a threshold applied to a snapshot of the residual stream. After section 7 it should look familiar: the unembedding already reads the stream with 150,000 weighted sums, one per word. A probe is one more readout, added from outside, trained on a label the model never produced: this hand was a bluff. It is a HUD rule, precisely: weight these internal stats like so, and if the total clears the bar, tag him.

The data behind a probe is the least glamorous part of the field. The model reads five thousand hands while a hook copies out the snapshot at each decision token, and next to each snapshot goes a label the solver already knew: this one was a bluff, this one was value. What you end up with is, honestly, a spreadsheet. Five thousand rows, 4,096 numeric columns, one label column. Nothing has been discovered at this point; a program ran five thousand times and something wrote down an intermediate variable.

Plotted instead of filed, each row is a point in the space from section 3, and the labels color the points. Five thousand hands make two clouds. Either a flat plane sits between them or it doesn’t:

hand x17 x2048 ... label

no hands yet

columns x17 and x2048, as coordinates

Each hand writes one row and drops one point. Same numbers, two views.

Figure 8. The spreadsheet and the geometry are the same object. Every hand appends a row on the left and a point on the right, and the point’s coordinates are two of the row’s 4,096 columns. The plane is the probe: one side reads bluff. Numbers invented for teaching.

The reason probes are kept linear is that linear functions are weak. One weighted sum cannot compute equity from raw card tokens; equity is a violently nonlinear function of the cards. So when a linear rule reads equity off layer 16 anyway, the nonlinear work happened inside the model, and the probe found where the answer was written on the felt.

That’s the plan from here. Next post: dot products and the rest of the geometry, formally, alongside building a GPT from scratch following Raschka. A few posts after that, the first real experiment: training probes for hand equity on an open model, where, unusually for this field, the ground truth is computable to four decimal places. Down the road, the questions the series exists for: whether probes trained to catch a model lying also fire when it bluffs, and whether tell-readers go stale as the player keeps learning after deployment. The background I’m working through is Neel Nanda’s guide plus the ARENA curriculum.

The felt is set. Next street.

The geometry in section 3 and the layer numbers in Figure 5 are illustrative; the numbers in Figures 1, 6, 7, and 8 are constructed for teaching. I’ve also left out the normalization steps that rescale the vector between blocks; they matter for engineering and change nothing in this picture. Corrections welcome.

Footnotes

  1. The best eleven minutes in mathematics education; I will not compete with it in prose.

  2. Position enters through a second mechanism that varies by model family: older models add a learned position vector to each token’s embedding, newer ones rotate the query and key vectors by position-dependent angles. Either way, the stream knows where its token sits.

  3. Libraries like TransformerLens make this painless: attach a small function (a hook) at “layer 16, last token,” run the model on a prompt, and the hook copies the 4,096 numbers to disk.

  4. The looking is also expensive. Every token scores against every earlier token, so a 10,000-token context needs on the order of 100 million scores per head per layer. Attention cost grows with the square of context length, which is why long contexts cost what they do.

  5. The framing comes from A Mathematical Framework for Transformer Circuits (Elhage et al., 2021), which names them the QK circuit and the OV circuit and analyzes them separately.