What Is Linear Algebra Really Doing in LLMs?

Recently, while revisiting LLM material more systematically, I noticed a subtle problem. Many concepts felt familiar, and I could use them in the usual contexts. But when I tried to explain them from the beginning, including why they have this form, what problem they solve, and what role they actually play inside an LLM, the explanation often became less precise.

Linear algebra is almost unavoidable as the first layer of that rebuilding process. The basic objects inside an LLM are already vectors and matrices: a hidden state is a vector, a linear layer is matrix multiplication, attention uses dot products, and LoRA uses low-rank updates. These statements are all true, but they are mostly entry points. The harder questions are: why can information be represented as vectors? What exactly does matrix multiplication change? How does attention let different tokens interact? Why do rank and SVD naturally connect to compression and LoRA?

This post follows those questions and tries to reorganize the role of linear algebra in LLMs. I want to take concepts that are familiar but easy to describe vaguely and pull them apart: starting from representation, looking at how vectors carry information; then looking at how matrices rewrite representations inside each position; then looking at how attention uses matching and weighted mixing to let tokens communicate; and finally looking at why trained matrices can reveal structure that low-rank methods can use.

If the sentence “LLMs are full of matrix multiplications” feels more concrete by the end, the post has done its job. Matrix multiplication matters because it is the visible operation. The deeper layer is the linear algebra language behind it: representation, transformation, interaction, scale, and the structure that appears after training.

Quick Recap: LLM Computation

Start with a decoder-only Transformer forward pass. From the outside, the model turns a sequence of token IDs into a probability distribution for the next token:

\[\text{token IDs} \rightarrow \text{embeddings} \rightarrow \text{transformer blocks} \rightarrow \text{logits} \rightarrow \text{next-token probabilities}\]

The object being updated throughout this computation is a matrix of hidden states for the whole sequence. The model first uses an embedding table to map each token ID to a $d$-dimensional vector. Positional information also enters the state stream or the attention computation. If the sequence length is $n$ and the model width is $d$, the initial state matrix can be written as

\[H^{(0)} \in \mathbb{R}^{n \times d}.\]

After that, the main body of the LLM can be seen as repeatedly updating this state matrix. Layer $\ell$ receives

\[H^{(\ell)} \in \mathbb{R}^{n \times d}\]

and outputs

\[H^{(\ell+1)} = \mathrm{Block}_{\ell}(H^{(\ell)}).\]

Inside a block, attention and the MLP project the state matrix into spaces with different purposes and sometimes different widths. Their outputs are eventually projected back to the model width $d$ and added to the residual stream. The hidden states passed between layers therefore usually keep the same $n \times d$ shape.

A common pre-norm block can be viewed in two steps: first an attention update, then an MLP update. Let

\[X = \mathrm{Norm}(H^{(\ell)}).\]

Attention projects the same state matrix $X$ into three matrices:

\[Q = XW_Q,\quad K = XW_K,\quad V = XW_V.\]

The rows of $Q$, $K$, and $V$ are still aligned with token positions. The difference is that they are produced by different projection matrices and serve different roles. $Q$ and $K$ generate matching scores between positions. $V$ contains the information that will later be read and mixed.

The model then uses $QK^\top$ to compute all pairwise matching scores at once:

\[S = \frac{QK^\top}{\sqrt{d_k}} + M,\]

where $M$ is the causal mask, which blocks future positions. Applying softmax row by row turns these scores into read weights:

\[A = \mathrm{softmax}(S).\]

Then the model mixes values using those weights:

\[C = AV.\]

Real models usually have multi-head attention and an output projection. For now, we can compress that into one attention update:

\[U_{\mathrm{attn}} = C W_O.\]

This update is added back to the original state through the residual connection:

\[\widetilde{H} = H^{(\ell)} + U_{\mathrm{attn}}.\]

At this point, attention has done one specific job: it projected the state matrix into representations for matching and reading, used $QK^\top$ to produce cross-position weights, and used $AV$ to mix information from visible positions back into each position.

Next comes the MLP. It applies the same transformation to the state at each position and mainly processes features within each position. Let

\[R = \mathrm{Norm}(\widetilde{H}).\]

A simplified MLP can be written as

\[\mathrm{MLP}(R) = \sigma(RW_1)W_2.\]

It first projects each state to an intermediate width, applies a nonlinearity, and then projects back to the model width $d$. This update is also added back to the state stream:

\[H^{(\ell+1)} = \widetilde{H} + \mathrm{MLP}(R).\]

So a transformer block can be viewed as: first use attention to read information across positions, then use the MLP to process features inside each position, and write both updates back into the same state matrix through residual connections.

After many such layers, the model takes the final hidden state at a position and produces vocabulary logits:

\[z_t = h_t^{(L)} W_{\mathrm{out}}.\]

Here $h_t^{(L)} \in \mathbb{R}^d$ is the final state at position $t$, and

\[W_{\mathrm{out}} \in \mathbb{R}^{d \times \lvert \mathcal{V} \rvert}\]

maps it into vocabulary space. Each logit can be understood as a linear score between the final state and one output-token direction. After softmax, these scores become the probability distribution for the next token:

\[p(x_{t+1}\mid x_{\le t}) = \mathrm{softmax}(z_t).\]

Looking back over the whole forward pass, linear algebra appears throughout the state update chain. The embedding table turns discrete token IDs into $H^{(0)}$, placing symbols into a continuous vector space. Each transformer block receives $H^{(\ell)}$ and uses $W_Q,W_K,W_V$ to project the same states into three matrices. $QK^\top$ produces matching scores between positions. Softmax turns those scores into read weights. $AV$ mixes value information back into each position. The attention output is projected back to width $d$ and written into the state matrix. The MLP then processes each position internally with $W_1$, a nonlinearity, and $W_2$. After many layers, a final hidden state is read by $W_{\mathrm{out}}$ into vocabulary logits. At a high level, the forward pass is a state matrix $H$ being repeatedly projected, compared, mixed, written back, and scored.

This overview only sets the stage. The next step is to look more closely: what is each operation doing, why is it a reasonable operation, what information does it change, and what role does it play? How does a hidden state carry information as a vector? How does matrix multiplication rewrite the representation inside each position? How does attention make different positions match and exchange information? How does weighted mixing write context back into a token state? We start with the smallest object: why a hidden state can be a vector at all.

From Token IDs to Vector States

When an LLM first sees text, it sees a sequence of token IDs.

After tokenization, a piece of text becomes an integer sequence:

\[x_1,x_2,\ldots,x_n.\]

Each $x_i$ is an index in the vocabulary:

\[x_i \in \{1,\ldots,\lvert \mathcal{V} \rvert\}.\]

These integers are useful for lookup, but they do not have a natural geometric meaning. If token ID 1234 and token ID 1235 differ by 1, that says only that their IDs are adjacent. It does not mean the two tokens are close in meaning, grammar, or usage. The model needs to do continuous computation: weighting, projection, comparison, and updating. Raw integer IDs cannot support those operations directly.

An embedding lookup assigns a vector state to each token ID. If the vocabulary size is $\lvert \mathcal{V} \rvert$ and the model width is $d$, the embedding table is

\[E \in \mathbb{R}^{\lvert \mathcal{V} \rvert \times d}.\]

If the token ID at position $i$ is $x_i$, the model takes row $x_i$:

\[e_i = E_{x_i} \in \mathbb{R}^d.\]

This $e_i$ is the initial state at position $i$ before the transformer blocks. It has $d$ coordinates. It can be updated by addition, projected by matrices, and adjusted during training through gradients. If we stack the vectors for a length-$n$ sequence by position, we get the initial state matrix:

\[H^{(0)} = \begin{pmatrix} e_1^\top \\\\ e_2^\top \\\\ \vdots \\\\ e_n^\top \end{pmatrix} \in \mathbb{R}^{n \times d}.\]

Each row corresponds to the initial hidden state at one position. Once the sequence enters the transformer blocks, this state matrix becomes the core object the model processes. At layer $\ell$, the sequence state is

\[H^{(\ell)} \in \mathbb{R}^{n \times d}.\]

The $i$-th row can be written as $(h_i^{(\ell)})^\top$, where

\[h_i^{(\ell)} \in \mathbb{R}^d\]

is the hidden state for position $i$ at that layer. As the layer index increases, $h_i^{(\ell)}$ keeps absorbing information from the current token, from the context, and from features produced by earlier layers. The $i$-th row still belongs to position $i$, but its contents are no longer just the original token embedding.

This is the basic role of a vector state in an LLM: it is the workspace for a position. The embedding gives the position an initial state. Each transformer block reads from that state, reads from context when attention is involved, and writes an update back into the same state. Later layers then read from the updated state.

This design is convenient for the model. Every position has a continuous state of the same width. Attention outputs, MLP outputs, and residual connections can all write back into the same object. The object has coordinates, so it can be updated by addition. It has directions, so it can be read by linear layers. As a row of a matrix, it can be processed together with the whole sequence. The model does not need one fixed coordinate to correspond to one human concept. Training shapes the whole space so that certain directions and combinations become useful.

Consider a small example. Suppose a hidden state is three-dimensional:

\[h = \begin{pmatrix} 3 \\\\ -2 \\\\ 4 \end{pmatrix}.\]

If we read it with the direction

\[u_1 = \begin{pmatrix} 1 \\\\ 0 \\\\ 0 \end{pmatrix},\]

then

\[u_1^\top h = 3.\]

This reads the component of $h$ along the first coordinate axis. If we use a direction that mixes coordinates,

\[u_2 = \frac{1}{\sqrt{3}} \begin{pmatrix} 1 \\\\ -1 \\\\ 1 \end{pmatrix},\]

then

\[u_2^\top h = \frac{1}{\sqrt{3}}(3 + 2 + 4) = \frac{9}{\sqrt{3}}.\]

This reads the response of $h$ along a mixed direction. That direction uses the first and third coordinates positively and the second coordinate with the opposite sign. The example stops at a basic fact: given a direction $u$, the number $u^\top h$ reads how strongly the hidden state responds along that direction.

In general, for any

\[u \in \mathbb{R}^d,\]

a linear readout is

\[u^\top h_i^{(\ell)}.\]

If $u$ is a unit vector, this value is the signed projection length of $h_i^{(\ell)}$ along direction $u$. Intuitively, $h_i^{(\ell)}$ is a working memory for the current position, and $u$ is a read direction. Computing $u^\top h_i^{(\ell)}$ asks: how strongly does this working memory respond along this direction?

This small observation explains why vector states are suitable internal representations. The same state can be read along many directions, and each direction produces a different linear signal (Fig. 1).

Figure 1. The same hidden state $h$ can be read along any direction $u$; the scalar $u^\top h$ is the signed projection length of $h$ onto $u$. The dial sweeps gently back and forth — grab the dot to steer $u$, and it resumes from wherever you leave it.

Real models organize many read directions into matrices, and they send the whole sequence of states through matrix multiplication at once. At this point, the objects we need are in place: a sequence state matrix $H$, and each row inside it can be read along directions. The question now shifts from “what is the state?” to “how do matrices use these states?”

The Two Jobs of Matrices

Once every position has a hidden state, the next question is what matrix multiplication does to those states. A useful way to organize the answer is by function. In a Transformer, matrix multiplication mostly does two kinds of work.

Rewriting Each Token

The first kind is token-internal transformation. These operations act on each position separately. If

\[H \in \mathbb{R}^{n \times d}\]

and

\[W \in \mathbb{R}^{d \times m},\]

then

\[Y = HW \in \mathbb{R}^{n \times m}.\]

The $i$-th row is

\[y_i^\top = h_i^\top W.\]

So the same matrix $W$ is applied to every token’s hidden state. This changes how each token is represented, but it does not mix information across token positions. Position $i$ stays position $i$. Its state is rewritten into a new set of features.

This covers many matrices in an LLM: $W_Q,W_K,W_V$, MLP projections, output projections, and LoRA updates on top of these matrices. The exact purpose differs, but the structural pattern is the same. Each token carries a vector, and the matrix reads that vector into another representation.

Creating Token Interaction

The second kind is token-to-token interaction. In a standard decoder-only Transformer, the main place where different tokens communicate is self-attention. $Q$, $K$, and $V$ have already been prepared by token-internal transformations. The model then needs to decide two things: which previous positions each position should read from, and how the information it reads should be written back into the current position.

Each position can be viewed as carrying three related objects. $q_i$ is the read request generated by position $i$: it expresses what kind of signal the current position wants from context. $k_j$ is the match signal exposed by position $j$: it expresses what kinds of requests that position can match. $v_j$ is the content that position $j$ offers to other positions. Query and key do routing. Value provides the content being routed.

The score between position $i$ and position $j$ is

\[q_i^\top k_j.\]

Collecting all such scores gives the score matrix

\[QK^\top.\]

This matrix is different from the token-internal $HW$ multiplication. It compares rows against rows. Its $i,j$ entry says how strongly position $i$ is inclined to read from position $j$.

For example, with four positions, the score matrix has the form

\[QK^\top = \begin{pmatrix} q_1^\top k_1 & q_1^\top k_2 & q_1^\top k_3 & q_1^\top k_4 \\\\ q_2^\top k_1 & q_2^\top k_2 & q_2^\top k_3 & q_2^\top k_4 \\\\ q_3^\top k_1 & q_3^\top k_2 & q_3^\top k_3 & q_3^\top k_4 \\\\ q_4^\top k_1 & q_4^\top k_2 & q_4^\top k_3 & q_4^\top k_4 \end{pmatrix}.\]

Each row belongs to the position doing the reading. Each column belongs to a candidate position being read. In a decoder-only model, a causal mask prevents a position from reading future positions. For a sequence like Alice dropped the glass because she ..., when the model is processing she, that row may score previous positions such as Alice, dropped, glass, and because, as well as the current position. Future tokens are masked before softmax.

The score table can be computed with one matrix multiplication. The scores between positions have no recursive dependency on each other, and the causal mask handles which columns remain visible for each row. This shape explains why self-attention is parallelizable: the model can first compute all queries and keys, then use one large matrix multiplication to get all pairwise scores.

The scores are still only routing signals. The content is in $V$. The model applies scaling and masking:

\[L = \frac{QK^\top}{\sqrt{d_k}} + M,\]

then softmax row by row:

\[A = \mathrm{softmax}(L).\]

The $i$-th row of $A$ contains the weights position $i$ assigns to visible positions. Then

\[C = AV.\]

For a single row,

\[c_i = \sum_j a_{ij}v_j.\]

This is the actual content mixing step. The new vector $c_i$ for position $i$ is a weighted sum of value vectors from visible positions. If $a_{ij}$ is large, then value $v_j$ contributes more to the new content at position $i$.

This separation gives matrix multiplication a clearer role inside the Transformer. $HW$ transforms each token independently. $QK^\top$ and $AV$ create token-to-token interaction. Once the distinction is clear, the rest of the model is easier to place: linear layers rewrite states inside a position, while attention lets positions communicate.

Why Dot Products Become Scores

In the previous section, $q_i^\top k_j$ appeared as entry $i,j$ in the score matrix. It controls how strongly position $i$ tends to read from position $j$. The plain dot product is the first step toward making that score feel natural:

Alignment and Length

\[q^\top k = \sum_{r=1}^{d_k} q_r k_r.\]

Each term $q_r k_r$ contributes positively when the two coordinates have the same sign, negatively when they have opposite signs, and little when one of them is near zero. The sum collects all these coordinate-level alignments into one scalar. That scalar can then be used as a logit before softmax.

A dot product also has a geometric form:

\[q^\top k = \lVert q\rVert \lVert k\rVert \cos\theta.\]

So it depends on both direction and length. If the two vectors point in similar directions, $\cos\theta$ is large. If their norms are large, the same directional alignment produces a larger score. The dot product keeps norm information in the score. Cosine similarity would normalize both vectors and keep only direction. In real models, norm meanings are shaped by normalization, training dynamics, and layer position, so they should not be assigned a fixed semantic interpretation. Mechanically, though, dot product gives the model both directional and magnitude degrees of freedom.

Learnable Matching

For attention, the more important point is that $q$ and $k$ are learned projections of hidden states:

\[q_i = W_Q^\top h_i,\quad k_j = W_K^\top h_j\]

if we use column-vector notation. Substituting these into the score gives

\[q_i^\top k_j = (W_Q^\top h_i)^\top(W_K^\top h_j) = h_i^\top W_QW_K^\top h_j.\]

Let

\[B = W_QW_K^\top.\]

Then the attention score is

\[h_i^\top B h_j.\]

This is a learned bilinear comparison between hidden states. Without the projections, a score like $h_i^\top h_j$ would compare hidden states using a fixed symmetric similarity rule. With learned query and key projections, the model learns what kind of matching matters for this head.

This also explains why query and key use different projections. At the same position $i$, both $q_i$ and $k_i$ exist, but they serve different roles. $q_i$ is used when position $i$ queries other positions. $k_i$ is used when other positions query position $i$. One expresses “what I am looking for now”. The other expresses “what requests can find me”. Separate projections let the model separate these roles.

There is also a precise matrix constraint behind this. If query and key used the same projection $W$, the comparison matrix would be

\[B = WW^\top.\]

This matrix is symmetric and positive semidefinite. It gives the model a restricted kind of matching rule. With two projections,

\[B = W_QW_K^\top,\]

which can represent a broader low-rank bilinear rule. In particular, $B$ does not have to be symmetric, so the score from one direction can differ from the score in the reverse direction:

\[h_i^\top B h_j \quad \text{can differ from} \quad h_j^\top B h_i.\]

This gives the model directional matching rules. It also weakens the tendency for a token to score itself strongly. If the model used $h_i^\top h_i$, the self-score would be a norm squared. With separate learned projections, the self-score becomes

\[q_i^\top k_i = h_i^\top W_QW_K^\top h_i,\]

and it can be shaped by the learned comparison rule.

Consider again Alice dropped the glass because she .... When the model processes she, the current token’s own state matters, but it may need to read from Alice to resolve the reference. The query at she can express something like “I need a person entity that can serve as the referent”. The key at Alice can express “this position provides a person entity”. If those signals match, $q_{\text{she}}^\top k_{\text{Alice}}$ can be higher than $q_{\text{she}}^\top k_{\text{she}}$. The score is routing: the current position can place weight on the position that provides the needed information.

This is also the point where the phrase “attention finds similar tokens” becomes too weak. The score is a learned compatibility function between projected roles. Depending on the head, the useful relation may be syntactic, referential, positional, or something more abstract.

Keeping Softmax Scale Stable

There is one more scale issue. $q^\top k$ is a sum of $d_k$ products. If each coordinate has roughly similar scale, the sum tends to grow with dimension. A common simplified analysis assumes $q_r$ and $k_r$ have mean 0, variance about 1, and are approximately independent. Then

\[\mathrm{Var}(q^\top k) = \sum_{r=1}^{d_k}\mathrm{Var}(q_rk_r) \approx d_k.\]

So the standard deviation grows roughly like $\sqrt{d_k}$. Attention uses

\[\frac{q^\top k}{\sqrt{d_k}}\]

The division keeps score scale more stable. If logits entering softmax are too large, softmax becomes too sharp too early: a few positions get weights near 1 and many positions get weights near 0, and gradients become less stable. This derivation is a scale-level approximation. It is not an exact statement about all model activations, yet it explains why the scaling term is natural (Fig. 2).

logit scale (÷ $\sqrt{d_k}$ effect)

Figure 2. $q^\top k$ is a sum of $d_k$ products, so its scale tends to grow like $\sqrt{d_k}$. When logits get too large, softmax collapses toward one-hot (a few ≈ 1, the rest ≈ 0) and gradients destabilize — so attention uses $q^\top k/\sqrt{d_k}$ to pull the scale back. The scale drifts gently, or drag the slider.

After scaling and masking, attention logits can be written as

\[\ell_{ij} = \frac{q_i^\top k_j}{\sqrt{d_k}} + m_{ij}.\]

Here $m_{ij}$ handles visibility. In a decoder-only model, if $j$ is in the future of $i$, then $m_{ij}$ makes the position unavailable after softmax. Softmax is applied row by row:

\[a_{ij} = \frac{\exp(\ell_{ij})}{\sum_t \exp(\ell_{it})}.\]

Now each row becomes a distribution over visible positions. The dot product produces matching scores; scaling stabilizes score magnitude; the mask enforces causality; softmax converts scores into read weights.

From Scores to State Updates

Mixing Values by Weight

Once attention has produced weights, the model still needs to write information back into each position. The operation that does this is $AV$.

Let the value matrix be

\[V = \begin{pmatrix} v_1^\top \\\\ v_2^\top \\\\ \vdots \\\\ v_n^\top \end{pmatrix}.\]

The attention weights are

\[A = \begin{pmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\\\ a_{21} & a_{22} & \cdots & a_{2n} \\\\ \vdots & \vdots & \ddots & \vdots \\\\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{pmatrix}.\]

Then

\[C = AV.\]

The $i$-th output row is

\[c_i^\top = \sum_{j=1}^{n} a_{ij} v_j^\top.\]

Equivalently,

\[c_i = \sum_{j=1}^{n} a_{ij}v_j.\]

This formula makes the information flow explicit. The new vector at position $i$ is a weighted mixture of value vectors from visible positions. If $a_{ij}$ is larger, value vector $v_j$ contributes more to the content written into position $i$.

For a small three-dimensional example, suppose position $i$ can see three value vectors:

\[v_1 = \begin{pmatrix} 1 \\\\ 0 \\\\ 2 \end{pmatrix}, \quad v_2 = \begin{pmatrix} 0 \\\\ 3 \\\\ 1 \end{pmatrix}, \quad v_3 = \begin{pmatrix} 2 \\\\ -1 \\\\ 0 \end{pmatrix}.\]

Suppose the softmax row for position $i$ is

\[a_i = \begin{pmatrix} 0.1 & 0.7 & 0.2 \end{pmatrix}.\]

Then the context vector written back to position $i$ is

\[c_i = 0.1v_1 + 0.7v_2 + 0.2v_3 = \begin{pmatrix} 0.5 \\\\ 1.9 \\\\ 0.9 \end{pmatrix}.\]

The example is only meant to show the linear combination. The second value vector has the largest weight, so it contributes most to $c_i$; the first and third value vectors still keep part of their influence. Attention is mixing information in vector space, not choosing a single token in a discrete way (Fig. 3).

Figure 3. Sentence Alice dropped the glass because she …. The chosen query (highlighted row) gives each visible position an attention weight $a_j=\mathrm{softmax}(q^\top k_j)$ (bars); the values then flow into the new state $c=\sum_j a_j v_j$ on the right — heavier weights flow thicker. Future positions are greyed by the causal mask. The query cycles on its own; hover or click any row to pin it. (Weights are illustrative.)

The causal mask only affects which weights can be nonzero. After softmax, future positions have weight 0, so they contribute nothing to $c_i$. The matrix multiplication $AV$ is still the same weighted-sum operation.

This is the point where token interaction becomes content. $QK^\top$ decides where to read. Softmax turns reading preference into weights. $AV$ mixes the content being read. Then $W_O$ and the residual connection write the result back into the residual stream.

Combining Heads Before Writeback

Real attention usually uses multiple heads. The cleanest way to view the heads is through their learned projections. For head $r$,

\[Q^{(r)} = XW_Q^{(r)},\quad K^{(r)} = XW_K^{(r)},\quad V^{(r)} = XW_V^{(r)}.\]

If the model width is $d$ and there are $h$ heads, a common choice is

\[d_{\text{head}} = \frac{d}{h}.\]

Then each head produces

\[C^{(r)} \in \mathbb{R}^{n \times d_{\text{head}}}.\]

The sequence length $n$ is unchanged. Each head returns one vector per token, but the vector width is $d_{\text{head}}$. After all heads run, their outputs are concatenated along the feature dimension:

\[C_{\text{cat}} = [C^{(1)}\; C^{(2)}\; \cdots \; C^{(h)}] \in \mathbb{R}^{n \times (h d_{\text{head}})}.\]

With $d_{\text{head}}=d/h$, this becomes

\[C_{\text{cat}} \in \mathbb{R}^{n \times d}.\]

So concatenation keeps the number of tokens the same and stacks head outputs in the feature width. At token position $i$, the row becomes a longer vector containing the output from all heads at that same position.

Then comes the output projection:

\[U_{\mathrm{attn}} = C_{\text{cat}} W_O.\]

If $C_{\text{cat}} \in \mathbb{R}^{n \times d}$, then typically

\[W_O \in \mathbb{R}^{d \times d}.\]

The output projection mixes the head results inside each token position. Token-to-token communication has already happened inside each head through $QK^\top$ and $AV$; $W_O$ decides how those separate reads should be combined before the update returns to the residual stream.

A small example makes this clearer. Suppose a token has two heads, each producing two features:

\[c_{\text{cat}} = \begin{pmatrix} a \\\\ b \\\\ x \\\\ y \end{pmatrix}.\]

Here $(a,b)$ comes from head 1 and $(x,y)$ comes from head 2. Now multiply by a $4\times4$ output projection. The first output feature might be

\[o_1 = 0.5a + 0.2b - 0.1x + 0.7y.\]

This one output coordinate can use features from both heads. Other output coordinates use other learned combinations. That is how $W_O$ integrates head outputs into a single residual-width vector for the same token.

After the output projection, the attention update is written back through the residual connection:

\[H' = H + U_{\mathrm{attn}}.\]

The context read by attention now becomes part of each token’s state. Later layers can process it further.

How Linear Layers Rewrite Tokens

After attention has written context into each position, the Transformer still repeatedly uses ordinary linear layers to process each token state. This is the main job of the MLP.

Take a single token state

\[h \in \mathbb{R}^{d}.\]

A linear layer maps it to

\[y = W^\top h\]

in column-vector notation, or $h^\top W$ in row-vector notation. If

\[W = \begin{pmatrix} w_1 & w_2 & \cdots & w_m \end{pmatrix},\]

where each $w_j \in \mathbb{R}^d$ is a column, then

\[y_j = w_j^\top h.\]

So each output feature is a readout of the input state along one learned direction. A dense matrix contains many such read directions. It can combine old coordinates into new features.

If $W$ were the identity matrix, the output would equal the input. If $W$ were diagonal, each coordinate would only be scaled independently. A dense matrix allows coordinates to mix, so it can rewrite the representation into a new coordinate system. That coordinate change is not necessarily a pure rotation. A neural network weight can also stretch some directions, compress others, and change the dimension.

In the MLP, the common pattern is expansion, nonlinearity, and projection back:

\[\mathrm{MLP}(h) = W_2^\top \sigma(W_1^\top h)\]

or, in row-matrix form,

\[\mathrm{MLP}(H) = \sigma(HW_1)W_2.\]

The first matrix reads the token state into a wider feature space. The nonlinearity or gate changes which features are active. The second matrix writes the result back to model width. This happens independently at every token position. The MLP does not read new information from other tokens. It processes the information that attention and the residual stream have already placed inside the current token state.

The output projection at the end of the model is also a linear readout. If

\[h_t^{(L)} \in \mathbb{R}^d\]

is the final state for position $t$, then

\[z = h_t^{(L)} W_{\mathrm{out}}\]

produces logits over the vocabulary. The $j$-th logit is

\[z_j = h_t^{(L)} w_j,\]

where $w_j$ is the output direction for token $j$. This is the final linear read from state space into token space.

Matrix multiplication inside a token has several roles. It can prepare query, key, and value representations. It can expand and transform features inside the MLP. It can combine multi-head outputs. It can read the final state into vocabulary logits. The shared mathematical pattern is that a matrix contains many learned directions for reading and writing features.

Outer Products and Low-Rank Structure

So far, we have mostly looked at matrix multiplication through output coordinates: each column of $W$ is a read direction, and $h^\top W$ reads a token state into new features. There is another useful way to view the same operation. Matrix multiplication can be written as a sum of outer products. This view connects matrix multiplication, rank, SVD, compression, and LoRA.

For two matrices

\[A \in \mathbb{R}^{m \times p},\quad B \in \mathbb{R}^{p \times n},\]

the usual entry-wise formula is

\[(AB)_{ij} = \sum_{k=1}^{p} A_{ik}B_{kj}.\]

This is the row-column dot product view. If we look at the whole output matrix instead, we can group the multiplication by the shared index $k$:

\[AB = \sum_{k=1}^{p} A_{:,k} B_{k,:}.\]

Here $A_{:,k}$ is the $k$-th column of $A$, and $B_{k,:}$ is the $k$-th row of $B$. Each term

\[A_{:,k}B_{k,:}\]

is an outer product: a column vector times a row vector. It produces a rank-1 matrix.

For example, let

\[A = \begin{pmatrix} 1 & 2 \\\\ 3 & 4 \end{pmatrix}, \quad B = \begin{pmatrix} 5 & 6 \\\\ 7 & 8 \end{pmatrix}.\]

Then

\[AB = \begin{pmatrix} 1 \\\\ 3 \end{pmatrix} \begin{pmatrix} 5 & 6 \end{pmatrix} + \begin{pmatrix} 2 \\\\ 4 \end{pmatrix} \begin{pmatrix} 7 & 8 \end{pmatrix}.\]

The first outer product is

\[\begin{pmatrix} 1 \\\\ 3 \end{pmatrix} \begin{pmatrix} 5 & 6 \end{pmatrix} = \begin{pmatrix} 5 & 6 \\\\ 15 & 18 \end{pmatrix}.\]

The second is

\[\begin{pmatrix} 2 \\\\ 4 \end{pmatrix} \begin{pmatrix} 7 & 8 \end{pmatrix} = \begin{pmatrix} 14 & 16 \\\\ 28 & 32 \end{pmatrix}.\]

Adding them gives

\[AB = \begin{pmatrix} 19 & 22 \\\\ 43 & 50 \end{pmatrix}.\]

This is the same matrix multiplication, but the outer-product view reveals something structural. A product $AB$ is a sum of simple rank-1 contributions. Each contribution reads one scalar from the shared dimension and writes a whole pattern into the output matrix.

This is the reason rank comes next. The previous section asked how matrices read new features. This view asks how many independent contributions the matrix actually uses.

Rank Counts Independent Directions

Rank answers a simple structural question: how many independent output directions does this matrix actually use?

A $3\times3$ matrix could in principle output any vector in $\mathbb{R}^3$. But if its rank is 2, all of its outputs lie in a two-dimensional plane. Some output direction is unreachable, no matter what input we provide.

Consider

\[M = \begin{pmatrix} 1 & 2 & 3 \\\\ 4 & 5 & 6 \\\\ 5 & 7 & 9 \end{pmatrix}.\]

The third row is the sum of the first two rows:

\[(5,7,9) = (1,2,3) + (4,5,6).\]

So the rows are not independent. The matrix cannot have rank 3. In fact, its rank is 2.

Looking at columns gives the same kind of story. The output of $Mx$ can be written as a combination of columns:

\[Mx = x_1 c_1 + x_2 c_2 + x_3 c_3,\]

where $c_1,c_2,c_3$ are the columns of $M$. If one column is a combination of the others, it does not add a new independent output direction. The rank counts the dimension of the space spanned by the columns, which is the set of all possible outputs.

This matters for LLMs because a matrix shape can be large while its effective action is concentrated in fewer directions. Shape tells us how many rows, columns, and parameters the matrix has. Rank tells us how many independent directions those parameters actually support. An attention head’s bilinear comparison matrix $W_QW_K^\top$ is low-rank because its rank is limited by the head dimension. A LoRA update is low-rank because it is explicitly written as a product of two skinny matrices.

Rank also has a limitation. Exact rank counts every nonzero direction, even if the direction is extremely weak. Real model weights more often have directions of different strengths: a few strong directions and many weak ones. To describe that strength ordering, we need SVD.

SVD Orders Directions by Strength

Rank tells us how many independent directions exist. SVD tells us how strong they are. It decomposes a matrix into simple read-write channels ordered by strength.

Read, Scale, Write

Start with one channel. Choose an input direction $v$, an output direction $u$, and a nonnegative strength $\sigma$. The matrix

\[M = \sigma u v^\top\]

acts on an input $x$ as

\[Mx = \sigma u(v^\top x).\]

Read this from right to left. First, $v^\top x$ reads how much input $x$ lies along direction $v$. This produces a scalar. Multiplying by $\sigma$ scales that scalar. Multiplying by $u$ writes the result into the output direction $u$.

So $uv^\top$ has a concrete meaning: $v^\top$ reads, and $u$ writes. The outer product combines “read along $v$” and “write along $u$” into one matrix channel. SVD decomposes a matrix into many such channels:

\[A = \sigma_1 u_1 v_1^\top + \sigma_2 u_2 v_2^\top + \cdots + \sigma_r u_r v_r^\top.\]

This is the expanded form of

\[A = U\Sigma V^\top.\]

$V$ collects input directions, $\Sigma$ collects strengths, and $U$ collects output directions.

Reconstructing by Summing Channels

Use a small matrix:

\[A = \begin{pmatrix} 3 & 3 \\\\ 1 & -1 \end{pmatrix}.\]

For input

\[x = \begin{pmatrix} x_1 \\\\ x_2 \end{pmatrix},\]

we get

\[Ax = \begin{pmatrix} 3x_1 + 3x_2 \\\\ x_1 - x_2 \end{pmatrix}.\]

The first row reads the sum of the two coordinates and scales it by 3. The second row reads their difference. The SVD decomposes this matrix into two channels:

\[v_1 = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 \\\\ 1 \end{pmatrix}, \quad u_1 = \begin{pmatrix} 1 \\\\ 0 \end{pmatrix}, \quad \sigma_1 = 3\sqrt{2},\]

and

\[v_2 = \frac{1}{\sqrt{2}} \begin{pmatrix} 1 \\\\ -1 \end{pmatrix}, \quad u_2 = \begin{pmatrix} 0 \\\\ 1 \end{pmatrix}, \quad \sigma_2 = \sqrt{2}.\]

The first channel reads the same-direction input pattern $v_1$, writes to the first output coordinate, and has strength $3\sqrt{2}$. The second reads the difference direction $v_2$, writes to the second output coordinate, and has strength $\sqrt{2}$.

Writing each channel as a matrix:

\[\sigma_1 u_1 v_1^\top = 3\sqrt{2} \begin{pmatrix} 1 \\\\ 0 \end{pmatrix} \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & 1 \end{pmatrix} = \begin{pmatrix} 3 & 3 \\\\ 0 & 0 \end{pmatrix}.\]

The second channel is

\[\sigma_2 u_2 v_2^\top = \sqrt{2} \begin{pmatrix} 0 \\\\ 1 \end{pmatrix} \frac{1}{\sqrt{2}} \begin{pmatrix} 1 & -1 \end{pmatrix} = \begin{pmatrix} 0 & 0 \\\\ 1 & -1 \end{pmatrix}.\]

Adding them gives the original matrix:

\[\begin{pmatrix} 3 & 3 \\\\ 0 & 0 \end{pmatrix} + \begin{pmatrix} 0 & 0 \\\\ 1 & -1 \end{pmatrix} = \begin{pmatrix} 3 & 3 \\\\ 1 & -1 \end{pmatrix}.\]

Each $\sigma_i u_i v_i^\top$ is a read-write channel that can be executed on its own, and the full matrix is the sum of those channel contributions.

To see those channels act on an input, take

\[x = \begin{pmatrix} 2 \\\\ 1 \end{pmatrix},\]

direct multiplication gives

\[Ax = \begin{pmatrix} 9 \\\\ 1 \end{pmatrix}.\]

The first channel reads

\[v_1^\top x = \frac{2+1}{\sqrt{2}} = \frac{3}{\sqrt{2}},\]

then scales and writes

\[3\sqrt{2}u_1(v_1^\top x) = \begin{pmatrix} 9 \\\\ 0 \end{pmatrix}.\]

The second channel reads

\[v_2^\top x = \frac{2-1}{\sqrt{2}} = \frac{1}{\sqrt{2}},\]

then scales and writes

\[\sqrt{2}u_2(v_2^\top x) = \begin{pmatrix} 0 \\\\ 1 \end{pmatrix}.\]

Adding the two channel outputs gives

\[\begin{pmatrix} 9 \\\\ 0 \end{pmatrix} + \begin{pmatrix} 0 \\\\ 1 \end{pmatrix} = \begin{pmatrix} 9 \\\\ 1 \end{pmatrix}.\]

This is the same as direct multiplication by $A$. In LLM terms, a linear-layer weight can be understood in the same way. It reads certain directions from the hidden state and writes those responses into new feature directions. Large singular values correspond to strong channels. Small singular values correspond to weak channels.

Why Low-Rank Approximation Works

SVD decomposes a matrix into read-write channels ordered by strength. For the matrix above,

\[\begin{pmatrix} 3 & 3 \\\\ 1 & -1 \end{pmatrix} = \begin{pmatrix} 3 & 3 \\\\ 0 & 0 \end{pmatrix} + \begin{pmatrix} 0 & 0 \\\\ 1 & -1 \end{pmatrix}.\]

The first channel has strength $3\sqrt{2}$. The second has strength $\sqrt{2}$. Both are part of the original matrix, but their strengths differ.

A low-rank approximation keeps the strongest channels and discards weaker ones. Keeping only the first channel gives

\[A_1 = \begin{pmatrix} 3 & 3 \\\\ 0 & 0 \end{pmatrix}.\]

This is rank 1 because its output always lies along one direction: the first output coordinate. It still accepts a two-dimensional input, but it can write only one independent output direction.

For the same input

\[x = \begin{pmatrix} 2 \\\\ 1 \end{pmatrix},\]

the full matrix gives

\[Ax = \begin{pmatrix} 9 \\\\ 1 \end{pmatrix},\]

while the rank-1 approximation gives

\[A_1x = \begin{pmatrix} 9 \\\\ 0 \end{pmatrix}.\]

The second channel has been dropped, so the second output coordinate disappears. This is the approximation error. The approximation still preserves the main action for this input because the dominant contribution came from the first channel. If a task depends heavily on the dropped channel, the approximation will lose important information.

In general, SVD gives

\[A = \sum_{i=1}^{r}\sigma_i u_i v_i^\top,\]

with singular values ordered from large to small. Keeping the first $k$ channels gives

\[A_k = \sum_{i=1}^{k}\sigma_i u_i v_i^\top.\]

The rank of $A_k$ is at most $k$. Its output can only be built from $u_1,\ldots,u_k$. A low-rank matrix can still have a large shape, but it writes into fewer independent output directions.

This connects directly to compression. Suppose a weight matrix maps $d_{\mathrm{in}}$ input dimensions to $d_{\mathrm{out}}$ output dimensions:

\[A \in \mathbb{R}^{d_{\mathrm{out}} \times d_{\mathrm{in}}}.\]

Here we are using column-vector notation $y=Ax$. If we use the row-vector convention from $HW$, the shapes are transposed, but the decomposition idea is the same.

The full matrix stores

\[d_{\mathrm{out}}d_{\mathrm{in}}\]

parameters. A rank-$k$ SVD approximation can be written as

\[A_k = U_k\Sigma_kV_k^\top,\]

where

\[U_k \in \mathbb{R}^{d_{\mathrm{out}} \times k},\quad \Sigma_k \in \mathbb{R}^{k \times k},\quad V_k \in \mathbb{R}^{d_{\mathrm{in}} \times k}.\]

The parameter count becomes roughly

\[k(d_{\mathrm{out}}+d_{\mathrm{in}}+1).\]

Computation can also be read as channels: $V_k^\top$ reads the input into $k$ coefficients, $\Sigma_k$ scales them, and $U_k$ writes them back to output space. A large linear map has been rewritten as “read a small number of channels, then write them back”.

In LLMs, this applies naturally to large linear layers such as attention projections, MLP projections, and output projections. Their shapes are large, but if their main action is concentrated in a small number of strong channels, a low-rank approximation can imitate much of their behavior with fewer parameters.

This still has to be verified. A weak channel may matter for rare inputs or important predictions. Low-rank approximation provides a useful structural hypothesis: a large matrix may be approximated by a smaller number of channels. Actual model compression still needs evaluation on perplexity, downstream tasks, and often recovery fine-tuning.

This line leads naturally to LoRA. SVD compression approximates an existing weight with a low-rank form. LoRA keeps the original weight and trains only a low-rank update. The next section looks at why that update can also be understood as a small number of new read-write channels.

Why LoRA Uses Low-Rank Updates

SVD compression starts with a complete matrix and tries to approximate it using fewer channels. LoRA deals with a different situation during fine-tuning. The pretrained weights already carry a large amount of capability. A new task often needs only an update next to the original linear layer.

For one linear layer, using column-vector notation:

\[y = Wx.\]

Full fine-tuning directly modifies the whole weight. After training, the layer can be written as

\[y = (W+\Delta W)x = Wx + \Delta W x.\]

Here $W$ is the original pretrained weight, and $\Delta W$ is the learned change from fine-tuning. If

\[W \in \mathbb{R}^{d_{\mathrm{out}} \times d_{\mathrm{in}}},\]

then $\Delta W$ has the same shape. For large matrices in an LLM, this means training and storing a full-size update.

LoRA constrains the update to have low-rank form:

\[\Delta W = BA,\]

where

\[A \in \mathbb{R}^{r \times d_{\mathrm{in}}},\quad B \in \mathbb{R}^{d_{\mathrm{out}} \times r},\quad r \ll \min(d_{\mathrm{in}},d_{\mathrm{out}}).\]

The forward pass becomes

\[y = Wx + \frac{\alpha}{r}B(Ax).\]

$W$ is frozen. Training updates only $A$ and $B$. The factor $\alpha/r$ controls the strength of the LoRA path.

This formula is the same read-write channel idea. First, $A$ reads $r$ numbers from input $x$:

\[z = Ax \in \mathbb{R}^{r}.\]

Then $B$ writes those $r$ numbers back to output space:

\[\delta y = Bz \in \mathbb{R}^{d_{\mathrm{out}}}.\]

So the LoRA path is

\[x \xrightarrow{\ A\ } z \xrightarrow{\ B\ } \delta y.\]

The original layer gives $Wx$. The LoRA path gives an extra update $\delta y$. The two are added (Fig. 4).

rank r = 8

full ΔW: LoRA:

Figure 4. Beside the frozen main path $Wx$, LoRA learns one low-rank detour: $x \to A$ (squeeze to $r$ dims) $\to B$ (write back). The middle bottleneck height = rank $r$, so $\mathrm{rank}(\Delta W)\le r$. $r$ drifts slowly, or drag the slider — the bottleneck and parameter count vary continuously with $r$ ($d=4096$).

For a concrete shape, suppose the input is three-dimensional, the output is four-dimensional, and the LoRA rank is $r=2$:

\[x \in \mathbb{R}^{3},\quad A \in \mathbb{R}^{2\times3},\quad B \in \mathbb{R}^{4\times2}.\]

Write the two rows of $A$ as read directions:

\[A = \begin{pmatrix} a_1^\top \\\\ a_2^\top \end{pmatrix}.\]

Then

\[Ax = \begin{pmatrix} a_1^\top x \\\\ a_2^\top x \end{pmatrix}.\]

This step performs two reads. The first row reads the response along $a_1$. The second reads the response along $a_2$.

Write the two columns of $B$ as output directions:

\[B = \begin{pmatrix} b_1 & b_2 \end{pmatrix}.\]

Then

\[BAx = B \begin{pmatrix} a_1^\top x \\\\ a_2^\top x \end{pmatrix} = b_1(a_1^\top x)+b_2(a_2^\top x).\]

This is two new read-write channels. The first reads with $a_1^\top$ and writes along $b_1$. The second reads with $a_2^\top$ and writes along $b_2$. A rank-2 LoRA update can provide at most two independent channels of this kind.

As a matrix,

\[BA = b_1a_1^\top + b_2a_2^\top.\]

Each $b_ia_i^\top$ is a rank-1 outer product. A rank-$r$ LoRA update is the sum of $r$ such terms, so

\[\mathrm{rank}(BA) \le r.\]

This explains both the constraint and the efficiency. LoRA compresses the freedom of $\Delta W$ into a small number of read-write channels. If the fine-tuning task mainly needs to adjust a small number of directions, the constraint works well. If the task requires many independent changes, too small a rank will limit expressiveness.

The parameter count follows directly. A full update needs

\[d_{\mathrm{out}}d_{\mathrm{in}}\]

parameters. LoRA trains $A$ and $B$, for

\[r d_{\mathrm{in}} + d_{\mathrm{out}}r = r(d_{\mathrm{in}}+d_{\mathrm{out}})\]

parameters. If $d_{\mathrm{in}}=d_{\mathrm{out}}=4096$ and $r=8$, a full update has about 16.8 million parameters, while LoRA has

\[8(4096+4096)=65536\]

parameters.

The rank constraint applies only to the update $\Delta W$. The actual weight used after adaptation is

\[W + \frac{\alpha}{r}BA.\]

The original $W$ still carries the high-dimensional behavior learned during pretraining. LoRA adds a low-rank change on top. The assumption is that, for a new task, the additional change can often be expressed through a small number of channels.

This also distinguishes LoRA from SVD compression. SVD compression starts from a full weight $W$ and tries to imitate $W$ itself with fewer channels. LoRA starts from a frozen $W$ and trains an extra $\Delta W$. One compresses existing capability. The other adds a lightweight adaptation path next to existing capability.

After training, the LoRA update can be merged back into the original matrix:

\[W_{\mathrm{merged}} = W + \frac{\alpha}{r}BA.\]

After merging, inference still uses an ordinary linear layer. LoRA decomposes the update into two small matrices during training, then folds the learned update back into the original weight for deployment.

In Transformers, LoRA is commonly applied to attention projections such as $W_Q,W_K,W_V,W_O$, and sometimes to MLP projections. These are linear layers, so LoRA can add a small number of trainable read-write channels to each. Updating $W_Q$ changes how hidden states become queries. Updating $W_V$ changes the content that can be read. Updating $W_O$ changes how head outputs write back into the residual stream.

LoRA combines several linear algebra ideas from this post: matrix multiplication as a linear transformation, outer products as rank-1 read-write channels, rank as the number of independent channels, and SVD as a way to see that a few strong channels can matter a lot. It puts that structure into training by freezing most weights and learning only a small low-rank update.

Norms, Normalization, and Stability

So far, the discussion has focused mostly on directions: which directions a hidden state responds to, which directions a matrix reads from, and which directions it writes to. Real Transformers also depend on scale. Vectors have lengths. Matrices have stretch factors. Logits have numerical scale. Gradient updates have step size. If scale drifts too far, the same direction can lead to very different computation.

Scale Changes Scores

For a hidden state

\[h = \begin{pmatrix} h_1 \\\\ h_2 \\\\ \vdots \\\\ h_d \end{pmatrix},\]

the common $L_2$ norm is

\[\lVert h\rVert_2 = \sqrt{h_1^2+h_2^2+\cdots+h_d^2}.\]

Length affects many computations inside an LLM. For attention,

\[q^\top k = \lVert q\rVert \lVert k\rVert \cos\theta.\]

So the score depends on both direction and length. Output logits have the same issue:

\[z_j = h^\top w_j.\]

If the final state $h$ or the output direction $w_j$ has larger norm, the logit can be larger even when the direction is the same. Larger logit differences make softmax sharper. Smaller logit differences make it flatter.

For example,

\[h = \begin{pmatrix} 3 \\\\ 4 \\\\ 0 \end{pmatrix},\quad c = \begin{pmatrix} 6 \\\\ 8 \\\\ 0 \end{pmatrix}.\]

These two vectors point in the same direction, but their lengths are 5 and 10. If

\[w = \begin{pmatrix} 0 \\\\ 1 \\\\ 0 \end{pmatrix},\]

then

\[w^\top h = 4,\quad w^\top c = 8.\]

The direction stayed the same. The readout doubled. Scale matters.

LayerNorm and RMSNorm enter at this scale-control layer. Their “Norm” refers to normalization. Normalization estimates the scale of a vector or activation group, then uses that scale to adjust the input.

Stabilizing Token States

During training, the transformer state usually has shape

\[H \in \mathbb{R}^{B\times n\times d},\]

where $B$ is batch size, $n$ is sequence length, and $d$ is hidden width. A single token state is

\[h_{b,i} \in \mathbb{R}^d.\]

For normalization, the crucial choice is the dimension over which the statistics are computed.

BatchNorm typically computes statistics for the same feature across a batch. If the relevant samples are indexed by $s=1,\ldots,m$, then for feature $j$ it computes

\[\mu_j = \frac{1}{m}\sum_{s=1}^{m}x_{s,j},\]

and

\[\sigma_j^2 = \frac{1}{m}\sum_{s=1}^{m}(x_{s,j}-\mu_j)^2.\]

Then each feature value is standardized using batch statistics:

\[\hat{x}_{s,j} = \gamma_j \frac{x_{s,j}-\mu_j}{\sqrt{\sigma_j^2+\epsilon}} + \beta_j.\]

This is natural in CNNs, where a channel often corresponds to a type of local visual feature, and many images or spatial locations can provide stable statistics for that channel.

Language modeling is less friendly to batch-level statistics. A batch can contain different sentences, different lengths, different positions, and different contexts. A hidden state at position $i$ may be representing a sentence start, code indentation, long-range reference, punctuation, or padding-adjacent content. If we normalize one hidden feature using statistics from other sequences and positions in the same batch, the result for one token depends on unrelated examples.

That dependence is especially awkward for autoregressive models. The output for a prompt should depend on the prompt and model parameters. BatchNorm uses current-batch statistics during training and usually moving averages during inference. The source of statistics changes between training and inference, and the statistics change with batch size, sequence length, and padding. Language models often need batch size 1, dynamic sequence lengths, and KV-cache incremental generation. Batch-level statistics are a poor interface for this setting.

LayerNorm brings the statistics back to the current token’s feature vector. For the token state

\[h_{b,i} = \begin{pmatrix} h_{b,i,1} \\\\ h_{b,i,2} \\\\ \vdots \\\\ h_{b,i,d} \end{pmatrix},\]

LayerNorm computes the mean over its $d$ features:

\[\mu_{b,i}=\frac{1}{d}\sum_{j=1}^{d}h_{b,i,j},\]

and the variance:

\[v_{b,i}=\frac{1}{d}\sum_{j=1}^{d}(h_{b,i,j}-\mu_{b,i})^2.\]

This gives a standardized token state:

\[\hat{h}_{b,i} = \frac{h_{b,i}-\mu_{b,i}}{\sqrt{v_{b,i}+\epsilon}}.\]

Then LayerNorm applies a learned feature-wise affine transform:

\[y_{b,i}=\boldsymbol{\gamma}\odot\hat{h}_{b,i}+\boldsymbol{\beta}.\]

Here

\[\boldsymbol{\gamma},\boldsymbol{\beta}\in\mathbb{R}^d.\]

Their length is the hidden size $d$. Each feature has its own scale and shift. Across the whole tensor $H\in\mathbb{R}^{B\times n\times d}$, the same $\boldsymbol{\gamma}$ and $\boldsymbol{\beta}$ are broadcast to every batch element and every token position. A single scalar would only scale or shift the whole vector uniformly. A length-$d$ vector lets each feature have its own scale and shift.

To see the sharing, ignore the batch dimension and take a sequence with three tokens:

\[H= \begin{pmatrix} h_1^\top \\\\ h_2^\top \\\\ h_3^\top \end{pmatrix} \in\mathbb{R}^{3\times d}.\]

Each row is a $d$-dimensional hidden state:

\[h_1,h_2,h_3\in\mathbb{R}^d.\]

LayerNorm computes separate statistics for $h_1$, $h_2$, and $h_3$, giving

\[\hat{h}_1,\hat{h}_2,\hat{h}_3\in\mathbb{R}^d.\]

Then the same parameters are applied to all three:

\[y_1=\boldsymbol{\gamma}\odot\hat{h}_1+\boldsymbol{\beta},\] \[y_2=\boldsymbol{\gamma}\odot\hat{h}_2+\boldsymbol{\beta},\] \[y_3=\boldsymbol{\gamma}\odot\hat{h}_3+\boldsymbol{\beta}.\]

Each token has different hidden contents and its own normalization statistics. The learned scale and shift are shared. Intuitively, $\gamma_j$ says how strongly feature $j$ should enter the next computation across this layer.

The operation $\boldsymbol{\gamma}\odot\hat{h}$ is coordinate-wise multiplication. In matrix language, it is a diagonal scaling:

\[\boldsymbol{\gamma}\odot\hat{h} = \mathrm{diag}(\boldsymbol{\gamma})\hat{h}.\]

It does feature-wise scaling and shifting. Mixing across features is still handled by dense linear layers such as $W_Q,W_K,W_V,W_O$ and the MLP projections.

The learned parameters matter because standardization makes the token state statistically stable, but the next layer may still need different feature strengths. Some features should enter more strongly, some more weakly, and some may need a stable baseline. $\boldsymbol{\gamma}$ learns feature strength after normalization. $\boldsymbol{\beta}$ learns feature baseline. The pattern is: pull a drifting input back into a stable coordinate system, then learn the useful scale and offset inside that system.

For a small example, suppose a standardized vector is

\[\hat{h}= \begin{pmatrix} 1 \\\\ -1 \\\\ 0 \end{pmatrix}.\]

A layer might learn

\[\boldsymbol{\gamma}= \begin{pmatrix} 2 \\\\ 0.5 \\\\ 1 \end{pmatrix},\quad \boldsymbol{\beta}= \begin{pmatrix} 0.1 \\\\ 0 \\\\ -0.2 \end{pmatrix}.\]

Then

\[\boldsymbol{\gamma}\odot\hat{h}+\boldsymbol{\beta} = \begin{pmatrix} 2.1 \\\\ -0.5 \\\\ -0.2 \end{pmatrix}.\]

LayerNorm and BatchNorm therefore differ in their statistics. BatchNorm asks how a feature is distributed across a batch. LayerNorm asks about the scale of the current token’s hidden state. Transformers need the latter as a stable interface.

RMSNorm simplifies the same idea. LayerNorm subtracts the mean, divides by a standard deviation, and then applies $\boldsymbol{\gamma}$ and $\boldsymbol{\beta}$. Many decoder-only LLMs mainly need stable scale at sublayer input. RMSNorm keeps the scale division and the learned $\boldsymbol{\gamma}$, while dropping mean-centering and $\boldsymbol{\beta}$.

For a token state $h\in\mathbb{R}^d$, RMSNorm computes

\[\mathrm{RMS}(h) = \sqrt{\frac{1}{d}\sum_{j=1}^{d}h_j^2+\epsilon},\]

then

\[\mathrm{RMSNorm}(h) = \boldsymbol{\gamma}\odot\frac{h}{\mathrm{RMS}(h)}.\]

For

\[h= \begin{pmatrix} 3 \\\\ 4 \\\\ 0 \end{pmatrix},\]

we have

\[\mathrm{RMS}(h)=\sqrt{\frac{3^2+4^2+0^2}{3}}=\frac{5}{\sqrt{3}}.\]

Ignoring $\boldsymbol{\gamma}$, the normalized vector is

\[\frac{h}{\mathrm{RMS}(h)} = \begin{pmatrix} \frac{3\sqrt{3}}{5} \\\\ \frac{4\sqrt{3}}{5} \\\\ 0 \end{pmatrix}.\]

If the original vector is scaled by 10, the RMS is also scaled by 10:

\[\mathrm{RMS}(10h)=\frac{50}{\sqrt{3}}.\]

\[\frac{10h}{\mathrm{RMS}(10h)} = \frac{h}{\mathrm{RMS}(h)}.\]

This is the core effect of RMSNorm: if a token state becomes larger or smaller as a whole, the input to the sublayer is pulled back to a comparable scale.

With the normalization object and statistics clear, the location of pre-norm is easier to understand. A transformer block repeatedly adds updates into the residual stream:

\[H\leftarrow H+U_{\mathrm{attn}},\]

then

\[H\leftarrow H+U_{\mathrm{mlp}}.\]

These updates accumulate layer after layer. The residual stream preserves information, but it also carries scale forward. If some updates are large, later layers see larger hidden states. If some directions are repeatedly amplified, attention logits, MLP activations, and final logits can become too large.

Pre-norm writes the block as

\[X=\mathrm{Norm}(H),\] \[H'=H+\mathrm{Attention}(X),\]

then

\[R=\mathrm{Norm}(H'),\] \[H_{\mathrm{next}}=H'+\mathrm{MLP}(R).\]

The attention and MLP read normalized inputs. The residual stream itself still passes forward through addition. This gives each sublayer a more stable input scale while preserving a direct residual path for forward information and backward gradients.

Post-norm instead places normalization after the sublayer update:

\[H'=\mathrm{Norm}(H+\mathrm{Attention}(H)).\]

Then attention reads the unnormalized $H$, so scale drift enters the $Q,K,V$ projections and attention logits before being normalized afterward. Pre-norm moves scale control to the sublayer entrance, which is one reason it is common in deep decoder-only models.

Controlling Stretch and Update Length

Normalization controls the scale of vectors entering sublayers. Linear layers then change vector lengths in direction-dependent ways. For a vector, norm asks how long it is. For a matrix, a natural question is how much it can stretch a vector. If the input is $x$ and the output is $Wx$, the stretch factor is

\[\frac{\lVert Wx\rVert}{\lVert x\rVert}.\]

This depends on direction. For

\[W= \begin{pmatrix} 4 & 0 & 0 \\\\ 0 & 1 & 0 \\\\ 0 & 0 & 0.25 \end{pmatrix},\]

the unit directions

\[e_1=\begin{pmatrix}1\\\\0\\\\0\end{pmatrix},\quad e_2=\begin{pmatrix}0\\\\1\\\\0\end{pmatrix},\quad e_3=\begin{pmatrix}0\\\\0\\\\1\end{pmatrix}\]

are mapped to

\[We_1=\begin{pmatrix}4\\\\0\\\\0\end{pmatrix},\quad We_2=\begin{pmatrix}0\\\\1\\\\0\end{pmatrix},\quad We_3=\begin{pmatrix}0\\\\0\\\\0.25\end{pmatrix}.\]

The first direction is stretched by 4, the second is unchanged, and the third is shrunk to 0.25. Matrix norms summarize this kind of behavior. The spectral norm focuses on the largest possible stretch over all unit inputs:

\[\lVert W\rVert_2 = \max_{\lVert x\rVert=1}\lVert Wx\rVert.\]

In this example,

\[\lVert W\rVert_2=4.\]

SVD connects directly to this. Since

\[Wx=\sum_i \sigma_i u_i(v_i^\top x),\]

each singular value $\sigma_i$ is the stretch strength of a read-write channel. The largest singular value equals the spectral norm:

\[\lVert W\rVert_2=\sigma_1.\]

Singular values therefore describe both the scale effect of a matrix along different channels and its low-rank structure (Fig. 5).

Figure 5. A linear layer $W$ maps the unit circle (all unit inputs) to an ellipse. Its two semi-axis lengths are the singular values $\sigma_1 \ge \sigma_2$; the longest axis $\sigma_1$ is the spectral norm $\lVert W\rVert_2$. The input point circles slowly — grab the dot to drag it, and it resumes from where you leave it.

Training stability depends on these scales. A linear layer that strongly amplifies some directions can push activations, attention logits, or gradients to large magnitudes. Many layers in sequence can compound that effect. Directions that are compressed too much may weaken important signals. Residual connections, normalization, initialization, learning rate, optimizers, and sometimes gradient clipping all help manage the same scale problem.

Gradient clipping is also a norm operation. Treat the gradient as one long vector $g$. A parameter update is roughly

\[\theta\leftarrow\theta-\eta g.\]

If $\lVert g\rVert$ spikes, the update can travel too far. Global norm clipping rescales the gradient to stay under a threshold $\tau$:

\[g_{\mathrm{clipped}} = g\cdot \min\left(1,\frac{\tau}{\lVert g\rVert}\right).\]

This preserves direction while limiting update length.

The common thread is scale management. Vector norms tell us how long states and updates are. Matrix norms tell us how much a linear layer can stretch an input. BatchNorm, LayerNorm, and RMSNorm differ in the statistics they use to control activation scale. Pre-norm places that control at the sublayer entrance. Singular values describe stretch along matrix channels. Gradient clipping controls the length of training updates. These mechanisms all keep deep networks from losing useful information to uncontrolled magnitude.

Back to the Whole Picture

Returning to the opening question, linear algebra in LLMs is a language for describing internal computation. Token IDs are placed into vector space. Each position receives a $d$-dimensional hidden state. Layer after layer, the model reads, rewrites, mixes, rescales, and writes back those states.

The first role is representation. Vectors turn discrete tokens into objects that can be continuously computed on. A hidden state can be read along many directions. $u^\top h$ gives the response along direction $u$. Embeddings, residual streams, and output logits all rely on this state space.

The second role is transformation. Ordinary linear layers operate inside each token position. $hW$ or $Wx$ rewrites the current state into a new set of coordinates. $W_Q,W_K,W_V$ read the same hidden state into query, key, and value roles. MLP projections expand, process, and write back features. The output projection reads the final state into vocabulary logits.

The third role is interaction. $QK^\top$ and $AV$ are where attention lets tokens communicate. $QK^\top$ produces matching scores between positions. Softmax turns them into read weights. $AV$ uses those weights to mix value information back into each position. Multi-head attention repeats this with multiple routing rules, and $W_O$ merges the resulting reads back into the residual stream.

The fourth role is scale control. Vector norms affect dot products, attention logits, and output logits. Matrix norms and singular values describe how linear layers amplify or shrink directions. LayerNorm, RMSNorm, pre-norm, and gradient clipping all help keep signals at usable magnitudes.

The final role is structure. Outer products show that matrix multiplication can be decomposed into rank-1 contributions. Rank measures how many independent directions a matrix uses. SVD orders those directions by strength. Low-rank approximation, SVD compression, and LoRA all use the fact that many useful linear actions may concentrate in relatively few channels.

So the sentence “LLMs are full of matrix multiplications” is true, but it is only the entry point. Matrix multiplication can rewrite a token representation, create routing scores in attention, merge head outputs, expose low-rank structure, or reveal scale through singular values. Linear algebra gives us a coordinate system for tracking those actions: where information is represented, which directions read it, which matrices rewrite it, where it flows across positions, and what scale it keeps as it moves.

Through this frame, attention, MLPs, normalization, LoRA, and compression all fall under the same set of traceable questions: what is being read, where is it written, which directions changed, what scale is controlled, and how much structure is preserved or discarded? That is the value of the linear algebra view: it turns a set of seemingly separate modules into a language for representation, transformation, interaction, scale, and structure.