Deep Parameter-Level Understanding: How Input Flows Through a Trained Language Model

Original: English

Modern language models like Phi-3 or LLaMA are often treated as black boxes — you feed them text and get intelligent answers. But beneath that, they are nothing more than massive, structured matrices of numbers (parameters) performing linear algebra operations in sequence. To truly understand how these models think, we must follow the journey of an input token through every stage of computation — from text to logits — and observe how parameters shape meaning.

1. Tokenization: Text to Numbers

Every input string is first broken into tokens — discrete integer IDs mapped by a vocabulary.
Example:

“Hello world” → [15496, 995]

Each token ID is an index into an embedding matrix E of shape (vocab_size, embedding_dim).
At the parameter level:

x₀ = E[token_id]

This vector x₀ (say 4096-dimensional) is the model’s numeric representation of the word.

2. Embedding and Positional Encoding

Language models are sequential, so they need to know where each token occurs.
A positional encoding (learned or sinusoidal) is added to each embedding:

x₀ = E[token_id] + P[position]
  • E = learned token embeddings

  • P = learned positional embeddings
    Both are stored in the model’s parameters and updated during training.

3. Transformer Layers: Parameterized Flow of Information

The input now flows through multiple identical Transformer blocks (e.g., 32–80 layers).

Each layer has two main parts:

  1. Multi-Head Self-Attention (MHSA)

  2. Feedforward Network (FFN)

Let’s zoom in to the parameter level.

(a) Self-Attention: The Dynamic Router

Each token embedding is linearly projected into three spaces using trainable matrices:

Q = xW_Q
K = xW_K
V = xW_V

Here:

  • W_Q, W_K, W_V are parameter matrices (each of size d_model × d_head).

  • These matrices are what the model learns to detect relationships between tokens.

The attention weights are computed as:

A = softmax(QKᵀ / √d_head)

This determines how much each token should attend to others.
Then the weighted sum of values gives:

z = A × V

The combined attention output passes through another learned projection:

x₁ = zW_O

where W_O is the output projection matrix.

At this point, each token’s representation has mixed information from all other tokens, guided entirely by learned matrices W_Q, W_K, W_V, W_O.

4. Feedforward Network: Nonlinear Transformation

After attention, the model applies a two-layer MLP to each token independently:

h₁ = x₁W₁ + b₁
h₂ = GELU(h₁)
x₂ = h₂W₂ + b₂
  • W₁ and W₂ are parameter matrices of large size (e.g., 4096×11008).

  • This expands and compresses the token’s hidden representation, enabling nonlinear mixing of semantic features.

Each layer updates the representation:

x ← x + LayerNorm(x₂)

residual connections ensure stable gradient flow and memory of previous states.

5. The Final Projection: Turning Thought into Words

After the last transformer block, we obtain a final hidden state h_final for each token.

To predict the next word, we project h_final back to vocabulary space using the same embedding matrix Eᵀ:

logits = h_final × Eᵀ

This gives one score per vocabulary token — the model’s belief in what comes next.
Applying softmax(logits) yields a probability distribution over all words.

6. Sampling: Converting Probabilities to Output

Finally, the model samples (or picks) the next token:

next_token = argmax(softmax(logits))

or via stochastic sampling (temperature, top-k, nucleus sampling).
This new token becomes input for the next iteration — recursively generating text.

7. Where the “Intelligence” Lives

Every “understanding” or “reasoning” capability of the model is encoded in the millions or billions of numbers inside:

  • W_Q, W_K, W_V, W_O

  • W₁, W₂

  • E and P

Each parameter fine-tunes how inputs mix, how attention flows, and how representations evolve.
At scale, these matrices form a distributed semantic memory — not rules, but high-dimensional geometry learned from data.

8. Summary of the Flow

Stage Operation Parameters Output
1 Tokenization Vocabulary Token IDs
2 Embedding E, P Token vectors
3 Attention W_Q, W_K, W_V, W_O Contextual features
4 FFN W₁, W₂ Transformed semantics
5 Output Eᵀ Next-token logits

Closing Thought

Understanding a model like Phi-3 or LLaMA at the parameter level reveals a simple but profound truth: these “intelligent” systems are deterministic numerical pipelines. The complexity and creativity we perceive are emergent properties of large-scale optimization in these matrices — a symphony of dot products and nonlinearities that together simulate reasoning.

In essence:

A language model doesn’t “know” words — it shapes probability landscapes where meaning naturally emerges through matrix multiplication.

Log in to add a comment.

Embed YouTube Video

Comments

No comments yet.

Embed YouTube Video