Foundations

From Networks to LLMs

How the neural network principles you learned scale up to create large language models

From How Networks Learn: Now you understand how neural networks learn patterns. LLMs are just very large networks trained on text. Let's see how.

Same Principles, Massive Scale

The weights and biases you adjusted in the perceptron? GPT-4 has 1.7 trillion of them. The training process you watched? It ran on thousands of GPUs for months. LLMs are the same fundamental ideas - just bigger.

1. What Are Parameters, Really?

When you "download a model," you're downloading numbers - the same weights and biases you've been exploring, just trillions of them organized into specialized structures.

When you "download a model," you're downloading numbers.

That's it. A model file is just a list of weights and biases - the same numbers you've been adjusting in the perceptron above. The magic isn't in the format, it's in finding the right numbers through training.

Perceptron Model File

The perceptron you just played with has exactly 3 parameters: two weights and one bias. Here's what a "model file" would look like:

// perceptron.json - 3 parameters
{
"w1": 0.6,// Weight for input 1
"w2": 0.4,// Weight for input 2
"b": -0.3// Bias
}
File size: ~50 bytes
These are the exact sliders you adjusted earlier. That's all a model is!

2. How LLMs Are Trained

LLMs use the same training loop you learned - but with a clever twist: they don't need human labels. The text labels itself.

From Neural Networks to LLMs

Everything you just learned applies to LLMs - but at a scale that changes what's possible.

No Labels Needed - Text Labels Itself

Traditional neural networks need labeled data: "This image is a cat." LLMs train differently - every sentence is its own training example.

How one sentence becomes a training example:

Input:
The cat sat on the ___
Predicts:
"floor"(32% confident)
Actual:
"mat"
Learn:
Adjust weights to increase "mat" probability next time

Key insight: Every piece of text on the internet is a training example. No human had to label anything - the next word IS the label.

3. Putting It All Together

An LLM is just a very large neural network. Here's the complete mental model:

The LLM Mental Model

Text
Tokenizer
Token IDs
Embedding Layer← the input to the neural network
Vectors (768+ numbers each)
Transformer Block × 50-100+
↳ Attention (every layer, not just preprocessing!)
↳ Feedforward network
↳ Repeat many times...
Output Layer→ probabilities for all ~50,000 tokens
Sample one token (temperature applied here)
Append to input → Run again!(autoregressive loop)

Why embeddings matter

They're literally the input format for the neural network. Without converting tokens to vectors, there's no way to feed text into the math.

Attention everywhere

Attention isn't preprocessing — it happens at every layer. Early layers see syntax, later layers see meaning and reasoning.

See how an LLM predicts the next word, then the next, then the next...

Input

"The cat sat on the"

Tokenize

The cat sat on the
The
cat
sat
on
the

Embed → 768 numbers each

The768d
cat768d
sat768d
on768d
the768d

Neural Network + Attention

The
cat
sat
on
the
Attention

Focuses on "cat" + "sat" to predict where cats typically sit

Billions of parameters process through many layers

Output: Next Token Probabilities

mat
24%
floor
15%
couch
11%
bed
8%

+ ~50,000 more tokens with tiny probabilities

Selected Token

"mat"

Learned from millions of texts that cats sit on mats!

Choose a scenario and click "Run Animation" to see how LLMs work

What we're skipping

The real magic is in "attention" - how words in a sentence look at each other to understand context. That's what makes "bank" mean different things in "river bank" vs "bank account". But you don't need to understand attention to reason about what LLMs can and can't do.

Remember parameters? GPT-4's 1.7 trillion parameters are the weights in this giant network - just like the w1, w2, and bias you adjusted in the perceptron, but trillions of them.

What you learned

  • A "model file" is just a list of parameters (weights, biases, attention matrices)
  • LLMs learn by predicting the next word - no human labels needed (self-supervised)
  • Scale matters: bigger models trained on more data develop emergent capabilities
  • Training costs millions; inference (using the model) costs fractions of a cent per query