From Networks to LLMs
How the neural network principles you learned scale up to create large language models
From How Networks Learn: Now you understand how neural networks learn patterns. LLMs are just very large networks trained on text. Let's see how.
Same Principles, Massive Scale
The weights and biases you adjusted in the perceptron? GPT-4 has 1.7 trillion of them. The training process you watched? It ran on thousands of GPUs for months. LLMs are the same fundamental ideas - just bigger.
1. What Are Parameters, Really?
When you "download a model," you're downloading numbers - the same weights and biases you've been exploring, just trillions of them organized into specialized structures.
When you "download a model," you're downloading numbers.
That's it. A model file is just a list of weights and biases - the same numbers you've been adjusting in the perceptron above. The magic isn't in the format, it's in finding the right numbers through training.
Perceptron Model File
The perceptron you just played with has exactly 3 parameters: two weights and one bias. Here's what a "model file" would look like:
2. How LLMs Are Trained
LLMs use the same training loop you learned - but with a clever twist: they don't need human labels. The text labels itself.
From Neural Networks to LLMs
Everything you just learned applies to LLMs - but at a scale that changes what's possible.
No Labels Needed - Text Labels Itself
Traditional neural networks need labeled data: "This image is a cat." LLMs train differently - every sentence is its own training example.
How one sentence becomes a training example:
Key insight: Every piece of text on the internet is a training example. No human had to label anything - the next word IS the label.
3. Putting It All Together
An LLM is just a very large neural network. Here's the complete mental model:
The LLM Mental Model
Why embeddings matter
They're literally the input format for the neural network. Without converting tokens to vectors, there's no way to feed text into the math.
Attention everywhere
Attention isn't preprocessing — it happens at every layer. Early layers see syntax, later layers see meaning and reasoning.
See how an LLM predicts the next word, then the next, then the next...
Input
"The cat sat on the"
Tokenize
Embed → 768 numbers each
Neural Network + Attention
Focuses on "cat" + "sat" to predict where cats typically sit
Billions of parameters process through many layers
Output: Next Token Probabilities
+ ~50,000 more tokens with tiny probabilities
Selected Token
"mat"
Learned from millions of texts that cats sit on mats!
Choose a scenario and click "Run Animation" to see how LLMs work
What we're skipping
The real magic is in "attention" - how words in a sentence look at each other to understand context. That's what makes "bank" mean different things in "river bank" vs "bank account". But you don't need to understand attention to reason about what LLMs can and can't do.
Remember parameters? GPT-4's 1.7 trillion parameters are the weights in this giant network - just like the w1, w2, and bias you adjusted in the perceptron, but trillions of them.
What you learned
- •A "model file" is just a list of parameters (weights, biases, attention matrices)
- •LLMs learn by predicting the next word - no human labels needed (self-supervised)
- •Scale matters: bigger models trained on more data develop emergent capabilities
- •Training costs millions; inference (using the model) costs fractions of a cent per query