Week 3
Training Data Archaeology
Explore what is actually in AI training data
1
Explore Dataset Composition
LLMs learn from massive datasets like The Pile, Common Crawl, and C4. But what exactly is in these datasets? Click on a dataset below to see its composition.
The datasets:
- The Pile: 825GB combining 22 smaller datasets for diverse language modeling
- Common Crawl: ~800GB of filtered web pages from massive internet crawls
- C4: 750GB cleaned version of Common Crawl with aggressive filtering
2
Filter and View Samples
Complete previous steps to unlock this step.
3
Research Challenge
Complete previous steps to unlock this step.