Week 3

Training Data Archaeology

Explore what is actually in AI training data

Explore Dataset Composition

LLMs learn from massive datasets like The Pile, Common Crawl, and C4. But what exactly is in these datasets? Click on a dataset below to see its composition.

The datasets:

The Pile: 825GB combining 22 smaller datasets for diverse language modeling
Common Crawl: ~800GB of filtered web pages from massive internet crawls
C4: 750GB cleaned version of Common Crawl with aggressive filtering

Filter and View Samples

Complete previous steps to unlock this step.

Research Challenge

Complete previous steps to unlock this step.