Week 3

Training Data Archaeology

Explore what is actually in AI training data

1

Explore Dataset Composition

LLMs learn from massive datasets like The Pile, Common Crawl, and C4. But what exactly is in these datasets? Click on a dataset below to see its composition.

The datasets:

  • The Pile: 825GB combining 22 smaller datasets for diverse language modeling
  • Common Crawl: ~800GB of filtered web pages from massive internet crawls
  • C4: 750GB cleaned version of Common Crawl with aggressive filtering
2

Filter and View Samples

Complete previous steps to unlock this step.
3

Research Challenge

Complete previous steps to unlock this step.