Similar items by topic, tags, and provider (metadata-only).
datasetnanHugging Face
Hugging Face
Huge cleaned English web corpus; best raw breadth for LLM pretraining.
datasetnanHugging Face
Hugging Face
Best legally cleaner starting corpus: 8 TB of public-domain and openly licensed text spanning books, papers, code, encyclopedias, educational materials, and transcripts.
datasetnanHugging Face
Hugging Face
Synthetic textbook/blog/WikiHow-style corpus that helps tutor-like explanations.
datasetHugging Face
Hugging Face
Massive public dataset hub spanning NLP, code, vision, audio, robotics, and benchmarks.
datasetFoundationHugging Face
Hugging Face
Essential for loading, cleaning, streaming, and publishing training / eval data.
datasetnanHugging Face
Hugging Face
Benchmark-oriented high-quality Common Crawl derivative used in DCLM.
datasetnanHugging Face
Hugging Face
Crosslingual prompt pool across many languages and tasks.
datasetnanHugging Face
Hugging Face
Good open visual instruction tuning layer after base vision-language pretraining.
datasetnanHugging Face
Hugging Face
Good synthetic multi-turn chat augmentation.
datasetnanHugging Face
Hugging Face
Large prompt/source mixture widely used in open instruction tuning.
datasetnanHugging Face
Hugging Face
High-quality human assistant trees with ratings; strong open chat data.
datasetnanHugging Face
Hugging Face
Smaller but clean human-written instruction set.