Similar items by topic, tags, and provider (metadata-only).
datasetnanHugging Face
Hugging Face
Best multilingual extension of FineWeb pipeline; very broad language coverage.
datasetnanHugging Face
Hugging Face
Best legally cleaner starting corpus: 8 TB of public-domain and openly licensed text spanning books, papers, code, encyclopedias, educational materials, and transcripts.
datasetnanHugging Face
Hugging Face
Synthetic textbook/blog/WikiHow-style corpus that helps tutor-like explanations.
datasetHugging Face
Hugging Face
Massive public dataset hub spanning NLP, code, vision, audio, robotics, and benchmarks.
datasetFoundationHugging Face
Hugging Face
Essential for loading, cleaning, streaming, and publishing training / eval data.
datasetnanHugging Face
Hugging Face
Benchmark-oriented high-quality Common Crawl derivative used in DCLM.
datasetnanHugging Face
Hugging Face
High-quality human assistant trees with ratings; strong open chat data.
datasetnanHugging Face
Hugging Face
Crosslingual prompt pool across many languages and tasks.
datasetnanHugging Face
Hugging Face
Good open visual instruction tuning layer after base vision-language pretraining.
datasetnanHugging Face
Hugging Face
Good synthetic multi-turn chat augmentation.
datasetnanHugging Face
Hugging Face
Large prompt/source mixture widely used in open instruction tuning.
datasetnanHugging Face
Hugging Face
Smaller but clean human-written instruction set.