Similar items by topic, tags, and provider (metadata-only).
datasetFoundationHugging Face
Hugging Face
Essential for loading, cleaning, streaming, and publishing training / eval data.
videonanHugging Face
Hugging Face
Modern LLM engineering, datasets, transformers, and community tutorials.
datasetUCI
UCI
Direct browser for UCI datasets when you want a clean, filterable dataset list for website linking.
datasetnanHugging Face
Hugging Face
Huge cleaned English web corpus; best raw breadth for LLM pretraining.
datasetnanHugging Face
Hugging Face
Best multilingual extension of FineWeb pipeline; very broad language coverage.
datasetnanHugging Face
Hugging Face
Synthetic textbook/blog/WikiHow-style corpus that helps tutor-like explanations.
datasetnanHugging Face
Hugging Face
Best legally cleaner starting corpus: 8 TB of public-domain and openly licensed text spanning books, papers, code, encyclopedias, educational materials, and transcripts.
datasetnanHugging Face
Hugging Face
High-quality human assistant trees with ratings; strong open chat data.
datasetnanHugging Face
Hugging Face
Crosslingual prompt pool across many languages and tasks.
datasetnanHugging Face
Hugging Face
Good open visual instruction tuning layer after base vision-language pretraining.
datasetnanHugging Face
Hugging Face
Good synthetic multi-turn chat augmentation.
datasetnanHugging Face
Hugging Face
Large prompt/source mixture widely used in open instruction tuning.