Similar items by topic, tags, and provider (metadata-only).
datasetOpenAI
OpenAI
Widely used grade-school math reasoning benchmark for evaluation and small-scale instruction data work.
datasetfoundationMIT
MIT
Excellent lecture notes, exams, and videos across advanced technical topics.
datasetUCI
UCI
Classic and modern ML datasets that are ideal for education, benchmarking, and tabular experiments.
datasetOpenML
OpenML
Open ecosystem for datasets, tasks, models, and runs that helps standardize ML benchmarking.
datasetHugging Face
Hugging Face
Massive public dataset hub spanning NLP, code, vision, audio, robotics, and benchmarks.
datasetnanHugging Face
Hugging Face
Best multilingual extension of FineWeb pipeline; very broad language coverage.
datasetnanHugging Face
Hugging Face
Huge cleaned English web corpus; best raw breadth for LLM pretraining.
datasetnanHugging Face
Hugging Face
Synthetic textbook/blog/WikiHow-style corpus that helps tutor-like explanations.
datasetnanHugging Face
Hugging Face
Best legally cleaner starting corpus: 8 TB of public-domain and openly licensed text spanning books, papers, code, encyclopedias, educational materials, and transcripts.
datasetnanHugging Face
Hugging Face
Crosslingual prompt pool across many languages and tasks.
datasetnanHugging Face
Hugging Face
Good open visual instruction tuning layer after base vision-language pretraining.
datasetnanHugging Face
Hugging Face
Good synthetic multi-turn chat augmentation.