Similar items by topic, tags, and provider (metadata-only).
datasetUCI
UCI
Direct browser for UCI datasets when you want a clean, filterable dataset list for website linking.
datasetfoundationMIT
MIT
Excellent lecture notes, exams, and videos across advanced technical topics.
datasetHugging Face
Hugging Face
Massive public dataset hub spanning NLP, code, vision, audio, robotics, and benchmarks.
datasetOpenML
OpenML
Open ecosystem for datasets, tasks, models, and runs that helps standardize ML benchmarking.
datasetnanHugging Face
Hugging Face
Huge cleaned English web corpus; best raw breadth for LLM pretraining.
datasetnanHugging Face
Hugging Face
Synthetic textbook/blog/WikiHow-style corpus that helps tutor-like explanations.
datasetnanHugging Face
Hugging Face
Best legally cleaner starting corpus: 8 TB of public-domain and openly licensed text spanning books, papers, code, encyclopedias, educational materials, and transcripts.
datasetfoundationopenstax.org
openstax.org
Open textbooks across core STEM and humanities subjects.
datasetnanHugging Face
Hugging Face
High-quality human assistant trees with ratings; strong open chat data.
datasetOpenAI
OpenAI
Widely used grade-school math reasoning benchmark for evaluation and small-scale instruction data work.
datasetnanHugging Face
Hugging Face
Best multilingual extension of FineWeb pipeline; very broad language coverage.
reponanGitHub
GitHub
One of the best open instruction mixtures; includes FLAN, P3, Super-Natural Instructions, and more.