Similar items by topic, tags, and provider (metadata-only).
coursenanMIT
MIT
Excellent educational content, but generally noncommercial CC BY-NC-SA.
datasetnanHugging Face
Hugging Face
Synthetic textbook/blog/WikiHow-style corpus that helps tutor-like explanations.
datasetnanHugging Face
Hugging Face
Best legally cleaner starting corpus: 8 TB of public-domain and openly licensed text spanning books, papers, code, encyclopedias, educational materials, and transcripts.
datasetnanHugging Face
Hugging Face
Huge cleaned English web corpus; best raw breadth for LLM pretraining.
datasetUCI
UCI
Classic and modern ML datasets that are ideal for education, benchmarking, and tabular experiments.
datasetnanHugging Face
Hugging Face
Best multilingual extension of FineWeb pipeline; very broad language coverage.
datasetHugging Face
Hugging Face
Massive public dataset hub spanning NLP, code, vision, audio, robotics, and benchmarks.
datasetFoundationHugging Face
Hugging Face
Essential for loading, cleaning, streaming, and publishing training / eval data.
datasetfoundationopenstax.org
openstax.org
Open textbooks across core STEM and humanities subjects.
datasetnanHugging Face
Hugging Face
High-quality human assistant trees with ratings; strong open chat data.
datasetnanoercommons.org
oercommons.org
Broad public digital library of open educational resources.
datasetnanHugging Face
Hugging Face
Crosslingual prompt pool across many languages and tasks.