Similar items by topic, tags, and provider (metadata-only).
datasetnanHugging Face
Hugging Face
Curators note they do not own the underlying text and maintain a takedown process.
datasetHugging Face
Hugging Face
Massive public dataset hub spanning NLP, code, vision, audio, robotics, and benchmarks.
datasetnanHugging Face
Hugging Face
Best multilingual extension of FineWeb pipeline; very broad language coverage.
datasetnanHugging Face
Hugging Face
Huge cleaned English web corpus; best raw breadth for LLM pretraining.
datasetnanHugging Face
Hugging Face
Synthetic textbook/blog/WikiHow-style corpus that helps tutor-like explanations.
datasetnanHugging Face
Hugging Face
Best legally cleaner starting corpus: 8 TB of public-domain and openly licensed text spanning books, papers, code, encyclopedias, educational materials, and transcripts.
datasetFoundationHugging Face
Hugging Face
Essential for loading, cleaning, streaming, and publishing training / eval data.
datasetnanaimi.stanford.edu
aimi.stanford.edu
Excellent medical data, but shared data is broadly non-commercial.
datasetnanapi.semanticscholar.org
api.semanticscholar.org
Public dataset license is limited to internal, non-commercial research/education use.
datasetnanHugging Face
Hugging Face
Crosslingual prompt pool across many languages and tasks.
datasetnanHugging Face
Hugging Face
Good open visual instruction tuning layer after base vision-language pretraining.
datasetnanHugging Face
Hugging Face
Good synthetic multi-turn chat augmentation.