Similar items by topic, tags, and provider (metadata-only).
datasetnanaimi.stanford.edu
aimi.stanford.edu
High-quality AI-ready clinical datasets and dataset index.
datasetnanapi.semanticscholar.org
api.semanticscholar.org
Public dataset license is limited to internal, non-commercial research/education use.
datasetHugging Face
Hugging Face
Massive public dataset hub spanning NLP, code, vision, audio, robotics, and benchmarks.
datasetFoundationHugging Face
Hugging Face
Essential for loading, cleaning, streaming, and publishing training / eval data.
datasetUCI
UCI
Direct browser for UCI datasets when you want a clean, filterable dataset list for website linking.
datasetnanHugging Face
Hugging Face
Mixed-source replication corpus including CommonCrawl, C4, GitHub, arXiv, Wikipedia, and StackExchange.
datasetnanHugging Face
Hugging Face
Curators note they do not own the underlying text and maintain a takedown process.
datasetfoundationMIT
MIT
Excellent lecture notes, exams, and videos across advanced technical topics.
datasetUCI
UCI
Classic and modern ML datasets that are ideal for education, benchmarking, and tabular experiments.
datasetOpenML
OpenML
Open ecosystem for datasets, tasks, models, and runs that helps standardize ML benchmarking.
datasetnanHugging Face
Hugging Face
Best multilingual extension of FineWeb pipeline; very broad language coverage.
datasetnanHugging Face
Hugging Face
Huge cleaned English web corpus; best raw breadth for LLM pretraining.