Similar items by topic, tags, and provider (metadata-only).
datasettogether.ai
together.ai
Access: open. Pretraining mix
datasetpaperswithcode.com
paperswithcode.com
Access: open. Dataset index across ML tasks
datasetnanHugging Face
Hugging Face
Huge cleaned English web corpus; best raw breadth for LLM pretraining.
datasethuggingface.co
huggingface.co
Access: open (check license). Trivia QA dataset for retrieval + reading comprehension
datasethuggingface.co
huggingface.co
Access: open. Large code corpus
datasethuggingface.co
huggingface.co
Access: open (check license). Reading comprehension QA dataset
datasethuggingface.co
huggingface.co
Benchmarks & dataset references
datasethuggingface.co
huggingface.co
Access: open. Math reasoning
datasethuggingface.co
huggingface.co
Access: open (check license). Open-domain QA dataset (long + short answers)
datasethuggingface.co
huggingface.co
Access: open. Advanced math
datasethuggingface.co
huggingface.co
Access: research-only. Video-text dataset
datasethuggingface.co
huggingface.co
Access: open (check license). Multi-hop QA for RAG evaluation