Similar items by topic, tags, and provider (metadata-only).
reponanGitHub
GitHub
One of the best open instruction mixtures; includes FLAN, P3, Super-Natural Instructions, and more.
repoFoundationggml-org
ggml-org
Core local inference stack for CPU / GPU quantized deployment and experimentation.
repoGitHub
GitHub
Dataset and benchmark for code search and code-language retrieval tasks.
datasetfoundationMIT
MIT
Excellent lecture notes, exams, and videos across advanced technical topics.
datasetnanHugging Face
Hugging Face
Best multilingual extension of FineWeb pipeline; very broad language coverage.
datasetnanHugging Face
Hugging Face
Huge cleaned English web corpus; best raw breadth for LLM pretraining.
datasetnanHugging Face
Hugging Face
Synthetic textbook/blog/WikiHow-style corpus that helps tutor-like explanations.
datasetnanHugging Face
Hugging Face
Best legally cleaner starting corpus: 8 TB of public-domain and openly licensed text spanning books, papers, code, encyclopedias, educational materials, and transcripts.
datasetnanHugging Face
Hugging Face
Large prompt/source mixture widely used in open instruction tuning.
datasetnanHugging Face
Hugging Face
Crosslingual prompt pool across many languages and tasks.
datasetnanHugging Face
Hugging Face
High-quality human assistant trees with ratings; strong open chat data.
datasetnanHugging Face
Hugging Face
Good synthetic multi-turn chat augmentation.