Similar items by topic, tags, and provider (metadata-only).
datasetnanHugging Face
Hugging Face
Good open visual instruction tuning layer after base vision-language pretraining.
reponanGitHub
GitHub
Large collection of NLP tasks with human-readable instructions.
repoGitHub
GitHub
Dataset and benchmark for code search and code-language retrieval tasks.
repoFoundationggml-org
ggml-org
Core local inference stack for CPU / GPU quantized deployment and experimentation.
datasetfoundationMIT
MIT
Excellent lecture notes, exams, and videos across advanced technical topics.
datasetnanHugging Face
Hugging Face
Best multilingual extension of FineWeb pipeline; very broad language coverage.
datasetnanHugging Face
Hugging Face
Huge cleaned English web corpus; best raw breadth for LLM pretraining.
datasetnanHugging Face
Hugging Face
Synthetic textbook/blog/WikiHow-style corpus that helps tutor-like explanations.
datasetnanHugging Face
Hugging Face
Best legally cleaner starting corpus: 8 TB of public-domain and openly licensed text spanning books, papers, code, encyclopedias, educational materials, and transcripts.
datasetUCI
UCI
Classic and modern ML datasets that are ideal for education, benchmarking, and tabular experiments.
videonanHugging Face
Hugging Face
Modern LLM engineering, datasets, transformers, and community tutorials.
datasetFoundationHugging Face
Hugging Face
Essential for loading, cleaning, streaming, and publishing training / eval data.