Open datasets

Open Data Directory

Datasets for training, evaluation, and retrieval-augmented generation, with health and fallback visibility.

All Datasets Start Here Local AI Search Ask AI

Loading data health...

AI Data Stack: training → embeddings → LoRA → RAG

Use these shortcuts to jump into the most common open datasets and resources for building local models and grounded assistants.

Foundation / pretraining

Large corpora and knowledge bases used to train foundation models. Always verify license and dataset terms for your use case.

Common Crawl FineWeb Dolma Wikimedia Dumps

Instruction / LoRA fine-tuning

High-signal instruction datasets and practical fine-tuning references (LoRA / QLoRA / PEFT).

OpenAssistant FLAN LoRA QLoRA

Embeddings + retrieval

Benchmarks and datasets used to measure retrieval quality (and to train/evaluate embedding models).

BEIR MTEB MS MARCO Embedding Benchmarks

RAG corpora + evaluation

Open corpora to ingest for RAG and references for building + evaluating grounded QA workflows.

arXiv Stack Exchange RAG Learning Hub

Licensing + reuse notes

These are common pitfalls when using “open” datasets for training or large-scale ingestion. Always verify the dataset card and terms for your exact use case.

LAION datasets are indexes of image URLs + alt text; they do not host the underlying images.
PMC Open Access Subset is only the reuse-permitted portion of PMC, not all PMC content.
Stack Exchange dumps can have explicit restrictions for LLM training access; verify current terms before using.
The Stack / The Stack v2 have dataset-specific terms and acknowledgments; follow them and filter by license.

Start Here (curated graph + full notes)