Category
Caution
Mixed-source replication corpus including CommonCrawl, C4, GitHub, arXiv, Wikipedia, and StackExchange.
Caution
Hugging Face
dataset
unknown
Local AI / LLM Engineering / RAG
Local AI / LLM Engineering / RAG
Warning
Reference
publishable
avoid-or-link-only
no
no
nan
Warning
direct_links_master
direct_links_master, mega_open_hub, training_data_stack
6bf6e3b4f4ec0bf7