Category
Base pretraining
Huge cleaned English web corpus; best raw breadth for LLM pretraining.
Base pretraining
Hugging Face
dataset
unknown
Local AI / LLM Engineering / RAG
Local AI / LLM Engineering / RAG
Open data
Dataset
manual_review
manual-review
yes
no
nan
A
direct_links_master
direct_links_master, mega_open_hub, website_existing
7edad47ddec51c5a