Category
Base pretraining
Best legally cleaner starting corpus: 8 TB of public-domain and openly licensed text spanning books, papers, code, encyclopedias, educational materials, and transcripts.
Base pretraining
Hugging Face
dataset
unknown
Local AI / LLM Engineering / RAG
Local AI / LLM Engineering / RAG
Open data
Dataset
publishable
candidate
yes
no
nan
A
direct_links_master
direct_links_master, mega_open_hub
67f077e36aae6596