Category
Base pretraining
Best multilingual extension of FineWeb pipeline; very broad language coverage.
Base pretraining
Hugging Face
dataset
unknown
Local AI / LLM Engineering / RAG
Local AI / LLM Engineering / RAG
Open data
Dataset
manual_review
manual-review
yes
no
nan
A
direct_links_master
direct_links_master, mega_open_hub, training_data_stack
6e067ab1eda68fa5