AI Power Progress iA
All Resources / Topics / Topic / FineWeb
Resource detail

FineWeb

Huge cleaned English web corpus; best raw breadth for LLM pretraining.

ai base-pretraining continual-pretraining cpt dataset hugging-face llm-engineering local-ai open-data rag text training-data

Resource Metadata

Category

Base pretraining

Provider

Hugging Face

Type

dataset

Level

unknown

Topic

Local AI / LLM Engineering / RAG

Track

Local AI / LLM Engineering / RAG

Section

Open data

Format

Dataset

Status

manual_review

Commercial

manual-review

Featured

yes

Fast start

no

Sequence

nan

Priority

A

Primary source

direct_links_master

Sources

direct_links_master, mega_open_hub, website_existing

ID

7edad47ddec51c5a

Open Resource

Fallback Access