AI Power Progress iA
All Resources / Topics / Topic / Common Pile v0.1
Resource detail

Common Pile v0.1

Best legally cleaner starting corpus: 8 TB of public-domain and openly licensed text spanning books, papers, code, encyclopedias, educational materials, and transcripts.

base-pretraining continual-pretraining course cpt dataset hugging-face llm-engineering local-ai rag research text training-data

Resource Metadata

Category

Base pretraining

Provider

Hugging Face

Type

dataset

Level

unknown

Topic

Local AI / LLM Engineering / RAG

Track

Local AI / LLM Engineering / RAG

Section

Open data

Format

Dataset

Status

publishable

Commercial

candidate

Featured

yes

Fast start

no

Sequence

nan

Priority

A

Primary source

direct_links_master

Sources

direct_links_master, mega_open_hub

ID

67f077e36aae6596

Open Resource

Fallback Access