AI Power Progress iA
All Resources / Topics / Topic / Common Crawl
Resource detail

Common Crawl

Broad web crawl data; requires careful filtering + dedup.

ai-training-data dataset open-data

Resource Metadata

Category

base_weights

Provider

commoncrawl.org

Type

dataset

Level

unknown

Topic

AI Training Data

Track

n/a

Section

Open Data Directory

Format

n/a

Status

publishable

Commercial

conditional

Featured

no

Fast start

no

Sequence

n/a

Priority

n/a

Primary source

training_data_stack

Sources

training_data_stack, website_existing

ID

7140efd5e93222fe

Open Resource

Fallback Access

Continue Learning

Keep momentum with nearby resources and structured tracks.

Tags: ai-training-data dataset open-data

Related Resources

Similar items by topic, tags, and provider (metadata-only).

datasetMozilla

Common Voice

Mozilla

Large multilingual speech dataset project for ASR, speech research, and voice tooling.

datasetnandumps.wikimedia.org

Wikimedia Dumps

dumps.wikimedia.org

Strong encyclopedic backbone for general knowledge and factual style.

datasetBuildphysionet.org

PhysioNet

physionet.org

Canonical source for ECG, ICU, waveform, and related biomedical datasets.

datasetBuildopenneuro.org

OpenNeuro

openneuro.org

Best open hub for MRI/EEG/MEG/iEEG style data.

datasetnanstorage.googleapis.com

Open Images V7

storage.googleapis.com

Large supervised vision dataset with labels, boxes, masks, relations, narratives.