AI Power Progress iA
All Resources / Topics / arXiv Bulk Data
Resource detail

arXiv Bulk Data

Research corpora + metadata pipelines; use via official bulk data guidance.

dataset research

Resource Metadata

Category

research_metadata

Provider

arxiv.org

Type

dataset

Level

unknown

Topic

general

Track

n/a

Section

n/a

Format

n/a

Status

publishable

Commercial

unknown

Featured

no

Fast start

no

Sequence

n/a

Priority

n/a

Primary source

training_data_stack

Sources

training_data_stack

ID

4bc651230dedd766

Open Resource

Fallback Access

Continue Learning

Keep momentum with nearby resources and structured tracks.

Tags: dataset research

Related Resources

Similar items by topic, tags, and provider (metadata-only).

datasetnaninfo.arxiv.org

arXiv bulk data

info.arxiv.org

Open bulk access for research papers across math, physics, CS, etc.

datasetBuildphysionet.org

PhysioNet

physionet.org

Canonical source for ECG, ICU, waveform, and related biomedical datasets.

datasetnandocs.openalex.org

OpenAlex

docs.openalex.org

CC0 research graph with snapshot updates; ideal for research retrieval and paper routing.

datasetMozilla

Common Voice

Mozilla

Large multilingual speech dataset project for ASR, speech research, and voice tooling.

datasetnanHugging Face

Common Pile v0.1

Hugging Face

Best legally cleaner starting corpus: 8 TB of public-domain and openly licensed text spanning books, papers, code, encyclopedias, educational materials, and transcripts.