All Resources / Topics / arXiv Bulk Data

Resource detail

arXiv Bulk Data

Research corpora + metadata pipelines; use via official bulk data guidance.

Open Learn All Resources Search

dataset research

Resource Metadata

Category

research_metadata

Provider

arxiv.org

Type

dataset

Level

unknown

Topic

general

Track

n/a

Section

n/a

Format

n/a

Status

publishable

Commercial

unknown

Featured

no

Fast start

no

Sequence

n/a

Priority

n/a

Primary source

training_data_stack

Sources

training_data_stack

ID

4bc651230dedd766

Fallback Access

https://web.archive.org/web/*/https://arxiv.org/help/bulk_data

Continue Learning

Keep momentum with nearby resources and structured tracks.

Tags: dataset research

More by provider More of this type Learning Hub Start Here

Related Resources

Similar items by topic, tags, and provider (metadata-only).

datasetnaninfo.arxiv.org

arXiv bulk data

info.arxiv.org

Open bulk access for research papers across math, physics, CS, etc.

datasetarxiv.org

arXiv Bulk Data (S3)

arxiv.org

retrieval_index

datasetarxiv.org

arXiv CS

arxiv.org

Access: open. Computer science papers

datasetarxiv.org

arXiv Quantum Category

arxiv.org

Access: open. Quantum physics papers

datasetFoundationCERN

CERN Open Data - Welcome

CERN

Orientation guide for using CERN Open Data for research and education.

resourceAdvancedOpenAlex / arXiv

OpenAlex + arXiv research stack

OpenAlex / arXiv

Use OpenAlex for metadata and citation graph work and arXiv for current papers and technical preprints.

datasetBuildphysionet.org

PhysioNet

physionet.org

Canonical source for ECG, ICU, waveform, and related biomedical datasets.

datasetfoundationMIT

MIT OpenCourseWare

MIT

Excellent lecture notes, exams, and videos across advanced technical topics.

datasetnanpmc.ncbi.nlm.nih.gov

PMC Open Access Subset

pmc.ncbi.nlm.nih.gov

Millions of full-text biomedical articles under reuse-friendly licenses.

datasetnandocs.openalex.org

OpenAlex

docs.openalex.org

CC0 research graph with snapshot updates; ideal for research retrieval and paper routing.

datasetMozilla

Common Voice

Mozilla

Large multilingual speech dataset project for ASR, speech research, and voice tooling.

datasetnanHugging Face

Common Pile v0.1

Hugging Face

Best legally cleaner starting corpus: 8 TB of public-domain and openly licensed text spanning books, papers, code, encyclopedias, educational materials, and transcripts.