AI Power Progress iA
All Resources / Topics / Topic / Semantic Scholar full datasets
Resource detail

Semantic Scholar full datasets

Public dataset license is limited to internal, non-commercial research/education use.

api-semanticscholar-org dataset research semantic-scholar-full-datasets warning warnings

Resource Metadata

Category

Caution

Provider

api.semanticscholar.org

Type

dataset

Level

unknown

Topic

Local AI / LLM Engineering / RAG

Track

Local AI / LLM Engineering / RAG

Section

Warning

Format

Reference

Status

manual_review

Commercial

avoid-or-link-only

Featured

no

Fast start

no

Sequence

nan

Priority

Warning

Primary source

direct_links_master

Sources

direct_links_master, mega_open_hub

ID

1174d0593cc678d0

Open Resource

Fallback Access

Continue Learning

Keep momentum with nearby resources and structured tracks.

Learning placement: track: Local AI / LLM Engineering / RAG ยท stage: nan

Tags: api-semanticscholar-org dataset research semantic-scholar-full-datasets warning warnings

Related Resources

Similar items by topic, tags, and provider (metadata-only).

datasetUCI

UCI Datasets

UCI

Direct browser for UCI datasets when you want a clean, filterable dataset list for website linking.

datasetnanHugging Face

RedPajama

Hugging Face

Mixed-source replication corpus including CommonCrawl, C4, GitHub, arXiv, Wikipedia, and StackExchange.

datasetnanHugging Face

OpenWebText

Hugging Face

Curators note they do not own the underlying text and maintain a takedown process.

datasetnanHugging Face

Common Pile v0.1

Hugging Face

Best legally cleaner starting corpus: 8 TB of public-domain and openly licensed text spanning books, papers, code, encyclopedias, educational materials, and transcripts.

datasetOpenML

OpenML Docs

OpenML

Open ecosystem for datasets, tasks, models, and runs that helps standardize ML benchmarking.

datasetnanHugging Face

FineWeb2

Hugging Face

Best multilingual extension of FineWeb pipeline; very broad language coverage.

datasetnanHugging Face

FineWeb

Hugging Face

Huge cleaned English web corpus; best raw breadth for LLM pretraining.