LLM Training Data + Open Data Vault

Open datasets for AI, science, and engineering.

Use trusted open datasets to power LLM training, RAG, and research workflows across physics, mathematics, computer science, biomedical engineering, and emerging tech.

Chip grid of data blocks powering training and retrieval.
Use with onsite AI
Index local files for fast lookup.
Search data before prompting the tutor.
Open Data

ARM Cortex-M Documentation

Arm · All levels · catalog

Visit
Marketing Data

Ahrefs (YouTube)

Ahrefs · All levels · video

Visit
Neuroscience

Allen Brain Atlas

Allen Institute · Intermediate · dataset

Visit
Open Data

Allen Brain Atlas

Allen Institute · All levels · dataset

Visit
Marketing Data

Amazon Customer Reviews (AWS Open Data)

AWS Open Data · Intermediate · dataset

Visit
AI Training Data

Andrej Karpathy (YouTube)

Andrej Karpathy · All levels · video

Visit
Audio Data

Andrej Karpathy (YouTube)

Andrej Karpathy · All levels · video

Visit
Code Data

Andrej Karpathy (YouTube)

Andrej Karpathy · All levels · video

Visit
Data Analysis

Andrej Karpathy (YouTube)

Andrej Karpathy · All levels · video

Visit
Instruction Data

Andrej Karpathy (YouTube)

Andrej Karpathy · All levels · video

Visit
Multimodal Data

Andrej Karpathy (YouTube)

Andrej Karpathy · All levels · video

Visit
Open Data

Andrej Karpathy (YouTube)

Andrej Karpathy · All levels · video

Visit
Tabular Data

Andrej Karpathy (YouTube)

Andrej Karpathy · All levels · video

Visit
Open Data

Anthropic HH-RLHF Dataset

Hugging Face · All levels · dataset

Visit
Open Data

ArXiv Bulk Data Access

arXiv · Intermediate · dataset

Visit
Open Data

Autoware Foundation

Autoware · All levels · catalog

Visit
Open Data

Autoware Foundation

Autonomous driving stack.

github.com · All levels · dataset

Visit
Bioinformatics

Awesome Bioinformatics

GitHub · All levels · catalog

Visit
Programming

Awesome C#

GitHub · All levels · catalog

Visit
Programming

Awesome C++

GitHub · All levels · catalog

Visit
Data Analysis

Awesome Data Science

GitHub · All levels · catalog

Visit
Embedded Systems

Awesome Embedded Systems

GitHub · All levels · catalog

Visit
FPGA/HDL

Awesome FPGA

GitHub · All levels · catalog

Visit
Web Development

Awesome Full Stack

GitHub · All levels · catalog

Visit
Web Development

Awesome JSON

GitHub · All levels · catalog

Visit
Neuroscience

Awesome Neuroscience

GitHub · All levels · catalog

Visit
Robotics

Awesome Open-Source Robotics

GitHub · All levels · catalog

Visit
PCB Design

Awesome PCB

GitHub · All levels · catalog

Visit
Programming

Awesome Python

GitHub · All levels · catalog

Visit
Quantum Computing

Awesome Quantum Computing

GitHub · All levels · catalog

Visit
Robotics

Awesome ROS 2

GitHub · All levels · catalog

Visit
VLSI

Awesome VLSI

GitHub · All levels · catalog

Visit
BCI/EEG

BCI Competition Datasets

BCI Competition · Intermediate · dataset

Visit
Open Data

BCI Competition Datasets

BCI Competition · All levels · dataset

Visit
BCI/Neurotech

BCI-Competition Datasets

BCI Competition · Intermediate · dataset

Visit
Open Data

CERN Open Data

CERN · Advanced · dataset

Visit
Computer Vision

COCO Dataset

COCO · Intermediate · dataset

Visit
Research

CORE Open Research

CORE · All levels · catalog

Visit
Marketing Data

CRM Data Architecture

YouTube · All levels · video

Visit
Code Data

CodeSearchNet

GitHub · Intermediate · dataset

Visit
Open Data

Common Crawl

Common Crawl · Advanced · dataset

Visit
AI Training Data

Common Crawl

Common Crawl · Intermediate · dataset

Visit
AI Training Data

Cosmopedia (FineWeb-EDU)

Hugging Face · Intermediate · dataset

Visit
Marketing Data

Criteo Display Advertising Challenge Dataset

Criteo · Advanced · dataset

Visit
Marketing Data

Customer Data Platforms (CDP)

YouTube · All levels · video

Visit
Marketing Data

Data Cleaning and Enrichment

YouTube · All levels · video

Visit
AI Training Data

DeepLearning.AI (YouTube)

DeepLearning.AI · All levels · video

Visit
Audio Data

DeepLearning.AI (YouTube)

DeepLearning.AI · All levels · video

Visit
Code Data

DeepLearning.AI (YouTube)

DeepLearning.AI · All levels · video

Visit
Data Analysis

DeepLearning.AI (YouTube)

DeepLearning.AI · All levels · video

Visit
Instruction Data

DeepLearning.AI (YouTube)

DeepLearning.AI · All levels · video

Visit
Multimodal Data

DeepLearning.AI (YouTube)

DeepLearning.AI · All levels · video

Visit
Open Data

DeepLearning.AI (YouTube)

DeepLearning.AI · All levels · video

Visit
Tabular Data

DeepLearning.AI (YouTube)

DeepLearning.AI · All levels · video

Visit
Research

Directory of Open Access Journals

DOAJ · All levels · catalog

Visit
Open Data

ESP32 Technical Reference Manuals

Espressif · All levels · catalog

Visit
Embedded Systems

Embedded Systems Resources

GitHub · All levels · catalog

Visit
Science

European Open Science Cloud

EOSC · All levels · catalog

Visit
Marketing Data

Event Tracking Design

YouTube · All levels · video

Visit
Open Data

FLAN Instruction Dataset

Hugging Face · All levels · dataset

Visit
Marketing Data

First-Party Data Strategies

YouTube · All levels · video

Visit
Open Data

FreeRTOS Kernel + Docs

FreeRTOS · All levels · catalog

Visit
Data Analysis

GNU Octave

GNU · Foundational · software

Visit
Open Data

GW Open Science Center

GWOSC · Intermediate · dataset

Visit
Open Data

GitHub Code Dataset (codeparrot)

Hugging Face · All levels · dataset

Visit
Marketing Data

Google Ads Transparency Center

Google · Intermediate · dataset

Visit
Marketing Data

Google Analytics (YouTube)

Google Analytics · All levels · video

Visit
Science

Google Dataset Search

Google · All levels · catalog

Visit
Marketing Data

HubSpot (YouTube)

HubSpot · All levels · video

Visit
Open Data

Hugging Face Datasets

Hugging Face · All · catalog

Visit
Open Data

IBM Quantum Textbook

IBM · All levels · catalog

Visit
Marketing Data

Identity Resolution

YouTube · All levels · video

Visit
Computer Vision

ImageNet Dataset

ImageNet · Intermediate · dataset

Visit
Data Analysis

Jupyter Notebook

Jupyter · Foundational · software

Visit
Data Analysis

Jupyter Notebook Documentation

Jupyter · Foundational · docs

Visit
Open Data

KITTI Vision Dataset

KITTI · All levels · dataset

Visit
Marketing Data

Kaggle - Marketing Analytics

Kaggle · Intermediate · dataset

Visit
Open Data

Kaggle Datasets

Kaggle · Foundational · catalog

Visit
Open Data

LAION-5B

LAION · Advanced · dataset

Visit
Multimodal Data

LAION-5B

LAION · Advanced · dataset

Visit
Audio Data

LibriSpeech

OpenSLR · Intermediate · dataset

Visit
Data Analysis

MATLAB Onramp

MathWorks · Foundational · interactive

Visit
Open Data

MIT OCW Computer Science Notes

MIT OCW · All levels · catalog

Visit
Open Data

MIT OCW Flight Dynamics Notes

MIT OCW · All levels · catalog

Visit
Open Data

MIT OCW Mathematics Notes

MIT OCW · All levels · catalog

Visit
Open Data

MITRE ATT&CK

MITRE · All levels · dataset

Visit
AI Training Data

MS COCO Dataset

Microsoft · Intermediate · dataset

Visit
Marketing Data

Marketing Data Governance

YouTube · All levels · video

Visit
Marketing Data

Marketing Data Pipelines

YouTube · All levels · video

Visit
Data Analysis

Matplotlib Gallery

Matplotlib · Foundational · docs

Visit
Biomedical

MedNIST Dataset

MONAI · Intermediate · dataset

Visit
Marketing Data

Meta Ads Library

Meta · Intermediate · dataset

Visit
Audio Data

Mozilla Common Voice

Mozilla · Intermediate · dataset

Visit
Science

NASA Earthdata

NASA · All levels · dataset

Visit
Open Data

NASA Lessons Learned (LLIS)

NASA · All levels · catalog

Visit
Science

NASA Open Data Portal

NASA · All levels · dataset

Visit
Open Data

NASA Systems Engineering Handbook

NASA · All levels · reference

Visit
Open Data

NASA Technical Reports Server (NTRS)

NASA · All levels · search

Visit
Bioinformatics

NCBI Genome Data

NCBI · Intermediate · dataset

Visit
Biomedical

NIH 3D Print Exchange

NIH · Intermediate · catalog

Visit
Open Data

NIST Cryptography Publications

NIST · All levels · reference

Visit
Open Data

NIST Zero Trust Architecture

NIST · All levels · reference

Visit
Science

NOAA Data Catalog

NOAA · All levels · dataset

Visit
Open Data

NVD Vulnerability Database

NIST · All levels · catalog

Visit
Open Data

NVIDIA Omniverse

NVIDIA · All levels · catalog

Visit
Open Data

NeuroTechX · awesome-neurotech

Clone repos for open neuroscience texts.

GitHub · All levels · repository

Visit
Neuromorphic Computing

Neuromorphic Computing Resources

GitHub · All levels · catalog

Visit
Data Analysis

NumPy Documentation

NumPy · Foundational · docs

Visit
Open Data

OSSU Computer Science Curriculum

OSSU · All levels · catalog

Visit
Open Data

OWASP Top 10

OWASP · All levels · reference

Visit
Open Data

Open Images Dataset

Google · Intermediate · dataset

Visit
AI Training Data

Open Images Dataset

Google · Intermediate · dataset

Visit
Neuroscience

Open Neuroscience Resources

GitHub · All levels · catalog

Visit
Photonics

Open Photonics Resources

GitHub · All levels · catalog

Visit
Open Data

Open-Source Neuroscience

GitHub · All levels · catalog

Visit
Research

OpenAlex

OpenAlex · All levels · dataset

Visit
AI Training Data

OpenAlex

OpenAlex · Intermediate · dataset

Visit
Instruction Data

OpenAssistant OASST1

OpenAssistant · Intermediate · dataset

Visit
Open Data

OpenAssistant OASST1

Hugging Face · All levels · dataset

Visit
Open Data

OpenCores

OpenCores · All levels · catalog

Visit
Data Analysis

OpenIntro Statistics

OpenIntro · Foundational · book

Visit
Open Data

OpenML

OpenML · All · catalog

Visit
Tabular Data

OpenML

OpenML · Intermediate · dataset

Visit
Open Data

OpenML Dataset Repository

OpenML · All levels · catalog

Visit
BCI/Neurotech

OpenNeuro

OpenNeuro · Intermediate · dataset

Visit
Open Data

OpenNeuro Datasets

OpenNeuro · Intermediate · dataset

Visit
Open Data

OpenStax Mathematics Textbooks

OpenStax · All levels · catalog

Visit
Open Data

OpenStax Physics

OpenStax · All levels · catalog

Visit
AI Training Data

OpenWebText Corpus

OpenWebText · Intermediate · dataset

Visit
Open Data

OpenWebText2

OpenWebText · Advanced · dataset

Visit
Data Analysis

Pandas Documentation

Pandas · Intermediate · docs

Visit
Research

Papers with Code

Papers with Code · All levels · catalog

Visit
Biomedical

PhysioNet

PhysioNet · Intermediate · dataset

Visit
Open Data

PhysioNet Biomedical Signals

PhysioNet · All levels · dataset

Visit
Data Analysis

Plotly Python

Plotly · Foundational · docs

Visit
Ethical Hacking

Practical Ethical Hacking

GitHub · Intermediate · catalog

Visit
Open Data

Project Gutenberg

Project Gutenberg · Foundational · dataset

Visit
AI Training Data

Project Gutenberg

Project Gutenberg · Foundational · dataset

Visit
Open Data

ProofWiki Export

ProofWiki · All levels · dataset

Visit
AI Training Data

PubMed Central Open Access Subset

NCBI · Advanced · dataset

Visit
Quantum Computing

Quantum Algorithm Zoo

NIST · Advanced · catalog

Visit
Open Data

Quantum Algorithm Zoo

NIST · All levels · catalog

Visit
Data Analysis

R for Data Science

R4DS · Intermediate · book

Visit
Open Data

ROS Wiki

Open Robotics · All levels · catalog

Visit
Open Data

RedPajama Dataset

Together · Advanced · dataset

Visit
Open Data

Robotics Book Notes

GitHub · All levels · catalog

Visit
Open Data

Robotics Book Notes

Open robotics notes.

github.com · All levels · dataset

Visit
Open Data

SPIE Course Materials

SPIE · All levels · catalog

Visit
Photonics

SPIE Digital Library

SPIE · Advanced · catalog

Visit
Instruction Data

SQuAD v2.0

Stanford · Intermediate · dataset

Visit
Data Analysis

SciPy Documentation

SciPy · Intermediate · docs

Visit
Research

Semantic Scholar

Semantic Scholar · All levels · search

Visit
AI Training Data

Sentdex (YouTube)

Sentdex · All levels · video

Visit
Audio Data

Sentdex (YouTube)

Sentdex · All levels · video

Visit
Code Data

Sentdex (YouTube)

Sentdex · All levels · video

Visit
Data Analysis

Sentdex (YouTube)

Sentdex · All levels · video

Visit
Instruction Data

Sentdex (YouTube)

Sentdex · All levels · video

Visit
Multimodal Data

Sentdex (YouTube)

Sentdex · All levels · video

Visit
Open Data

Sentdex (YouTube)

Sentdex · All levels · video

Visit
Tabular Data

Sentdex (YouTube)

Sentdex · All levels · video

Visit
Marketing Data

Server-Side Tracking

YouTube · All levels · video

Visit
Open Data

Spacekit Orbital Mechanics Notes

GitHub · All levels · catalog

Visit
AI Training Data

Stack Exchange Data Dump

Stack Exchange · Intermediate · dataset

Visit
Open Data

Stack Exchange Data Dump

Stack Exchange · All levels · dataset

Visit
Open Data

StatQuest Companion Datasets

StatQuest · All levels · dataset

Visit
Open Data

StatQuest Datasets

Clone datasets and notes.

github.com · All levels · dataset

Visit
Open Data

TUM Vision Datasets

TUM · All levels · dataset

Visit
Open Data

The Pile

EleutherAI · Advanced · dataset

Visit
Open Data

The Pile (EleutherAI)

EleutherAI · All levels · dataset

Visit
Code Data

The Stack (BigCode)

BigCode · Advanced · dataset

Visit
Open Data

TheAlgorithms Repository

GitHub · All levels · catalog

Visit
Open Data

TheAlgorithms · Python

Clone repositories for algorithm explanations.

GitHub · All levels · repository

Visit
Open Data

UCI Machine Learning Repository

UCI · All · catalog

Visit
Open Data

UCI Machine Learning Repository

UCI · All levels · catalog

Visit
Marketing Data

UCI Online Retail

UCI · Intermediate · dataset

Visit
Marketing Data

UCI Online Retail II

UCI · Intermediate · dataset

Visit
Marketing Data

UCI Online Shoppers Purchasing Intention

UCI · Intermediate · dataset

Visit
Science

USGS Science Data Catalog

USGS · All levels · dataset

Visit
VLSI

VLSI Academy

VLSI System Design · Intermediate · catalog

Visit
Multimodal Data

WIT (Wikipedia Image-Text)

Google · Intermediate · dataset

Visit
AI Training Data

WikiText-103

Salesforce · Foundational · dataset

Visit
Open Data

Wikimedia Dumps

Wikimedia · Foundational · dataset

Visit
AI Training Data

Wikipedia Dumps

Wikimedia · Foundational · dataset

Visit
Open Data

Wikipedia English Dumps

Wikimedia · All levels · dataset

Visit
Marketing Data

Yelp Open Dataset

Yelp · Intermediate · dataset

Visit
AI Training Data

arXiv Bulk Data (S3)

arXiv · Advanced · dataset

Visit
AI Training Data

arXiv Bulk Data Access

arXiv · Advanced · dataset

Visit
Open Data

arXiv Mathematics Archive

arXiv · All levels · catalog

Visit
Open Data

arXiv Quantum Physics Archive

arXiv · All levels · catalog

Visit
Biomedical

edX - Biomedical Engineering

edX · Foundational · catalog

Visit
Bioengineering

edX Bioengineering Courses

edX · Foundational · catalog

Visit

Related hubs

Pair datasets with course material, analysis tools, and the evidence timeline to build high‑quality RAG and training pipelines.

Local training data (RAG-ready)

Store datasets and notes under `/mnt/pp_data/training_data` or `/home/powerprogress` and index them here. The index helps your onsite AI find relevant files fast.

Index local files

Tip: a 404 means the path does not exist yet.

Search indexed data

Results appear here.

How the onsite AI uses this data

RAG workflow — index your local files, search for relevant entries, then paste the references into AI Tutor prompts.
Training prep — use open data catalogs to build safe, licensed datasets.
Compliance — only use data you are allowed to store and process.
IA Tutor
Open
Ask a question to get guidance.
AI Assist
Use this to augment any workflow on the page.