Similar items by topic, tags, and provider (metadata-only).
datasetarchive.org
archive.org
Access: open. Q&A dumps
datasetcommoncrawl.org
commoncrawl.org
Broad web crawl data; requires careful filtering + dedup.
datasetlib.ncsu.edu
lib.ncsu.edu
Access: open. arXiv metadata/full text access options
repogithub.com
github.com
Access: open. Toy conveyor/valve data for acoustic anomaly detection
repogithub.com
github.com
Access: open. Scholarly corpus
repogithub.com
github.com
Access: open. Embedding benchmark suite; task/dataset licenses vary
repogithub.com
github.com
Access: open. MicroPython source/examples
repogithub.com
github.com
Access: open. Compiler test suite
repogithub.com
github.com
Access: open. Kernel source
repogithub.com
github.com
Access: open. Formalized math
repogithub.com
github.com
Access: open. Cell painting imaging datasets
repogithub.com
github.com
Access: open. Public datasets