Free Datasets¶
A curated catalog of high-quality, freely available datasets for machine learning, data science, and analytics.
Categories¶
| Category | Examples | Count |
|---|---|---|
| General Purpose | Iris, Titanic, MNIST, CIFAR-10, ImageNet subsets | 15+ |
| Computer Vision | COCO, Open Images, PASCAL VOC, CelebA, LFW | 12+ |
| Natural Language | Wikipedia dumps, Common Crawl, BookCorpus, SQuAD | 15+ |
| Tabular & Structured | UCI ML Repository, Kaggle datasets, Census data | 12+ |
| Audio & Speech | LibriSpeech, Common Voice, AudioSet, VoxCeleb | 10+ |
| Time Series | Stock prices, weather, energy, IoT sensor data | 10+ |
| Geospatial | OpenStreetMap, satellite imagery, climate data | 8+ |
| Healthcare & Bio | MIMIC, PhysioNet, PubMed, protein structures | 10+ |
| Government & Public | US Census, EU Open Data, World Bank, UN data | 12+ |
Dataset Registries¶
These are platforms where you can discover thousands more datasets:
| Platform | URL | Notes |
|---|---|---|
| Hugging Face Datasets | huggingface.co/datasets | 100k+ datasets, easy download via datasets library |
| Kaggle | kaggle.com/datasets | 50k+ datasets, requires free account |
| Google Dataset Search | datasetsearch.research.google.com | Search engine for datasets across the web |
| UCI ML Repository | archive.ics.uci.edu | Classic ML datasets, well-documented |
| Papers With Code | paperswithcode.com/datasets | Datasets linked to research papers |
| AWS Open Data | registry.opendata.aws | Large-scale datasets hosted on S3 |
| GitHub Awesome Lists | github.com/awesomedata/awesome-public-datasets | Community-curated list |
| data.gov | data.gov | US government open data |
| EU Open Data | data.europa.eu | European Union open data |
Metadata standard¶
Every dataset in our catalog includes:
- Name and description
- Size (rows, columns, file size)
- Format (CSV, JSON, Parquet, images, etc.)
- License (CC0, CC-BY, MIT, etc.)
- Direct link to download
- Browser-compatible flag (can it be loaded in our tools?)
- Citation (BibTeX where available)