Healthcare & Bio Datasets¶
Free datasets for medical AI, bioinformatics, drug discovery, and clinical research.
Ethical use
Healthcare datasets may contain sensitive information. Always check the data use agreement and follow ethical guidelines for your institution.
Clinical¶
| Dataset | Records | Size | License | Link |
|---|---|---|---|---|
| MIMIC-IV | 300K+ ICU stays | 7 GB | PhysioNet credentialed | physionet.org |
| eICU | 200K+ ICU stays | 3 GB | PhysioNet credentialed | physionet.org |
| PhysioNet | 80+ datasets | Varies | Various open | physionet.org |
| NHANES | 150K+ participants | 500 MB | Open | cdc.gov |
| UK Biobank | 500K participants | Petabytes | Application required | ukbiobank.ac.uk |
Genomics & Proteomics¶
| Dataset | Records | Size | License | Link |
|---|---|---|---|---|
| UniProt | 250M+ proteins | 120 GB | CC-BY-4.0 | uniprot.org |
| PDB | 200K+ structures | 50 GB | CC0 | rcsb.org |
| NCBI GenBank | 230M+ sequences | Terabytes | Open | ncbi.nlm.nih.gov |
| 1000 Genomes | 3,202 genomes | 800 TB | Fort Lauderdale | internationalgenome.org |
| AlphaFold DB | 200M+ predictions | 23 TB | CC-BY-4.0 | alphafold.ebi.ac.uk |
Drug Discovery¶
| Dataset | Compounds | Size | License | Link |
|---|---|---|---|---|
| ChEMBL | 2.4M compounds | 4 GB | CC-BY-SA-3.0 | ebi.ac.uk/chembl |
| PubChem | 116M+ compounds | 50 GB | Open | pubchem.ncbi.nlm.nih.gov |
| ZINC | 230M+ compounds | 300 GB | Free academic | zinc.docking.org |
| DrugBank | 14K+ drugs | 500 MB | CC-BY-NC-4.0 | drugbank.com |
Medical Literature¶
| Dataset | Records | Size | License | Link |
|---|---|---|---|---|
| PubMed | 36M+ articles | Metadata: 50 GB | Open | pubmed.ncbi.nlm.nih.gov |
| PMC Open Access | 8.5M+ full-text | 400 GB | CC variants | ncbi.nlm.nih.gov/pmc |
| CORD-19 | 1M+ COVID papers | 12 GB | Various | semanticscholar.org |