Biological Datasets

Dataset

General

Entrez: Several databases from National Center for Biotechnology Information (NCBI)

Google Dataset Search

Kaggle Bioinformatic Datasets

Therapeutics Data Commons: Commons hold a number of ML ready datasets and tasks.

Academic Torrents: hundred of TB of research data

Catalogue of Life Index of World's Known Species: 2009 Compilation of 66 data sources

Protein

AlphaFold Protein Structure Database

ProteinNet

Phactor Chemical Reactions, Structures, Reagent, etc

RNA

scRNAseq Dataset Collection

Genomic

Genome Analysis Toolkit: Broad Institute Genome Analysis Toolkit (GATK) best practices and documentations.

Genome Aggregation Database (gnomAD): Collabroative effort that aggregated multiple sequencing data including genome-wide data in addition to exome data, the current version is built against the Genome Reference Consortium (GRC) Human Build 37

European Genome-phenome Archive (EGA): genetic, phenotypic, and clinical data

KEGG Encyclopedia of Genes and Genomes

Addgene Vector Database

Microsoft Health and Genomics Dataset

Cancer Biology

The Cancer Genome Atlas (TCGA): cancer samples from over 11,000 patients over a 12 year period.

Pan-Cancer Atlas

National Mammography Database

TCIA: archive of DICOM medical images of cancer organized as “Collections”, typically organized by patients related by a common disease (e.g. lung cancer), image modality (MRI, CT, etc) or research focus. DICOM is the primary file format used by TCIA for image storage.

National Biomedical Imaging Archive (NBIA): a searchable repository of in vivo images that support lesion detection and classification, accelerated diagnostic imaging decision, and quantitative imaging assessment of drug response

ICGC–TCGA DREAM Genomic Mutation Calling Challenge for identifying cancer-associated mutations and rearrangements in whole-genome sequencing (WGS) data

US CDC WONDER: Cancer Morbidity Data: CDC maintains public statistics on a variety of different diseases, including cancer.

Heart Imaging

EchoNet-LVH: large public Echocardiogram dataset with 12k labeled echocardiogram videos and expert annotations

Brain Imaging

OASIS: OASIS-3 is a longitudinal neuroimaging, clinical, cognitive, and biomarker dataset for normal aging and Alzheimer’s Disease.

Mindboggle: labelled the macroscopic anatomy in magnetic resonance images of 101 healthy participants Paper

MRI

fastMRI Dataset: more than 1,500 fully sampled knee MRIs.

Tractograpy

TractoInferno Machine Learning Tractography Dataset: multi-site tractography database, including both research- and clinical-like human acquisitions.

Parkinson's

Fox Insight, Parkinson’s Progression Markers Initiative, RNA Sequencing Project, LRRK2, BioFIND, etc (Michael J Fox Foundation)

Infectious Diseases

Inspirations and Collections

https://github.com/seandavi/awesome-single-cell: List of software packages (and the people developing these methods) for single-cell data analysis, including RNA-seq, ATAC-seq, etc.


Version History

  • Version history is not maintained for resources, lists, and tools.
  • 06.11.2018 Initial revision.

[Suggest a Link: TBD]