Biological Datasets
Dataset
General
Entrez: Several databases from National Center for Biotechnology Information (NCBI)
Therapeutics Data Commons: Commons hold a number of ML ready datasets and tasks.
Academic Torrents: hundred of TB of research data
Catalogue of Life Index of World's Known Species: 2009 Compilation of 66 data sources
Protein
AlphaFold Protein Structure Database
Phactor Chemical Reactions, Structures, Reagent, etc
RNA
Genomic
Genome Analysis Toolkit: Broad Institute Genome Analysis Toolkit (GATK) best practices and documentations.
Genome Aggregation Database (gnomAD): Collabroative effort that aggregated multiple sequencing data including genome-wide data in addition to exome data, the current version is built against the Genome Reference Consortium (GRC) Human Build 37
European Genome-phenome Archive (EGA): genetic, phenotypic, and clinical data
KEGG Encyclopedia of Genes and Genomes
Microsoft Health and Genomics Dataset
Cancer Biology
The Cancer Genome Atlas (TCGA): cancer samples from over 11,000 patients over a 12 year period.
TCIA: archive of DICOM medical images of cancer organized as “Collections”, typically organized by patients related by a common disease (e.g. lung cancer), image modality (MRI, CT, etc) or research focus. DICOM is the primary file format used by TCIA for image storage.
National Biomedical Imaging Archive (NBIA): a searchable repository of in vivo images that support lesion detection and classification, accelerated diagnostic imaging decision, and quantitative imaging assessment of drug response
ICGC–TCGA DREAM Genomic Mutation Calling Challenge for identifying cancer-associated mutations and rearrangements in whole-genome sequencing (WGS) data
US CDC WONDER: Cancer Morbidity Data: CDC maintains public statistics on a variety of different diseases, including cancer.
Heart Imaging
EchoNet-LVH: large public Echocardiogram dataset with 12k labeled echocardiogram videos and expert annotations
Brain Imaging
OASIS: OASIS-3 is a longitudinal neuroimaging, clinical, cognitive, and biomarker dataset for normal aging and Alzheimer’s Disease.
Mindboggle: labelled the macroscopic anatomy in magnetic resonance images of 101 healthy participants Paper
MRI
fastMRI Dataset: more than 1,500 fully sampled knee MRIs.
Tractograpy
TractoInferno Machine Learning Tractography Dataset: multi-site tractography database, including both research- and clinical-like human acquisitions.
Parkinson's
Infectious Diseases
Inspirations and Collections
https://github.com/seandavi/awesome-single-cell: List of software packages (and the people developing these methods) for single-cell data analysis, including RNA-seq, ATAC-seq, etc.
Version History
- Version history is not maintained for resources, lists, and tools.
- 06.11.2018 Initial revision.
[Suggest a Link: TBD]