Computational Biology Datasets Suitable For Machine Learning
July 30, 2024Machine learning (ML) is a rapidly evolving subfield of artificial intelligence that focuses on developing algorithms and models that allow computers to learn from data without explicit programming. The field has progressed significantly since its origins in the 1950s, with major advancements in neural networks, deep learning, and applications across diverse domains. ML encompasses several types of learning approaches, including supervised learning for tasks like classification and regression, unsupervised learning for discovering patterns in unlabeled data, and reinforcement learning for decision-making through trial and error. Key algorithms and techniques in ML include decision trees, support vector machines, neural networks, and ensemble methods. In the realm of computational biology, ML has found numerous applications, from analyzing genomic and proteomic data to drug discovery and development. It plays a crucial role in tasks such as gene expression analysis, protein structure prediction, metabolic pathway modeling, and integrative multi-omics approaches. Computational biology itself is an interdisciplinary field that combines biology, computer science, mathematics, and statistics to analyze and interpret biological data. It relies heavily on bioinformatics tools and databases, and employs various computational techniques like sequence alignment and phylogenetic analysis. As the field continues to advance, it faces important ethical and regulatory considerations, particularly regarding data privacy and the implications of genomic research and personalized medicine.
This is a curated list of computational biology datasets that have been pre-processed for machine learning. This list is a work in progress, please submit a pull request for any dataset you would like to advertise!
Name | Description | Comments |
---|---|---|
The Cancer Genome Atlas | Variety of Cancer Data | most cancer types have 100-1000 samples |
NIH GDC | Cancer, many types of genomic data | |
UK Biobank | ||
European Genome-Phenome Archive | ||
METABRIC | The genomic profiles (somatic mutations [targeted sequencing], copy number alterations, and gene expression) of 2509 breast cancers. | |
HapMap | ||
23andMe | 2280 Public Domain Curated Genotypes | |
Mice | SNPs, 2000+ samples | 4 generations. It might be possible to learn a family structure out of the data. |
Arabidopsis | SNPs, 100+ phenotypes |
Name | Description | Comments |
---|---|---|
TargetFinder | ~100,000 DNA-DNA interaction pairs |
Name | Description | Comments |
---|---|---|
GEO | Main place for NCBI data | |
ENCODE | Variety of assays to identify functional elements | |
ArrayExpress | DNA sequencing, gene/protein expression, epigenetics | |
Cytometry Continuous | flow cytometry data of 11 proteins+phospholipids, Discretized and cleaned data available offline | Classical benchmark dataset for learning graphical models; contains known errors |
Transcription factor binding | ChIP-Seq data on 12 TFs | |
GTEx | Landmark study for EQTL analysis | |
PharmacoGenomics DB | ||
ProteomeXChange | ||
BeatAML | whole-exome sequencing, RNA sequencing and analyses of ex vivo drug sensitivity | 672 tumour specimens collected from 562 patients |
Name | Description | Comments |
---|---|---|
Single-cell expression atlas | ||
scPerturb | single-cell perturbation-response datasets | harmonized and preprocessed across 44 original datasets |
Name | Description | Comments |
---|---|---|
TRRUST | manually curated database of human transcriptional regulatory network | |
Yeast Network | 23-million yeast 2-hybrid experiments to investigate genetic interactions | |
Perturb-Seq | Integrated model of perturbations, single cell phenotypes, and epistatic interactions | |
KEGG Metabolic Regulatory Network (Undirected) | 65554 instances, 29 attributes each | |
KEGG Metabolic Regulatory Network (Directed) | 53414 instance, 24 attributes each |
Images
Name | Description | Comments |
---|---|---|
The Cancer Imaging Archive | Extracts the images from the TCGA data | |
Multiple Myeloma DREAM Challenge | Challenge to identify Multiple Myeloma Patients | |
Breast Cancer Wisconsin (Diagnostic) Data Set | Predict whether the cancer is benign or malignant | |
DDSM | Mammogram Database | |
Kaggle Soft Tissue Sarcomas | Preprocessed subset of the TCIA study “Soft Tissue Sarcoma” | segmentation task |
Kaggle Cervical Cancer Screening | Classify cervix type from images | |
CMELYON17 | Pathology challenge – automated detection and classification of breast cancer metastases in whole-slide images of histological lymph node sections | |
Grand Challenges | Datasets from biomedical image analysis competitions | |
Breast Cancer MRI Dataset | Demographic, clinical, pathology, treatment, outcomes, and genomic data + MRI images |
Name | Description | Comments |
---|---|---|
ENGIMA Cerebellum | Goal: Examine the relationships between regional atrophy and motor and cognitive dysfunction | |
Seizure Prediction | Goal: Classify EEG time series into pre-seizure vs. interictal (i.e., not preceding a seizure). |
Name | Description | Comments |
---|---|---|
MIMIC | 59,000 EHRs | |
UCI Diabetes | 130 US hospital data for 1999-2008 | |
i2b2 | Clinical notes only, designed for NLP tasks | |
PhysioNet | ||
Metadata Acquired from Clinical Case Reports (MACCRs) | 3,100 curated clinical case reports spanning 15 disease groups and more than 750 reports of rare diseases | |
eICU | 200k EHRs | |
All of Us | >250k EHRs, some genomic data | |
PMC-Patients | 167k patient summaries with 3.1 M patient-article relevance annotations and 293k patient-patient similarity annotations |
Name | Description | Comments |
---|---|---|
CheXPert | 200k chest radiographs | Competition and leaderboard associated |
MIMIC-CXR | ~400k chest x-rays, 14 labels | Data on PhysioNet |
PadChest | 160k chest x-rays, 174 different findings |
Name | Description | Comments |
---|---|---|
HINT (High-quality INTeractomes) | curated compilation of high-quality protein-protein interactions from 8 interactome resources |
Name | Description | Comments |
---|---|---|
National Population Health Survey | Longitudinal Survey that collects health information via surveys every two years. |
Name | Description | Comments |
---|---|---|
ProteinNet | Standardized dataset for learning protein structure. Includes sequences, structures, alignments, PSSMs, and standardized train/test/valid splits. |
Name | Description | Comments |
---|---|---|
BioASQ | Abstracts of medical articles (from PubMed); ontologies of medical concepts. | Tasks: MLC, QA. |
Cases | Articles from medical case studies. | |
UPMC Pathology | UPMC Pathology case studies. |
Name | Description | Comments |
---|---|---|
Therapeutic Data Commons | Many preprocessed datasets for therapeutic discovery, including target discovery, activity modeling, efficacy and safety, and manufacturing. | Available as Python modules. |
Cancer Omics Drug Experiment Response Dataset | Molecular datasets paired with corresponding drug sensitivity data | Seeks to standardize datasets of cancer drug responses into a standard schema |