Artificial_Intelligence__AI__Machine_Learning_-_Deeplearning

Computational Biology Datasets Suitable For Machine Learning

July 30, 2024 Off By admin

Machine learning (ML) is a rapidly evolving subfield of artificial intelligence that focuses on developing algorithms and models that allow computers to learn from data without explicit programming. The field has progressed significantly since its origins in the 1950s, with major advancements in neural networks, deep learning, and applications across diverse domains. ML encompasses several types of learning approaches, including supervised learning for tasks like classification and regression, unsupervised learning for discovering patterns in unlabeled data, and reinforcement learning for decision-making through trial and error. Key algorithms and techniques in ML include decision trees, support vector machines, neural networks, and ensemble methods. In the realm of computational biology, ML has found numerous applications, from analyzing genomic and proteomic data to drug discovery and development. It plays a crucial role in tasks such as gene expression analysis, protein structure prediction, metabolic pathway modeling, and integrative multi-omics approaches. Computational biology itself is an interdisciplinary field that combines biology, computer science, mathematics, and statistics to analyze and interpret biological data. It relies heavily on bioinformatics tools and databases, and employs various computational techniques like sequence alignment and phylogenetic analysis. As the field continues to advance, it faces important ethical and regulatory considerations, particularly regarding data privacy and the implications of genomic research and personalized medicine.

Table of Contents

Computational Biology Datasets Suitable For Machine Learning

This is a curated list of computational biology datasets that have been pre-processed for machine learning. This list is a work in progress, please submit a pull request for any dataset you would like to advertise!

Genotyping

Name	Description	Comments
The Cancer Genome Atlas	Variety of Cancer Data	most cancer types have 100-1000 samples
NIH GDC	Cancer, many types of genomic data
UK Biobank
European Genome-Phenome Archive
METABRIC	The genomic profiles (somatic mutations [targeted sequencing], copy number alterations, and gene expression) of 2509 breast cancers.
HapMap
23andMe	2280 Public Domain Curated Genotypes
Mice	SNPs, 2000+ samples	4 generations. It might be possible to learn a family structure out of the data.
Arabidopsis	SNPs, 100+ phenotypes

Promoter-Enhancer Pairs

Name	Description	Comments
TargetFinder	~100,000 DNA-DNA interaction pairs

Gene/Protein Expression

Name	Description	Comments
GEO	Main place for NCBI data
ENCODE	Variety of assays to identify functional elements
ArrayExpress	DNA sequencing, gene/protein expression, epigenetics
Cytometry Continuous	flow cytometry data of 11 proteins+phospholipids, Discretized and cleaned data available offline	Classical benchmark dataset for learning graphical models; contains known errors
Transcription factor binding	ChIP-Seq data on 12 TFs
GTEx	Landmark study for EQTL analysis
PharmacoGenomics DB
ProteomeXChange
BeatAML	whole-exome sequencing, RNA sequencing and analyses of ex vivo drug sensitivity	672 tumour specimens collected from 562 patients

Single-cell Data

Name	Description	Comments
Single-cell expression atlas
scPerturb	single-cell perturbation-response datasets	harmonized and preprocessed across 44 original datasets

Regulatory Networks

Name	Description	Comments
TRRUST	manually curated database of human transcriptional regulatory network
Yeast Network	23-million yeast 2-hybrid experiments to investigate genetic interactions
Perturb-Seq	Integrated model of perturbations, single cell phenotypes, and epistatic interactions
KEGG Metabolic Regulatory Network (Undirected)	65554 instances, 29 attributes each
KEGG Metabolic Regulatory Network (Directed)	53414 instance, 24 attributes each

Images

Name	Description	Comments
The Cancer Imaging Archive	Extracts the images from the TCGA data
Multiple Myeloma DREAM Challenge	Challenge to identify Multiple Myeloma Patients
Breast Cancer Wisconsin (Diagnostic) Data Set	Predict whether the cancer is benign or malignant
DDSM	Mammogram Database
Kaggle Soft Tissue Sarcomas	Preprocessed subset of the TCIA study “Soft Tissue Sarcoma”	segmentation task
Kaggle Cervical Cancer Screening	Classify cervix type from images
CMELYON17	Pathology challenge – automated detection and classification of breast cancer metastases in whole-slide images of histological lymph node sections
Grand Challenges	Datasets from biomedical image analysis competitions
Breast Cancer MRI Dataset	Demographic, clinical, pathology, treatment, outcomes, and genomic data + MRI images

fMRI

Name	Description	Comments
ENGIMA Cerebellum	Goal: Examine the relationships between regional atrophy and motor and cognitive dysfunction
Seizure Prediction	Goal: Classify EEG time series into pre-seizure vs. interictal (i.e., not preceding a seizure).

Electronic Medical Records

Name	Description	Comments
MIMIC	59,000 EHRs
UCI Diabetes	130 US hospital data for 1999-2008
i2b2	Clinical notes only, designed for NLP tasks
PhysioNet
Metadata Acquired from Clinical Case Reports (MACCRs)	3,100 curated clinical case reports spanning 15 disease groups and more than 750 reports of rare diseases
eICU	200k EHRs
All of Us	>250k EHRs, some genomic data
PMC-Patients	167k patient summaries with 3.1 M patient-article relevance annotations and 293k patient-patient similarity annotations

Radiographs

Name	Description	Comments
CheXPert	200k chest radiographs	Competition and leaderboard associated
MIMIC-CXR	~400k chest x-rays, 14 labels	Data on PhysioNet
PadChest	160k chest x-rays, 174 different findings

Protein-Protein Interactions

Name	Description	Comments
HINT (High-quality INTeractomes)	curated compilation of high-quality protein-protein interactions from 8 interactome resources

Longitudinal Studies

Name	Description	Comments
National Population Health Survey	Longitudinal Survey that collects health information via surveys every two years.

Protein Structure

Name	Description	Comments
ProteinNet	Standardized dataset for learning protein structure. Includes sequences, structures, alignments, PSSMs, and standardized train/test/valid splits.

Natural Language Data

Name	Description	Comments
BioASQ	Abstracts of medical articles (from PubMed); ontologies of medical concepts.	Tasks: MLC, QA.
Cases	Articles from medical case studies.
UPMC Pathology	UPMC Pathology case studies.

Therapeutics

Name	Description	Comments
Therapeutic Data Commons	Many preprocessed datasets for therapeutic discovery, including target discovery, activity modeling, efficacy and safety, and manufacturing.	Available as Python modules.
Cancer Omics Drug Experiment Response Dataset	Molecular datasets paired with corresponding drug sensitivity data	Seeks to standardize datasets of cancer drug responses into a standard schema