Artificial_Intelligence__AI__Machine_Learning_-_Deeplearning

Computational Biology Datasets Suitable For Machine Learning

July 30, 2024 Off By admin
Shares

Machine learning (ML) is a rapidly evolving subfield of artificial intelligence that focuses on developing algorithms and models that allow computers to learn from data without explicit programming. The field has progressed significantly since its origins in the 1950s, with major advancements in neural networks, deep learning, and applications across diverse domains. ML encompasses several types of learning approaches, including supervised learning for tasks like classification and regression, unsupervised learning for discovering patterns in unlabeled data, and reinforcement learning for decision-making through trial and error. Key algorithms and techniques in ML include decision trees, support vector machines, neural networks, and ensemble methods. In the realm of computational biology, ML has found numerous applications, from analyzing genomic and proteomic data to drug discovery and development. It plays a crucial role in tasks such as gene expression analysis, protein structure prediction, metabolic pathway modeling, and integrative multi-omics approaches. Computational biology itself is an interdisciplinary field that combines biology, computer science, mathematics, and statistics to analyze and interpret biological data. It relies heavily on bioinformatics tools and databases, and employs various computational techniques like sequence alignment and phylogenetic analysis. As the field continues to advance, it faces important ethical and regulatory considerations, particularly regarding data privacy and the implications of genomic research and personalized medicine.

This is a curated list of computational biology datasets that have been pre-processed for machine learning. This list is a work in progress, please submit a pull request for any dataset you would like to advertise!

Genotyping

NameDescriptionComments
The Cancer Genome AtlasVariety of Cancer Datamost cancer types have 100-1000 samples
NIH GDCCancer, many types of genomic data
UK Biobank
European Genome-Phenome Archive
METABRICThe genomic profiles (somatic mutations [targeted sequencing], copy number alterations, and gene expression) of 2509 breast cancers.
HapMap
23andMe2280 Public Domain Curated Genotypes
MiceSNPs, 2000+ samples4 generations. It might be possible to learn a family structure out of the data.
ArabidopsisSNPs, 100+ phenotypes

Promoter-Enhancer Pairs

NameDescriptionComments
TargetFinder~100,000 DNA-DNA interaction pairs

Gene/Protein Expression

NameDescriptionComments
GEOMain place for NCBI data
ENCODEVariety of assays to identify functional elements
ArrayExpressDNA sequencing, gene/protein expression, epigenetics
Cytometry Continuousflow cytometry data of 11 proteins+phospholipids, Discretized and cleaned data available offlineClassical benchmark dataset for learning graphical models; contains known errors
Transcription factor bindingChIP-Seq data on 12 TFs
GTExLandmark study for EQTL analysis
PharmacoGenomics DB
ProteomeXChange
BeatAMLwhole-exome sequencing, RNA sequencing and analyses of ex vivo drug sensitivity672 tumour specimens collected from 562 patients

Single-cell Data

NameDescriptionComments
Single-cell expression atlas
scPerturbsingle-cell perturbation-response datasetsharmonized and preprocessed across 44 original datasets

Regulatory Networks

NameDescriptionComments
TRRUSTmanually curated database of human transcriptional regulatory network
Yeast Network23-million yeast 2-hybrid experiments to investigate genetic interactions
Perturb-SeqIntegrated model of perturbations, single cell phenotypes, and epistatic interactions
KEGG Metabolic Regulatory Network (Undirected)65554 instances, 29 attributes each
KEGG Metabolic Regulatory Network (Directed)53414 instance, 24 attributes each
NameDescriptionComments
The Cancer Imaging ArchiveExtracts the images from the TCGA data
Multiple Myeloma DREAM ChallengeChallenge to identify Multiple Myeloma Patients
Breast Cancer Wisconsin (Diagnostic) Data SetPredict whether the cancer is benign or malignant
DDSMMammogram Database
Kaggle Soft Tissue SarcomasPreprocessed subset of the TCIA study “Soft Tissue Sarcoma”segmentation task
Kaggle Cervical Cancer ScreeningClassify cervix type from images
CMELYON17Pathology challenge – automated detection and classification of breast cancer metastases in whole-slide images of histological lymph node sections
Grand ChallengesDatasets from biomedical image analysis competitions
Breast Cancer MRI DatasetDemographic, clinical, pathology, treatment, outcomes, and genomic data + MRI images

fMRI

NameDescriptionComments
ENGIMA CerebellumGoal: Examine the relationships between regional atrophy and motor and cognitive dysfunction
Seizure PredictionGoal: Classify EEG time series into pre-seizure vs. interictal (i.e., not preceding a seizure).

Electronic Medical Records

NameDescriptionComments
MIMIC59,000 EHRs
UCI Diabetes130 US hospital data for 1999-2008
i2b2Clinical notes only, designed for NLP tasks
PhysioNet
Metadata Acquired from Clinical Case Reports (MACCRs)3,100 curated clinical case reports spanning 15 disease groups and more than 750 reports of rare diseases
eICU200k EHRs
All of Us>250k EHRs, some genomic data
PMC-Patients167k patient summaries with 3.1 M patient-article relevance annotations and 293k patient-patient similarity annotations

Radiographs

NameDescriptionComments
CheXPert200k chest radiographsCompetition and leaderboard associated
MIMIC-CXR~400k chest x-rays, 14 labelsData on PhysioNet
PadChest160k chest x-rays, 174 different findings

Protein-Protein Interactions

NameDescriptionComments
HINT (High-quality INTeractomes)curated compilation of high-quality protein-protein interactions from 8 interactome resources

Longitudinal Studies

NameDescriptionComments
National Population Health SurveyLongitudinal Survey that collects health information via surveys every two years.

Protein Structure

NameDescriptionComments
ProteinNetStandardized dataset for learning protein structure. Includes sequences, structures, alignments, PSSMs, and standardized train/test/valid splits.

Natural Language Data

NameDescriptionComments
BioASQAbstracts of medical articles (from PubMed); ontologies of medical concepts.Tasks: MLC, QA.
CasesArticles from medical case studies.
UPMC PathologyUPMC Pathology case studies.

Therapeutics

NameDescriptionComments
Therapeutic Data CommonsMany preprocessed datasets for therapeutic discovery, including target discovery, activity modeling, efficacy and safety, and manufacturing.Available as Python modules.
Cancer Omics Drug Experiment Response DatasetMolecular datasets paired with corresponding drug sensitivity dataSeeks to standardize datasets of cancer drug responses into a standard schema
Shares