Tackling Bias in Gene Expression Dataset Selection

December 19, 2024 Off By admin

Table of Contents

Introduction

In the age of machine learning and bioinformatics, the quality of data dictates the reliability of results. Gene expression datasets play a pivotal role in feature selection, a technique essential for filtering irrelevant or redundant information and identifying the most informative features. However, a significant bias in the use of outdated gene expression datasets, particularly from the microarray era, has emerged as a critical problem in bioinformatics research.

A review of over 1,200 publications between 2010 and 2020 reveals a troubling trend: many studies rely on older datasets riddled with issues such as mislabeled samples, class imbalances, and outdated gene definitions. This overreliance on flawed data not only skews the development of feature selection methods but also undermines the accuracy and reproducibility of scientific findings.

This blog explores the origins of this bias, the impact of outdated datasets on feature selection research, and actionable strategies to overcome these challenges.

Timeline of Main Events:

1984: Introduction of the Classification and Regression Trees (CART) algorithm, a foundation for later machine learning.
1986: Introduction of the Least Absolute Shrinkage and Selection Operator (LASSO) method.
Early 1990s: The Human Genome Project (HGP) begins.
1992: Affymetrix is founded.
1993: Random Forest (RF) method is popularized.
1994: Affymetrix releases the GeneChip microarray platform.
Mid-1990s: Large-scale gene expression techniques emerge.
1996: Dual-Channel Microarrays are developed.
1997: Introduction of the ReliefF algorithm and the Orange data mining software.
1998: Illumina Inc. is founded.
1999:Agilent Technologies is founded.
First significant application of feature selection and machine learning to gene expression data, pioneered by Alon et al. and Golub et al.
Alon et al. publish study on colon cancer gene expression.
Golub et al. publish study on leukemia gene expression.
2000: Alizadeh et al. publish study on diffuse large B-cell lymphoma (DLBCL) using gene expression profiling.
2001:Initial draft of the Human Genome is released.
Bhattacharjee et al. and Khan et al. publish studies on lung cancer and small round blue cell tumors (SRBCT) respectively, using gene expression data.
2002:Illumina’s Bead array technology is released.
Shipp et al. publish study on DLBCL using gene expression profiling.
The Gene Expression Omnibus (GEO) database is released.
Multiple significant gene expression studies published (Beer, Gordon, Singh, Pomeroy, Van’t Veer, Armstrong etc).
2003: The Human Genome Project (HGP) is completed (covering 92% of the human genome).
2004: Introduction of the mRMR and FS-NEAT algorithms.
2005: The 454 GS-20 pyrosequencer is released.
2007:Introduction of the Support Vector Machine (SVM) algorithm with the kernel trick
Introduction of the R programming language.
2010: ABI capillary sequencer is released.
2011:Introduction of the scikit-learn Python library.
The inSilicoDb database is established
2013: Single-Cell gene expression analysis becomes prominent.
2019:Introduction of the CuMiDa database.
2021:Introduction of the BARRA:CuRDa database
2022:New sequencing technologies provide long-read sequences allowing for the completion of the human genome.
Nature publishes the article “The rise and fall (and rise) of datasets” highlighting the data issues in machine learning.

The Historical Context of Gene Expression Data

The Microarray Era

In the mid-1990s, microarray technology revolutionized gene expression analysis. It enabled researchers to measure thousands of genes simultaneously, laying the groundwork for numerous breakthroughs. However, early microarray platforms were prone to inaccuracies, including non-specific hybridization and limited probe coverage.

Despite these limitations, the datasets generated during this period became benchmarks for computational methods like feature selection. Studies such as those by Alon et al. (1999) and Golub et al. (1999) focused on human cancer datasets, setting a precedent for their use in later research.

The Advent of RNA-seq

The introduction of RNA sequencing (RNA-seq) offered a more comprehensive and precise approach to gene expression analysis. Unlike microarrays, RNA-seq quantifies RNA molecules directly, enabling the discovery of novel transcripts and alternative splicing events. Yet, many researchers continued to rely on microarray data due to its availability and historical prominence.

The Problem: Old Data, New Challenges

Key Issues with Outdated Datasets

Outdated Technology: Early microarray platforms often lacked specificity and sensitivity. Discontinued probes and incomplete genome annotations further compound these issues.
Mislabeled Samples: Many older datasets contain mislabeled samples or suffer from biases introduced during experimental design.
Class Imbalances: Uneven distribution of classes within training and test sets can lead to misleading results.
Pre-Filtered Data: Some datasets underwent pre-filtering processes, inadvertently introducing biases into subsequent analyses.
Accessibility Problems: Links to older datasets are often broken, making it challenging to validate results.
Lack of Diversity: A disproportionate focus on human cancer datasets limits the generalizability of feature selection methods to other organisms or conditions.

The Vicious Cycle of Dataset Reuse

Researchers frequently reuse older datasets due to their widespread citation and ease of access. Benchmarking practices in computer science further perpetuate this cycle, as new algorithms are often tested against well-known datasets for comparative purposes.

Term	Definition
Feature Selection	The process of selecting a subset of relevant features (variables or attributes) to improve machine learning performance, reduce overfitting, and simplify analysis.
Gene Expression	The process by which information from a gene is used to synthesize a functional gene product, such as a protein, resulting in variations in mRNA levels.
Microarray	A high-throughput technology for measuring the expression levels of thousands of genes simultaneously using predefined probes.
RNA-seq	A sequencing technology that measures RNA transcript abundance by sequencing all RNA molecules in a sample, producing a matrix of read counts.
Transcriptomics	The study of the transcriptome, encompassing all RNA transcripts in a cell or tissue, to analyze gene expression patterns.
Probes	Short nucleic acid sequences designed to hybridize with complementary sequences in gene expression experiments, such as microarrays.
cDNA	Complementary DNA synthesized from mRNA by reverse transcriptase, commonly used in microarray experiments.
Normalization	A preprocessing technique to remove systematic variations and biases in gene expression data for accurate comparison between samples.
Differentially Expressed Genes (DEG)	Genes showing statistically significant differences in expression between biological conditions.
Univariate Feature Selection	A method that evaluates individual features based on their correlation with the target variable without considering feature dependencies.
Wrapper Method	A feature selection technique that evaluates subsets of features using a learning algorithm to determine their performance.
Batch Effects	Systematic variations in gene expression data caused by technical factors rather than biological differences.
Benchmarking	The process of testing and comparing algorithm performance using standardized datasets for objective evaluation.
GEO Database	A public NCBI database for depositing and accessing gene expression data.
Indirect Citations	Citing a study that references the original source, rather than directly citing the original source itself.
Model Organisms	Non-human species (e.g., Homo sapiens, Mus musculus, Rattus rattus) studied to understand biological phenomena applicable across species.
Homogeneity	The state of being uniform or the same. In feature selection, dataset homogeneity can be a concern.
Heterogeneity	The state of being diverse or different. In gene expression studies, it refers to biological variation in data.

The Consequences of Biased Data Usage

Inaccurate Conclusions

Even minor errors in a dataset can cascade into significant inaccuracies in model predictions and feature selection outcomes.

Limited Real-World Applicability

Models trained on biased or outdated data may fail when applied to new, diverse datasets, limiting their utility in practical applications such as disease diagnosis or drug discovery.

Stagnation in Method Development

Reliance on flawed datasets hinders innovation, as researchers focus on optimizing algorithms for outdated data instead of addressing contemporary challenges.

Moving Forward: Recommendations for Change

To address these issues, the bioinformatics community must adopt a more critical approach to dataset selection and usage:

Cite Original Sources: Always reference the original data source to ensure transparency.
Provide Dataset Details: Include information on the number of samples, features, and classes to facilitate reproducibility.
Ensure Accessibility: Maintain working links to data repositories.
Diversify Dataset Selection: Use a variety of datasets, including those from non-cancer studies and other model organisms.
Avoid Pre-Filtered Data: Conduct analyses on raw datasets to prevent inadvertent biases.
Incorporate RNA-seq Data: Leverage RNA-seq datasets to validate findings alongside older microarray data.
Follow FAIR Principles: Adhere to guidelines for creating Findable, Accessible, Interoperable, and Reusable datasets.
Promote Interpretable Machine Learning: Use interpretable models to enhance understanding and reproducibility of results.

The Role of Reviewers and Editors

Journal editors and reviewers play a crucial role in enforcing best practices. By mandating the use of diverse, up-to-date datasets and encouraging detailed reporting, they can drive the bioinformatics community toward more reliable and impactful research.

Conclusion

The field of gene expression analysis stands at a crossroads. While historical datasets have laid the foundation for numerous advances, the time has come to break free from the limitations of the past. By embracing diverse, high-quality datasets and addressing inherent biases, researchers can unlock the full potential of feature selection methods, paving the way for transformative discoveries in biology and medicine.

FAQ’s Gene Expression Data analysis

1. What is feature selection in the context of gene expression data analysis?

Feature selection, when applied to gene expression data (often referred to as gene selection), is a process of identifying the most relevant genes that can distinguish between different sample populations or biological conditions. This is achieved by filtering out noisy, irrelevant, or redundant genes. These relevant genes, often called informative genes, can be used as potential biomarkers for diagnosis, prognosis, or drug target identification. In essence, it helps to simplify complex gene expression datasets, making them more manageable for analysis and machine learning.

2. Why is the age of gene expression datasets a critical factor when conducting feature selection research?

The age of gene expression datasets is a crucial, often overlooked aspect in feature selection research. Older datasets, especially those from the early days of microarray technology, may have been generated using less accurate probes and probe sets, or based on incomplete genomic information. Over time, gene expression technology has improved considerably, leading to more accurate data. Older datasets also frequently have issues such as different class distributions between training and test sets, mislabeled samples, and may have undergone pre-filtering which can introduce bias. Using these outdated datasets can lead to unreliable conclusions and may not generalize well to newer, more accurate data. The use of such data perpetuates a cycle of inaccurate findings.

3. What are some common biases and issues associated with commonly used gene expression datasets in feature selection research?

Several biases and issues plague commonly used gene expression datasets. Class distribution discrepancies between training and testing sets are frequent. Some datasets might contain mislabeled samples or outliers, which can affect the performance of feature selectors and classifiers. Pre-filtering of genes in some datasets reduces the dimension of the problem and biases results, since it mixes the results of a third-party filter with the feature selection being tested, making it impossible to evaluate how the algorithm would behave with original, unfiltered data. Additionally, the datasets may be heterogeneous in terms of sample preparation protocols and sample types. These flaws in older datasets can ultimately impact the validity and generalizability of feature selection algorithms trained on them.

4. How do microarray and RNA-seq technologies differ, and what implications do these differences have for feature selection analysis?

Microarray and RNA-seq technologies differ significantly in how they measure gene expression. Microarrays rely on predefined probes designed to hybridize with mRNA molecules, generating light emissions measured as log-intensities. RNA-seq, on the other hand, involves sequencing the total RNA, allowing for the identification of novel transcripts, allele variants, and splicing variations. This approach allows a more comprehensive and unbiased measurement of gene expression and results in read count matrices, rather than the log-intensity values of microarrays. RNA-seq datasets need different preprocessing steps than microarrays, such as normalization considering both library size and gene length, and batch effect correction. Therefore, feature selection algorithms need to be adapted and evaluated based on the specific nature of the data they’re working with.

5. What are the major public repositories for gene expression data, and what are their strengths and weaknesses in the context of feature selection?

The main public repositories for gene expression data are the NCBI Gene Expression Omnibus (GEO), EMBL-EBI ArrayExpress, and the Cancer Genome Atlas (TCGA). GEO and ArrayExpress are the de facto repositories where researchers deposit their gene expression datasets, while TCGA is a large-scale project focused on cancer RNA-seq data. These are biology-first databases with organization, jargon, and file formats that may not be optimal for computer scientists looking to quickly test feature selection algorithms. Other databases, such as inSilicoDB, CuMiDa, and BARRA:CuRDa are curated to be more accessible for machine learning research, with files in ready-to-use formats, but some of them are not updated with newer datasets. Overall, the more machine-learning focused databases promote old datasets which, albeit easy to use, contribute to the perpetuation of bias.

6. What specific recommendations are given to researchers to ensure the responsible use of gene expression data for feature selection?

Researchers are advised to follow several guidelines to ensure responsible data usage. First, they should always cite the original data source, not just intermediate publications. Second, they should thoroughly report the main characteristics of datasets used. Third, working links to the data source should be provided. They should validate algorithms using diverse datasets, not just those of human cancer, and test the algorithms with various model organisms. The usage of pre-filtered datasets should be avoided, and researchers should refrain from relying too heavily on older datasets. Finally, researchers should always consider the difference between microarray and RNA-seq data and handle them accordingly. Also, they should use state-of-the-art data curation practices when designing new databases.

7. What is the impact of indirect citations and broken links on the integrity of feature selection research?

Indirect citations, referencing an intermediary study instead of the original data source, hinder proper assessment of data quality and context, and perpetuate the cycle of outdated datasets usage. Broken links to dataset sources make the data inaccessible, preventing others from replicating the studies, validating the research, and understanding the data used to develop the algorithms. Both practices contribute to a lack of transparency and reproducibility in the field of gene expression data analysis.

8. How does the paper suggest addressing the tendency to rely on old datasets, especially in feature selection research?

The paper emphasizes that relying on older datasets is a harmful mindset, as it does not reflect the current state of the art in biological data generation and analysis. It promotes a mindset change in the research community, encouraging the validation of feature selection algorithms with current, diverse, and well-annotated datasets. Using old datasets is acceptable for comparison purposes with older studies, but new algorithms should always be tested on current datasets that can generate reliable biological information. Researchers should actively seek and curate newer datasets from multiple organisms and conditions. Moreover, reviewers and editors should enforce these practices in their publications. Additionally, the use of interpretable machine learning and visualization methods may also improve the understanding and replicability of feature selection experiments.

Study Guide: Feature Selection and Gene Expression Data

Short Answer Quiz

What is feature selection and why is it important in the context of machine learning and bioinformatics?
How does microarray technology work and what type of data does it produce?
How does RNA-seq technology differ from microarray and what are its advantages?
Why is the normalization step important in gene expression data analysis, and how does it differ between microarray and RNA-seq data?
What were the most common datasets used for feature selection research between 2011 and 2016 according to the review by Ang et al. (2016), and what is the issue with the age of these datasets?
What are some of the issues identified with the Singh et al. (2002) and Gordon et al. (2002) datasets that could affect the performance of feature selection algorithms?
What problems did Mramor et al. (2007) identify with the Bhattacharjee et al. (2001) dataset, and how can these issues impact feature selection studies?
Why is the leukemia dataset from Golub et al. (1999) considered problematic, despite its pioneering nature, and why does it continue to be used so frequently?
How does the practice of using pre-filtered datasets affect the evaluation of feature selection algorithms?
What are the authors’ recommendations for researchers, reviewers, and editors to improve the quality and validity of feature selection studies applied to gene expression data?

Short Answer Quiz – Answer Key

Feature selection involves algorithms and methods that create more representative datasets by filtering out noisy, irrelevant, or redundant samples. It is important because it leads to optimized machine learning training and identifies relevant features for knowledge discovery, such as hub genes or biomarkers.
Microarrays use predefined probes to hybridize with cDNA, generating fluorescent light that is quantified to measure gene expression levels. This process results in a matrix of log-intensity values representing the expression of genes in samples.
RNA-seq, unlike microarrays, quantifies the total RNA expression in a single experiment without using predesigned probes. This allows for the identification of novel transcripts, allele variants, and alternative splicing and has lower technical variability than microarrays.
Normalization is essential to remove biases and variations that could distort results. In microarrays it addresses artifacts in raw data and variations within and between arrays. In RNA-seq, it accounts for differences in library size and gene length, ensuring accurate comparisons across samples.
The most common datasets were colon cancer by Alon et al. (1999), leukemia by Golub et al. (1999), DLBCL by Alizadeh et al. (2000), SRBCT by Khan et al. (2001), and prostate cancer by Singh et al. (2002). The main issue is that these datasets were old, with an age gap between 14 and 17 years during the review, potentially not representative of current data.
The Singh et al. dataset has a different class distribution between the training and test sets, with almost a 10-fold difference in microarray intensity between the two. The Gordon et al. dataset has an uneven class split, with a single feature capable of perfectly sorting training samples, a biological impossibility, but that same feature is not relevant in the test set.
Mramor et al. (2007) found seven mislabeled samples in the Bhattacharjee et al. (2001) dataset. These errors in data distribution and outliers can significantly impact the accuracy and effectiveness of feature selection algorithms.
The Golub et al. (1999) dataset is small, highly heterogeneous, and has classes that can be perfectly classified using several assortments of genes. Its continued use stems from its early and significant impact on the field, as well as its availability in shared databases.
Using pre-filtered datasets reduces the actual dimensionality of the task, making it impossible to evaluate how the selection algorithms would deal with the full set of features, including irrelevant or redundant ones. This inadvertently mixes the results of a third-party filtering algorithm with a researcher’s own work.
The authors recommend always citing original data sources, specifying dataset characteristics, including working hyperlinks, using diverse datasets, exploring model organisms beyond H. sapiens, preventing use of pre-filtered datasets, avoiding only older datasets, exploring RNA-seq, following guidelines when creating new databases, and that reviewers and editors enforce these recommendations.

Essay Questions

Discuss the historical context of gene expression data and feature selection, outlining how advancements in technology have influenced the methods and challenges in the field. Analyze how the early focus on microarray data has shaped the current practices and what implications these historical factors have had on current research.
Critically evaluate the trade-offs between using widely available, popular datasets and using newer, potentially more accurate datasets in feature selection. What are the benefits and drawbacks of each approach, and how should researchers navigate this decision-making process to enhance the validity of their studies?
Analyze the challenges of bias in gene expression data, exploring how biological and technical variations affect the outcomes of feature selection studies. How can researchers identify and mitigate these biases to achieve more reliable and reproducible results, and what strategies could enhance the diversity of datasets used for research?
Compare and contrast the different types of gene expression databases, such as those designed for machine learning versus biology-focused databases, analyzing the advantages and limitations of each for feature selection research. How can such databases be improved to better support high-quality research, especially regarding data sharing, curation, and access?
Based on the recommendations provided, propose a practical framework for researchers conducting feature selection studies with gene expression data. Describe the steps they should take from initial planning to final publication to ensure that their work is robust, reproducible, and has a valid biological interpretation.