Direct RNA Sequencing (DRS)

Step-by-Step Guide: Differences Between edgeR, LIMMA, and DESeq2 for Differential Expression Analysis

December 28, 2024 Off By admin
Shares

As a bioinformatician, you may be tasked with explaining the differences between various methods for differential expression (DE) analysis, such as edgeR, LIMMA, and DESeq. Here’s a detailed explanation aimed at experimental biologists and beginners, including the underlying principles and practical steps for using each method.

Table of Contents

1. Overview of Differential Expression Analysis

Before diving into each method, it’s important to understand the goal of differential expression analysis: identifying genes (or other features) that are expressed differently across different experimental conditions or groups. In RNA-Seq, this involves comparing gene expression levels between groups (e.g., treated vs untreated) to detect genes with significantly different expression.

Differential expression (DE) analysis aims to identify genes whose expression levels vary significantly between conditions or groups. The tools:

  • edgeR: Uses a negative binomial (NB) distribution for RNA-Seq data, especially suitable for small sample sizes.
  • DESeq2: Also employs the NB model but focuses on robust handling of dispersion estimation and normalization.
  • LIMMA: Originally for microarray data but extended (with voom) for RNA-Seq. It models expression using linear modeling and transforms RNA-Seq counts to log2 counts per million (logCPM).

2. Key Assumptions and Statistical Models

The core difference between these methods lies in the statistical models and normalization strategies they use. Each tool assumes a different distribution of the data and applies distinct methods to account for technical biases in the data.

  • edgeR and DESeq2 use the negative binomial distribution, which is suitable for count-based data (like RNA-Seq), where the data is overdispersed (i.e., the variance is greater than the mean).
  • LIMMA, originally developed for microarrays, uses a linear model with empirical Bayes moderation and log-transformed data (or the Voom transformation for RNA-Seq data), making it suitable for both microarray and RNA-Seq datasets.

3. Detailed Differences Between edgeR, LIMMA, and DESeq

Key Differences Between edgeR, LIMMA, and DESeq2

FeatureedgeRDESeq2LIMMA (with voom)
Statistical ModelNegative binomialNegative binomialLinear models with log2(CPM) transformation
Normalization MethodTMM (Trimmed Mean of M-values)Geometric normalizationQuantile normalization or TMM (RNA-Seq)
Suitable forSmall sample sizes, overdispersed countsLarger datasets, robust for variable dispersionRNA-Seq and microarray data
Data TypeRNA-Seq countsRNA-Seq countsRNA-Seq counts, microarray intensities
Ease of UseRequires statistical knowledgeUser-friendlyUser-friendly

edgeR

  • Model: Uses negative binomial distribution to model count data.
  • Normalization: Implements the Trimmed Mean of M-values (TMM) method to normalize data. This method assumes that most genes are not differentially expressed and normalizes based on the distribution of gene expression levels.
  • Methodology: Performs exact tests and generalized linear models (GLMs) to identify differentially expressed genes.
  • Assumptions: Works well when data is overdispersed, i.e., the variance exceeds the mean.

Key Feature: edgeR is particularly suited for small datasets or when working with low counts.

Example R Code (for DE analysis using edgeR):

R
# Load edgeR package
library(edgeR)

# Assuming 'counts' is a matrix of gene expression data (genes x samples)
# and 'group' is a vector specifying the condition of each sample (e.g., control, treatment)
y <- DGEList(counts=counts, group=group)
y <- calcNormFactors(y) # TMM normalization
design <- model.matrix(~group) # Design matrix for comparison
y <- estimateDisp(y, design) # Estimate dispersion
fit <- glmFit(y, design) # Fit GLM
lrt <- glmLRT(fit) # Likelihood ratio test
topTags(lrt) # Show top differentially expressed genes


DESeq2

  • Model: Also uses a negative binomial distribution, similar to edgeR.
  • Normalization: Utilizes the size factor method for normalization, which assumes that most genes are not differentially expressed. The scaling factor for each sample is computed based on the geometric mean of gene counts.
  • Methodology: Uses a Wald test or likelihood ratio test (LRT) for differential expression analysis.
  • Assumptions: Assumes that most genes are not differentially expressed, and adjusts for differences in library size across samples.

Key Feature: DESeq2 is known for handling small sample sizes and accounting for experimental batch effects.

Example R Code (for DE analysis using DESeq2):

R
# Load DESeq2 package
library(DESeq2)

# Assuming 'dds' is the DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = counts, colData = colData, design = ~group)
dds <- DESeq(dds) # Run DESeq2 analysis
res <- results(dds) # Get differential expression results
summary(res) # Show results summary


LIMMA (Voom for RNA-Seq)

  • Model: Originally designed for microarrays, LIMMA applies a linear model (after log transformation of RNA-Seq counts) and uses empirical Bayes moderation to stabilize variance estimates. The Voom transformation is used for RNA-Seq, which estimates the mean-variance relationship.
  • Normalization: Quantile normalization is used for microarray data, while for RNA-Seq data, TMM normalization (via edgeR) is often recommended.
  • Methodology: Uses linear models (for microarrays) and log-transformed count data for RNA-Seq. Uses empirical Bayes methods to shrink the variance of individual coefficients.
  • Assumptions: It assumes that the data is normally distributed after transformation.

Key Feature: LIMMA is well-suited for large RNA-Seq datasets and for when comparisons across many groups are needed.

Example R Code (for DE analysis using LIMMA with Voom):

R
# Load LIMMA package
library(limma)
library(edgeR)

# Assuming 'counts' is the gene expression matrix and 'group' is the condition vector
y <- DGEList(counts=counts)
y <- calcNormFactors(y) # Normalize counts using TMM
v <- voom(y, design=design) # Voom transformation
fit <- lmFit(v) # Fit linear model
fit <- eBayes(fit) # Apply empirical Bayes moderation
topTable(fit) # Show top differentially expressed genes


Workflow for Each Tool

3.1 Pre-requisites

  • Input: Raw count matrix (genes x samples).
  • Software: R, edgeR, DESeq2, LIMMA.
  • Ensure samples are grouped (e.g., control vs treated).

3.2 Steps for edgeR

bash
# R commands for edgeR
library(edgeR)

# Load count data and group information
counts <- read.csv("counts.csv", row.names=1)
group <- factor(c("control", "control", "treated", "treated"))

# Create a DGEList object
dge <- DGEList(counts=counts, group=group)

# Normalize counts (TMM)
dge <- calcNormFactors(dge)

# Estimate dispersions
dge <- estimateDisp(dge)

# Perform differential expression analysis
fit <- glmQLFit(dge)
qlf <- glmQLFTest(fit)

# View top genes
topTags(qlf)


3.3 Steps for DESeq2

bash
# R commands for DESeq2
library(DESeq2)

# Load count data and group information
counts <- read.csv("counts.csv", row.names=1)
colData <- data.frame(condition=factor(c("control", "control", "treated", "treated")))
dds <- DESeqDataSetFromMatrix(countData=counts, colData=colData, design=~condition)

# Normalize counts and estimate dispersions
dds <- DESeq(dds)

# Perform differential expression analysis
res <- results(dds)

# View top genes
resOrdered <- res[order(res$pvalue), ]
head(resOrdered)


3.4 Steps for LIMMA with Voom

bash
# R commands for LIMMA
library(limma)
library(edgeR)

# Load count data and group information
counts <- read.csv("counts.csv", row.names=1)
group <- factor(c("control", "control", "treated", "treated"))

# Normalize counts (TMM)
dge <- DGEList(counts=counts)
dge <- calcNormFactors(dge)

# Transform counts with voom
voomCounts <- voom(dge, design=model.matrix(~group))

# Fit linear model and perform differential expression
fit <- lmFit(voomCounts, design=model.matrix(~group))
fit <- eBayes(fit)

# View top genes
topTable(fit)


4. Practical Tips for Experimental Biologists

  • When to Use Each Tool:
    • edgeR: For smaller sample sizes or when extreme outliers need robust handling.
    • DESeq2: General-purpose tool; handles normalization and dispersion estimation well.
    • LIMMA: Best when integrating RNA-Seq with microarray workflows.
  • Key Output:
    • Log-fold change (LFC): Indicates the magnitude of gene expression change.
    • Adjusted p-value: Helps identify significantly differentially expressed genes after correction.

5. Scripts for Batch Processing

5.1 UNIX Script to Run edgeR on Multiple Datasets

bash
#!/bin/bash
# edgeR Batch Processing Script

for file in *.csv; do
Rscript -e "
library(edgeR)
counts <- read.csv('$file', row.names=1)
group <- factor(c('control', 'control', 'treated', 'treated'))
dge <- DGEList(counts=counts, group=group)
dge <- calcNormFactors(dge)
dge <- estimateDisp(dge)
fit <- glmQLFit(dge)
qlf <- glmQLFTest(fit)
write.csv(topTags(qlf), file='${file%.csv}_edgeR_results.csv')
"

done

5.2 Perl Script for Input File Validation

#!/usr/bin/perl
# Validate Count Files
use strict;
use warnings;

my $dir = './counts/';
opendir(DIR, $dir) or die "Cannot open directory: $!";

while (my $file = readdir(DIR)) {
next unless ($file =~ /\.csv$/);
open my $fh, '<', "$dir/$file" or die "Cannot open $file: $!";
my $header = <$fh>;
if ($header =~ /GeneID,Sample/) {
print "$file is valid.\n";
} else {
print "$file is invalid.\n";
}
close $fh;
}
closedir(DIR);


6. Recent Updates

Choosing the Right Method

Each method has its strengths and weaknesses. Here’s a quick guide to choosing the right tool:

  • edgeR: Best for small RNA-Seq datasets, low-count genes, and when you need a precise, count-based model.
  • DESeq2: Ideal for RNA-Seq data, especially with small sample sizes or when batch effects need to be considered.
  • LIMMA: Useful when working with larger RNA-Seq datasets or if you’re dealing with microarray data. The Voom transformation makes it suitable for RNA-Seq as well.

7. Additional Considerations

  • All three methods rely on assumptions about the data’s distribution and noise structure.
  • Cross-validation: As a best practice, you may want to validate results through cross-validation or by using additional datasets to ensure the robustness of findings.

Conclusion

While edgeR, DESeq2, and LIMMA all perform similar tasks (differential expression analysis), they each employ different statistical models and normalization strategies that are suitable for different types of data. Understanding these methods’ assumptions, strengths, and weaknesses is crucial in choosing the right tool for your analysis.

Shares