Basic Normalization, Batch Correction, and Visualization of RNA-seq Data
January 10, 2025This step-by-step guide will walk you through a basic RNA-seq data analysis workflow using edgeR
in R. The workflow includes data loading, normalization, batch correction, visualization, and differential expression analysis. The example data used here are from the GEO dataset GSE124167, which includes RNA-seq data from treated and untreated mice across three different library preparation kits (batches).
Step 1: Load Required Libraries and Data
First, install and load the necessary R packages. Then, download and parse the RNA-seq count data from GEO.
Install and Load Libraries
# Install necessary packages if not already installed install.packages("data.table") install.packages("dplyr") install.packages("ggplot2") BiocManager::install("edgeR") BiocManager::install("limma") # Load libraries library(data.table) library(dplyr) library(edgeR) library(limma) library(ggplot2)
Download and Parse Data
# Function to parse data properly parseProperly <- function(x) { x <- x[, -c(ncol(x) - 1, ncol(x))] # Remove last two columns return(x) } # URLs for the data m1 <- "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE124nnn/GSE124167/suppl/GSE124167_FINAL_master_list_of_gene_counts_MIN.V4.txt.gz" m2 <- "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE124nnn/GSE124167/suppl/GSE124167_FINAL_master_list_of_gene_counts_MIN.sense.Pico.R.txt.gz" m3 <- "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE124nnn/GSE124167/suppl/GSE124167_FINAL_master_list_of_gene_counts_MIN.sense.Truseq.txt.gz" # Read and parse data method1 <- parseProperly(fread(m1, data.table = FALSE)) method2 <- parseProperly(fread(m2, data.table = FALSE)) method3 <- parseProperly(fread(m3, data.table = FALSE)) # Combine data into a single table counts <- Reduce(function(x, y) full_join(x, y, by = "id"), list(method1, method2, method3)) # Replace NAs with zeros counts[is.na(counts)] <- 0 # Move "id" column to row names rownames(counts) <- counts$id counts$id <- NULL
Step 2: Set Up edgeR Object
Create a DGEList
object, which is the primary data structure used by edgeR
. Add metadata for the library preparation kits and treatments.
# Create DGEList object y <- DGEList(counts = counts) # Add metadata y$samples$kit <- c(rep("Pico", 6), rep("Illumina", 6), rep("V4", 6)) y$samples$treatment <- rep(c(rep("treated", 3), rep("untreated", 3)), 3) # Normalize data y <- calcNormFactors(y) logCPMs <- cpm(y, log = TRUE)
Step 3: Perform PCA for Quality Control
Principal Component Analysis (PCA) is a useful technique for identifying batch effects and other sources of variation in the data.
# Calculate row-wise variance rv <- apply(logCPMs, 1, var) # Select top 1000 most variable genes top1000 <- order(rv, decreasing = TRUE)[1:1000] logCPM_top1000 <- logCPMs[top1000, ] # Run PCA pca <- prcomp(t(logCPM_top1000)) # Combine PCA results with metadata to_plot <- data.frame(pca$x, y$samples) # Calculate percentage of variance explained by each PC percentVar <- pca$sdev^2 / sum(pca$sdev^2) * 100 # Plot PCA results ggplot(to_plot, aes(x = PC1, y = PC2, color = kit, shape = treatment)) + geom_point(size = 3) + xlab(paste0("PC1 - Var.expl = ", round(percentVar[1], 2), "%")) + ylab(paste0("PC2 - Var.expl = ", round(percentVar[2], 2), "%")) + ggtitle("PCA Before Batch Correction")
Step 4: Correct for Batch Effects
Use the removeBatchEffect
function from the limma
package to correct for batch effects.
# Correct for batch effects batch <- factor(y$samples$kit) logCPMs_corrected <- removeBatchEffect(logCPMs, batch = batch) # Repeat PCA with corrected data logCPM_corrected_top1000 <- logCPMs_corrected[top1000, ] pca_corrected <- prcomp(t(logCPM_corrected_top1000)) to_plot_corrected <- data.frame(pca_corrected$x, y$samples) percentVar_corrected <- pca_corrected$sdev^2 / sum(pca_corrected$sdev^2) * 100 # Plot PCA after batch correction ggplot(to_plot_corrected, aes(x = PC1, y = PC2, color = kit, shape = treatment)) + geom_point(size = 3) + xlab(paste0("PC1 - Var.expl = ", round(percentVar_corrected[1], 2), "%")) + ylab(paste0("PC2 - Var.expl = ", round(percentVar_corrected[2], 2), "%")) + ggtitle("PCA After Batch Correction")
Step 5: Differential Expression Analysis
Perform differential expression analysis using edgeR
, accounting for batch effects.
# Design matrix accounting for batch and treatment design <- model.matrix(~kit + treatment, y$samples) # Estimate dispersion and fit model y <- estimateDisp(y, design) fit <- glmQLFit(y, design) # Test for differential expression fit <- glmQLFTest(fit, coef = 4) # Fourth column corresponds to treatment tt <- data.frame(topTags(fit, n = Inf)) # Classify genes as up-regulated, down-regulated, or not significant tt_modified <- tt %>% mutate(status = factor(case_when( logFC > 0 & FDR < 0.05 ~ "up", logFC < 0 & FDR < 0.05 ~ "down", TRUE ~ "not.signif" ), levels = c("up", "not.signif", "down"))) # MA plot ggplot(tt_modified, aes(x = logCPM, y = logFC, color = status)) + geom_point(size = 1) + scale_color_manual(values = c("firebrick", "grey", "dodgerblue")) + ggtitle("MA Plot") # Volcano plot ggplot(tt_modified, aes(x = logFC, y = -log10(FDR), color = status)) + geom_point(size = 1) + scale_color_manual(values = c("firebrick", "grey", "dodgerblue")) + ggtitle("Volcano Plot")
Tips and Tricks
- Batch Effect Correction: Always check for batch effects using PCA before proceeding with differential expression analysis. Use
removeBatchEffect
or include batch in the design matrix. - Top Variable Genes: For PCA, focus on the top 1000 most variable genes to reduce noise and improve visualization.
- Visualization: Use MA plots and volcano plots to visualize differential expression results. These plots help identify significant genes and their fold changes.
- Normalization: Always normalize your data using
calcNormFactors
inedgeR
to account for library size differences.
This guide provides a comprehensive workflow for RNA-seq data analysis, including normalization, batch correction, and visualization. The code is modular and can be adapted to other datasets or NGS assays like ATAC-seq or ChIP-seq.