Basic Normalization, Batch Correction, and Visualization of RNA-seq Data

January 10, 2025 Off By admin

This step-by-step guide will walk you through a basic RNA-seq data analysis workflow using edgeR in R. The workflow includes data loading, normalization, batch correction, visualization, and differential expression analysis. The example data used here are from the GEO dataset GSE124167, which includes RNA-seq data from treated and untreated mice across three different library preparation kits (batches).

Table of Contents

Step 1: Load Required Libraries and Data

First, install and load the necessary R packages. Then, download and parse the RNA-seq count data from GEO.

Install and Load Libraries

# Install necessary packages if not already installed
install.packages("data.table")
install.packages("dplyr")
install.packages("ggplot2")
BiocManager::install("edgeR")
BiocManager::install("limma")

# Load libraries
library(data.table)
library(dplyr)
library(edgeR)
library(limma)
library(ggplot2)

Download and Parse Data

# Function to parse data properly
parseProperly <- function(x) {
  x <- x[, -c(ncol(x) - 1, ncol(x))]  # Remove last two columns
  return(x)
}

# URLs for the data
m1 <- "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE124nnn/GSE124167/suppl/GSE124167_FINAL_master_list_of_gene_counts_MIN.V4.txt.gz"
m2 <- "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE124nnn/GSE124167/suppl/GSE124167_FINAL_master_list_of_gene_counts_MIN.sense.Pico.R.txt.gz"
m3 <- "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE124nnn/GSE124167/suppl/GSE124167_FINAL_master_list_of_gene_counts_MIN.sense.Truseq.txt.gz"

# Read and parse data
method1 <- parseProperly(fread(m1, data.table = FALSE))
method2 <- parseProperly(fread(m2, data.table = FALSE))
method3 <- parseProperly(fread(m3, data.table = FALSE))

# Combine data into a single table
counts <- Reduce(function(x, y) full_join(x, y, by = "id"), list(method1, method2, method3))

# Replace NAs with zeros
counts[is.na(counts)] <- 0

# Move "id" column to row names
rownames(counts) <- counts$id
counts$id <- NULL

Step 2: Set Up edgeR Object

Create a DGEList object, which is the primary data structure used by edgeR. Add metadata for the library preparation kits and treatments.

# Create DGEList object
y <- DGEList(counts = counts)

# Add metadata
y$samples$kit <- c(rep("Pico", 6), rep("Illumina", 6), rep("V4", 6))
y$samples$treatment <- rep(c(rep("treated", 3), rep("untreated", 3)), 3)

# Normalize data
y <- calcNormFactors(y)
logCPMs <- cpm(y, log = TRUE)

Step 3: Perform PCA for Quality Control

Principal Component Analysis (PCA) is a useful technique for identifying batch effects and other sources of variation in the data.

# Calculate row-wise variance
rv <- apply(logCPMs, 1, var)

# Select top 1000 most variable genes
top1000 <- order(rv, decreasing = TRUE)[1:1000]
logCPM_top1000 <- logCPMs[top1000, ]

# Run PCA
pca <- prcomp(t(logCPM_top1000))

# Combine PCA results with metadata
to_plot <- data.frame(pca$x, y$samples)

# Calculate percentage of variance explained by each PC
percentVar <- pca$sdev^2 / sum(pca$sdev^2) * 100

# Plot PCA results
ggplot(to_plot, aes(x = PC1, y = PC2, color = kit, shape = treatment)) +
  geom_point(size = 3) +
  xlab(paste0("PC1 - Var.expl = ", round(percentVar[1], 2), "%")) +
  ylab(paste0("PC2 - Var.expl = ", round(percentVar[2], 2), "%")) +
  ggtitle("PCA Before Batch Correction")

Step 4: Correct for Batch Effects

Use the removeBatchEffect function from the limma package to correct for batch effects.

# Correct for batch effects
batch <- factor(y$samples$kit)
logCPMs_corrected <- removeBatchEffect(logCPMs, batch = batch)

# Repeat PCA with corrected data
logCPM_corrected_top1000 <- logCPMs_corrected[top1000, ]
pca_corrected <- prcomp(t(logCPM_corrected_top1000))
to_plot_corrected <- data.frame(pca_corrected$x, y$samples)
percentVar_corrected <- pca_corrected$sdev^2 / sum(pca_corrected$sdev^2) * 100

# Plot PCA after batch correction
ggplot(to_plot_corrected, aes(x = PC1, y = PC2, color = kit, shape = treatment)) +
  geom_point(size = 3) +
  xlab(paste0("PC1 - Var.expl = ", round(percentVar_corrected[1], 2), "%")) +
  ylab(paste0("PC2 - Var.expl = ", round(percentVar_corrected[2], 2), "%")) +
  ggtitle("PCA After Batch Correction")

Step 5: Differential Expression Analysis

Perform differential expression analysis using edgeR, accounting for batch effects.

# Design matrix accounting for batch and treatment
design <- model.matrix(~kit + treatment, y$samples)

# Estimate dispersion and fit model
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design)

# Test for differential expression
fit <- glmQLFTest(fit, coef = 4)  # Fourth column corresponds to treatment
tt <- data.frame(topTags(fit, n = Inf))

# Classify genes as up-regulated, down-regulated, or not significant
tt_modified <- tt %>%
  mutate(status = factor(case_when(
    logFC > 0 & FDR < 0.05 ~ "up",
    logFC < 0 & FDR < 0.05 ~ "down",
    TRUE ~ "not.signif"
  ), levels = c("up", "not.signif", "down")))

# MA plot
ggplot(tt_modified, aes(x = logCPM, y = logFC, color = status)) +
  geom_point(size = 1) +
  scale_color_manual(values = c("firebrick", "grey", "dodgerblue")) +
  ggtitle("MA Plot")

# Volcano plot
ggplot(tt_modified, aes(x = logFC, y = -log10(FDR), color = status)) +
  geom_point(size = 1) +
  scale_color_manual(values = c("firebrick", "grey", "dodgerblue")) +
  ggtitle("Volcano Plot")

Tips and Tricks

Batch Effect Correction: Always check for batch effects using PCA before proceeding with differential expression analysis. Use removeBatchEffect or include batch in the design matrix.
Top Variable Genes: For PCA, focus on the top 1000 most variable genes to reduce noise and improve visualization.
Visualization: Use MA plots and volcano plots to visualize differential expression results. These plots help identify significant genes and their fold changes.
Normalization: Always normalize your data using calcNormFactors in edgeR to account for library size differences.

This guide provides a comprehensive workflow for RNA-seq data analysis, including normalization, batch correction, and visualization. The code is modular and can be adapted to other datasets or NGS assays like ATAC-seq or ChIP-seq.