Calculating TPM from featureCounts Output

January 10, 2025 Off By admin

In this guide, we will walk through the process of calculating Transcripts Per Million (TPM) from the output of featureCounts. TPM is a widely used normalization method for RNA-seq data that accounts for both gene length and sequencing depth. We will provide a step-by-step explanation, along with R code to perform the calculations.

Table of Contents

Step 1: Understand the TPM Formula

The TPM formula is as follows:

$TPM_{i} = \sum ^{j = 1 n} ( Length ^{j} Reads ^{j} ) ( Length ^{i} Reads ^{i} ) \times 10 ^{6}$

Where:

$Reads_{i}$ : Number of reads mapped to gene $i$ .
$Length_{i}$ : Length of gene $i$ in kilobases (kb).
$\sum_{j = 1 n} (Length ^{j} Reads ^{j})$ : Sum of reads per kilobase for all genes.

Step 2: Prepare the featureCounts Output

Assume you have a featureCounts output file with the following columns:

Geneid: Gene identifier.
Length: Length of the gene in base pairs.
sample.bam: Number of reads mapped to the gene.

Example:

Geneid      Length   sample.bam
NM_032291   10934    25
NM_001101   2000     100

Step 3: Load the Data in R

Load the featureCounts output into R and convert the gene length to kilobases.

# Load necessary libraries
library(dplyr)

# Example featureCounts output
feature_counts <- data.frame(
  Geneid = c("NM_032291", "NM_001101"),
  Length = c(10934, 2000),
  sample.bam = c(25, 100)
)

# Convert gene length to kilobases
feature_counts <- feature_counts %>%
  mutate(Length_kb = Length / 1000)

Step 4: Calculate Reads Per Kilobase (RPK)

Calculate the reads per kilobase (RPK) for each gene.

# Calculate RPK
feature_counts <- feature_counts %>%
  mutate(RPK = sample.bam / Length_kb)

Step 5: Calculate the Scaling Factor

Sum the RPK values for all genes and divide by $1 0^{6}$ to get the scaling factor.

# Calculate scaling factor
scaling_factor <- sum(feature_counts$RPK) / 1e6

Step 6: Calculate TPM

Divide the RPK values by the scaling factor to obtain TPM.

# Calculate TPM
feature_counts <- feature_counts %>%
  mutate(TPM = RPK / scaling_factor)

# View the final table
print(feature_counts)

Step 7: Automate the Process with a Function

Here’s a reusable R function to calculate TPM from featureCounts output:

calculate_tpm <- function(feature_counts) {
  # Convert gene length to kilobases
  feature_counts <- feature_counts %>%
    mutate(Length_kb = Length / 1000)
  
  # Calculate RPK
  feature_counts <- feature_counts %>%
    mutate(RPK = sample.bam / Length_kb)
  
  # Calculate scaling factor
  scaling_factor <- sum(feature_counts$RPK) / 1e6
  
  # Calculate TPM
  feature_counts <- feature_counts %>%
    mutate(TPM = RPK / scaling_factor)
  
  return(feature_counts)
}

# Example usage
tpm_results <- calculate_tpm(feature_counts)
print(tpm_results)

Step 8: Handle Multiple Samples

If you have multiple samples, you can extend the function to handle a matrix of counts.

calculate_tpm_matrix <- function(count_matrix, gene_lengths) {
  # Convert gene lengths to kilobases
  gene_lengths_kb <- gene_lengths / 1000
  
  # Calculate RPK for each sample
  rpk_matrix <- sweep(count_matrix, 1, gene_lengths_kb, "/")
  
  # Calculate scaling factor for each sample
  scaling_factors <- colSums(rpk_matrix) / 1e6
  
  # Calculate TPM for each sample
  tpm_matrix <- sweep(rpk_matrix, 2, scaling_factors, "/")
  
  return(tpm_matrix)
}

# Example usage
count_matrix <- matrix(c(25, 100, 50, 200), nrow = 2, byrow = TRUE)
gene_lengths <- c(10934, 2000)

tpm_matrix <- calculate_tpm_matrix(count_matrix, gene_lengths)
print(tpm_matrix)

Tips and Tricks

Gene Length: Ensure the gene length is in base pairs before converting to kilobases.
Zero Counts: Handle genes with zero counts appropriately to avoid division by zero.
Multiple Samples: Use the matrix-based function for datasets with multiple samples.
Visualization: Use TPM values for downstream analysis like clustering, PCA, or heatmaps.