Calculating TPM from featureCounts Output
January 10, 2025In this guide, we will walk through the process of calculating Transcripts Per Million (TPM) from the output of featureCounts
. TPM is a widely used normalization method for RNA-seq data that accounts for both gene length and sequencing depth. We will provide a step-by-step explanation, along with R code to perform the calculations.
Step 1: Understand the TPM Formula
The TPM formula is as follows:
TPMi=(ReadsiLengthi)×106∑j=1n(ReadsjLengthj)
Where:
- Readsi: Number of reads mapped to gene i.
- Lengthi: Length of gene i in kilobases (kb).
- ∑j=1n(ReadsjLengthj): Sum of reads per kilobase for all genes.
Step 2: Prepare the featureCounts Output
Assume you have a featureCounts
output file with the following columns:
Geneid
: Gene identifier.Length
: Length of the gene in base pairs.sample.bam
: Number of reads mapped to the gene.
Example:
Geneid Length sample.bam NM_032291 10934 25 NM_001101 2000 100
Step 3: Load the Data in R
Load the featureCounts
output into R and convert the gene length to kilobases.
# Load necessary libraries library(dplyr) # Example featureCounts output feature_counts <- data.frame( Geneid = c("NM_032291", "NM_001101"), Length = c(10934, 2000), sample.bam = c(25, 100) ) # Convert gene length to kilobases feature_counts <- feature_counts %>% mutate(Length_kb = Length / 1000)
Step 4: Calculate Reads Per Kilobase (RPK)
Calculate the reads per kilobase (RPK) for each gene.
# Calculate RPK feature_counts <- feature_counts %>% mutate(RPK = sample.bam / Length_kb)
Step 5: Calculate the Scaling Factor
Sum the RPK values for all genes and divide by 106 to get the scaling factor.
# Calculate scaling factor scaling_factor <- sum(feature_counts$RPK) / 1e6
Step 6: Calculate TPM
Divide the RPK values by the scaling factor to obtain TPM.
# Calculate TPM feature_counts <- feature_counts %>% mutate(TPM = RPK / scaling_factor) # View the final table print(feature_counts)
Step 7: Automate the Process with a Function
Here’s a reusable R function to calculate TPM from featureCounts
output:
calculate_tpm <- function(feature_counts) { # Convert gene length to kilobases feature_counts <- feature_counts %>% mutate(Length_kb = Length / 1000) # Calculate RPK feature_counts <- feature_counts %>% mutate(RPK = sample.bam / Length_kb) # Calculate scaling factor scaling_factor <- sum(feature_counts$RPK) / 1e6 # Calculate TPM feature_counts <- feature_counts %>% mutate(TPM = RPK / scaling_factor) return(feature_counts) } # Example usage tpm_results <- calculate_tpm(feature_counts) print(tpm_results)
Step 8: Handle Multiple Samples
If you have multiple samples, you can extend the function to handle a matrix of counts.
calculate_tpm_matrix <- function(count_matrix, gene_lengths) { # Convert gene lengths to kilobases gene_lengths_kb <- gene_lengths / 1000 # Calculate RPK for each sample rpk_matrix <- sweep(count_matrix, 1, gene_lengths_kb, "/") # Calculate scaling factor for each sample scaling_factors <- colSums(rpk_matrix) / 1e6 # Calculate TPM for each sample tpm_matrix <- sweep(rpk_matrix, 2, scaling_factors, "/") return(tpm_matrix) } # Example usage count_matrix <- matrix(c(25, 100, 50, 200), nrow = 2, byrow = TRUE) gene_lengths <- c(10934, 2000) tpm_matrix <- calculate_tpm_matrix(count_matrix, gene_lengths) print(tpm_matrix)
Tips and Tricks
- Gene Length: Ensure the gene length is in base pairs before converting to kilobases.
- Zero Counts: Handle genes with zero counts appropriately to avoid division by zero.
- Multiple Samples: Use the matrix-based function for datasets with multiple samples.
- Visualization: Use TPM values for downstream analysis like clustering, PCA, or heatmaps.
By following this guide, you can calculate TPM from featureCounts
output and normalize your RNA-seq data for further analysis.