DNA-crispr

FPKM vs Raw Counts vs RPKM: Step-by-Step Guide

January 3, 2025 Off By admin
Shares

This guide will clarify the differences between FPKM, raw counts, and RPKM in RNA-seq analysis, explaining when and how to use each. Additionally, it provides computational instructions and references tools for analysis.


1. Definitions and Differences

  • Raw Counts: The number of reads directly mapped to a gene/transcript. Used as input for tools like DESeq2 and edgeR for differential expression analysis.
  • RPKM (Reads Per Kilobase of transcript per Million mapped reads): Normalizes raw counts by gene length and sequencing depth.
  • FPKM (Fragments Per Kilobase of transcript per Million mapped reads): Similar to RPKM but designed for paired-end RNA-seq data (uses fragments instead of reads).

Key Considerations:


2. Calculations

FPKM Calculation (Python Script):

def calculate_fpkm(counts, gene_lengths, total_mapped_reads):
fpkm_values = {}
for gene, count in counts.items():
length_kb = gene_lengths[gene] / 1000.0
fpkm = (count / length_kb) / (total_mapped_reads / 1e6)
fpkm_values[gene] = fpkm
return fpkm_values

# Example usage
counts = {'GeneA': 500, 'GeneB': 1000}
gene_lengths = {'GeneA': 2000, 'GeneB': 1000} # in bp
total_mapped_reads = 10_000_000

fpkm = calculate_fpkm(counts, gene_lengths, total_mapped_reads)
print(fpkm)

RPKM Calculation (R Script):

R
calculate_rpkm <- function(raw_counts, gene_lengths, total_mapped_reads) {
rpkm <- (raw_counts / (gene_lengths / 1000)) / (total_mapped_reads / 1e6)
return(rpkm)
}

# Example usage
raw_counts <- c(GeneA = 500, GeneB = 1000)
gene_lengths <- c(GeneA = 2000, GeneB = 1000) # in bp
total_mapped_reads <- 10e6

rpkm <- calculate_rpkm(raw_counts, gene_lengths, total_mapped_reads)
print(rpkm)


3. Tools and Pipelines

  • Raw Counts:
    • HTSeq-count: Generates raw counts from aligned BAM files.
    • Command:
      bash
      htseq-count -f bam -r pos -s no input.bam genes.gtf > counts.txt
  • FPKM/RPKM Calculation:
    • StringTie: Provides both FPKM and TPM normalization.
    • Command:
      bash
      stringtie input.bam -G annotation.gtf -o output.gtf -A gene_abundances.txt
  • DESeq2/edgeR: Analyzes raw counts for differential expression.

4. Pros and Cons

MetricProsCons
Raw CountsInput for robust statistical methods.Requires normalization for interpretation.
RPKM/FPKMUseful for within-sample expression ranking.Not ideal for cross-sample comparisons.

5. Recent Tools and Resources

  • Salmon/Kallisto: Fast pseudo-alignment and quantification of RNA-seq data.
  • Bioconductor Workflow: Comprehensive guide for RNA-seq analysis (link).
  • HAROLD Blog: Insights into RNA-seq expression units (link).

By understanding the differences between FPKM, raw counts, and RPKM, and following these steps, you’ll be equipped to choose the appropriate metric for your RNA-seq analysis.

Shares