FPKM vs Raw Counts vs RPKM: Step-by-Step Guide
January 3, 2025This guide will clarify the differences between FPKM, raw counts, and RPKM in RNA-seq analysis, explaining when and how to use each. Additionally, it provides computational instructions and references tools for analysis.
1. Definitions and Differences
- Raw Counts: The number of reads directly mapped to a gene/transcript. Used as input for tools like DESeq2 and edgeR for differential expression analysis.
- RPKM (Reads Per Kilobase of transcript per Million mapped reads): Normalizes raw counts by gene length and sequencing depth.
- FPKM (Fragments Per Kilobase of transcript per Million mapped reads): Similar to RPKM but designed for paired-end RNA-seq data (uses fragments instead of reads).
Key Considerations:
- Use raw counts for statistical models like DESeq2, which handle normalization internally.
- Use FPKM/RPKM for within-sample comparison of gene expression levels but not for cross-sample differential expression.
2. Calculations
FPKM Calculation (Python Script):
RPKM Calculation (R Script):
3. Tools and Pipelines
- Raw Counts:
- HTSeq-count: Generates raw counts from aligned BAM files.
- Command:
- FPKM/RPKM Calculation:
- StringTie: Provides both FPKM and TPM normalization.
- Command:
- DESeq2/edgeR: Analyzes raw counts for differential expression.
4. Pros and Cons
Metric | Pros | Cons |
---|---|---|
Raw Counts | Input for robust statistical methods. | Requires normalization for interpretation. |
RPKM/FPKM | Useful for within-sample expression ranking. | Not ideal for cross-sample comparisons. |
5. Recent Tools and Resources
- Salmon/Kallisto: Fast pseudo-alignment and quantification of RNA-seq data.
- Bioconductor Workflow: Comprehensive guide for RNA-seq analysis (link).
- HAROLD Blog: Insights into RNA-seq expression units (link).
By understanding the differences between FPKM, raw counts, and RPKM, and following these steps, you’ll be equipped to choose the appropriate metric for your RNA-seq analysis.