Should We Remove Duplicated Reads in RNA-Seq?
January 2, 2025In RNA-Seq analysis, duplicated reads can arise from different sources, and whether to remove them is a debated issue. This decision should be based on understanding the cause of duplication and its impact on the data. Below are step-by-step instructions and considerations for handling duplicated reads in RNA-Seq data:
Step 1: Understanding Duplicated Reads
- Definition: Duplicated reads refer to identical sequences that are aligned to the same position on the reference genome. These could be due to PCR amplification or naturally occurring highly expressed transcripts.
- Types of Duplicates:
- PCR Duplicates: Generated during the library preparation process.
- True Biological Duplicates: Representing highly expressed genes or repetitive regions.
Step 2: Assessing Duplication in RNA-Seq Data
- Check Duplication Rate: Before removing duplicates, assess the duplication rate in your data. High duplication rates are often due to highly expressed genes, while low duplication rates may indicate low sample complexity.
- Tool for Duplication Assessment:
- FastQC: Can help you assess sequence duplication levels.
- Picard Tools (MarkDuplicates): Useful for detecting and marking duplicate reads.
Step 3: Should You Remove Duplicates?
- Not Always Necessary: Removing duplicates might lead to the loss of important biological information, especially for highly expressed genes.
- When to Remove Duplicates:
- If you suspect PCR bias.
- If duplicates are overwhelmingly present due to low sample complexity or over-amplification.
- When to Keep Duplicates:
- In cases of highly expressed genes where duplicates are not artifacts but a reflection of biological abundance.
Step 4: Remove Duplicates (If Decided)
Using Samtools
:
To remove duplicates using samtools
:
Using Picard
:
If using Picard tools, to remove duplicates:
Using DEDUP
in R:
In R, you can use packages such as Rsubread to handle duplicates:
Step 5: Validate Impact on Differential Expression
- Compare Results: Run differential expression analysis with and without duplicate removal to check for any major differences in the gene expression profiles.
- Tools for Differential Expression:
Step 6: Online Tools for Duplicate Removal
- EXPRESS Pipeline (Berkeley): This tool smooths the coverage distribution and removes outlier spikes, which can be a better approach than simply removing duplicates.
- Picard Tools: A reliable set of tools for working with RNA-Seq data, including duplicate marking and removal.
Conclusion:
Whether to remove duplicated reads in RNA-Seq depends on the nature of the duplicates and the potential for amplification artifacts. In general, it’s better to investigate the duplication rates and make decisions based on data quality and biological context rather than blindly removing duplicates.