Should We Remove Duplicated Reads in RNA-Seq?

January 2, 2025 Off By admin

In RNA-Seq analysis, duplicated reads can arise from different sources, and whether to remove them is a debated issue. This decision should be based on understanding the cause of duplication and its impact on the data. Below are step-by-step instructions and considerations for handling duplicated reads in RNA-Seq data:

Table of Contents

Step 1: Understanding Duplicated Reads

Definition: Duplicated reads refer to identical sequences that are aligned to the same position on the reference genome. These could be due to PCR amplification or naturally occurring highly expressed transcripts.
Types of Duplicates:
- PCR Duplicates: Generated during the library preparation process.
- True Biological Duplicates: Representing highly expressed genes or repetitive regions.

Step 2: Assessing Duplication in RNA-Seq Data

Check Duplication Rate: Before removing duplicates, assess the duplication rate in your data. High duplication rates are often due to highly expressed genes, while low duplication rates may indicate low sample complexity.
Tool for Duplication Assessment:
- FastQC: Can help you assess sequence duplication levels.
  bash
  fastqc your_data.fastq
- Picard Tools (MarkDuplicates): Useful for detecting and marking duplicate reads.
  bash
  picard MarkDuplicates I=input.bam O=output.bam M=metrics.txt REMOVE_DUPLICATES=true

Step 3: Should You Remove Duplicates?

Not Always Necessary: Removing duplicates might lead to the loss of important biological information, especially for highly expressed genes.
When to Remove Duplicates:
- If you suspect PCR bias.
- If duplicates are overwhelmingly present due to low sample complexity or over-amplification.
When to Keep Duplicates:
- In cases of highly expressed genes where duplicates are not artifacts but a reflection of biological abundance.

Step 4: Remove Duplicates (If Decided)

Using `Samtools`:

To remove duplicates using samtools:

Using `Picard`:

If using Picard tools, to remove duplicates:

Using `DEDUP` in R:

In R, you can use packages such as Rsubread to handle duplicates:

Step 5: Validate Impact on Differential Expression

Compare Results: Run differential expression analysis with and without duplicate removal to check for any major differences in the gene expression profiles.
Tools for Differential Expression:
- DESeq2:
  R
  library(DESeq2) dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ condition) dds <- DESeq(dds) results <- results(dds)

Step 6: Online Tools for Duplicate Removal

EXPRESS Pipeline (Berkeley): This tool smooths the coverage distribution and removes outlier spikes, which can be a better approach than simply removing duplicates.
- EXPRESS Pipeline
Picard Tools: A reliable set of tools for working with RNA-Seq data, including duplicate marking and removal.
- Picard Tools

Conclusion:

Whether to remove duplicated reads in RNA-Seq depends on the nature of the duplicates and the potential for amplification artifacts. In general, it’s better to investigate the duplication rates and make decisions based on data quality and biological context rather than blindly removing duplicates.