AI-bioinformatics

Should We Remove Duplicated Reads in RNA-Seq?

January 2, 2025 Off By admin
Shares

In RNA-Seq analysis, duplicated reads can arise from different sources, and whether to remove them is a debated issue. This decision should be based on understanding the cause of duplication and its impact on the data. Below are step-by-step instructions and considerations for handling duplicated reads in RNA-Seq data:

Step 1: Understanding Duplicated Reads

  • Definition: Duplicated reads refer to identical sequences that are aligned to the same position on the reference genome. These could be due to PCR amplification or naturally occurring highly expressed transcripts.
  • Types of Duplicates:

Step 2: Assessing Duplication in RNA-Seq Data

  • Check Duplication Rate: Before removing duplicates, assess the duplication rate in your data. High duplication rates are often due to highly expressed genes, while low duplication rates may indicate low sample complexity.
  • Tool for Duplication Assessment:
    • FastQC: Can help you assess sequence duplication levels.
      bash
      fastqc your_data.fastq
    • Picard Tools (MarkDuplicates): Useful for detecting and marking duplicate reads.
      bash
      picard MarkDuplicates I=input.bam O=output.bam M=metrics.txt REMOVE_DUPLICATES=true

Step 3: Should You Remove Duplicates?

  • Not Always Necessary: Removing duplicates might lead to the loss of important biological information, especially for highly expressed genes.
  • When to Remove Duplicates:
    • If you suspect PCR bias.
    • If duplicates are overwhelmingly present due to low sample complexity or over-amplification.
  • When to Keep Duplicates:
    • In cases of highly expressed genes where duplicates are not artifacts but a reflection of biological abundance.

Step 4: Remove Duplicates (If Decided)

Using Samtools:

To remove duplicates using samtools:

bash
samtools rmdup input.bam output.bam

Using Picard:

If using Picard tools, to remove duplicates:

bash
picard MarkDuplicates I=input.bam O=output.bam REMOVE_DUPLICATES=true

Using DEDUP in R:

In R, you can use packages such as Rsubread to handle duplicates:

R
library(Rsubread)
align(index="your_index", readfile1="your_reads_1.fastq", readfile2="your_reads_2.fastq", nthreads=4, removeDuplicate=TRUE)

Step 5: Validate Impact on Differential Expression

Step 6: Online Tools for Duplicate Removal

  • EXPRESS Pipeline (Berkeley): This tool smooths the coverage distribution and removes outlier spikes, which can be a better approach than simply removing duplicates.
  • Picard Tools: A reliable set of tools for working with RNA-Seq data, including duplicate marking and removal.

Conclusion:

Whether to remove duplicated reads in RNA-Seq depends on the nature of the duplicates and the potential for amplification artifacts. In general, it’s better to investigate the duplication rates and make decisions based on data quality and biological context rather than blindly removing duplicates.

Shares