Integrativebioinformatics

Step-by-Step Guide: Error Correction Tools for PacBio Long Reads

January 10, 2025 Off By admin
Shares

PacBio long reads are valuable for genome assembly and structural variant detection but are prone to high error rates. Error correction tools are essential to improve the accuracy of these reads. This guide provides an overview of popular tools for PacBio long-read error correction, their pros and cons, and step-by-step instructions for using them.


1. Overview of PacBio Error Correction Tools

Below are some widely used tools for correcting PacBio long reads:

ToolType of CorrectionProsCons
ProovreadHybrid (Short + Long Reads)High accuracy, flexible, supports various data typesRequires high-coverage short reads
LoRDECHybrid (Short + Long Reads)Fast, efficient, uses de Bruijn graphsRequires high-quality short reads
PacBioToCAHybrid (Short + Long Reads)Part of Celera Assembler, well-documentedSlow, requires high computational resources
HGAPSelf-CorrectionDesigned for PacBio, integrates with SMRT AnalysisRequires high coverage (>50x)
ECToolsHybrid (Short + Long Reads)Uses unitigs from short-read assembliesRequires specific grid computing setup

2. Using Proovread for Hybrid Correction

Proovread uses high-coverage short reads (e.g., Illumina) to correct PacBio long reads.

Step 1: Install Proovread

bash
Copy
git clone https://github.com/BioInf-Wuerzburg/proovread.git
cd proovread
./install.sh

Step 2: Run Proovread

bash
Copy
proovread -l pacbio_reads.fasta -s illumina_reads.fastq -t 16 -o corrected_output
  • -l: PacBio long reads.
  • -s: Illumina short reads.
  • -t: Number of threads.
  • -o: Output directory.

Pros:

  • High accuracy.
  • Flexible with different types of short-read data.

Cons:

  • Requires high-coverage short reads.
  • Can be computationally intensive.

3. Using LoRDEC for Hybrid Correction

LoRDEC uses de Bruijn graphs constructed from short reads to correct long reads.

Step 1: Install LoRDEC

bash
Copy
wget https://www.atgc-montpellier.fr/download/sources/lordec/lordec-0.8.tar.gz
tar -xvzf lordec-0.8.tar.gz
cd lordec-0.8
make

Step 2: Run LoRDEC

bash
Copy
lordec-correct -i pacbio_reads.fasta -2 illumina_reads.fastq -k 19 -o corrected_reads.fasta
  • -i: PacBio long reads.
  • -2: Illumina short reads.
  • -k: k-mer size (default: 19).
  • -o: Output file.

Pros:

  • Fast and efficient.
  • Works well with high-quality short reads.

Cons:

  • Requires high-quality short reads.
  • May not handle complex repeats well.

4. Using PacBioToCA for Hybrid Correction

PacBioToCA is part of the Celera Assembler and uses short reads to correct PacBio long reads.

Step 1: Install Celera Assembler

bash
Copy
wget https://github.com/PacificBiosciences/pbcore/raw/master/celera-assembler.tar.gz
tar -xvzf celera-assembler.tar.gz
cd celera-assembler
make

Step 2: Run PacBioToCA

bash
Copy
runCA -p output_dir -d output_dir -s pacbio.spec pacbio_reads.fasta illumina_reads.fastq
  • -p: Output prefix.
  • -d: Output directory.
  • -s: Specification file (configure parameters here).

Pros:

  • Well-documented.
  • Part of a comprehensive assembly pipeline.

Cons:

  • Slow and resource-intensive.
  • Requires high computational resources.

5. Using HGAP for Self-Correction

HGAP (Hierarchical Genome Assembly Process) is designed for PacBio data and performs self-correction.

Step 1: Install SMRT Analysis

Download and install SMRT Analysis from the PacBio website.

Step 2: Run HGAP

bash
Copy
smrtpipe.py --params=hgap_params.xml --output=output_dir input.xml
  • --params: HGAP parameter file.
  • --output: Output directory.
  • input.xml: Input XML file specifying reads.

Pros:

  • Designed specifically for PacBio data.
  • Integrates with SMRT Analysis.

Cons:

  • Requires high coverage (>50x).
  • Limited to PacBio data.

6. Using ECTools for Hybrid Correction

ECTools uses unitigs from short-read assemblies to correct long reads.

Step 1: Install ECTools

bash
Copy
git clone https://github.com/edrezen/ECTools.git
cd ECTools
make

Step 2: Run ECTools

bash
Copy
correct.sh -l pacbio_reads.fasta -s short_read_assembly.fasta -o corrected_reads.fasta
  • -l: PacBio long reads.
  • -s: Short-read assembly (unitigs).
  • -o: Output file.

Pros:

  • Uses unitigs for correction.
  • Can handle complex datasets.

Cons:


7. Choosing the Right Tool

  • Hybrid Correction: Use Proovread or LoRDEC if you have high-coverage short reads.
  • Self-Correction: Use HGAP if you have high-coverage PacBio data.
  • Comprehensive Pipeline: Use PacBioToCA if you are already using Celera Assembler.

By following this guide, you can effectively correct PacBio long reads, improving the accuracy of your downstream analyses. Choose the tool that best fits your data and computational resources.

Shares