Step-by-Step Guide: Error Correction Tools for PacBio Long Reads

January 10, 2025 Off By admin

PacBio long reads are valuable for genome assembly and structural variant detection but are prone to high error rates. Error correction tools are essential to improve the accuracy of these reads. This guide provides an overview of popular tools for PacBio long-read error correction, their pros and cons, and step-by-step instructions for using them.

Table of Contents

1. Overview of PacBio Error Correction Tools

Below are some widely used tools for correcting PacBio long reads:

Tool	Type of Correction	Pros	Cons
Proovread	Hybrid (Short + Long Reads)	High accuracy, flexible, supports various data types	Requires high-coverage short reads
LoRDEC	Hybrid (Short + Long Reads)	Fast, efficient, uses de Bruijn graphs	Requires high-quality short reads
PacBioToCA	Hybrid (Short + Long Reads)	Part of Celera Assembler, well-documented	Slow, requires high computational resources
HGAP	Self-Correction	Designed for PacBio, integrates with SMRT Analysis	Requires high coverage (>50x)
ECTools	Hybrid (Short + Long Reads)	Uses unitigs from short-read assemblies	Requires specific grid computing setup

2. Using Proovread for Hybrid Correction

Proovread uses high-coverage short reads (e.g., Illumina) to correct PacBio long reads.

Step 1: Install Proovread

git clone https://github.com/BioInf-Wuerzburg/proovread.git
cd proovread
./install.sh

Step 2: Run Proovread

proovread -l pacbio_reads.fasta -s illumina_reads.fastq -t 16 -o corrected_output

-l: PacBio long reads.
-s: Illumina short reads.
-t: Number of threads.
-o: Output directory.

Pros:

High accuracy.
Flexible with different types of short-read data.

Cons:

Requires high-coverage short reads.
Can be computationally intensive.

3. Using LoRDEC for Hybrid Correction

LoRDEC uses de Bruijn graphs constructed from short reads to correct long reads.

Step 1: Install LoRDEC

wget https://www.atgc-montpellier.fr/download/sources/lordec/lordec-0.8.tar.gz
tar -xvzf lordec-0.8.tar.gz
cd lordec-0.8
make

Step 2: Run LoRDEC

lordec-correct -i pacbio_reads.fasta -2 illumina_reads.fastq -k 19 -o corrected_reads.fasta

-i: PacBio long reads.
-2: Illumina short reads.
-k: k-mer size (default: 19).
-o: Output file.

Pros:

Fast and efficient.
Works well with high-quality short reads.

Cons:

Requires high-quality short reads.
May not handle complex repeats well.

4. Using PacBioToCA for Hybrid Correction

PacBioToCA is part of the Celera Assembler and uses short reads to correct PacBio long reads.

Step 1: Install Celera Assembler

wget https://github.com/PacificBiosciences/pbcore/raw/master/celera-assembler.tar.gz
tar -xvzf celera-assembler.tar.gz
cd celera-assembler
make

Step 2: Run PacBioToCA

runCA -p output_dir -d output_dir -s pacbio.spec pacbio_reads.fasta illumina_reads.fastq

-p: Output prefix.
-d: Output directory.
-s: Specification file (configure parameters here).

Pros:

Well-documented.
Part of a comprehensive assembly pipeline.

Cons:

Slow and resource-intensive.
Requires high computational resources.

5. Using HGAP for Self-Correction

HGAP (Hierarchical Genome Assembly Process) is designed for PacBio data and performs self-correction.

Step 1: Install SMRT Analysis

Download and install SMRT Analysis from the PacBio website.

Step 2: Run HGAP

smrtpipe.py --params=hgap_params.xml --output=output_dir input.xml

--params: HGAP parameter file.
--output: Output directory.
input.xml: Input XML file specifying reads.

Pros:

Designed specifically for PacBio data.
Integrates with SMRT Analysis.

Cons:

Requires high coverage (>50x).
Limited to PacBio data.

6. Using ECTools for Hybrid Correction

ECTools uses unitigs from short-read assemblies to correct long reads.

Step 1: Install ECTools

git clone https://github.com/edrezen/ECTools.git
cd ECTools
make

Step 2: Run ECTools

correct.sh -l pacbio_reads.fasta -s short_read_assembly.fasta -o corrected_reads.fasta

-l: PacBio long reads.
-s: Short-read assembly (unitigs).
-o: Output file.

Pros:

Uses unitigs for correction.
Can handle complex datasets.

Cons:

Requires specific grid computing setup.
Slow for large datasets.

7. Choosing the Right Tool

Hybrid Correction: Use Proovread or LoRDEC if you have high-coverage short reads.
Self-Correction: Use HGAP if you have high-coverage PacBio data.
Comprehensive Pipeline: Use PacBioToCA if you are already using Celera Assembler.

By following this guide, you can effectively correct PacBio long reads, improving the accuracy of your downstream analyses. Choose the tool that best fits your data and computational resources.