Step-by-Step Guide: Error Correction Tools for PacBio Long Reads
January 10, 2025PacBio long reads are valuable for genome assembly and structural variant detection but are prone to high error rates. Error correction tools are essential to improve the accuracy of these reads. This guide provides an overview of popular tools for PacBio long-read error correction, their pros and cons, and step-by-step instructions for using them.
1. Overview of PacBio Error Correction Tools
Below are some widely used tools for correcting PacBio long reads:
Tool | Type of Correction | Pros | Cons |
---|---|---|---|
Proovread | Hybrid (Short + Long Reads) | High accuracy, flexible, supports various data types | Requires high-coverage short reads |
LoRDEC | Hybrid (Short + Long Reads) | Fast, efficient, uses de Bruijn graphs | Requires high-quality short reads |
PacBioToCA | Hybrid (Short + Long Reads) | Part of Celera Assembler, well-documented | Slow, requires high computational resources |
HGAP | Self-Correction | Designed for PacBio, integrates with SMRT Analysis | Requires high coverage (>50x) |
ECTools | Hybrid (Short + Long Reads) | Uses unitigs from short-read assemblies | Requires specific grid computing setup |
2. Using Proovread for Hybrid Correction
Proovread uses high-coverage short reads (e.g., Illumina) to correct PacBio long reads.
Step 1: Install Proovread
git clone https://github.com/BioInf-Wuerzburg/proovread.git cd proovread ./install.sh
Step 2: Run Proovread
proovread -l pacbio_reads.fasta -s illumina_reads.fastq -t 16 -o corrected_output
-l
: PacBio long reads.-s
: Illumina short reads.-t
: Number of threads.-o
: Output directory.
Pros:
- High accuracy.
- Flexible with different types of short-read data.
Cons:
- Requires high-coverage short reads.
- Can be computationally intensive.
3. Using LoRDEC for Hybrid Correction
LoRDEC uses de Bruijn graphs constructed from short reads to correct long reads.
Step 1: Install LoRDEC
wget https://www.atgc-montpellier.fr/download/sources/lordec/lordec-0.8.tar.gz tar -xvzf lordec-0.8.tar.gz cd lordec-0.8 make
Step 2: Run LoRDEC
lordec-correct -i pacbio_reads.fasta -2 illumina_reads.fastq -k 19 -o corrected_reads.fasta
-i
: PacBio long reads.-2
: Illumina short reads.-k
: k-mer size (default: 19).-o
: Output file.
Pros:
- Fast and efficient.
- Works well with high-quality short reads.
Cons:
- Requires high-quality short reads.
- May not handle complex repeats well.
4. Using PacBioToCA for Hybrid Correction
PacBioToCA is part of the Celera Assembler and uses short reads to correct PacBio long reads.
Step 1: Install Celera Assembler
wget https://github.com/PacificBiosciences/pbcore/raw/master/celera-assembler.tar.gz tar -xvzf celera-assembler.tar.gz cd celera-assembler make
Step 2: Run PacBioToCA
runCA -p output_dir -d output_dir -s pacbio.spec pacbio_reads.fasta illumina_reads.fastq
-p
: Output prefix.-d
: Output directory.-s
: Specification file (configure parameters here).
Pros:
- Well-documented.
- Part of a comprehensive assembly pipeline.
Cons:
- Slow and resource-intensive.
- Requires high computational resources.
5. Using HGAP for Self-Correction
HGAP (Hierarchical Genome Assembly Process) is designed for PacBio data and performs self-correction.
Step 1: Install SMRT Analysis
Download and install SMRT Analysis from the PacBio website.
Step 2: Run HGAP
smrtpipe.py --params=hgap_params.xml --output=output_dir input.xml
--params
: HGAP parameter file.--output
: Output directory.input.xml
: Input XML file specifying reads.
Pros:
- Designed specifically for PacBio data.
- Integrates with SMRT Analysis.
Cons:
- Requires high coverage (>50x).
- Limited to PacBio data.
6. Using ECTools for Hybrid Correction
ECTools uses unitigs from short-read assemblies to correct long reads.
Step 1: Install ECTools
git clone https://github.com/edrezen/ECTools.git cd ECTools make
Step 2: Run ECTools
correct.sh -l pacbio_reads.fasta -s short_read_assembly.fasta -o corrected_reads.fasta
-l
: PacBio long reads.-s
: Short-read assembly (unitigs).-o
: Output file.
Pros:
- Uses unitigs for correction.
- Can handle complex datasets.
Cons:
- Requires specific grid computing setup.
- Slow for large datasets.
7. Choosing the Right Tool
- Hybrid Correction: Use Proovread or LoRDEC if you have high-coverage short reads.
- Self-Correction: Use HGAP if you have high-coverage PacBio data.
- Comprehensive Pipeline: Use PacBioToCA if you are already using Celera Assembler.
By following this guide, you can effectively correct PacBio long reads, improving the accuracy of your downstream analyses. Choose the tool that best fits your data and computational resources.