Step-by-Step Guide: Best Pipeline for Human Whole Exome Sequencing (WES)
December 28, 2024Whole Exome Sequencing (WES) is a powerful tool for understanding genetic variations within coding regions of the genome. This pipeline guide covers the steps from raw sequencing data to variant detection and annotation, with the most commonly used tools in the bioinformatics field today. The pipeline described below includes updates for modern tools and a focus on accuracy and reproducibility.
1. Preprocessing and Initial Setup
1.1. Set Up the Environment
Before you start, ensure your computational environment is set up with the necessary software and dependencies. If you are using a Unix-based system (Linux/Mac), you can install the tools via package managers (e.g., apt
, brew
, or conda
). The key tools include:
- BWA (Burrows-Wheeler Aligner) for read alignment
- SAMtools for SAM/BAM file manipulation
- Picard for file processing
- GATK (Genome Analysis Toolkit) for variant calling and realignment
- SIFT or PolyPhen for variant effect prediction
- VCFtools for handling VCF files
Install these via conda
, apt
, or brew
:
2. Quality Control and Preprocessing of Raw Data
2.1. Convert Illumina Data to FASTQ Format (if required)
In some cases, you may receive data in a different format. To convert Illumina’s proprietary format to standard FASTQ:
2.2. Quality Control
Run quality control on the raw sequencing data using tools like FastQC. This step ensures that your sequencing reads are of good quality and can help detect any issues with the sequencing run.
Inspect the output files to identify any problems with the data (e.g., poor quality scores).
3. Alignment
3.1. Align Reads to Reference Genome Using BWA
The next step is to align your reads to the reference human genome. Download the human genome reference (e.g., hg38 or GRCh38), and index it using bwa index
:
Now align the sequencing reads to the reference genome:
Here:
-t 4
uses 4 threads for faster alignment.output.sam
is the alignment file in SAM format.
3.2. Convert SAM to BAM
Once the reads are aligned, convert the SAM file to BAM format (binary format) for efficiency:
3.3. Sort and Index the BAM File
Sort the BAM file to prepare it for downstream processing:
Index the sorted BAM file:
4. Realignment Around Indels
4.1. Marking Duplicate Reads
Use Picard to mark PCR duplicates (if present), which is crucial for accurate variant calling:
4.2. Indel Realignment Using GATK
Realignment around indels improves variant calling accuracy. First, identify regions requiring realignment:
Then, perform the actual realignment:
5. Variant Calling
5.1. Call SNPs and Indels Using GATK
To call single nucleotide polymorphisms (SNPs) and indels, use GATK’s HaplotypeCaller or UnifiedGenotyper:
Alternatively, for older GATK versions:
This generates a VCF (Variant Call Format) file containing all identified variants.
6. Variant Filtering
6.1. Apply Variant Quality Score Recalibration (VQSR)
To improve the accuracy of variant calls, use VQSR to filter out false positives. First, you need a dbSNP resource for known variants:
This applies a recalibration model to filter out low-confidence variants.
7. Variant Annotation
7.1. Annotate Variants Using ANNOVAR or VEP
To understand the biological impact of the detected variants, use annotation tools like ANNOVAR or VEP (Variant Effect Predictor).
For example, using ANNOVAR:
This annotates variants with gene information and predicted deleteriousness (e.g., SIFT, PolyPhen).
8. Functional Prediction
8.1. Predict Variant Impact Using SIFT or PolyPhen
To classify the functional impact of variants (e.g., benign or deleterious), tools like SIFT and PolyPhen can be used. For instance, with SIFT:
This tool predicts whether the mutations are likely to disrupt protein function.
9. Visualization and Interpretation
9.1. Visualize Variants Using IGV
To explore and visualize your aligned data, use the Integrative Genomics Viewer (IGV):
IGV allows you to visually inspect the alignments and variant calls across the genome.
10. Additional Considerations
10.1. Batch Processing of Multiple Samples
For larger datasets (e.g., 10+ samples), it’s essential to handle each sample individually during alignment, realignment, and variant calling. However, after these steps, you can combine them for joint variant calling. For example:
Where gvcf_list.txt
is a list of GVCF files from individual samples.
Conclusion
This pipeline provides a comprehensive, reproducible workflow for processing human whole exome sequencing data, from raw reads to annotated variants. Depending on your specific goals (e.g., variant discovery, functional prediction), you may adjust the tools and parameters. As bioinformatics tools evolve, it’s essential to stay updated with the latest versions and methodologies.