Step-by-Step Guide: Best Pipeline for Human Whole Exome Sequencing (WES)

December 28, 2024 Off By admin

Whole Exome Sequencing (WES) is a powerful tool for understanding genetic variations within coding regions of the genome. This pipeline guide covers the steps from raw sequencing data to variant detection and annotation, with the most commonly used tools in the bioinformatics field today. The pipeline described below includes updates for modern tools and a focus on accuracy and reproducibility.

Table of Contents

1. Preprocessing and Initial Setup

1.1. Set Up the Environment

Before you start, ensure your computational environment is set up with the necessary software and dependencies. If you are using a Unix-based system (Linux/Mac), you can install the tools via package managers (e.g., apt, brew, or conda). The key tools include:

BWA (Burrows-Wheeler Aligner) for read alignment
SAMtools for SAM/BAM file manipulation
Picard for file processing
GATK (Genome Analysis Toolkit) for variant calling and realignment
SIFT or PolyPhen for variant effect prediction
VCFtools for handling VCF files

Install these via conda, apt, or brew:

2. Quality Control and Preprocessing of Raw Data

2.1. Convert Illumina Data to FASTQ Format (if required)

In some cases, you may receive data in a different format. To convert Illumina’s proprietary format to standard FASTQ:

2.2. Quality Control

Run quality control on the raw sequencing data using tools like FastQC. This step ensures that your sequencing reads are of good quality and can help detect any issues with the sequencing run.

Inspect the output files to identify any problems with the data (e.g., poor quality scores).

3. Alignment

3.1. Align Reads to Reference Genome Using BWA

The next step is to align your reads to the reference human genome. Download the human genome reference (e.g., hg38 or GRCh38), and index it using bwa index:

Now align the sequencing reads to the reference genome:

Here:

-t 4 uses 4 threads for faster alignment.
output.sam is the alignment file in SAM format.

3.2. Convert SAM to BAM

Once the reads are aligned, convert the SAM file to BAM format (binary format) for efficiency:

3.3. Sort and Index the BAM File

Sort the BAM file to prepare it for downstream processing:

Index the sorted BAM file:

4. Realignment Around Indels

4.1. Marking Duplicate Reads

Use Picard to mark PCR duplicates (if present), which is crucial for accurate variant calling:

4.2. Indel Realignment Using GATK

Realignment around indels improves variant calling accuracy. First, identify regions requiring realignment:

Then, perform the actual realignment:

5. Variant Calling

5.1. Call SNPs and Indels Using GATK

To call single nucleotide polymorphisms (SNPs) and indels, use GATK’s HaplotypeCaller or UnifiedGenotyper:

Alternatively, for older GATK versions:

This generates a VCF (Variant Call Format) file containing all identified variants.

6. Variant Filtering

6.1. Apply Variant Quality Score Recalibration (VQSR)

To improve the accuracy of variant calls, use VQSR to filter out false positives. First, you need a dbSNP resource for known variants:

This applies a recalibration model to filter out low-confidence variants.

7. Variant Annotation

7.1. Annotate Variants Using ANNOVAR or VEP

To understand the biological impact of the detected variants, use annotation tools like ANNOVAR or VEP (Variant Effect Predictor).

For example, using ANNOVAR:

This annotates variants with gene information and predicted deleteriousness (e.g., SIFT, PolyPhen).

8. Functional Prediction

8.1. Predict Variant Impact Using SIFT or PolyPhen

To classify the functional impact of variants (e.g., benign or deleterious), tools like SIFT and PolyPhen can be used. For instance, with SIFT:

This tool predicts whether the mutations are likely to disrupt protein function.

9. Visualization and Interpretation

9.1. Visualize Variants Using IGV

To explore and visualize your aligned data, use the Integrative Genomics Viewer (IGV):

IGV allows you to visually inspect the alignments and variant calls across the genome.

10. Additional Considerations

10.1. Batch Processing of Multiple Samples

For larger datasets (e.g., 10+ samples), it’s essential to handle each sample individually during alignment, realignment, and variant calling. However, after these steps, you can combine them for joint variant calling. For example:

Where gvcf_list.txt is a list of GVCF files from individual samples.

Conclusion

This pipeline provides a comprehensive, reproducible workflow for processing human whole exome sequencing data, from raw reads to annotated variants. Depending on your specific goals (e.g., variant discovery, functional prediction), you may adjust the tools and parameters. As bioinformatics tools evolve, it’s essential to stay updated with the latest versions and methodologies.