humangenome

Step-by-Step Guide: Best Pipeline for Human Whole Exome Sequencing (WES)

December 28, 2024 Off By admin
Shares

Whole Exome Sequencing (WES) is a powerful tool for understanding genetic variations within coding regions of the genome. This pipeline guide covers the steps from raw sequencing data to variant detection and annotation, with the most commonly used tools in the bioinformatics field today. The pipeline described below includes updates for modern tools and a focus on accuracy and reproducibility.


1. Preprocessing and Initial Setup

1.1. Set Up the Environment

Before you start, ensure your computational environment is set up with the necessary software and dependencies. If you are using a Unix-based system (Linux/Mac), you can install the tools via package managers (e.g., apt, brew, or conda). The key tools include:

Install these via conda, apt, or brew:

bash
conda install bwa samtools picard gatk4 vcf-tools

2. Quality Control and Preprocessing of Raw Data

2.1. Convert Illumina Data to FASTQ Format (if required)

In some cases, you may receive data in a different format. To convert Illumina’s proprietary format to standard FASTQ:

bash
maq sol2sanger input_illumina.solid.fastq output.sanger.fastq

2.2. Quality Control

Run quality control on the raw sequencing data using tools like FastQC. This step ensures that your sequencing reads are of good quality and can help detect any issues with the sequencing run.

bash
fastqc input.fastq

Inspect the output files to identify any problems with the data (e.g., poor quality scores).


3. Alignment

3.1. Align Reads to Reference Genome Using BWA

The next step is to align your reads to the reference human genome. Download the human genome reference (e.g., hg38 or GRCh38), and index it using bwa index:

bash
bwa index hg38.fasta

Now align the sequencing reads to the reference genome:

bash
bwa mem -t 4 hg38.fasta input.fastq > output.sam

Here:

  • -t 4 uses 4 threads for faster alignment.
  • output.sam is the alignment file in SAM format.

3.2. Convert SAM to BAM

Once the reads are aligned, convert the SAM file to BAM format (binary format) for efficiency:

bash
samtools view -Sb output.sam > output.bam

3.3. Sort and Index the BAM File

Sort the BAM file to prepare it for downstream processing:

bash
samtools sort output.bam -o output.sorted.bam

Index the sorted BAM file:

bash
samtools index output.sorted.bam

4. Realignment Around Indels

4.1. Marking Duplicate Reads

Use Picard to mark PCR duplicates (if present), which is crucial for accurate variant calling:

bash
picard MarkDuplicates I=output.sorted.bam O=output.dedup.bam M=output.metrics.txt

4.2. Indel Realignment Using GATK

Realignment around indels improves variant calling accuracy. First, identify regions requiring realignment:

bash
gatk RealignerTargetCreator -R hg38.fasta -I output.dedup.bam -O output.intervals

Then, perform the actual realignment:

bash
gatk IndelRealigner -R hg38.fasta -I output.dedup.bam -targetIntervals output.intervals -O output.realigned.bam

5. Variant Calling

5.1. Call SNPs and Indels Using GATK

To call single nucleotide polymorphisms (SNPs) and indels, use GATK’s HaplotypeCaller or UnifiedGenotyper:

bash
gatk HaplotypeCaller -R hg38.fasta -I output.realigned.bam -O output.raw.vcf

Alternatively, for older GATK versions:

bash
gatk UnifiedGenotyper -R hg38.fasta -I output.realigned.bam -O output.raw.vcf -stand_call_conf 30 -stand_emit_conf 10

This generates a VCF (Variant Call Format) file containing all identified variants.


6. Variant Filtering

6.1. Apply Variant Quality Score Recalibration (VQSR)

To improve the accuracy of variant calls, use VQSR to filter out false positives. First, you need a dbSNP resource for known variants:

bash
gatk ApplyVQSR -R hg38.fasta -V output.raw.vcf --recal-file output.recal --tranches-file output.tranches -O output.filtered.vcf

This applies a recalibration model to filter out low-confidence variants.


7. Variant Annotation

7.1. Annotate Variants Using ANNOVAR or VEP

To understand the biological impact of the detected variants, use annotation tools like ANNOVAR or VEP (Variant Effect Predictor).

For example, using ANNOVAR:

bash
table_annovar.pl output.filtered.vcf humandb/ -build hg38 -out output.annotated -protocol refGene,dbnsfp30a -operation g,f

This annotates variants with gene information and predicted deleteriousness (e.g., SIFT, PolyPhen).


8. Functional Prediction

8.1. Predict Variant Impact Using SIFT or PolyPhen

To classify the functional impact of variants (e.g., benign or deleterious), tools like SIFT and PolyPhen can be used. For instance, with SIFT:

bash
sift input.vcf > output.sift

This tool predicts whether the mutations are likely to disrupt protein function.


9. Visualization and Interpretation

9.1. Visualize Variants Using IGV

To explore and visualize your aligned data, use the Integrative Genomics Viewer (IGV):

bash
igv.sh output.filtered.vcf

IGV allows you to visually inspect the alignments and variant calls across the genome.


10. Additional Considerations

10.1. Batch Processing of Multiple Samples

For larger datasets (e.g., 10+ samples), it’s essential to handle each sample individually during alignment, realignment, and variant calling. However, after these steps, you can combine them for joint variant calling. For example:

bash
gatk GenotypeGVCFs -R hg38.fasta -V gvcf_list.txt -O final_variants.vcf

Where gvcf_list.txt is a list of GVCF files from individual samples.


Conclusion

This pipeline provides a comprehensive, reproducible workflow for processing human whole exome sequencing data, from raw reads to annotated variants. Depending on your specific goals (e.g., variant discovery, functional prediction), you may adjust the tools and parameters. As bioinformatics tools evolve, it’s essential to stay updated with the latest versions and methodologies.

Shares