GATK Tutorial: Comprehensive Guide to Variant Discovery and Genotyping
November 20, 2023Table of Contents
Introduction to GATK
- Overview: Understand GATK as a versatile toolkit for variant discovery and genotyping from high-throughput sequencing data, developed by the Broad Institute.
- Workflow Overview: Explore the typical GATK workflow involving read mapping, duplicate marking, base quality recalibration, variant calling, and variant filtering.
Key GATK Tools
- Picard: Processing Aligned Sequences
- Explanation of Picard’s role in processing aligned sequences and marking duplicates.
- BaseRecalibrator: Recalibrating Base Quality Scores
- Overview of recalibrating base quality scores using BaseRecalibrator.
- HaplotypeCaller: SNP and Indel Calling
- Introduction to HaplotypeCaller for calling SNPs and indels with a focus on accuracy.
- VariantFiltration: Filtering Variant Calls
- Understanding the VariantFiltration tool for refining variant calls.
Best Practices with GATK
- GATK Recommendations: Implement best practices, including realigning around indels, recalibrating based on known sites, and using hard filters to reduce false positives.
- Data Compatibility: Ensure adherence to version compatibility guidelines to prevent discrepancies across tools and versions.
- Data Types: GATK’s flexibility to process whole genomes, exomes, or targeted panels in common formats (BAM/CRAM, VCF).
Advanced Variant Analysis with GATK
- Variant Analysis Tasks: Explore GATK tools for diverse variant analysis tasks, including SNP/indel calling, structural variant calling, and copy number analysis.
- HaplotypeCaller for Accuracy: Emphasize the higher accuracy of HaplotypeCaller over UnifiedGenotyper for calling variants.
Filtering Strategies and Annotation
- Filtering Recommendations: Discuss different filtering strategies for SNP and indel calls.
- Annotation Tools: Explore utilities like VariantAnnotator and VariantRecalibrator for annotation and adaptive error model application.
Special Considerations
- Non-Diploid Organisms: Address considerations and tools, such as HaplotypeCaller in GVCF mode, for accurate variant calling in non-diploid organisms.
- RNAseq Data: Provide additional tools and steps, including SplitNCigarReads and BaseRecalibrator, as part of best practices for RNAseq data.
Ongoing Developments and Machine Learning
- Machine Learning Integration: Highlight ongoing developments, including the use of machine learning models to filter variants and improve accuracy.
- Community Engagement: Point to the active GATK forum as a resource for troubleshooting and advice.
Running GATK: Optimization and Best Practices
- Execution Modes: Understand different execution modes (single-threaded, multi-threaded, Spark-based) and their importance for scalability.
- Optimization Strategies: Recommendations for optimizing runtime parameters, regularly rebuilding indices, and monitoring resource usage.
- gatk-launch Workflow: Introduction to gatk-launch as a workflow execution framework for robust pipeline execution.
Large Cohort Analysis and Distributed Processing
- Cohort Analysis: Techniques for running multi-sample cohort analysis, including joint genotyping and considerations for large cohorts.
- Apache Spark Integration: Highlight the use of Apache Spark clusters for accelerated GATK performance through distributed processing.
Integration with External Tools and Platforms
- Containerization and Workflow Platforms: Discuss the benefits of using established container or workflow platforms like Docker, WDL, Cromwell, Nextflow for streamlined execution at scale.
- Benchmarking and Validation: Emphasize the importance of benchmarking GATK pipeline performance and validating results against other pipelines.
Somatic Variant Calling and Tumor Analysis
- Tumor Analysis Best Practices: Guidelines for calling somatic SNVs/Indels in paired tumor-normal samples.
- Tools for Tumor Analysis: Introduce tools like Mutect2, Manta, and Convading for specific aspects of tumor analysis.
Maintenance and Training
- Version Updates: Stress the significance of regularly updating GATK to benefit from bug fixes, features, and accuracy improvements.
- Training Resources: Point to resources like the GATK DSG group for training materials and courses.
Efficient Workflow Strategies
- Intermediate File Management: Emphasize the importance of storing intermediate files efficiently, considering options like CRAM for space-saving.
- Batch Processing and Sample Metadata: Discuss the benefits of batching samples into cohorts for optimized runtimes and the importance of sample metadata for workflow integrity.
- Quality Control and Contamination Checks: Address tools like VerifyBamID for tracing sample swaps and contamination issues.
General Best Practices and Tips
- Data Formatting and Compatibility: Stress the need for proper formatting, matching reference genome builds, and avoiding mixed builds.
- Resource Allocation: Tips on optimizing resource allocation, adjusting Java heap size, and ensuring sufficient sequencing coverage.
- Monitoring and Troubleshooting: Recommendations for proactive monitoring of resource usage, identifying bottlenecks, and investigating unexpected issues.
In conclusion, this tutorial provides a comprehensive guide to using GATK for variant discovery and genotyping. By following the outlined best practices, users can harness the power of GATK to analyze high-throughput sequencing data with accuracy and efficiency. Here are some key takeaways:
- Adherence to Best Practices: Strictly follow GATK’s best practices to ensure accurate variant calling, from data preprocessing to variant filtering.
- Advanced Variant Analysis: Explore GATK’s diverse tools for advanced variant analysis, such as SNP/indel calling, structural variant calling, and copy number analysis.
- Optimization Strategies: Optimize GATK performance by selecting the right execution mode, tuning parameters, and monitoring resource usage.
- Integration with External Tools: Seamlessly integrate GATK into established container or workflow platforms for scalability and reproducibility.
- Somatic Variant Calling and Tumor Analysis: Follow best practices for somatic variant calling in tumor-normal pairs, utilizing tools like Mutect2, Manta, and Convading.
- Efficient Workflow Strategies: Efficiently manage intermediate files, batch samples intelligently, and perform quality control checks to ensure accurate and timely results.
- Continuous Learning: Stay updated with the latest developments, version updates, and engage with the GATK community for ongoing support and optimization.
- Data Formatting and Quality Control: Ensure proper formatting of input files, pay attention to reference genome builds, and perform quality control checks to avoid data inconsistencies and errors.
- Multi-threading and Parallelization: Take advantage of multi-threading and parallelization options in GATK tools like HaplotypeCaller and Mutect to enhance computational efficiency.
- Variant Filtering and Post-Processing: Apply recommended hard filters to variant calls and consider additional post-processing steps using tools like VCFTools for refining variant datasets.
- Customization and Validation: When customizing GATK tools or parameters, validate changes thoroughly to confirm improvements in accuracy or performance.
- Resource Monitoring and Adjustments: Monitor resource usage, identify bottlenecks, and make necessary adjustments, such as RAM allocation, to optimize pipeline performance.
- Organization and Metadata: Keep your workflow organized with proper sample and file metadata to prevent sample swaps and maintain overall workflow integrity.
- Compatibility and Versioning: Pay close attention to compatibility across file formats, reference genomes, sample metadata, and tool versions to avoid mismatches and ensure smooth analysis.
- Cloud Resources and Dockerization: Explore cloud resources for analyzing large cohorts and consider using pre-built Docker images or containers for consistent and portable GATK environments.
- Checksums and Data Integrity: Implement checksums throughout the pipeline to verify data integrity and detect errors early in the analysis process.
- Long-Term Monitoring: Continuously monitor pipeline runtimes over time, investigate unexpected slowdowns or crashes proactively, and implement optimizations as needed.
- Collaboration and Community Engagement: Engage with the GATK community through forums, discussions, and training materials to enhance collaboration and stay informed about new features and best practices.
- Ploidy and Variant Interpretation: Understand how ploidy affects expected allele frequencies and coverage, as this knowledge is crucial for proper interpretation of variants, especially in non-diploid organisms or complex genomic regions.
- Variant Inspection Before Exclusion: Always inspect filtered variants before excluding them from the analysis. Filter aggressively but prudently to avoid discarding true positives inadvertently.
- VCFTools for Post-Processing: Consider using VCFTools alongside GATK for additional post-processing tasks such as phasing, genotype refinement, and further filtration. VCFTools provides additional functionalities beyond GATK.
- Storage-Intensive Steps Optimization: For steps that are storage-intensive, such as BAM sorting, conversion to CRAM, and indexing, offload these tasks to high-performance file systems to improve efficiency.
- Build in Checksums: Introduce checksums at various stages of the pipeline to check data integrity systematically. This approach helps catch errors early and ensures the reliability of your results.
- Cloud Resource Leverage: When dealing with large cohorts spanning hundreds or thousands of samples, leverage cloud resources to scale your analysis efficiently. Cloud platforms like Terra provide a robust environment for running and sharing GATK workflows at scale.
- Cross-Validation with Other Pipelines: Evaluate the concordance of GATK results with those from other pipelines. Any discrepancies need to be thoroughly investigated to understand the underlying reasons and ensure the reliability of your analysis.
- Training Materials and Courses: Take advantage of training materials and courses provided by the GATK DSG group. These resources are valuable for learning best practices and staying updated on new features introduced in GATK.
- Somatic Variant Calling Best Practices: Follow GATK best practices for calling somatic SNVs/Indels in paired tumor-normal samples. Use tools like Mutect2 for SNVs, Manta for indels and structural variants, and Convading for copy number alterations in tumor analysis.
- Known Variants Considerations: Incorporating known variants, such as those from dbSNP, into your analysis can aid in filtering. However, exercise caution as databases may contain errors, and it’s essential to use them judiciously.
- OxoG Filters for Oxidative Artifacts: Apply OxoG filters to reduce false positives stemming from oxidative artifacts in variant calling, particularly relevant for maintaining accuracy in your results.
- Allele-Specific Expression Tools: For RNA-seq data, tools like ASEReadCounter can be employed to refine the understanding of expression Quantitative Trait Loci (eQTL) variants, contributing to a more comprehensive analysis.
- Integration with BQSR Training Sets: Integrate GATK with Base Quality Score Recalibration (BQSR) training sets specific to the sequencing chemistries and protocols used in your experiment, enhancing accuracy in base quality scores.
- Input File Formatting and Checks: Follow guidelines on properly formatting input files, headers, encodings, etc. Improperly formatted data can lead to issues in downstream tools and potentially break the analysis pipeline.
- Adjust Default Parameters in Integrated Pipelines: When running integrated pipeline scripts, carefully check default parameters to see if they need adjustments based on the specifics of your data or analysis requirements.
- Consistent Reference Genome Builds: Ensure that reference genome builds and annotations match across all stages of the analysis. Mixing different builds can lead to compatibility problems and inaccurate results.
- Caution with gVCF Cohorts: While gVCF cohorts hold information for all sites, be cautious as they can inflate storage requirements and potentially slow downstream analyses. Use them judiciously based on your specific needs.
- Careful Inspection of BAM Header Tags: Inspect BAM header tags carefully and address any issues before proceeding with the analysis. Problems at this stage can propagate downstream, affecting the accuracy of results.
- Enable Read Filters as Needed: Enable read filters such as “NotPrimaryAlignment” as needed. Including secondary alignments in the analysis can skew results, so filter reads appropriately based on your analysis goals.
- Considerations for RNAseq Input Data: Look for strandedness and read group tags in RNAseq input data. Depending on the experimental setup, you may need to make custom changes to the pipeline to accommodate specific requirements.
- Increase Runtime Resources for Tools: Increase runtime resources, such as Java heap size, if encountering out-of-memory errors while running tools. Adequate resources are essential for smooth tool execution, especially on large datasets.
- Ensure Sufficient Sequencing Coverage: Before initiating variant calling analysis, ensure that the sequencing coverage is sufficient. Inadequate coverage can reduce the power of the analysis and compromise the reliability of results.
- Masking Low Complexity Regions: Masking low complexity regions in the genome can help avoid false positives in alignments around those regions. This step is particularly relevant when dealing with challenging genomic regions.
- Pay Attention to Log Warnings: Always pay close attention to log warnings, as they can indicate suboptimal parameters or issues with data quality. Addressing these warnings is crucial for maintaining the integrity of the analysis.
- Check GATK Documentation for Latest Recommendations: GATK evolves across versions, and it’s essential to check the latest documentation for recommendations and best practices. Keeping abreast of updates ensures that your analysis benefits from the latest improvements.
- Multithreading for Parallel Execution: Enable multithreading for tools like HaplotypeCaller and Mutect to parallelize execution. This is especially important for improving the speed and efficiency of the analysis.
- Limit Variant Calling Scope: Limit the scope of variant calling to targeted regions or whole exomes unless analyzing whole genomes. This reduces computational demands and improves overall efficiency.
- Follow Pre-Processing Steps for Targeted Panels: Even for targeted panels, follow sample pre-processing steps like indel realignment. These steps contribute to improved accuracy in variant calling.
- Base Quality Score Recalibration (BQSR): Use Base Quality Score Recalibration (BQSR) to correct for systematic errors and biases in base quality scores. This step is critical for enhancing the accuracy of variant calls.
- Apply Hard Filters Post-Variant Calling: Apply hard filters recommended by GATK post-variant calling to reduce false positives. Following GATK’s filtering recommendations is crucial for refining variant call sets.
- Read Preprocessing for RNAseq: For RNAseq data, perform read preprocessing, such as splitting spliced reads, before variant calling. This ensures that the input data is appropriately prepared for variant analysis.
- Choose Appropriate Genome Resources: When running variant calling, choose appropriate genome resources such as dbSNP and training sets tailored to your pipeline and data. This contributes to the accuracy of variant calls.
- Validate Customizations to GATK Tools: If you make any customizations to GATK tools or parameters, validate them to ensure they genuinely improve accuracy or performance. Thorough testing is crucial before incorporating customizations into production pipelines.
- Monitor Pipeline Resource Usage and Bottlenecks: Regularly monitor pipeline resource usage and identify any bottlenecks. Making adjustments, such as RAM allocation, can significantly impact performance.
- Stay Organized with Metadata: Keep your analysis organized with sample and file metadata to avoid swaps and maintain workflow integrity. Well-organized metadata facilitates efficient tracking of samples and results.
- Strict Concordance for Tumor-Normal Somatic Calling: Use strictly concordant sample pairs for tumor-normal somatic variant calling. Contamination can confound results, so ensuring high-quality matched samples is crucial.
- Version Compatibility Across Tools: Pay close attention to compatibility across file formats, reference genomes, sample metadata, and tool versions. Mismatches in these aspects can lead to errors and inconsistencies in the analysis.
- Optimizing Spark Pipelines: When running distributed Spark pipelines, tune cluster resource allocation and executor settings to optimize performance. Efficient configuration is essential for achieving scalability.
- Prefer GATK4 Tools for Improved Performance: Use the latest GATK4 tools where possible, as they often offer better threading, lower memory usage, and improved accuracy compared to earlier versions.
- Understand Ploidy for Proper Interpretation: Understand how ploidy affects expected allele frequencies and coverage for proper interpretation of variants. This understanding is crucial, especially in complex genomic regions.
- Inspect Filtered Variants Before Exclusion: Before excluding filtered variants, carefully inspect them to minimize the risk of discarding true positives. Aggressive filtering is essential, but it should be done judiciously.
- VCFTools for Additional Post-Processing: Consider using VCFTools alongside GATK for additional post-processing tasks such as phasing, genotype refinement, and further filtration. VCFTools provides additional functionalities beyond GATK.
- Offloading Storage-Intensive Steps: Offload storage-intensive steps, such as BAM sorting, conversion to CRAM, and indexing, to high-performance file systems. This can significantly improve overall I/O performance.
- Build in Checksums for Data Integrity: Throughout the pipeline, build in checksums to check data integrity and catch errors early. Regularly verifying data integrity helps maintain the reliability of the analysis.
- Monitor Runtimes Over Time: Monitor pipeline runtimes over time and investigate unexpected slowdowns or crashes proactively. Regularly assessing pipeline performance helps identify and address issues promptly.
- Consider Docker Images for Consistency: Consider using pre-built Docker images or containers for consistent, portable GATK environments. Dockerization ensures that the analysis environment remains consistent across different computing environments.
- Leverage Cloud Resources for Large Cohorts: Leverage cloud resources when analyzing large cohorts across hundreds or thousands of samples. Cloud services, such as Terra, provide a robust platform for running and sharing GATK workflows at scale.
- Integration with NGS Aligners and Variant Databases: Integrate GATK with NGS aligners like BWA and variant databases like ClinVar to enhance the analysis context. This integration ensures comprehensive analysis and interpretation of genomic variants.
- Phasing Variants for Improved Imputation: Phasing variants with tools like ReadBackedPhasing can refine genotype calls and improve downstream imputation accuracy. Proper phasing contributes to a more accurate representation of genomic variants.
- Evaluating Concordance with Other Pipelines: Evaluate the concordance of GATK results versus results obtained from other pipelines. Discrepancies need to be investigated to understand the underlying reasons and ensure the reliability of the analysis.
- GATK DSG Group for Training Materials: The GATK DSG (Data Sciences and Genomic Analysis) group provides training materials and courses for learning best practices and staying updated on new features. Engaging with these resources can enhance proficiency in GATK usage.
- Efficient Intermediate File Storage: Storing intermediate files efficiently is crucial when running large workflows. Options like CRAM can save storage space, especially when dealing with extensive datasets.
- Batching Samples for Joint Calling: Batching samples into cohorts improves runtimes for joint calling and genotyping. However, batches should have similar coverage to ensure accurate and efficient analysis.
- Detecting Sample Swaps and Contamination: Tracing sample swaps and contamination is critical for ensuring the integrity of the analysis. Tools like VerifyBamID can help detect issues before variant calling, especially in tumor-normal analyses.
- Best Practices for Tumor Analysis: For tumor analysis, use tools like ContEst to identify contamination from normal samples. Follow GATK Best Practices for calling somatic SNVs/Indels in paired tumor-normal samples using tools like Mutect2, Manta, and Convading.
- Incorporating Known Variants with Caution: Incorporate known variants like dbSNP for filtering, but exercise caution as databases may contain errors. Regularly updating reference databases is advisable.
- Setting Appropriate Intervals During Scattering: When scattering tools like HaplotypeCaller, set intervals/padding appropriately to avoid gaps or mismatches. Properly defined intervals contribute to accurate variant calling.
- OxoG Filters for Reducing False Positives: Use OxoG filters to reduce false positives from oxidative artifacts in variant calling. These filters are particularly relevant for improving accuracy in certain experimental setups.
- Allele-Specific Expression Analysis: Employ allele-specific expression tools like ASEReadCounter for refining understanding of eQTL variants. Integrating these tools can provide insights into allele-specific gene expression patterns.
- Integrate BQSR Training Sets for Specific Protocols: Integrate GATK with Base Quality Score Recalibration (BQSR) training sets specific to sequencing chemistries and protocols used. This customization contributes to improved accuracy in base quality recalibration.
- GATK Documentation for Latest Recommendations: Always check the GATK documentation for the latest recommendations and best practices. Regularly updating workflows based on the latest guidelines ensures optimal analysis performance.
- Multithreading for Parallel Execution: Enable multithreading for computationally intensive operations, such as sort, index, and mpileup. Using the appropriate thread count can significantly speed up processing, especially on large datasets.
- Watch for Changes Between Major Versions: Watch out for changes between major versions of GATK, such as default output formats or changes in the behavior of certain commands. Checking release notes is crucial when upgrading.
- Install Latest Versions of Dependencies: When installing GATK, pay attention to library dependencies like htslib, zlib, and other related tools. Ensure that you have the latest versions of these dependencies to avoid compatibility issues.
- Use Dependency Managers for Installation: If encountering issues during installation, consider using a dependency manager like conda. Dependency managers can help resolve and install the required libraries more efficiently.
- Memory Considerations for Large BAMs: Pay close attention to memory requirements, especially for sorting and indexing large BAM files. For sorting, consider setting the maximum memory per thread to avoid memory-related issues.
- Understand Read Depth Calculations: Samtools coverages/depths are calculated from BAM records marked as “primary alignments.” Secondary or supplemental alignments are ignored. Be aware that reads can be undercounted when using options like -F 1024 with samtools depth.
- Sorting Before Deduplication: Samtools rmdup only works for removing PCR duplicates when the input BAM is coordinate-sorted. Always sort before deduplication, as it looks for read pairs mapping to the same position.
- Watch for Output Redirection: When piping between samtools commands, watch out for output that gets directed to stdout vs. stderr. Useful redirection, like “2> log,” can capture info/warning/error messages for better debugging.
- Per-Position Stats vs. Region-Based Stats: Samtools only provides per-position stats like depth/coverage. For more detailed region-based stats (e.g., exon coverage), use additional tools like bedtools, mosdepth, or Qualimap.
- Command Line Argument Sensitivity: Samtools can be sensitive to small changes in flags or the order of operations. When encountering errors or unexpected output, double-check your command line arguments to ensure accuracy.
- Check for Samtools Output Warnings: Pay attention to samtools output warnings like “[W::bam_hdr_read].” These warnings may indicate issues with the compressed block size when converting between BAM/CRAM formats.
- Logging for Pipeline Debugging: Enable samtools logging with -l to capture warnings, errors, etc. Logging is essential for pipeline development, troubleshooting, and debugging unexpected results
Related posts:
Python Strings for Bioinformatics: From Basics to Applications
Introduction to Python for Biologists
How Can I Leverage AI Tools Effectively in My Job?
Enhancing Scientific Writing: Implementing a Literature Review Project in Undergraduate Biochemistry...
Foundations of Biology for Bioinformatics
Protein Modeling with Modeller: A Comprehensive Guide
How to Perform Protein Sequence Analysis?
Advanced Topics in Computational Biology
The Human Genome Project: A Landmark in Genomics
Bulk Protein Sequence Retrieval: NCBI Utilities and Beyond
Mastering Text Editors in Linux for Bioinformatics
Comprehensive Genome Annotation: A Step-by-Step Guide