Step-by-Step Guide: Sanity Checks for NGS Data
January 10, 2025Next-Generation Sequencing (NGS) data requires rigorous quality control to ensure reliability and accuracy. This guide outlines essential sanity checks for NGS data, covering both raw sequence quality and alignment metrics.
1. Raw Sequence Quality Checks
Perform these checks on raw FASTQ files to assess sequencing quality.
1.1. Use FastQC for Quality Control
FastQC is a widely used tool for assessing raw sequence quality.
Run FastQC:
fastqc input.fastq.gz -o output_dir
Key Metrics to Check:
- Per Base Sequence Quality: Ensure quality scores are high (Q ≥ 30) across all bases.
- Per Sequence Quality Scores: Identify sequences with consistently low quality.
- Per Base Sequence Content: Check for biases in nucleotide composition (e.g., GC bias).
- Sequence Duplication Levels: High duplication may indicate PCR artifacts.
- Overrepresented Sequences: Identify adapter contamination or other artifacts.
1.2. Check Read Length Distribution
Ensure reads are of expected length.
Example Command:
zcat input.fastq.gz | awk 'NR%4==2 {print length($0)}' | sort | uniq -c
1.3. Assess GC Content
Compare the GC content of your data to the expected distribution for your organism.
Example Command:
zcat input.fastq.gz | awk 'NR%4==2 {gc+=gsub(/[GC]/, ""); total+=length} END {print gc/total}'
2. Alignment Quality Checks
After aligning reads to a reference genome, perform these checks.
2.1. Alignment Statistics
Use tools like samtools
to generate alignment statistics.
Run samtools flagstat
:
samtools flagstat aligned.bam
Key Metrics:
- Total Reads: Ensure the number of aligned reads matches expectations.
- Mapped Reads: High mapping rates (e.g., > 80%) are generally desirable.
- Properly Paired Reads: For paired-end data, check the percentage of properly paired reads.
2.2. Check for Chromosomal Bias
Ensure reads are evenly distributed across chromosomes.
Example Command:
samtools idxstats aligned.bam
2.3. Assess Insert Size Distribution
For paired-end data, check the insert size distribution.
Run Picard CollectInsertSizeMetrics:
java -jar picard.jar CollectInsertSizeMetrics \ I=aligned.bam \ O=insert_size_metrics.txt \ H=insert_size_histogram.pdf
3. Library Complexity and Duplication
Assess library complexity to identify potential issues with PCR amplification.
3.1. Use Picard MarkDuplicates
Identify and mark duplicate reads.
Run Picard MarkDuplicates:
java -jar picard.jar MarkDuplicates \ I=aligned.bam \ O=deduplicated.bam \ M=mark_duplicates_metrics.txt
Key Metrics:
- Duplicate Rate: High duplication rates (> 20%) may indicate PCR artifacts.
- Library Complexity: Assess the number of unique molecules in the library.
4. Annotation-Based Checks
Check for biases related to genomic features.
4.1. Gene Coverage
Ensure reads are evenly distributed across genes.
Use Bedtools Coverage:
bedtools coverage -a genes.bed -b aligned.bam > gene_coverage.txt
4.2. Check for Strand Bias
For RNA-Seq or ChIP-Seq, check for strand-specific biases.
Example Command:
samtools view -f 0x10 aligned.bam | wc -l # Reverse strand samtools view -F 0x10 aligned.bam | wc -l # Forward strand
5. Contamination Checks
Identify potential contamination from other species or sequences.
5.1. Check for rRNA Contamination
For RNA-Seq, quantify rRNA reads.
Example Command:
grep -c "rRNA_gene_name" aligned.bam
5.2. Use Kraken for Taxonomic Classification
Identify contamination from other species.
Run Kraken:
kraken --db kraken_db input.fastq --output kraken_output.txt
6. Visualize Results
Use tools like MultiQC to aggregate and visualize QC metrics.
Run MultiQC:
multiqc output_dir/ -o multiqc_report
By performing these sanity checks, you can identify potential issues in your NGS data and ensure the reliability of your downstream analyses.