herapeutic Potential of Long Non-coding RNAs

Step-by-Step Guide: Sanity Checks for NGS Data

January 10, 2025 Off By admin
Shares

Next-Generation Sequencing (NGS) data requires rigorous quality control to ensure reliability and accuracy. This guide outlines essential sanity checks for NGS data, covering both raw sequence quality and alignment metrics.


1. Raw Sequence Quality Checks

Perform these checks on raw FASTQ files to assess sequencing quality.

1.1. Use FastQC for Quality Control

FastQC is a widely used tool for assessing raw sequence quality.

Run FastQC:

bash
Copy
fastqc input.fastq.gz -o output_dir

Key Metrics to Check:

  • Per Base Sequence Quality: Ensure quality scores are high (Q ≥ 30) across all bases.
  • Per Sequence Quality Scores: Identify sequences with consistently low quality.
  • Per Base Sequence Content: Check for biases in nucleotide composition (e.g., GC bias).
  • Sequence Duplication Levels: High duplication may indicate PCR artifacts.
  • Overrepresented Sequences: Identify adapter contamination or other artifacts.

1.2. Check Read Length Distribution

Ensure reads are of expected length.

Example Command:

bash
Copy
zcat input.fastq.gz | awk 'NR%4==2 {print length($0)}' | sort | uniq -c

1.3. Assess GC Content

Compare the GC content of your data to the expected distribution for your organism.

Example Command:

bash
Copy
zcat input.fastq.gz | awk 'NR%4==2 {gc+=gsub(/[GC]/, ""); total+=length} END {print gc/total}'

2. Alignment Quality Checks

After aligning reads to a reference genome, perform these checks.

2.1. Alignment Statistics

Use tools like samtools to generate alignment statistics.

Run samtools flagstat:

bash
Copy
samtools flagstat aligned.bam

Key Metrics:

  • Total Reads: Ensure the number of aligned reads matches expectations.
  • Mapped Reads: High mapping rates (e.g., > 80%) are generally desirable.
  • Properly Paired Reads: For paired-end data, check the percentage of properly paired reads.

2.2. Check for Chromosomal Bias

Ensure reads are evenly distributed across chromosomes.

Example Command:

bash
Copy
samtools idxstats aligned.bam

2.3. Assess Insert Size Distribution

For paired-end data, check the insert size distribution.

Run Picard CollectInsertSizeMetrics:

bash
Copy
java -jar picard.jar CollectInsertSizeMetrics \
  I=aligned.bam \
  O=insert_size_metrics.txt \
  H=insert_size_histogram.pdf

3. Library Complexity and Duplication

Assess library complexity to identify potential issues with PCR amplification.

3.1. Use Picard MarkDuplicates

Identify and mark duplicate reads.

Run Picard MarkDuplicates:

bash
Copy
java -jar picard.jar MarkDuplicates \
  I=aligned.bam \
  O=deduplicated.bam \
  M=mark_duplicates_metrics.txt

Key Metrics:

  • Duplicate Rate: High duplication rates (> 20%) may indicate PCR artifacts.
  • Library Complexity: Assess the number of unique molecules in the library.

4. Annotation-Based Checks

Check for biases related to genomic features.

4.1. Gene Coverage

Ensure reads are evenly distributed across genes.

Use Bedtools Coverage:

bash
Copy
bedtools coverage -a genes.bed -b aligned.bam > gene_coverage.txt

4.2. Check for Strand Bias

For RNA-Seq or ChIP-Seq, check for strand-specific biases.

Example Command:

bash
Copy
samtools view -f 0x10 aligned.bam | wc -l  # Reverse strand
samtools view -F 0x10 aligned.bam | wc -l  # Forward strand

5. Contamination Checks

Identify potential contamination from other species or sequences.

5.1. Check for rRNA Contamination

For RNA-Seq, quantify rRNA reads.

Example Command:

bash
Copy
grep -c "rRNA_gene_name" aligned.bam

5.2. Use Kraken for Taxonomic Classification

Identify contamination from other species.

Run Kraken:

bash
Copy
kraken --db kraken_db input.fastq --output kraken_output.txt

6. Visualize Results

Use tools like MultiQC to aggregate and visualize QC metrics.

Run MultiQC:

bash
Copy
multiqc output_dir/ -o multiqc_report

By performing these sanity checks, you can identify potential issues in your NGS data and ensure the reliability of your downstream analyses.

Shares