herapeutic Potential of Long Non-coding RNAs

Step-by-Step Guide: Sanity Checks for NGS Data

January 10, 2025 Off By admin

Next-Generation Sequencing (NGS) data requires rigorous quality control to ensure reliability and accuracy. This guide outlines essential sanity checks for NGS data, covering both raw sequence quality and alignment metrics.

Table of Contents

1. Raw Sequence Quality Checks

Perform these checks on raw FASTQ files to assess sequencing quality.

1.1. Use FastQC for Quality Control

FastQC is a widely used tool for assessing raw sequence quality.

Run FastQC:

fastqc input.fastq.gz -o output_dir

Key Metrics to Check:

Per Base Sequence Quality: Ensure quality scores are high (Q ≥ 30) across all bases.
Per Sequence Quality Scores: Identify sequences with consistently low quality.
Per Base Sequence Content: Check for biases in nucleotide composition (e.g., GC bias).
Sequence Duplication Levels: High duplication may indicate PCR artifacts.
Overrepresented Sequences: Identify adapter contamination or other artifacts.

1.2. Check Read Length Distribution

Ensure reads are of expected length.

Example Command:

zcat input.fastq.gz | awk 'NR%4==2 {print length($0)}' | sort | uniq -c

1.3. Assess GC Content

Compare the GC content of your data to the expected distribution for your organism.

Example Command:

zcat input.fastq.gz | awk 'NR%4==2 {gc+=gsub(/[GC]/, ""); total+=length} END {print gc/total}'

2. Alignment Quality Checks

After aligning reads to a reference genome, perform these checks.

2.1. Alignment Statistics

Use tools like samtools to generate alignment statistics.

Run `samtools flagstat`:

samtools flagstat aligned.bam

Key Metrics:

Total Reads: Ensure the number of aligned reads matches expectations.
Mapped Reads: High mapping rates (e.g., > 80%) are generally desirable.
Properly Paired Reads: For paired-end data, check the percentage of properly paired reads.

2.2. Check for Chromosomal Bias

Ensure reads are evenly distributed across chromosomes.

Example Command:

samtools idxstats aligned.bam

2.3. Assess Insert Size Distribution

For paired-end data, check the insert size distribution.

Run Picard CollectInsertSizeMetrics:

java -jar picard.jar CollectInsertSizeMetrics \
  I=aligned.bam \
  O=insert_size_metrics.txt \
  H=insert_size_histogram.pdf

3. Library Complexity and Duplication

Assess library complexity to identify potential issues with PCR amplification.

3.1. Use Picard MarkDuplicates

Identify and mark duplicate reads.

Run Picard MarkDuplicates:

java -jar picard.jar MarkDuplicates \
  I=aligned.bam \
  O=deduplicated.bam \
  M=mark_duplicates_metrics.txt

Key Metrics:

Duplicate Rate: High duplication rates (> 20%) may indicate PCR artifacts.
Library Complexity: Assess the number of unique molecules in the library.

4. Annotation-Based Checks

Check for biases related to genomic features.

4.1. Gene Coverage

Ensure reads are evenly distributed across genes.

Use Bedtools Coverage:

bedtools coverage -a genes.bed -b aligned.bam > gene_coverage.txt

4.2. Check for Strand Bias

For RNA-Seq or ChIP-Seq, check for strand-specific biases.

Example Command:

samtools view -f 0x10 aligned.bam | wc -l  # Reverse strand
samtools view -F 0x10 aligned.bam | wc -l  # Forward strand

5. Contamination Checks

Identify potential contamination from other species or sequences.

5.1. Check for rRNA Contamination

For RNA-Seq, quantify rRNA reads.

Example Command:

grep -c "rRNA_gene_name" aligned.bam

5.2. Use Kraken for Taxonomic Classification

Identify contamination from other species.

Run Kraken:

kraken --db kraken_db input.fastq --output kraken_output.txt

6. Visualize Results

Use tools like MultiQC to aggregate and visualize QC metrics.

Run MultiQC:

multiqc output_dir/ -o multiqc_report

By performing these sanity checks, you can identify potential issues in your NGS data and ensure the reliability of your downstream analyses.

Categorybioinformatics Guides transcriptomics

Step-by-Step Guide: Counting Sequences in a FASTQ.GZ File

Step-by-Step Guide: Error Correction Tools for PacBio Long Reads

Step-by-Step Guide: Sanity Checks for NGS Data

1. Raw Sequence Quality Checks

1.1. Use FastQC for Quality Control

Run FastQC:

Key Metrics to Check:

1.2. Check Read Length Distribution

Example Command:

1.3. Assess GC Content

Example Command:

2. Alignment Quality Checks

2.1. Alignment Statistics

Run samtools flagstat:

Key Metrics:

2.2. Check for Chromosomal Bias

Example Command:

2.3. Assess Insert Size Distribution

Run Picard CollectInsertSizeMetrics:

3. Library Complexity and Duplication

3.1. Use Picard MarkDuplicates

Run Picard MarkDuplicates:

Key Metrics:

4. Annotation-Based Checks

4.1. Gene Coverage

Use Bedtools Coverage:

4.2. Check for Strand Bias

Example Command:

5. Contamination Checks

5.1. Check for rRNA Contamination

Example Command:

5.2. Use Kraken for Taxonomic Classification

Run Kraken:

6. Visualize Results

Run MultiQC:

Related posts:

Advanced Topics in Enzymology for Bioinformatics

Translational Bioinformatics: Bridging Bench to Bedside

Comprehensive Guide to Setting up and Using Linux for Bioinformatics Analysis

Getting Started With Molecular Dynamics Simulation

Step-by-Step Guide: using the pheatmap package in R to annotate heatmaps

Efficient Linux File Management and NGS Data Analysis Techniques

SWISS-MODEL: A Guide for Bioinformaticians"

How can mutations in tumors be detected from sequencing cancer cells and tissue?

Single-Molecule Sequencing in RNA Dynamics

XML in Bioinformatics: A Comprehensive Guide for Biologists

Methods for Retrieving and Searching Biological Data

Step-by-Step Guide: Combining FASTA files

Substractive Proteomics approach and Computational Vaccine Discovery- A Comprehensive Guide

Personalized Medicine: The Future of Healthcare

Physics Fundamentals for Bioinformatics

Who Can Benefit from Learning AI?

Share

Copy short link

Run `samtools flagstat`: