Step-by-Step Manual: Numbers Every Bioinformatician Should Know
January 9, 2025Bioinformatics is a rapidly evolving field, and staying updated with the latest trends, tools, and metrics is crucial. Below is an updated list of numbers, recent topics, and tips that every bioinformatician should know
1. Sequencing and Genomics
1.1. Human Genome Basics
- Human Genome Size: ~3.2 billion base pairs (bp).
- Coding Genes: ~20,000–25,000.
- Non-Coding Genes: ~22,000.
- Pseudogenes: ~14,000.
- Transcripts: ~200,000.
- SNPs (Single Nucleotide Polymorphisms): ~3 million per individual.
- Structural Variants: ~10,000 per individual.
1.2. Sequencing Metrics
- Illumina NovaSeq 6000 (S4 Flow Cell):
- Output: ~8 billion reads (2×150 bp).
- Cost: ~$16,000 per flow cell.
- PacBio HiFi Reads:
- Read Length: ~10–25 kb.
- Accuracy: ~99.9%.
- Cost: ~1,000–1,500 per human genome.
- Oxford Nanopore:
- Read Length: Up to 1 Mb.
- Accuracy: ~95–98%.
- Cost: ~$1,000 per human genome.
1.3. File Sizes
- FASTQ File:
- ~1.5 GB per 1 million reads (2×100 bp).
- BAM File:
- ~1.2–1.5 GB per 1 million reads (compressed).
- CRAM File:
- ~0.8–1 GB per 1 million reads (more compressed than BAM).
- VCF File:
- ~1–2 GB for a human genome (compressed).
2. Computational Performance (2023 Updates)
2.1. Alignment Speed
- BWA-MEM:
- ~10–20 GB/hour on a single CPU core.
- STAR:
- ~30–50 GB/hour on a single CPU core.
- Minimap2:
- ~50–100 GB/hour on a single CPU core (for long reads).
2.2. Memory Usage
- BWA-MEM:
- ~5–10 GB for human genome alignment.
- STAR:
- ~30 GB for human genome alignment.
- GATK HaplotypeCaller:
- ~8–16 GB for variant calling.
2.3. Storage Requirements
- Human Genome:
- Raw FASTQ: ~200–300 GB.
- Aligned BAM: ~100–150 GB.
- Variant Calls (VCF): ~1–2 GB.
3. Quality Control and Filtering (2023 Updates)
3.1. FASTQ Quality Scores
- Sanger/Illumina 1.8+:
- ASCII Range: 33–126.
- Quality Score Range: 0–93.
- Illumina 1.3+:
- ASCII Range: 64–126.
- Quality Score Range: 0–62.
- Recommended Cutoffs:
- Base Quality: ≥20 (Q20 = 1% error rate).
- Read Quality: ≥30 (Q30 = 0.1% error rate).
3.2. RNA-seq Metrics
- Recommended Reads:
- Human: ~30–50 million reads per sample.
- Mouse: ~20–30 million reads per sample.
- Mapping Rate: ≥70% for RNA-seq data.
3.3. ChIP-seq Metrics
- Recommended Reads:
- Transcription Factors: ~20–30 million reads.
- Histone Marks: ~40–60 million reads.
- Peak Calling: ~10,000–50,000 peaks per sample.
4. Variant Calling and Population Genetics (2023 Updates)
4.1. Human Genetic Diversity
- SNPs per Individual: ~3 million.
- Transition/Transversion Ratio: ~2:1.
- FST (Genetic Differentiation):
- African vs. European: ~0.07.
- African vs. Asian: ~0.08.
- Asian vs. European: ~0.05.
4.2. Variant Calling
- GATK HaplotypeCaller:
- Recommended Coverage: ≥30x.
- Variant Filtering:
- QD (Quality by Depth): ≥2.
- FS (Fisher Strand Bias): ≤60.
- MQ (Mapping Quality): ≥40.
5. Key Bioinformatics Tools and Formats (2023 Updates)
5.1. SAM/BAM Flags
- Common Flags:
- 4: Read unmapped.
- 12: Both reads unmapped (mate pair).
- Tool: Use
samtools flags
or Picard’sExplainFlags
to interpret flags.
5.2. Taxonomy IDs
- Human: 9606.
- Mouse: 10090.
- E. coli K12: 511145.
5.3. Reference Genome Builds
- Human:
- hg19: GRCh37.
- hg38: GRCh38.
- Mouse:
- mm10: GRCm38.
6. Recent Topics and Tips (2023)
6.1. Single-Cell Sequencing
- 10x Genomics:
- Cells per Run: ~1,000–10,000.
- Cost: ~1,000–2,000 per run.
- Recommended Reads per Cell: ~50,000–100,000.
6.2. Long-Read Sequencing
- PacBio HiFi:
- Read Length: ~10–25 kb.
- Accuracy: ~99.9%.
- Oxford Nanopore:
- Read Length: Up to 1 Mb.
- Accuracy: ~95–98%.
6.3. AI and Machine Learning in Bioinformatics
- DeepVariant: A deep learning-based variant caller from Google.
- AI for Drug Discovery: Tools like AlphaFold for protein structure prediction.
6.4. Cloud Computing
- AWS, Google Cloud, Azure: Leverage cloud platforms for scalable and cost-effective bioinformatics analyses.
- Cost Management: Use spot instances and preemptible VMs to reduce costs.
6.5. Data Management and Reproducibility
- FAIR Principles: Ensure data is Findable, Accessible, Interoperable, and Reusable.
- Containerization: Use Docker or Singularity for reproducible environments.
- Workflow Managers: Use Nextflow, Snakemake, or WDL for automating and scaling workflows.
7. Practical Tips (2023)
7.1. Reproducibility
- Version Control: Use Git for scripts and workflows.
- Containerization: Use Docker or Singularity for reproducible environments.
7.2. Performance Optimization
- Parallelization: Use tools like GNU Parallel or Snakemake for parallel processing.
- Cloud Computing: Leverage AWS, Google Cloud, or Azure for large-scale analyses.
7.3. Data Management
- Backup: Regularly back up critical data.
- Metadata: Document all steps and parameters for reproducibility.
8. Example Calculations (2023)
8.1. Coverage Calculation
- Genome Size: 3.2 billion bp.
- Read Length: 100 bp.
- Desired Coverage: 30x.
- Total Reads Needed:
- Formula: (Genome Size × Coverage) / Read Length.
- Calculation: (3.2 × 10^9 × 30) / 100 = 960 million reads.
8.2. File Size Estimation
- FASTQ File:
- 1 million reads (2×100 bp) = ~1.5 GB.
- 100 million reads = ~150 GB.
- BAM File:
- 1 million reads = ~1.2 GB.
- 100 million reads = ~120 GB.
9. Conclusion
By familiarizing yourself with these key numbers, recent topics, and tips, you can make informed decisions about sequencing, computational resources, and data analysis. Whether you’re designing experiments, optimizing pipelines, or interpreting results, these numbers will serve as a valuable reference in your bioinformatics journey. Stay updated with the latest trends and tools to remain at the forefront of the field.