Bioinformatics Cheat Sheet: A Step-by-Step Guide for Beginners
December 28, 2024This cheat sheet is designed to help bioinformaticians, especially beginners, quickly reference key concepts, commands, scripts, and useful tools that are essential in bioinformatics workflows. It includes basic Unix commands, Perl one-liners, Python snippets, and helpful biological concepts that every bioinformatician should be familiar with.
1. Key Biological Concepts and Units
- IUPAC Ambiguity Codes for Nucleotides
- A = Adenine
- C = Cytosine
- G = Guanine
- T = Thymine
- U = Uracil (RNA only)
- R = A or G (Purines)
- Y = C or T (Pyrimidines)
- M = A or C
- K = G or T
- S = G or C
- W = A or T
- B = C, G, or T
- D = A, G, or T
- H = A, C, or T
- V = A, C, or G
- N = Any base
- Amino Acid Single-Letter Codes
- A = Alanine
- C = Cysteine
- D = Aspartic Acid
- E = Glutamic Acid
- F = Phenylalanine
- G = Glycine
- H = Histidine
- I = Isoleucine
- K = Lysine
- L = Leucine
- M = Methionine
- N = Asparagine
- P = Proline
- Q = Glutamine
- R = Arginine
- S = Serine
- T = Threonine
- V = Valine
- W = Tryptophan
- Y = Tyrosine
- 1 Human Genome ≈ 7 pg of DNA
- This is useful for DNA quantification.
- 1 Base Pair (bp) = 660 Daltons
- This conversion is helpful when calculating the molecular weight of DNA sequences.
2. Unix Commands and One-Liners
- Convert Unix to Windows Line Endings
- Sort a BED File by Chromosome and Position
- Find and Replace in Multiple Files
- Remove Header from a File
- Sum the Values in the First Column of a File
- Print a Specific Line from a File (e.g., Line 83)
- Insert a Header Line into a File
3. Perl One-Liners
- Remove Windows Line Breaks (convert to Unix format)
- Replace String in Multiple Files
- Simple Text Processing with Perl
- Count the Number of Lines in a File
4. Python Scripts for Bioinformatics
- Calculate GC Content of a DNA Sequence
- Read and Parse a FASTA File
- Basic Sequence Alignment Using Biopython
5. Handy Biological Conversion Factors
- Molecular Weights
- 1 mole of DNA = 6.022 × 10²³ molecules (Avogadro’s number)
- 1 bp = 660 Daltons (molecular weight of a base pair)
- Volume Conversions
- 1 mm³ = 1 µL
- 1 µm³ = 1 pL
- 1 liter = 1000 mL
6. Tm Calculation for DNA Sequences
- Tm (Melting Temperature) Estimation Using GC Content
For a rough estimate, use the formula:
Tm = 2°C × (A + T) + 4°C × (G + C)
Where A, T, G, and C are the count of the respective bases in the sequence.
7. Bioinformatics Tools and References
- BLAST (Basic Local Alignment Search Tool)
- Used for comparing nucleotide or protein sequences to databases.
- Bowtie/Tophat
- Aligns short reads to a reference genome.
- Bowtie: for large-scale alignment.
- Tophat: for RNA-Seq data analysis.
- Samtools
- Handles the SAM/BAM format for sequencing data.
- Example command to view a BAM file:
- BEDTools
- A suite of tools for manipulating genomic intervals.
- Example: to merge overlapping regions in a BED file:
8. Best Practices and Tips
- Version Control: Always use version control (e.g., Git) for your bioinformatics projects.
- Document Your Code: Add comments to scripts and notebooks to ensure reproducibility.
- Reproducibility: Use virtual environments (e.g., Conda or Docker) to ensure reproducibility across systems.
- Data Formats: Understand the file formats you are working with (e.g., FASTA, FASTQ, BAM, VCF).
- Efficient Data Processing: Use parallel processing and optimize algorithms for large datasets (e.g., RNA-Seq).
Conclusion
This Bioinformatics Cheat Sheet offers a collection of essential commands, conversions, and tools for bioinformaticians. As you progress in bioinformatics, adapting these tools to your specific research needs and automating repetitive tasks will make your work more efficient. Keep this sheet handy as a quick reference to streamline your bioinformatics workflows.