Bioinformatics Cheatsheet

Bioinformatics Cheat Sheet: A Step-by-Step Guide for Beginners

December 28, 2024 Off By admin
Shares

This cheat sheet is designed to help bioinformaticians, especially beginners, quickly reference key concepts, commands, scripts, and useful tools that are essential in bioinformatics workflows. It includes basic Unix commands, Perl one-liners, Python snippets, and helpful biological concepts that every bioinformatician should be familiar with.


1. Key Biological Concepts and Units

  • IUPAC Ambiguity Codes for Nucleotides
    • A = Adenine
    • C = Cytosine
    • G = Guanine
    • T = Thymine
    • U = Uracil (RNA only)
    • R = A or G (Purines)
    • Y = C or T (Pyrimidines)
    • M = A or C
    • K = G or T
    • S = G or C
    • W = A or T
    • B = C, G, or T
    • D = A, G, or T
    • H = A, C, or T
    • V = A, C, or G
    • N = Any base
  • Amino Acid Single-Letter Codes
    • A = Alanine
    • C = Cysteine
    • D = Aspartic Acid
    • E = Glutamic Acid
    • F = Phenylalanine
    • G = Glycine
    • H = Histidine
    • I = Isoleucine
    • K = Lysine
    • L = Leucine
    • M = Methionine
    • N = Asparagine
    • P = Proline
    • Q = Glutamine
    • R = Arginine
    • S = Serine
    • T = Threonine
    • V = Valine
    • W = Tryptophan
    • Y = Tyrosine
  • 1 Human Genome ≈ 7 pg of DNA
    • This is useful for DNA quantification.
  • 1 Base Pair (bp) = 660 Daltons
    • This conversion is helpful when calculating the molecular weight of DNA sequences.

2. Unix Commands and One-Liners

  • Convert Unix to Windows Line Endings
    bash
    sed -i 's/$/\r/' file.txt
  • Sort a BED File by Chromosome and Position
    bash
    sort -k1,1 -k2,2n file.bed > file.sorted.bed
  • Find and Replace in Multiple Files
    bash
    perl -pi -w -e 's/old_value/new_value/g' *.txt
  • Remove Header from a File
    bash
    tail -n +2 file.txt > file_no_header.txt
  • Sum the Values in the First Column of a File
    bash
    awk '{s+=$1} END {print s}' file.txt
  • Print a Specific Line from a File (e.g., Line 83)
    bash
    sed -n '83p' file.txt
  • Insert a Header Line into a File
    bash
    sed -i '1i# New header line' file.txt

3. Perl One-Liners

  • Remove Windows Line Breaks (convert to Unix format)
    bash
    perl -pi -e 's/\r\n/\n/g' file.txt
  • Replace String in Multiple Files
    bash
    perl -pi -e 's/old_string/new_string/g' *.txt
  • Simple Text Processing with Perl
    bash
    perl -ne 'print if /pattern/' file.txt
  • Count the Number of Lines in a File
    bash
    perl -lne 'END { print $. }' file.txt

4. Python Scripts for Bioinformatics

  • Calculate GC Content of a DNA Sequence
    python
    def gc_content(sequence):
    gc_count = sequence.count('G') + sequence.count('C')
    return (gc_count / len(sequence)) * 100
  • Read and Parse a FASTA File
    python
    def read_fasta(file):
    with open(file, 'r') as f:
    sequences = {}
    seq_name = ''
    seq_data = ''
    for line in f:
    if line.startswith('>'):
    if seq_name:
    sequences[seq_name] = seq_data
    seq_name = line[1:].strip()
    seq_data = ''
    else:
    seq_data += line.strip()
    sequences[seq_name] = seq_data
    return sequences
  • Basic Sequence Alignment Using Biopython
    python
    from Bio import pairwise2
    from Bio.pairwise2 import format_alignment

    seq1 = "AGCTGTA"
    seq2 = "AGCTGCA"
    alignments = pairwise2.align.globalxx(seq1, seq2)
    for alignment in alignments:
    print(format_alignment(*alignment))


5. Handy Biological Conversion Factors

  • Molecular Weights
    • 1 mole of DNA = 6.022 × 10²³ molecules (Avogadro’s number)
    • 1 bp = 660 Daltons (molecular weight of a base pair)
  • Volume Conversions
    • 1 mm³ = 1 µL
    • 1 µm³ = 1 pL
    • 1 liter = 1000 mL

6. Tm Calculation for DNA Sequences

  • Tm (Melting Temperature) Estimation Using GC Content
    For a rough estimate, use the formula:
    Tm = 2°C × (A + T) + 4°C × (G + C)
    Where A, T, G, and C are the count of the respective bases in the sequence.

7. Bioinformatics Tools and References


8. Best Practices and Tips


Conclusion

This Bioinformatics Cheat Sheet offers a collection of essential commands, conversions, and tools for bioinformaticians. As you progress in bioinformatics, adapting these tools to your specific research needs and automating repetitive tasks will make your work more efficient. Keep this sheet handy as a quick reference to streamline your bioinformatics workflows.

Shares