python-bioinformatics-basics

Mastering Bioinformatics with Biopython: A Comprehensive Guide

March 24, 2024 Off By admin
Shares

Prerequisites:

Course Outcome:

Table of Contents

Introduction to Biopython

Overview of Biopython and its capabilities in bioinformatics

Biopython is a popular open-source Python package designed for bioinformatics analysis. It provides tools for working with biological data, including sequence analysis, molecular structures, and phylogenetics. Here’s an overview of Biopython and its capabilities:

  1. Sequence handling: Biopython can parse and manipulate biological sequences, including DNA, RNA, and protein sequences. It provides classes and functions for reading, writing, and analyzing sequences.
  2. Sequence alignment: Biopython supports sequence alignment algorithms, such as pairwise and multiple sequence alignment. It can perform alignments using popular algorithms like BLAST and ClustalW.
  3. Sequence motif analysis: Biopython includes tools for identifying sequence motifs, such as regular expressions and position-specific scoring matrices (PSSMs).
  4. File format support: Biopython can read and write various file formats commonly used in bioinformatics, including FASTA, GenBank, and PDB (Protein Data Bank) formats.
  5. Biological databases: Biopython provides interfaces to access biological databases, such as NCBI (National Center for Biotechnology Information) and UniProt, allowing users to search and retrieve biological data.
  6. Phylogenetics: Biopython supports phylogenetic analysis, including tree construction, manipulation, and visualization.
  7. Structure analysis: Biopython can parse and analyze molecular structure files, such as those in PDB format. It provides tools for calculating properties of molecular structures and visualizing them.
  8. Population genetics: Biopython includes modules for analyzing population genetic data, such as calculating allele frequencies and Hardy-Weinberg equilibrium.
  9. Machine learning integration: Biopython can be integrated with machine learning libraries like scikit-learn for advanced bioinformatics analyses.

Biopython is widely used in bioinformatics research and provides a comprehensive set of tools for analyzing biological data. Its flexibility and ease of use make it a valuable tool for both beginners and experienced bioinformaticians.

Installing Biopython and setting up the development environment

To install Biopython and set up your development environment, follow these steps:

  1. Install Python: If you don’t already have Python installed, download and install it from the official Python website (https://www.python.org/). Biopython requires Python 3.6 or later.
  2. Install Biopython: You can install Biopython using pip, the Python package installer. Open a terminal or command prompt and run the following command:
    pip install biopython

    This will download and install the Biopython package and its dependencies.

  3. Set up a development environment: It’s recommended to use a virtual environment to manage your Python dependencies and isolate them from other projects. You can create a virtual environment using the following commands:

    On Windows:

    python -m venv myenv
    myenv\Scripts\activate

    On macOS and Linux:

    bash
    python3 -m venv myenv
    source myenv/bin/activate

    Replace myenv with the name you want to give to your virtual environment.

  4. Verify installation: After setting up your virtual environment, you can verify that Biopython is installed correctly by running a Python shell and importing Biopython:
    python
    python
    >>> import Bio
    >>> Bio.__version__

    This should print the version of Biopython you have installed.

  5. Start coding: You can now start using Biopython in your Python scripts. Import the Biopython modules you need and start writing your bioinformatics analyses.

Remember to deactivate your virtual environment when you’re done working on your project:

deactivate

This will return you to your regular system environment.

Working with Biological Sequences

Reading and writing sequence files in various formats (FASTA, GenBank, etc.)

Biopython provides functions for reading and writing sequence files in various formats, including FASTA, GenBank, and others. Here’s how you can read and write sequence files in different formats using Biopython:

  1. Reading sequence files:
    python
    from Bio import SeqIO

    # Read a FASTA file
    fasta_file = "example.fasta"
    for record in SeqIO.parse(fasta_file, "fasta"):
    print(record.id)
    print(record.seq)

    # Read a GenBank file
    genbank_file = "example.gbk"
    for record in SeqIO.parse(genbank_file, "genbank"):
    print(record.id)
    print(record.seq)

    The SeqIO.parse() function reads the sequence file and returns a SeqRecord object for each entry in the file. You can access the ID (record.id) and sequence (record.seq) of each entry.

  2. Writing sequence files:
    python
    from Bio import SeqIO
    from Bio.Seq import Seq
    from Bio.SeqRecord import SeqRecord

    # Create a SeqRecord object
    sequence = Seq("ATCGATCGATCG")
    record = SeqRecord(sequence, id="seq1", description="Example sequence")

    # Write to a FASTA file
    fasta_outfile = "output.fasta"
    with open(fasta_outfile, "w") as output_handle:
    SeqIO.write(record, output_handle, "fasta")

    # Write to a GenBank file
    genbank_outfile = "output.gbk"
    with open(genbank_outfile, "w") as output_handle:
    SeqIO.write(record, output_handle, "genbank")

    The SeqRecord class is used to create a record with a sequence (sequence), an ID (id), and a description (description). The SeqIO.write() function writes the record to a file in the specified format ("fasta" or "genbank").

These examples demonstrate how to read and write sequence files in various formats using Biopython. You can use similar approaches for other supported formats by changing the format strings in the SeqIO.parse() and SeqIO.write() functions.

Manipulating sequences (e.g., slicing, concatenation, reverse complement)

Biopython provides functions for manipulating sequences, such as slicing, concatenation, and reverse complementation. Here’s how you can perform these operations:

  1. Slicing sequences:
    python
    from Bio.Seq import Seq

    # Create a sequence
    sequence = Seq("ATCGATCGATCG")

    # Slice the sequence
    print(sequence[2:6]) # Output: 'CGAT'

  2. Concatenating sequences:
    python
    from Bio.Seq import Seq

    # Create two sequences
    seq1 = Seq("ATCG")
    seq2 = Seq("GATC")

    # Concatenate the sequences
    concatenated_seq = seq1 + seq2
    print(concatenated_seq) # Output: 'ATCGGATC'

  3. Reverse complementing sequences:
    python
    from Bio.Seq import Seq

    # Create a sequence
    sequence = Seq("ATCGATCGATCG")

    # Reverse complement the sequence
    reverse_complement = sequence.reverse_complement()
    print(reverse_complement) # Output: 'CGATCGATCGAT'

These examples demonstrate how to manipulate sequences using Biopython. Biopython provides many other functions for working with sequences, such as translating sequences, finding motifs, and calculating molecular weights. Refer to the Biopython documentation for more information: https://biopython.org/docs/

Sequence Alignment

Performing pairwise and multiple sequence alignments

Biopython provides modules for performing pairwise and multiple sequence alignments using popular algorithms such as Needleman-Wunsch (for global alignment), Smith-Waterman (for local alignment), and ClustalW (for multiple sequence alignment). Here’s how you can perform these alignments:

  1. Pairwise sequence alignment:
    python
    from Bio import pairwise2
    from Bio.Seq import Seq

    # Create two sequences
    seq1 = Seq("ACGT")
    seq2 = Seq("AGT")

    # Perform pairwise alignment
    alignments = pairwise2.align.globalxx(seq1, seq2)

    # Print the alignments
    for alignment in alignments:
    print(pairwise2.format_alignment(*alignment))

    This example performs a global alignment of two sequences (seq1 and seq2) using the globalxx function (which uses a simple match/mismatch scoring scheme). The format_alignment function is used to print the alignments.

  2. Multiple sequence alignment:
    python
    from Bio import AlignIO
    from Bio.Align import MultipleSeqAlignment
    from Bio.SeqRecord import SeqRecord

    # Create a list of sequences
    sequences = [
    SeqRecord(Seq("ATCG"), id="seq1"),
    SeqRecord(Seq("AGT"), id="seq2"),
    SeqRecord(Seq("ACGT"), id="seq3")
    ]

    # Perform multiple sequence alignment
    alignment = MultipleSeqAlignment(sequences)
    print(alignment)

    # Write the alignment to a file
    with open("output.fasta", "w") as output_handle:
    AlignIO.write(alignment, output_handle, "fasta")

    This example creates a list of sequences and performs a multiple sequence alignment using the MultipleSeqAlignment class. The AlignIO.write function is used to write the alignment to a file in FASTA format.

Biopython provides more advanced options for sequence alignment, such as specifying custom scoring matrices, gap penalties, and alignment algorithms. Refer to the Biopython documentation for more information: https://biopython.org/docs/

Visualizing sequence alignments

Visualizing sequence alignments can help you understand the similarities and differences between sequences. Biopython provides functionality for visualizing alignments using the Bio.pairwise2 and Bio.Align modules. Here’s how you can visualize sequence alignments:

  1. Install Biopython: If you haven’t already installed Biopython, you can do so using pip:
    pip install biopython
  2. Perform the alignment: First, perform the sequence alignment using the pairwise2 module. Here’s an example of pairwise sequence alignment:
    python
    from Bio import pairwise2
    from Bio.Seq import Seq

    # Create two sequences
    seq1 = Seq("ACGT")
    seq2 = Seq("AGT")

    # Perform pairwise alignment
    alignments = pairwise2.align.globalxx(seq1, seq2)

    # Get the first alignment
    alignment = alignments[0]

    # Print the alignment
    print(pairwise2.format_alignment(*alignment))

  3. Visualize the alignment: Biopython doesn’t have built-in functions for graphical visualization of alignments. However, you can use external tools such as Jalview (http://www.jalview.org/) or BioEdit (http://bioedit.software.informer.com/) to visualize the alignment.

    Alternatively, you can create a simple text-based visualization of the alignment by printing the aligned sequences:

    python
    aligned_seq1, _, aligned_seq2, _ = alignment
    for i in range(0, len(aligned_seq1), 60):
    print(aligned_seq1[i:i+60])
    print(aligned_seq2[i:i+60])
    print()

    This will print the aligned sequences in blocks of 60 characters, showing the matching and mismatching positions.

While Biopython doesn’t provide graphical visualization tools for alignments, it provides a foundation for aligning sequences and extracting alignment information that can be used with other tools for visualization.

Working with Sequence Annotations

Extracting and manipulating annotation information from sequence files

To extract and manipulate annotation information from sequence files, such as GenBank files, you can use Biopython’s SeqIO module. Here’s an example of how to extract and manipulate annotation information:

  1. Extracting annotation information:
    python
    from Bio import SeqIO

    # Open a GenBank file
    genbank_file = "example.gbk"

    # Parse the GenBank file and extract annotation information
    for record in SeqIO.parse(genbank_file, "genbank"):
    print("ID:", record.id)
    print("Description:", record.description)
    print("Annotations:")
    for key, value in record.annotations.items():
    print(f" {key}: {value}")

    This example opens a GenBank file and extracts the sequence ID, description, and other annotations associated with the sequence.

  2. Manipulating annotation information:

    You can manipulate annotation information by modifying the annotations dictionary of the SeqRecord object:

    python
    from Bio import SeqIO

    # Open a GenBank file
    genbank_file = "example.gbk"

    # Parse the GenBank file and manipulate annotation information
    for record in SeqIO.parse(genbank_file, "genbank"):
    # Modify the description
    record.description = "Modified description"
    # Add a new annotation
    record.annotations["organism"] = "Homo sapiens"
    # Print the modified record
    print(record.format("genbank"))

    This example modifies the description and adds a new annotation (“organism”) to the SeqRecord object, then prints the modified record in GenBank format.

Biopython’s SeqIO module provides a convenient way to extract and manipulate annotation information from sequence files. You can use this functionality to extract specific annotations, modify them, or add new annotations as needed.

Annotating sequences and visualizing annotations

To annotate sequences and visualize the annotations, you can use Biopython to add annotations to a SeqRecord object and then visualize the annotations using external tools or by creating custom visualizations. Here’s an example of how to annotate sequences and visualize the annotations:

  1. Annotating sequences:
    python
    from Bio import SeqIO
    from Bio.SeqFeature import SeqFeature, FeatureLocation

    # Open a GenBank file
    genbank_file = "example.gbk"

    # Parse the GenBank file and annotate the sequence
    for record in SeqIO.parse(genbank_file, "genbank"):
    # Add a new feature to the sequence
    feature_location = FeatureLocation(10, 50) # Example location
    feature = SeqFeature(feature_location, type="CDS", qualifiers={"gene": "example_gene"})
    record.features.append(feature)

    # Print the annotated record
    print(record.format("genbank"))

    This example adds a new feature (CDS – coding sequence) to the SeqRecord object with a location from 10 to 50 and a gene qualifier “example_gene”.

  2. Visualizing annotations:

    Biopython does not provide built-in functions for visualizing annotations. However, you can use external tools such as GenomeDiagram (https://biopython.org/docs/1.75/api/Bio.Graphics.GenomeDiagram.html) or create custom visualizations using matplotlib or other plotting libraries.

    Here’s an example of how you can create a simple visualization of annotations using matplotlib:

    python
    import matplotlib.pyplot as plt

    # Define the sequence length
    seq_length = len(record.seq)

    # Create a plot
    plt.figure(figsize=(10, 2))
    plt.xlim(0, seq_length)
    plt.ylim(0, 1)
    plt.yticks([])

    # Add annotations
    for feature in record.features:
    if feature.type == "CDS":
    start, end = feature.location.start, feature.location.end
    plt.plot([start, end], [0.5, 0.5], linewidth=5, color="blue", solid_capstyle="butt")

    # Show the plot
    plt.show()

    This example creates a plot where each blue line represents a coding sequence (CDS) annotation. You can customize the visualization to include different types of features or annotations as needed.

Biopython provides functionality for annotating sequences and extracting annotation information, but visualizing annotations typically requires using external tools or custom visualization code.

Phylogenetics with Biopython

Building phylogenetic trees from sequence data

To build phylogenetic trees from sequence data, you can use Biopython’s Phylo module, which provides tools for performing phylogenetic analyses. Here’s an example of how to build a phylogenetic tree from sequence data:

  1. Prepare the sequence data:
    python
    from Bio import SeqIO

    # Open a FASTA file containing sequences
    fasta_file = "sequences.fasta"

    # Parse the FASTA file and extract sequences
    sequences = []
    for record in SeqIO.parse(fasta_file, "fasta"):
    sequences.append(record.seq)

  2. Perform multiple sequence alignment (optional):

    Before building a phylogenetic tree, it’s often necessary to perform multiple sequence alignment to align the sequences. Biopython provides tools for this, such as AlignIO and MultipleSeqAlignment. Here’s an example:

    python
    from Bio.Align import MultipleSeqAlignment
    from Bio.Align import AlignIO
    from Bio import Align

    aligner = Align.PairwiseAligner()
    aligner.mode = 'local'
    alignments = aligner.align(sequences)

    alignment = MultipleSeqAlignment([s for s in alignments])

  3. Build the phylogenetic tree:
    python
    from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
    from Bio import Phylo

    # Calculate the distance matrix
    calculator = DistanceCalculator('identity')
    distance_matrix = calculator.get_distance(alignment)

    # Build the tree
    constructor = DistanceTreeConstructor()
    tree = constructor.upgma(distance_matrix)

    # Draw the tree
    Phylo.draw(tree)

    This example uses the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) algorithm to construct the tree. You can use other algorithms supported by Biopython, such as Neighbor-Joining or Maximum Likelihood, by specifying the appropriate constructor (DistanceTreeConstructor).

  4. Visualize the phylogenetic tree:

    Biopython provides basic tree visualization capabilities using Phylo.draw(). However, for more advanced visualizations, you may want to export the tree in Newick format and use external tools such as FigTree or iTOL (Interactive Tree of Life) for visualization.

    python
    # Export the tree in Newick format
    Phylo.write(tree, "tree.nwk", "newick")

    You can then open the exported Newick file in a tree visualization tool for further analysis and visualization.

This is a basic example of how to build a phylogenetic tree from sequence data using Biopython. Depending on your data and analysis requirements, you may need to customize the process or use additional tools and algorithms.

Visualizing and analyzing phylogenetic trees

To visualize and analyze phylogenetic trees, you can use Biopython’s Phylo module for basic visualization and external tools or libraries for more advanced analyses. Here’s how you can visualize and analyze a phylogenetic tree using Biopython and additional tools:

  1. Visualizing phylogenetic trees:

    Biopython’s Phylo.draw() function provides a basic way to visualize phylogenetic trees. Here’s an example:

    python
    from Bio import Phylo

    # Read the tree from a Newick file
    tree = Phylo.read("tree.nwk", "newick")

    # Basic tree visualization
    Phylo.draw(tree)

    For more advanced visualizations, you can export the tree in Newick format and use external tools like FigTree, iTOL, or ETE Toolkit for interactive and customizable visualizations.

  2. Annotating phylogenetic trees:

    You can annotate phylogenetic trees with additional information, such as branch lengths, support values, or metadata associated with the sequences. Here’s an example:

    python
    # Add branch lengths to the tree
    for clade in tree.find_clades():
    if clade.branch_length is None:
    clade.branch_length = 0.1 # Example branch length

    # Annotate tree with support values (bootstrap values)
    for clade in tree.find_clades():
    if clade.confidence is not None:
    clade.name = f"{clade.name}:{clade.confidence}%"

    # Visualize the annotated tree
    Phylo.draw(tree)

  3. Analyzing phylogenetic trees:

    Biopython’s Phylo module provides functions for basic tree analysis, such as calculating the tree’s root, finding common ancestors, and traversing the tree. For more advanced analyses, you can use external tools or libraries like ete3 for tree manipulation and analysis.

    python
    # Get the root of the tree
    root = tree.root

    # Find common ancestors
    common_ancestor = tree.common_ancestor(["A", "B"])

    # Traverse the tree
    for clade in tree.find_clades():
    print(clade)

    Additionally, you can use tools like ete3 for advanced tree analysis, including subtree extraction, branch length manipulation, and visualization.

Biopython provides basic functionalities for visualizing and analyzing phylogenetic trees. For more advanced analyses and visualizations, consider using external tools or libraries that offer more specialized features and customization options.

Structural Bioinformatics

Parsing and analyzing protein structure files (PDB format)

To parse and analyze protein structure files in PDB format, you can use Biopython’s Bio.PDB module, which provides tools for working with protein structures. Here’s an example of how to parse and analyze a protein structure file in PDB format:

  1. Parse a PDB file:
    python
    from Bio.PDB import PDBParser

    # Create a PDB parser object
    parser = PDBParser()

    # Parse the PDB file
    structure = parser.get_structure("1abc", "1abc.pdb")

    Replace "1abc" with the structure ID and "1abc.pdb" with the path to your PDB file.

  2. Accessing structure information:
    python
    # Access information about the structure
    print("Structure ID:", structure.id)
    print("Number of models:", len(structure))
  3. Accessing chain and residue information:
    python
    # Iterate over chains in the structure
    for chain in structure[0]:
    # Access information about the chain
    print("Chain ID:", chain.id)
    print("Number of residues:", len(chain))

    # Iterate over residues in the chain
    for residue in chain:
    # Access information about the residue
    print("Residue ID:", residue.id)
    print("Residue name:", residue.resname)

  4. Calculating protein properties:

    Biopython’s Bio.PDB module provides functions for calculating various properties of protein structures, such as the distance between atoms, the angle between atoms, and the RMSD (Root Mean Square Deviation) between structures. Here’s an example:

    python
    from Bio.PDB import calc_angle, calc_distance

    # Calculate the distance between two atoms
    atom1 = structure[0]['A'][10]['CA'] # Example atom
    atom2 = structure[0]['A'][20]['CA'] # Example atom
    distance = calc_distance(atom1, atom2)

    # Calculate the angle between three atoms
    atom3 = structure[0]['A'][30]['CA'] # Example atom
    angle = calc_angle(atom1, atom2, atom3)

  5. Visualizing the protein structure:

    Biopython’s Bio.PDB module does not provide built-in visualization tools for protein structures. However, you can export the structure in PDB format and visualize it using external tools like PyMOL or VMD.

    python
    # Export the structure in PDB format
    io = PDBIO()
    io.set_structure(structure)
    io.save("output.pdb")

    You can then open the exported PDB file in a molecular visualization tool for visualization and analysis.

Biopython’s Bio.PDB module provides a powerful and flexible framework for parsing, analyzing, and manipulating protein structure files in PDB format.

Visualizing protein structures

To visualize protein structures, you can use external tools like PyMOL, Chimera, or VMD. These tools provide interactive 3D visualization capabilities for protein structures. Here’s a general approach to visualizing protein structures using PyMOL as an example:

  1. Export the protein structure in PDB format:

    Before you can visualize a protein structure, you need to export it in PDB format. If you have already parsed the structure using Biopython, you can export it as follows:

    python
    from Bio.PDB import PDBIO

    # Export the structure in PDB format
    io = PDBIO()
    io.set_structure(structure)
    io.save("output.pdb")

  2. Open PyMOL:

    Start PyMOL on your computer. You can download PyMOL from the official website (https://pymol.org/).

  3. Load the protein structure:

    In PyMOL, use the “File” menu to open the exported PDB file containing your protein structure.

  4. Visualize the protein structure:

    Once the structure is loaded, you can use PyMOL’s graphical interface to manipulate and visualize the protein structure. You can rotate, zoom, and pan the structure to view it from different angles. PyMOL also provides tools for coloring the structure, highlighting specific residues, and measuring distances between atoms.

  5. Save the visualization:

    After you have customized the visualization to your liking, you can save the image or session for future reference. PyMOL allows you to save images in various formats and save sessions to resume your work later.

PyMOL provides a range of advanced features for visualizing and analyzing protein structures, making it a powerful tool for structural biology research. Other tools like Chimera and VMD offer similar functionalities and can also be used for visualizing protein structures.

Biological Databases and Data Retrieval

Accessing biological databases (e.g., NCBI, UniProt) using Biopython

To access biological databases such as NCBI and UniProt using Biopython, you can use the Bio.Entrez and Bio.SeqIO modules. These modules provide functions for searching and retrieving data from these databases. Here’s an example of how to access NCBI and UniProt databases using Biopython:

  1. Accessing NCBI databases:
    python
    from Bio import Entrez

    # Provide your email address to NCBI
    Entrez.email = "your_email@example.com"

    # Search NCBI nucleotide database
    search_term = "CYP2D6[Gene Name] AND Homo sapiens[Organism]"
    handle = Entrez.esearch(db="nucleotide", term=search_term, retmax=10)
    record = Entrez.read(handle)
    handle.close()

    # Get the list of matching IDs
    id_list = record["IdList"]

    # Fetch sequences for the matching IDs
    for uid in id_list:
    handle = Entrez.efetch(db="nucleotide", id=uid, rettype="fasta", retmode="text")
    sequence = handle.read()
    handle.close()
    print(sequence)

    This example searches the NCBI nucleotide database for sequences of the CYP2D6 gene in Homo sapiens and retrieves the first 10 matching sequences in FASTA format.

  2. Accessing UniProt database:
    python
    from Bio import ExPASy
    from Bio import SeqIO

    # Access a specific UniProt entry
    uniprot_id = "P68871"
    handle = ExPASy.get_sprot_raw(uniprot_id)
    record = SeqIO.read(handle, "swiss")
    handle.close()

    # Accessing specific information from the record
    print("ID:", record.id)
    print("Description:", record.description)
    print("Sequence:", str(record.seq))

    This example accesses a specific entry (P68871) from the UniProt database and prints its ID, description, and sequence.

Biopython provides a convenient way to access and retrieve data from biological databases like NCBI and UniProt. It’s important to provide your email address to NCBI (using Entrez.email) to comply with their usage policies.

Retrieving and parsing data from databases

To retrieve and parse data from biological databases using Biopython, you can use the Bio.Entrez module for NCBI databases and various parsers such as Bio.SeqIO for sequence data. Here’s a general example of how to retrieve and parse data from NCBI’s nucleotide database:

  1. Retrieve data from NCBI:
    python
    from Bio import Entrez

    # Provide your email address to NCBI
    Entrez.email = "your_email@example.com"

    # Search NCBI nucleotide database
    search_term = "CYP2D6[Gene Name] AND Homo sapiens[Organism]"
    handle = Entrez.esearch(db="nucleotide", term=search_term, retmax=10)
    record = Entrez.read(handle)
    handle.close()

    # Get the list of matching IDs
    id_list = record["IdList"]

    # Fetch sequences for the matching IDs
    for uid in id_list:
    handle = Entrez.efetch(db="nucleotide", id=uid, rettype="fasta", retmode="text")
    sequence = handle.read()
    handle.close()
    print(sequence)

    This example searches the NCBI nucleotide database for sequences of the CYP2D6 gene in Homo sapiens and retrieves the first 10 matching sequences in FASTA format.

  2. Parse the retrieved data:

    Once you have retrieved the data, you can parse it using the appropriate parser. For example, to parse the retrieved sequences as FASTA records:

    python
    from Bio import SeqIO

    for uid in id_list:
    handle = Entrez.efetch(db="nucleotide", id=uid, rettype="fasta", retmode="text")
    records = SeqIO.parse(handle, "fasta")
    for record in records:
    print("ID:", record.id)
    print("Description:", record.description)
    print("Sequence:", str(record.seq))
    handle.close()

    This code uses SeqIO.parse() to parse the retrieved data as FASTA records and then iterates over each record to access its ID, description, and sequence.

Biopython provides similar functionality for other databases and data formats. You can use Bio.SeqIO to parse sequence data in various formats (FASTA, GenBank, etc.) and Bio.PDB to parse protein structure data in PDB format, among other capabilities.

Data Analysis and Visualization

Performing basic data analysis on biological data using Biopython

Biopython provides several tools for performing basic data analysis on biological data. Here’s a general overview of some common tasks you can perform using Biopython:

  1. Sequence analysis:
    python
    from Bio.Seq import Seq
    from Bio.SeqUtils import GC

    # Create a DNA sequence
    dna_seq = Seq("ATCGATCGATCG")

    # Count nucleotides
    print("Number of A's:", dna_seq.count("A"))

    # Calculate GC content
    print("GC content:", GC(dna_seq))

    # Translate DNA to protein sequence
    protein_seq = dna_seq.translate()
    print("Protein sequence:", protein_seq)

  2. Sequence alignment:
    • Performing pairwise and multiple sequence alignments.
    python
    from Bio import pairwise2, SeqIO
    from Bio.pairwise2 import format_alignment

    # Read sequences from a file
    seq1 = SeqIO.read("seq1.fasta", "fasta").seq
    seq2 = SeqIO.read("seq2.fasta", "fasta").seq

    # Perform pairwise alignment
    alignments = pairwise2.align.globalxx(seq1, seq2)

    # Print the alignments
    for alignment in alignments:
    print(format_alignment(*alignment))

  3. Phylogenetic analysis:
    • Building phylogenetic trees from sequence data.
    python
    from Bio import Phylo, AlignIO
    from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

    # Read sequences from a file
    alignment = AlignIO.read("alignment.fasta", "fasta")

    # Calculate the distance matrix
    calculator = DistanceCalculator('identity')
    distance_matrix = calculator.get_distance(alignment)

    # Build the tree
    constructor = DistanceTreeConstructor()
    tree = constructor.upgma(distance_matrix)

    # Draw the tree
    Phylo.draw(tree)

  4. Accessing and parsing biological databases:
    • Retrieving and parsing data from databases like NCBI or UniProt.
    python
    from Bio import Entrez, SeqIO

    # Provide your email address to NCBI
    Entrez.email = "your_email@example.com"

    # Search NCBI nucleotide database
    search_term = "CYP2D6[Gene Name] AND Homo sapiens[Organism]"
    handle = Entrez.esearch(db="nucleotide", term=search_term, retmax=10)
    record = Entrez.read(handle)
    handle.close()

    # Get the list of matching IDs
    id_list = record["IdList"]

    # Fetch sequences for the matching IDs
    for uid in id_list:
    handle = Entrez.efetch(db="nucleotide", id=uid, rettype="fasta", retmode="text")
    record = SeqIO.read(handle, "fasta")
    print("ID:", record.id)
    print("Sequence:", str(record.seq))
    handle.close()

These examples cover some of the basic data analysis tasks you can perform using Biopython. Biopython provides extensive documentation and tutorials for more advanced analyses and specific tasks.

Visualizing biological data using matplotlib and other libraries

Visualizing biological data using libraries like matplotlib can be incredibly useful for gaining insights and communicating findings. Here are some examples of how you can visualize different types of biological data:

  1. Sequence composition:
    • Visualize the nucleotide or amino acid composition of a sequence using a bar plot.
    python
    import matplotlib.pyplot as plt
    from Bio.Seq import Seq
    from Bio.SeqUtils import nt_distribution

    # Create a DNA sequence
    seq = Seq("ATCGATCGATCG")

    # Calculate nucleotide distribution
    nt_dist = nt_distribution(str(seq))

    # Plot the distribution
    plt.bar(nt_dist.keys(), nt_dist.values())
    plt.xlabel('Nucleotide')
    plt.ylabel('Count')
    plt.title('Nucleotide Composition')
    plt.show()

  2. GC content:
    • Visualize the GC content of sequences using a histogram.
    python
    import matplotlib.pyplot as plt
    from Bio.SeqUtils import GC

    # Calculate GC content for multiple sequences
    gc_contents = [GC(seq) for seq in sequences]

    # Plot the GC content distribution
    plt.hist(gc_contents, bins=20)
    plt.xlabel('GC Content (%)')
    plt.ylabel('Count')
    plt.title('GC Content Distribution')
    plt.show()

  3. Sequence alignment:
    • Visualize sequence alignments using a heatmap.
    python
    import seaborn as sns
    import matplotlib.pyplot as plt
    from Bio import AlignIO

    # Read the alignment from a file
    alignment = AlignIO.read("alignment.fasta", "fasta")

    # Create a heatmap
    sns.heatmap(alignment, cmap="viridis")
    plt.xlabel('Position')
    plt.ylabel('Sequence')
    plt.title('Sequence Alignment')
    plt.show()

  4. Phylogenetic tree:
    • Visualize a phylogenetic tree using a dendrogram.
    python
    from Bio import Phylo
    import matplotlib.pyplot as plt

    # Read the tree from a Newick file
    tree = Phylo.read("tree.nwk", "newick")

    # Plot the tree
    Phylo.draw(tree)
    plt.show()

  5. Protein structure:
    • Visualize protein structures using external tools like PyMOL or Chimera. Alternatively, you can use libraries like nglview for interactive visualization in Jupyter notebooks.
    python
    import nglview as nv

    # Load a protein structure file
    view = nv.show_pdbid("1crn")

    # Display the structure
    view

These examples demonstrate how you can use matplotlib and other libraries to visualize different types of biological data. Visualizations can be customized further to suit specific requirements and improve readability.

Biopython in Practical Applications

Biopython is the largest and most common bioinformatics package for Python. It contains a number of different sub modules for common bioinformatics tasks.

It is freely available here: http://biopython.org/wiki/Biopython

while the entire source code is hosted here: https://github.com/biopython/biopython

The Biopython code is distributed under the Biopython License.

Biopython sports several features. It provides:

Data structures for biological sequences and features thereof, as well as a multitude of manipulation functions for performing common tasks, such as translation, transcription, weight computations, …

Functions for loading biological data into -like data structures.

dict

Supported formats include, for instance: FASTA, PDB, GenBank, Blast, SCOP, PubMed/Medline, ExPASy-related formats

Functions for data analysis, including basic data mining and machine learning algorithm: k– Nearest Neighbors, Naive Bayes, and Support Vector Machines

Access to online services and database, including NCBI services (Blast, Entrez, PubMed) and ExPASY services (SwissProt, Prosite)

Access to local services, including Blast, Clustalw, EMBOSS.

In order to use local services (such as Blast), the latter must be properly installed and configured on the local machine.

Setting up the local services is beyond the scope of this course.

In order to install Biopython:

On a Debian-based distributions (for instance Ubuntu), just type:

$ sudo apt-get install python-biopython

On a generic GNU/Linux distribution, just type:

$ pip install biopython –user

Please refer to the official documentation for further instructions.

While Biopython is the main player in the field, it is not the only one. Other interesting packages are:

ETE and DendroPy, dedicated to computation and visualization of phylogenetic trees.

CSB for dealing with sequences and structures, computing alignments and profiles (with profile HMMs), and Monte Carlo sampling.

biskit is designed for structural bioinformatics tasks, such as loading 3D protein structures in Protein Data Bank (PDB) format, homology modelling, molecular dynamics simulations (using external simulators) and trajectory computations, and docking, among others.

You are free to mix and match all the packages you need.

Biopython: Sequences

Sequences lay at the core of bioinformatics: although DNA, RNA and proteins are molecules with specific structures and dynamical behaviors, their basic building block is their sequence.

Bio.Seq

Biopython encodes sequences using objects of type module. Take a look at their manual:

Seq

, provided by the

sub-

from Bio.Seq import Seq help(Seq)

The official documentation of the Biopython class can be found on the Biopython wiki.

Seq

Each object is associated to a formal alphabet.

Seq

Each sequence has its own alphabet

Conversions between different sequence types (transcription, translation) can be seen as a change of alphabet.

Some operations (e.g. concatenation) can not be done between sequences with incompatible alphabets.

The various alphabets supported by Biopython can be found in

Bio.Alphabet

.

Bio.Alphabet.IUPAC

The most basic alphabets are:

, e.g.

: a sequence of DNA

IUPAC.unambiguous_dna

: a sequence of RNA

IUPAC.unambiguous_rna

: a protein sequence

IUPAC.protein

Example. To create a generic sequence, i.e. a sequence not bound to any specific alphabet, write:

>>> from Bio.Seq import Seq

>>> s = Seq(“ACCGTTTAAC”) # no alphabet specified

>>> print type(s)

<class ‘Bio.Seq.Seq’>

>>> s

Seq(‘ACCGTTTAAC’, Alphabet())

>>> print s ACCGTTTAAC

To create a DNA sequence, specify the corresponding alphabet when creating the object:

Seq

>>> from Bio.Seq import Seq

>>> from Bio.Alphabet import IUPAC

>>> s = Seq(“ACCGTTTAAC”, IUPAC.unambiguous_dna)

>>> s

Seq(‘ACCGTTTAAC’, IUPACUnambiguousDNA())

objects act mostly like standard Python strings . For instance, you can write:

str

Seq

>>> from Bio.Seq import Seq

>>> s = Seq(“ACTG”)

>>> len(s) 4

>>> s[2] ‘T’

>>> s[1:1]

Seq(‘CT’, Alphabet())

>>> s.count(“C”) 3

However, some operations are restricted to objects with the same alphabet (in order to

Seq

prevent the user from, say, concatenating nucleotides and amino acids):

>>> s = Seq(“ACTG”, IUPAC.unambiguous_dna)

>>> s + s

Seq(‘ACTGACTG’, IUPACUnambiguousDNA())

>>> t = Seq(“AAADGHHHTEWW”, IUPAC.protein)

>>> s

Seq(‘ACTG’, IUPACUnambiguousDNA())

>>> t

Seq(‘AAADGHHHTEWW’, IUPACProtein())

>>> s + t

Traceback (most recent call last):

File “<stdin>”, line 1, in <module>

File “/home/stefano/.local/lib/python2.7/site-packages/Bio/Seq.py”, line 292, in add

self.alphabet, other.alphabet))

TypeError: Incompatible alphabets IUPACUnambiguousDNA() and IUPACProtein()

To convert back a

Seq

object into a normal

object, just use the normal type-casting

operator :

str()

str

>>> s = Seq(“ACTG”, IUPAC.unambiguous_dna)

>>> str(s)

Warning

Just like strings, objects are immutable:

Seq

>>> s = Seq(“ACCGTTTAAC”)

>>> s[0] = “G”

Traceback (most recent call last):

File “<stdin>”, line 1, in <module>

TypeError: ‘Seq’ object does not support item assignment

Thanks to their immutability, objects do).

Seq

objects can be used as keys in dictionaries (like

Note

str

Biopython also provides a mutable sequence type .

MutableSeq

You can convert back and forth between and :

MutableSeq

Seq

>>> s = Seq(“ACTG”, IUPAC.unambiguous_dna)

>>> muts = s.tomutable()

>>> muts[0] = “C”

>>> muts

MutableSeq(‘CCTG’, IUPACUnambiguousDNA())

>>> muts.toseq()

Seq(‘CCTG’, IUPACUnambiguousDNA())

Some operations are only available when specific alphabets are used. Let’s see a few examples.

Example. The method allows to complement a DNA or RNA sequence:

complement()

>>> s = Seq(“CCGTTAAAAC”, IUPAC.unambiguous_dna)

>>> s.complement()

Seq(‘GGCAATTTTG’, IUPACUnambiguousDNA())

while the also reverses the result from left to right:

reverse_complement()

>>> s.reverse_complement()

Seq(‘GTTTTAACGG’, IUPACUnambiguousDNA())

Warning

Neither of these methods is available for other alphabets:

>>> p = Seq(“ACLKQ”, IUPAC.protein)

>>> p.complement()

Traceback (most recent call last):

File “<stdin>”, line 1, in <module>

File “/home/stefano/.local/lib/python2.7/site-packages/Bio/Seq.py”, line 760, in complement

raise ValueError(“Proteins do not have complements!”) ValueError: Proteins do not have complements!

Example. (Adapted from the Biopython tutorial.) Transcription is the biological process by which a DNA sequence is transcribed into the corresponding DNA.

A picture of the transcription process:

5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’ (coding strand)

|||||||||||||||||||||||||||||||||||||||

3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’ (template strand)

|

Transcription

5’ AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3’ (mRNA)

The same process can be “simulated” on objects. Starting from the coding strand (as is

Seq

usually done in bioinformatics), it is only a matter of converting T’s into U’s:

>>> coding_dna

Seq(‘ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG’, IUPACUnambiguousDNA())

>>> mrna = coding_dna.transcribe()

>>> mrna

Seq(‘AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG’, IUPACUnambiguousRNA())

From the template strand instead we first have to compute the reverse complement, and then perform the character substitution:

>>> template_dna

‘TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC’

>>> template_dna.reverse_complement().transcribe()

Seq(‘AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG’, IUPACUnambiguousRNA())

It is also easy to perform the reverse process, back-transcription:

>>> mrna

‘AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG’

>>> mrna.back_transcribe()

Seq(‘ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG’, IUPACUnambiguousDNA())

Warning

These functions also take care of checking that the input alphabet is the right one (e.g. DNA) and apply the right alphabet to the result (e.g. RNA).

Example. (Adapted from the Biopython tutorial.) Let’s not take a look at translation, i.e. mRNA to protein.

The code is as follows:

>>>from Bio.Seq import Seq

>>>from Bio.Alphabet import IUPAC

>>>mrna = Seq(“AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG”, IUPAC.unambiguous_rna)

>>>print mrna

AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG

>>>print mrna.translate() MAIVMGR*KGAR*

Here the stop codons are indicated with an asterisk .

“*”

Now, different coding tables are available, for instance:

>>> coding_dna.translate(table=“Vertebrate Mitochondrial”) Seq(‘MAIVMGRWKGAR*’, HasStopCodon(IUPACProtein(), ‘*’))

It’s also possible to tell the method to stop at the first stop codon:

translate()

>>> coding_dna.translate()

Seq(‘MAIVMGR*KGAR*’, HasStopCodon(IUPACProtein(), ‘*’))

>>> coding_dna.translate(to_stop=True) Seq(‘MAIVMGR’, IUPACProtein())

In this case the stop codon is not included in the resulting sequence.

Note

The Genetic Codes page of the NCBI provides the full list of translation tables used by Biopython.

In order to visualize the various translation tables, you can do:

from Bio.Data import CodonTable

ttable = CodonTable.unambiguous_dna_by_name[“Standard”] print ttable

ttable = CodonTable.unambiguous_dna_by_name[“Vertebrate Mitochondrial”] print ttable

You can also visualize what are the encodings of stop and start codons:

print ttable.stop_codons print ttable.start_codons

Let’s take a look at some common sequence formats.

FASTA is the most basic format, we have used it several times already:

>sp|P25730|FMS1_ECOLI CS1 fimbrial subunit A precursor (CS1 pilin) MKLKKTIGAMALATLFATMGASAVEKTISVTASVDPTVDLLQSDGSALPNSVALTYSPAV

NNFEAHTINTVVHTNDSDKGVVVKLSADPVLSNVLNPTLQIPVSVNFAGKPLSTTGITID SNDLNFASSGVNKVSSTQKLSIHADATRVTGGALTAGQYQGLVSIILTKST

>sp|P15488|FMS3_ECOLI CS3 fimbrial subunit A precursor (CS3 pilin) MLKIKYLLIGLSLSAMSSYSLAAAGPTLTKELALNVLSPAALDATWAPQDNLTLSNTGVS

NTLVGVLTLSNTSIDTVSIASTNVSDTSKNGTVTFAHETNNSASFATTISTDNANITLDK NAGNTIVKTTNGSQLPTNLPLKFITTEGNEHLVSGNYRANITITSTIK

GenBank is a richer sequence format for genes, it includes fields for various kinds of annotations (e.g. source organism, bibliographical references, database-specific accession number).

A GenBank file looks like this:

LOCUS

DEFINITION

5I3U_F

39 bp DNA

linear SYN 31AUG2016

ACCESSION

VERSION KEYWORDS SOURCE

ORGANISM

REFERENCE

AUTHORS

TITLE

JOURNAL PUBMED

FEATURES

source

Chain F, Structure Of Hiv1 Reverse Transcriptase Nsite Complex;

Catalytic Incorporation Of Aztmp To A Dna Aptamer In Crystal. 5I3U_F

5I3U_F

.

synthetic construct synthetic construct

other sequences; artificial sequences.

1 (bases 1 to 39)

Tu,X., Das,K., Han,Q., Bauman,J.D., Clark AD,Jr., Hou,X.,

Frenkel,Y.V., Gaffney,B.L., Jones,R.A., Boyer,P.L., Hughes,S.H., Sarafianos,S.G. and Arnold,E.

Structural basis of HIV1 resistance to AZT by excision Nat. Struct. Mol. Biol. 17 (10), 12021209 (2010)

20852643

Location/Qualifiers 1..39

/organism=“synthetic construct”

/mol_type=“other DNA”

/db_xref=“taxon:32630”

ORIGIN

1 taatacncnc ccttcggtgc tttgcaccga agggggggn

//

UniProt (obtained by combining SwissProt, TrEMBL and PIR) is the GenBank equivalent for protein sequences:

ID 143B_HUMAN STANDARD; PRT; 245 AA. AC P31946;

DT 01JUL1993 (Rel. 26, Created)

DT 01FEB1996 (Rel. 33, Last sequence update)

DT 15SEP2003 (Rel. 42, Last annotation update)

DE 1433 protein beta/alpha (Protein kinase C inhibitor protein1) DE (KCIP1) (Protein 1054).

GN YWHAB.

OS Homo sapiens (Human).

OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.

OX NCBI_TaxID=9606; RN [1]

RP SEQUENCE FROM N.A.

RC TISSUE=Keratinocytes;

RX MEDLINE=93294871; PubMed=8515476;

RA Leffers H., Madsen P., Rasmussen H.H., Honore B., Andersen A.H., RA Walbum E., Vandekerckhove J., Celis J.E.;

RT “Molecular cloning and expression of the transformation sensitive RT epithelial marker stratifin. A member of a protein family that has RT been involved in the protein kinase C signalling pathway.“;

The module provides a functions to parse all of the above (among others).

Bio.SeqIO

In particular, the returns an iterable of

SeqRecord

parse()

function takes a file handle and the format of the file (as a string) and objects:

from Bio.SeqIO import parse help(parse)

f = open(“test.fasta”)

for record in parse(f, “fasta”): print record.id, record.seq

Warning

An iterable is a function that allows to iterate over a collection of objects (similar to a list). When given an iterator, you can either use it as in the code above:

iterable = get_it_from_somewhere() # e.g. parse()

for element in iterable: do_something(element)

or extract one element at a time, with the method:

next()

iterable = get_it_from_somewhere() # e.g. parse()

first_element = iterable.next() do_something(first_element)

second_element = iterable.next() do_something(second_element)

# …

The iterable allows to iterate over a list of

objects. The

objects associate

a sequence with additional meta-data, such as human-readable descriptions, annotations,

Seq

SeqRecord

SeqRecord

and sequence features:

>>> from Bio.SeqRecord import SeqRecord

>>> help(SeqRecord)

Each object provides these attributes:

SeqRecord

  1. : the sequence itself, typically a object.

Seq

.seq

  1. : the primary identifier of the sequence – a string.

.id

  1. : a “common” identifier for the sequence, also a string.

.name

  1. : a human readable description of the sequence.

.description

  1. : a of additional information.

dict

.annotations

among others.

In order to read a object from a file:

SeqRecord

>>> from Bio import SeqIO

>>> record = SeqIO.parse(open(“sequences.fasta”), “fasta”).next()

>>> record

SeqRecord(seq=Seq(‘METNCRKLVSACVQLGVQPAAVECLFSKDSEIKKVEFTDSPESRKEAASSKFFP…RPV’,

SingleLetterAlphabet()), …

>>> record.id ‘Q99697’

>>> record.name ‘Q99697’

>>> record.annotations

{}

Here annotations.

record.annotations

is empty because the FASTA format does not really support sequence

The very nice thing about the

and

mechanism is that they allow to process all

of the different formats described above (FASTA, GenBank, UniProt) using the very same code!

SeqRecord

SeqIO

Let’s see what happens with a GenBank file:

>>> record = SeqIO.parse(open(“data.genbank”), “genbank”).next()

>>> record

SeqRecord(seq=Seq(‘TAATACNCNCCCTTCGGTGCTTTGCACCGAAGGGGGGGN’, IUPACAmbiguousDNA()), …

>>> from pprint import pprint

>>> pprint(record.annotations)

{‘accessions’: [‘5I3U_F’],

‘data_file_division’: ‘SYN’, ‘date’: ’31-AUG-2016′,

‘keywords’: [”],

‘organism’: ‘synthetic construct’,

‘references’: [Reference(title=’Structural basis of HIV-1 resistance to AZT by excision’,

…)],

‘source’: ‘synthetic construct’,

‘taxonomy’: [‘other sequences’, ‘artificial sequences’], ‘topology’: ‘linear’}

This way, you can (almost) forget about the format of the sequence data you are working with

– all the parsing is done automatically by Biopython.

Finally, the

SeqIO.write()

method allows to write a

objects or a list of

objects to disk, in a given format:

SeqRecord

SeqRecord

fr = open(“data.genbank”, “r”)

record = SeqIO.parse(fr, “genbank”).next()

fw = open(“seq1.fasta”, “w)

SeqIO.write(record, fw, “fasta”)

Biopython: Structures

Biopython allows to manipulate polypeptide structures via the module.

Bio.PDB

We will focus on data coming from the Protein Data Bank (PDB): http://www.rcsb.org/pdb/home/home.do

The PDB is by far the largest protein structure resource available online. At the time of writing, it hosts more thank 120k distinct protein structures, including protein-protein, protein-DNA, protein-RNA complexes.

In order to load the required code, just type:

from Bio.PDB import *

Note

For the documentation of the module, see:

Bio.PDB

http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ

Note

In the following we will use the Zaire Ebolavirus glicoprotein structure here:

5JQ3

http://www.rcsb.org/pdb/explore/explore.do?structureId=5JQ3 This is what this structure looks like:

, available

Protein Structure File Formats

The PDB distributes protein structures in two different formats:

  1. The XML-based file format. It is not supported by Biopython.

.cif

mmcif

  1. The

file format. (The actual file extension is

.) See for instance:

https://files.rcsb.org/view/5JQ3.cif

This is the most recent and “parsable” file format.

  1. The file format, which is a specially formatted text file. See for instance:

pdb

https://files.rcsb.org/view/5JQ3.pdb

This is the historical file format. It is (mostly) still distributed because of legacy tools that do

not (yet) understand the newer and cleaner files.

mmcif

Warning

Some (many?) PDB files distributed by the Protein Data Bank contain formatting errors

that make them ambiguous or difficult to parse.

The module attempts to deal with these errors automatically.

Bio.PDB

Of course, Biopython is not perfect, and some formatting errors may still make it do the wrong thing, or raise an exception.

Depending on your goal, you can either:

manually correct the structure, e.g. if you have to analyze a single structure

ignore the error, e.g. when you are just computing global statistics about a large number of structures, and a few errors do not really impact the overall numbers

We will see how to ignore these “spurious” errors soon.

pdb

The the

mmcif

Bio.PDB

module implements two different parsers, one for the format. Let’s try them out.

format and one for

Starting with:

Bio.MMCIF.MMCIFParser

>>> from Bio.PDB import *

To load a

cif

file, use the

module:

>>> parser = MMCIFParser()

>>> struct = parser.get_structure(“5jq3”, “5jq3.cif”) # four warnings

>>> struct

<Structure id=5jq3>

>>> print type(struct)

<class ‘Bio.PDB.Structure.Structure’>

To load a

Bio.PDB.PDBParser

pdb

file, use the

module:

>>> parser = PDBParser()

>>> struct = parser.get_structure(“5jq3”, “5jq3.pdb”) # four warnings

>>> struct

<Structure id=5jq3>

>>> print type(struct)

<class ‘Bio.PDB.Structure.Structure’>

As you can see, both parsers give you the same kind of object, a .

Structure

The header of the structure is a which stores meta-data about the protein structure,

dict

including its resolution (in Angstroms), name, author, release date, etc. To access it, write:

>>> print struct.header.keys()

[‘structure_method’, ‘head’, ‘journal’, ‘journal_reference’, ‘compound’, ‘keywords’, ‘name’, ‘author’, ‘deposition_date’,

‘release_date’, ‘source’, ‘resolution’, ‘structure_reference’]

>>> print struct.header[“name”]

crystal structure of ebola glycoprotein

>>> print struct.header[“release_date”] 2016-06-29

>>> print struct.header[“resolution”]

2.23 # Å

Warning

The header dictionary may be an empty, depending on the structure and the parser. Try it

out with the parser!

mmcif

Note

You can download the required files directly from the PBD server with:

>>> pdbl = PDBList()

>>> pdbl.retrieve_pdb_file(‘1FAT’) Downloading PDB structure ‘1FAT’…

‘/home/stefano/scratch/fa/pdb1fat.ent’

The file will be put into the current working directory.

The object also allows you to retrieve a full copy of the PDB database, if you wish

PDBList

to do so. Note that the database takes quite a bit of disk space.

Structures as Hierarchies

Structural information is natively hierarchical.

The hierarchy is represented by Structural Bioinformatics FAQ):

Bio.PDB

as follows (image taken from the Biopython

Example. For instance, the PDB structure is composed of:

5JQ3

A single model, containing Two chains:

chain A, containing 338 residues chain B, containing 186 residues

Each residue is composed by multiple atoms, each having a 3D position represented as by (x, y, z) coordinates.

More specifically, here are the related classes:

Bio.PDB

A describes one or more models, i.e. different 3D conformations of the very

Structure

same structure (for NMR). The

Structure.get_models()

method returns an iterator over the models:

>>> it = struct.get_models()

>>> it

<generator object get_models at 0x7f318f0b3fa0>

>>> models = list(it)

>>> models

[<Model id=0>]

>>> type(models[0])

<class ‘Bio.PDB.Model.Model’>

Here I turned the iterator into a proper list with the function.

Model

dict

list()

A

Structure

object:

also acts as a

: given a model ID, it returns the corresponding

>>> struct[0]

<Model id=0>

A

Model

The

>>> chains = list(models[0].get_chains())

>>> chains

[<Chain id=A>, <Chain id=B>]

>>> type(chains[0])

<class ‘Bio.PDB.Chain.Chain’>

>>> type(chains[1])

<class ‘Bio.PDB.Chain.Chain’>

Model.get_chain()

describes exactly one 3D conformation. It contains one or more chains. method returns an iterator over the chains:

A also acts as a

dict

Model

mapping from chain IDs to

objects:

>>> model = models[0]

>>> model

<Model id=0>

>>> model[“B”]

<Chain id=B>

A describes a proper polypeptide structure, i.e. a consecutive sequence of bound

Chain

residues. The

Chain.get_residues()

method returns an iterator over the residues:

>>> residues = list(chains[0].get_residues())

>>> len(residues) 338

>>> type(residues[0])

<class ‘Bio.PDB.Residue.Residue’>

A

Residue

Chain

The atoms:

Residue.get_atom()

holds the atoms that belong to an amino acid.

(unfortunately, “atom” is singular) returns an iterator over the

>>> atoms = list(residues[0].get_atom())

>>> atoms

[<Atom N>, <Atom CA>, <Atom C>, <Atom O>, <Atom CB>, <Atom OG>]

>>> type(atoms[0])

<class ‘Bio.PDB.Atom.Atom’>

An holds the 3D coordinate of an atom, as a :

Vector

Atom

>>> atoms[0].get_vector()

<Vector -68.82, 17.89, -5.51>

The x, y, z coordinates in a can be accessed with:

Vector

>>> vector = atoms[0].get_vector()

<Vector -68.82, 17.89, -5.51>

>>> coords = list(vector.get_array())

[-68.81500244, 17.88899994, -5.51300001]

The other methods of the

norm()

Vector

class allow to compute the

with another vector,

compute its

angle()

, map it to the unit sphere with

, and multiply the

coordinates by a matrix with .

left_multiply()

normalize()

Example. To iterate over all atoms in a structure, type:

parser = PDBParser()

struct = parser.get_structure(“target”, “5jq3.pdb”)

for model in struct:

for chain in model:

for res in chain:

for atom in res: print atom

and to print a single atom:

>>> struct[0][“B”][502][“CA”].get_vector()

<Vector -64.45, 8.02, 1.62>

There are also shortcuts to skip the middle levels:

for res in model.get_residues(): print residue

and you can go back “up” the hierarchy:

model = chain.get_parent()

Finally, you can inspect the properties of atoms and residues using the methods described here:

http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ#analysis

The meaning of residue IDs are not obvious. From the Biopython Structural Bioinformatics FAQ:

What is a residue id?

This is a bit more complicated, due to the clumsy PDB format. A residue id is a tuple with three elements:

The hetero-flag: this is plus the name of the hetero-residue (e.g. ‘H_GLC’ in the

“H_”

case of a glucose molecule), or ‘W’ in the case of a water molecule. The sequence identifier in the chain, e.g. 100

The insertion code, e.g. ‘A’. The insertion code is sometimes used to preserve a certain desirable residue numbering scheme. A Ser 80 insertion mutant (inserted e.g. between a Thr 80 and an Asn 81 residue) could e.g. have sequence identifiers and insertion codes as follows: Thr 80 A, Ser 80 B, Asn 81. In this way the residue numbering scheme stays in tune with that of the wild type structure. The id of the above glucose residue would thus be (‘H_GLC’, 100, ‘A’). If the hetero-flag and insertion code are blank, the sequence identifier alone can be used:

# Full id

residue = chain[(‘ ‘, 100, ‘ ‘)]

# Shortcut id

residue = chain[100]

The reason for the hetero-flag is that many, many PDB files use the same sequence identifier for an amino acid and a hetero-residue or a water, which would create obvious problems if the hetero-flag was not used.

Writing a file

pdb

pdb

Given a

Structure

object, it is easy to write it to disk in

format:

struct = … # a Structure

io = PDBIO()

io.set_structure(struct) io.save(“output.pdb”)

And that’s it!

Structure Alignment

In order to be comparable, structures must first be aligned: superimposed on top of each other.

Note

Structure alignment is very common problem.

Say you have a wildtype protein and an SNP mutant, and you would like to compare the structural consequences of the mutation. Then in order to compare the structure, first you want them to match “as closely as possible”.

It also occurs in protein structure prediction: in order to evaluate the quality of a structural predictor (a software that takes the sequence of a protein and attempts to figure out how the structure looks like), the predictions produced by the software must be compared to the real corresponding protein structures. The same goes, of course, for dynamical molecular simulation software.

For a quick overview of several superposition methods, see: https://en.wikipedia.org/wiki/Structural_alignment

The most basic superposition algorithms work by rototranslating the two proteins so to maximize their alignment.

In particular, the quality of the alignment is measured by the Root Mean Squared Error (RMSD), defined as:

RM SD(s, s′) = 1

n

n

∑ ∥xi

− x’i ∥2

i=1

This is a least-squares minimization problem, and is solved using standard decomposition techniques.

Warning

This is a useful, but naive interpretation of the structural alignment problem. It may not be suitable to the specific task.

For instance, it is well known that there is a correlation between the RMSD values and the length of the protein, which makes average RMSD values computed over different proteins very biased.

Different models of structure alignment avoid this issue, such as TM-score: http://zhanglab.ccmb.med.umich.edu/TM-score/

We will not cover them here.

The

Bio.PDB

module supports RMSD-based structural alignment via the

class,

let’s see how.

Superimposer

First, let’s extract the list of atoms from the two chains of our structure:

>>> atoms_a = list(struct[0][“A”].get_atoms())

>>> atoms_b = list(struct[0][“B”].get_atoms())

We make sure that the two lists of atoms are of the same length; the work otherwise:

Superimposer

does not

>>> atoms_a = atoms_a[:len(atoms_b)]

Now we create a object and specify the two lists of atoms that should be aligned:

Superimposer

>>> si = Superimposer()

>>> si.set_atoms(atoms_a, atoms_b)

Here the atoms

are fixed, while the atoms

are movable, i.e. the second

structure is aligned on top of the first one.

atoms_b

atoms_a

The superposition occurs automatically, behind the scene. We can access the minimum RMSD

value from the object with:

si

>>> print si.rms 26.463458137

as well as the rotation matrix and the translation vector:

>>> rotmatrix = si.rotran[0]

>>> rotmatrix

array([[-0.43465167, 0.59888164, 0.67262078],

[ 0.63784155, 0.73196744, -0.23954504],

[-0.63579563, 0.32490683, -0.70014246]])

>>> transvector = si.rotran[1]

>>> transvector

Finally, we can apply the optimal rototranslation to the movable atoms:

array([92.66668863, 31.1358793 , 17.84958153])

>> si.apply(atoms_b)

Conclusion and Next Steps

In this tutorial, we covered several key concepts in Biopython and bioinformatics, including:

  1. Biopython Basics: We learned how to manipulate biological sequences, perform sequence alignments, access biological databases, and visualize biological data using libraries like matplotlib.
  2. Sequence Analysis: We covered basic sequence analysis tasks such as counting nucleotides, calculating GC content, and translating DNA sequences.
  3. Sequence Alignment: We learned how to perform pairwise and multiple sequence alignments using Biopython’s Bio.pairwise2 and Bio.Align modules.
  4. Phylogenetic Analysis: We covered how to build phylogenetic trees from sequence data using Biopython’s Bio.Phylo.TreeConstruction module.
  5. Biological Database Access: We learned how to access and retrieve data from biological databases like NCBI and UniProt using Biopython’s Bio.Entrez and Bio.ExPASy modules.

 

Shares