Mastering Bioinformatics with Biopython: A Comprehensive Guide
March 24, 2024Prerequisites:
- Basic knowledge of Python programming language
- Understanding of basic bioinformatics concepts (e.g., biological sequences, sequence alignment)
Course Outcome:
- Ability to effectively use Biopython for various bioinformatics tasks, including sequence manipulation, alignment, and phylogenetics
- Proficiency in working with biological databases and retrieving relevant data using Biopython
- Skills in visualizing biological data and structures using Biopython and other libraries
- Understanding of best practices and advanced techniques in using Biopython for bioinformatics applications
Table of Contents
Introduction to Biopython
Overview of Biopython and its capabilities in bioinformatics
Biopython is a popular open-source Python package designed for bioinformatics analysis. It provides tools for working with biological data, including sequence analysis, molecular structures, and phylogenetics. Here’s an overview of Biopython and its capabilities:
- Sequence handling: Biopython can parse and manipulate biological sequences, including DNA, RNA, and protein sequences. It provides classes and functions for reading, writing, and analyzing sequences.
- Sequence alignment: Biopython supports sequence alignment algorithms, such as pairwise and multiple sequence alignment. It can perform alignments using popular algorithms like BLAST and ClustalW.
- Sequence motif analysis: Biopython includes tools for identifying sequence motifs, such as regular expressions and position-specific scoring matrices (PSSMs).
- File format support: Biopython can read and write various file formats commonly used in bioinformatics, including FASTA, GenBank, and PDB (Protein Data Bank) formats.
- Biological databases: Biopython provides interfaces to access biological databases, such as NCBI (National Center for Biotechnology Information) and UniProt, allowing users to search and retrieve biological data.
- Phylogenetics: Biopython supports phylogenetic analysis, including tree construction, manipulation, and visualization.
- Structure analysis: Biopython can parse and analyze molecular structure files, such as those in PDB format. It provides tools for calculating properties of molecular structures and visualizing them.
- Population genetics: Biopython includes modules for analyzing population genetic data, such as calculating allele frequencies and Hardy-Weinberg equilibrium.
- Machine learning integration: Biopython can be integrated with machine learning libraries like scikit-learn for advanced bioinformatics analyses.
Biopython is widely used in bioinformatics research and provides a comprehensive set of tools for analyzing biological data. Its flexibility and ease of use make it a valuable tool for both beginners and experienced bioinformaticians.
Installing Biopython and setting up the development environment
To install Biopython and set up your development environment, follow these steps:
- Install Python: If you don’t already have Python installed, download and install it from the official Python website (https://www.python.org/). Biopython requires Python 3.6 or later.
- Install Biopython: You can install Biopython using pip, the Python package installer. Open a terminal or command prompt and run the following command:
pip install biopython
This will download and install the Biopython package and its dependencies.
- Set up a development environment: It’s recommended to use a virtual environment to manage your Python dependencies and isolate them from other projects. You can create a virtual environment using the following commands:
On Windows:
python -m venv myenv
myenv\Scripts\activate
On macOS and Linux:
bashpython3 -m venv myenv
source myenv/bin/activate
Replace
myenv
with the name you want to give to your virtual environment. - Verify installation: After setting up your virtual environment, you can verify that Biopython is installed correctly by running a Python shell and importing Biopython:python
python
import Bio
Bio.__version__
This should print the version of Biopython you have installed.
- Start coding: You can now start using Biopython in your Python scripts. Import the Biopython modules you need and start writing your bioinformatics analyses.
Remember to deactivate your virtual environment when you’re done working on your project:
deactivate
This will return you to your regular system environment.
Working with Biological Sequences
Reading and writing sequence files in various formats (FASTA, GenBank, etc.)
Biopython provides functions for reading and writing sequence files in various formats, including FASTA, GenBank, and others. Here’s how you can read and write sequence files in different formats using Biopython:
- Reading sequence files:python
from Bio import SeqIO
# Read a FASTA file
fasta_file = "example.fasta"
for record in SeqIO.parse(fasta_file, "fasta"):
print(record.id)
print(record.seq)# Read a GenBank file
genbank_file = "example.gbk"
for record in SeqIO.parse(genbank_file, "genbank"):
print(record.id)
print(record.seq)
The
SeqIO.parse()
function reads the sequence file and returns aSeqRecord
object for each entry in the file. You can access the ID (record.id
) and sequence (record.seq
) of each entry. - Writing sequence files:python
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord# Create a SeqRecord object
sequence = Seq("ATCGATCGATCG")
record = SeqRecord(sequence, id="seq1", description="Example sequence")# Write to a FASTA file
fasta_outfile = "output.fasta"
with open(fasta_outfile, "w") as output_handle:
SeqIO.write(record, output_handle, "fasta")# Write to a GenBank file
genbank_outfile = "output.gbk"
with open(genbank_outfile, "w") as output_handle:
SeqIO.write(record, output_handle, "genbank")
The
SeqRecord
class is used to create a record with a sequence (sequence
), an ID (id
), and a description (description
). TheSeqIO.write()
function writes the record to a file in the specified format ("fasta"
or"genbank"
).
These examples demonstrate how to read and write sequence files in various formats using Biopython. You can use similar approaches for other supported formats by changing the format strings in the SeqIO.parse()
and SeqIO.write()
functions.
Manipulating sequences (e.g., slicing, concatenation, reverse complement)
Biopython provides functions for manipulating sequences, such as slicing, concatenation, and reverse complementation. Here’s how you can perform these operations:
- Slicing sequences:python
from Bio.Seq import Seq
# Create a sequence
sequence = Seq("ATCGATCGATCG")# Slice the sequence
print(sequence[2:6]) # Output: 'CGAT'
- Concatenating sequences:python
from Bio.Seq import Seq
# Create two sequences
seq1 = Seq("ATCG")
seq2 = Seq("GATC")# Concatenate the sequences
concatenated_seq = seq1 + seq2
print(concatenated_seq) # Output: 'ATCGGATC'
- Reverse complementing sequences:python
from Bio.Seq import Seq
# Create a sequence
sequence = Seq("ATCGATCGATCG")# Reverse complement the sequence
reverse_complement = sequence.reverse_complement()
print(reverse_complement) # Output: 'CGATCGATCGAT'
These examples demonstrate how to manipulate sequences using Biopython. Biopython provides many other functions for working with sequences, such as translating sequences, finding motifs, and calculating molecular weights. Refer to the Biopython documentation for more information: https://biopython.org/docs/
Sequence Alignment
Performing pairwise and multiple sequence alignments
Biopython provides modules for performing pairwise and multiple sequence alignments using popular algorithms such as Needleman-Wunsch (for global alignment), Smith-Waterman (for local alignment), and ClustalW (for multiple sequence alignment). Here’s how you can perform these alignments:
- Pairwise sequence alignment:python
from Bio import pairwise2
from Bio.Seq import Seq# Create two sequences
seq1 = Seq("ACGT")
seq2 = Seq("AGT")# Perform pairwise alignment
alignments = pairwise2.align.globalxx(seq1, seq2)# Print the alignments
for alignment in alignments:
print(pairwise2.format_alignment(*alignment))
This example performs a global alignment of two sequences (
seq1
andseq2
) using theglobalxx
function (which uses a simple match/mismatch scoring scheme). Theformat_alignment
function is used to print the alignments. - Multiple sequence alignment:python
from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment
from Bio.SeqRecord import SeqRecord# Create a list of sequences
sequences = [
SeqRecord(Seq("ATCG"), id="seq1"),
SeqRecord(Seq("AGT"), id="seq2"),
SeqRecord(Seq("ACGT"), id="seq3")
]# Perform multiple sequence alignment
alignment = MultipleSeqAlignment(sequences)
print(alignment)# Write the alignment to a file
with open("output.fasta", "w") as output_handle:
AlignIO.write(alignment, output_handle, "fasta")
This example creates a list of sequences and performs a multiple sequence alignment using the
MultipleSeqAlignment
class. TheAlignIO.write
function is used to write the alignment to a file in FASTA format.
Biopython provides more advanced options for sequence alignment, such as specifying custom scoring matrices, gap penalties, and alignment algorithms. Refer to the Biopython documentation for more information: https://biopython.org/docs/
Visualizing sequence alignments
Visualizing sequence alignments can help you understand the similarities and differences between sequences. Biopython provides functionality for visualizing alignments using the Bio.pairwise2
and Bio.Align
modules. Here’s how you can visualize sequence alignments:
- Install Biopython: If you haven’t already installed Biopython, you can do so using pip:
pip install biopython
- Perform the alignment: First, perform the sequence alignment using the
pairwise2
module. Here’s an example of pairwise sequence alignment:pythonfrom Bio import pairwise2
from Bio.Seq import Seq# Create two sequences
seq1 = Seq("ACGT")
seq2 = Seq("AGT")# Perform pairwise alignment
alignments = pairwise2.align.globalxx(seq1, seq2)# Get the first alignment
alignment = alignments[0]# Print the alignment
print(pairwise2.format_alignment(*alignment))
- Visualize the alignment: Biopython doesn’t have built-in functions for graphical visualization of alignments. However, you can use external tools such as Jalview (http://www.jalview.org/) or BioEdit (http://bioedit.software.informer.com/) to visualize the alignment.
Alternatively, you can create a simple text-based visualization of the alignment by printing the aligned sequences:
pythonaligned_seq1, _, aligned_seq2, _ = alignment
for i in range(0, len(aligned_seq1), 60):
print(aligned_seq1[i:i+60])
print(aligned_seq2[i:i+60])
print()
This will print the aligned sequences in blocks of 60 characters, showing the matching and mismatching positions.
While Biopython doesn’t provide graphical visualization tools for alignments, it provides a foundation for aligning sequences and extracting alignment information that can be used with other tools for visualization.
Working with Sequence Annotations
Extracting and manipulating annotation information from sequence files
To extract and manipulate annotation information from sequence files, such as GenBank files, you can use Biopython’s SeqIO
module. Here’s an example of how to extract and manipulate annotation information:
- Extracting annotation information:python
from Bio import SeqIO
# Open a GenBank file
genbank_file = "example.gbk"# Parse the GenBank file and extract annotation information
for record in SeqIO.parse(genbank_file, "genbank"):
print("ID:", record.id)
print("Description:", record.description)
print("Annotations:")
for key, value in record.annotations.items():
print(f" {key}: {value}")
This example opens a GenBank file and extracts the sequence ID, description, and other annotations associated with the sequence.
- Manipulating annotation information:
You can manipulate annotation information by modifying the
annotations
dictionary of theSeqRecord
object:pythonfrom Bio import SeqIO
# Open a GenBank file
genbank_file = "example.gbk"# Parse the GenBank file and manipulate annotation information
for record in SeqIO.parse(genbank_file, "genbank"):
# Modify the description
record.description = "Modified description"
# Add a new annotation
record.annotations["organism"] = "Homo sapiens"
# Print the modified record
print(record.format("genbank"))
This example modifies the description and adds a new annotation (“organism”) to the
SeqRecord
object, then prints the modified record in GenBank format.
Biopython’s SeqIO
module provides a convenient way to extract and manipulate annotation information from sequence files. You can use this functionality to extract specific annotations, modify them, or add new annotations as needed.
Annotating sequences and visualizing annotations
To annotate sequences and visualize the annotations, you can use Biopython to add annotations to a SeqRecord
object and then visualize the annotations using external tools or by creating custom visualizations. Here’s an example of how to annotate sequences and visualize the annotations:
- Annotating sequences:python
from Bio import SeqIO
from Bio.SeqFeature import SeqFeature, FeatureLocation# Open a GenBank file
genbank_file = "example.gbk"# Parse the GenBank file and annotate the sequence
for record in SeqIO.parse(genbank_file, "genbank"):
# Add a new feature to the sequence
feature_location = FeatureLocation(10, 50) # Example location
feature = SeqFeature(feature_location, type="CDS", qualifiers={"gene": "example_gene"})
record.features.append(feature)# Print the annotated record
print(record.format("genbank"))
This example adds a new feature (CDS – coding sequence) to the
SeqRecord
object with a location from 10 to 50 and a gene qualifier “example_gene”. - Visualizing annotations:
Biopython does not provide built-in functions for visualizing annotations. However, you can use external tools such as GenomeDiagram (https://biopython.org/docs/1.75/api/Bio.Graphics.GenomeDiagram.html) or create custom visualizations using matplotlib or other plotting libraries.
Here’s an example of how you can create a simple visualization of annotations using matplotlib:
pythonimport matplotlib.pyplot as plt
# Define the sequence length
seq_length = len(record.seq)# Create a plot
plt.figure(figsize=(10, 2))
plt.xlim(0, seq_length)
plt.ylim(0, 1)
plt.yticks([])# Add annotations
for feature in record.features:
if feature.type == "CDS":
start, end = feature.location.start, feature.location.end
plt.plot([start, end], [0.5, 0.5], linewidth=5, color="blue", solid_capstyle="butt")# Show the plot
plt.show()
This example creates a plot where each blue line represents a coding sequence (CDS) annotation. You can customize the visualization to include different types of features or annotations as needed.
Biopython provides functionality for annotating sequences and extracting annotation information, but visualizing annotations typically requires using external tools or custom visualization code.
Phylogenetics with Biopython
Building phylogenetic trees from sequence data
To build phylogenetic trees from sequence data, you can use Biopython’s Phylo
module, which provides tools for performing phylogenetic analyses. Here’s an example of how to build a phylogenetic tree from sequence data:
- Prepare the sequence data:python
from Bio import SeqIO
# Open a FASTA file containing sequences
fasta_file = "sequences.fasta"# Parse the FASTA file and extract sequences
sequences = []
for record in SeqIO.parse(fasta_file, "fasta"):
sequences.append(record.seq)
- Perform multiple sequence alignment (optional):
Before building a phylogenetic tree, it’s often necessary to perform multiple sequence alignment to align the sequences. Biopython provides tools for this, such as
AlignIO
andMultipleSeqAlignment
. Here’s an example:pythonfrom Bio.Align import MultipleSeqAlignment
from Bio.Align import AlignIO
from Bio import Alignaligner = Align.PairwiseAligner()
aligner.mode = 'local'
alignments = aligner.align(sequences)alignment = MultipleSeqAlignment([s for s in alignments])
- Build the phylogenetic tree:python
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio import Phylo# Calculate the distance matrix
calculator = DistanceCalculator('identity')
distance_matrix = calculator.get_distance(alignment)# Build the tree
constructor = DistanceTreeConstructor()
tree = constructor.upgma(distance_matrix)# Draw the tree
Phylo.draw(tree)
This example uses the UPGMA (Unweighted Pair Group Method with Arithmetic Mean) algorithm to construct the tree. You can use other algorithms supported by Biopython, such as Neighbor-Joining or Maximum Likelihood, by specifying the appropriate constructor (
DistanceTreeConstructor
). - Visualize the phylogenetic tree:
Biopython provides basic tree visualization capabilities using
Phylo.draw()
. However, for more advanced visualizations, you may want to export the tree in Newick format and use external tools such as FigTree or iTOL (Interactive Tree of Life) for visualization.python# Export the tree in Newick format
Phylo.write(tree, "tree.nwk", "newick")
You can then open the exported Newick file in a tree visualization tool for further analysis and visualization.
This is a basic example of how to build a phylogenetic tree from sequence data using Biopython. Depending on your data and analysis requirements, you may need to customize the process or use additional tools and algorithms.
Visualizing and analyzing phylogenetic trees
To visualize and analyze phylogenetic trees, you can use Biopython’s Phylo
module for basic visualization and external tools or libraries for more advanced analyses. Here’s how you can visualize and analyze a phylogenetic tree using Biopython and additional tools:
- Visualizing phylogenetic trees:
Biopython’s
Phylo.draw()
function provides a basic way to visualize phylogenetic trees. Here’s an example:pythonfrom Bio import Phylo
# Read the tree from a Newick file
tree = Phylo.read("tree.nwk", "newick")# Basic tree visualization
Phylo.draw(tree)
For more advanced visualizations, you can export the tree in Newick format and use external tools like FigTree, iTOL, or ETE Toolkit for interactive and customizable visualizations.
- Annotating phylogenetic trees:
You can annotate phylogenetic trees with additional information, such as branch lengths, support values, or metadata associated with the sequences. Here’s an example:
python# Add branch lengths to the tree
for clade in tree.find_clades():
if clade.branch_length is None:
clade.branch_length = 0.1 # Example branch length# Annotate tree with support values (bootstrap values)
for clade in tree.find_clades():
if clade.confidence is not None:
clade.name = f"{clade.name}:{clade.confidence}%"# Visualize the annotated tree
Phylo.draw(tree)
- Analyzing phylogenetic trees:
Biopython’s
Phylo
module provides functions for basic tree analysis, such as calculating the tree’s root, finding common ancestors, and traversing the tree. For more advanced analyses, you can use external tools or libraries likeete3
for tree manipulation and analysis.python# Get the root of the tree
root = tree.root# Find common ancestors
common_ancestor = tree.common_ancestor(["A", "B"])# Traverse the tree
for clade in tree.find_clades():
print(clade)
Additionally, you can use tools like
ete3
for advanced tree analysis, including subtree extraction, branch length manipulation, and visualization.
Biopython provides basic functionalities for visualizing and analyzing phylogenetic trees. For more advanced analyses and visualizations, consider using external tools or libraries that offer more specialized features and customization options.
Structural Bioinformatics
Parsing and analyzing protein structure files (PDB format)
To parse and analyze protein structure files in PDB format, you can use Biopython’s Bio.PDB
module, which provides tools for working with protein structures. Here’s an example of how to parse and analyze a protein structure file in PDB format:
- Parse a PDB file:python
from Bio.PDB import PDBParser
# Create a PDB parser object
parser = PDBParser()# Parse the PDB file
structure = parser.get_structure("1abc", "1abc.pdb")
Replace
"1abc"
with the structure ID and"1abc.pdb"
with the path to your PDB file. - Accessing structure information:python
# Access information about the structure
print("Structure ID:", structure.id)
print("Number of models:", len(structure))
- Accessing chain and residue information:python
# Iterate over chains in the structure
for chain in structure[0]:
# Access information about the chain
print("Chain ID:", chain.id)
print("Number of residues:", len(chain))# Iterate over residues in the chain
for residue in chain:
# Access information about the residue
print("Residue ID:", residue.id)
print("Residue name:", residue.resname)
- Calculating protein properties:
Biopython’s
Bio.PDB
module provides functions for calculating various properties of protein structures, such as the distance between atoms, the angle between atoms, and the RMSD (Root Mean Square Deviation) between structures. Here’s an example:pythonfrom Bio.PDB import calc_angle, calc_distance
# Calculate the distance between two atoms
atom1 = structure[0]['A'][10]['CA'] # Example atom
atom2 = structure[0]['A'][20]['CA'] # Example atom
distance = calc_distance(atom1, atom2)# Calculate the angle between three atoms
atom3 = structure[0]['A'][30]['CA'] # Example atom
angle = calc_angle(atom1, atom2, atom3)
- Visualizing the protein structure:
Biopython’s
Bio.PDB
module does not provide built-in visualization tools for protein structures. However, you can export the structure in PDB format and visualize it using external tools like PyMOL or VMD.python# Export the structure in PDB format
io = PDBIO()
io.set_structure(structure)
io.save("output.pdb")
You can then open the exported PDB file in a molecular visualization tool for visualization and analysis.
Biopython’s Bio.PDB
module provides a powerful and flexible framework for parsing, analyzing, and manipulating protein structure files in PDB format.
Visualizing protein structures
To visualize protein structures, you can use external tools like PyMOL, Chimera, or VMD. These tools provide interactive 3D visualization capabilities for protein structures. Here’s a general approach to visualizing protein structures using PyMOL as an example:
- Export the protein structure in PDB format:
Before you can visualize a protein structure, you need to export it in PDB format. If you have already parsed the structure using Biopython, you can export it as follows:
pythonfrom Bio.PDB import PDBIO
# Export the structure in PDB format
io = PDBIO()
io.set_structure(structure)
io.save("output.pdb")
- Open PyMOL:
Start PyMOL on your computer. You can download PyMOL from the official website (https://pymol.org/).
- Load the protein structure:
In PyMOL, use the “File” menu to open the exported PDB file containing your protein structure.
- Visualize the protein structure:
Once the structure is loaded, you can use PyMOL’s graphical interface to manipulate and visualize the protein structure. You can rotate, zoom, and pan the structure to view it from different angles. PyMOL also provides tools for coloring the structure, highlighting specific residues, and measuring distances between atoms.
- Save the visualization:
After you have customized the visualization to your liking, you can save the image or session for future reference. PyMOL allows you to save images in various formats and save sessions to resume your work later.
PyMOL provides a range of advanced features for visualizing and analyzing protein structures, making it a powerful tool for structural biology research. Other tools like Chimera and VMD offer similar functionalities and can also be used for visualizing protein structures.
Biological Databases and Data Retrieval
Accessing biological databases (e.g., NCBI, UniProt) using Biopython
To access biological databases such as NCBI and UniProt using Biopython, you can use the Bio.Entrez
and Bio.SeqIO
modules. These modules provide functions for searching and retrieving data from these databases. Here’s an example of how to access NCBI and UniProt databases using Biopython:
- Accessing NCBI databases:python
from Bio import Entrez
# Provide your email address to NCBI
Entrez.email = "[email protected]"# Search NCBI nucleotide database
search_term = "CYP2D6[Gene Name] AND Homo sapiens[Organism]"
handle = Entrez.esearch(db="nucleotide", term=search_term, retmax=10)
record = Entrez.read(handle)
handle.close()# Get the list of matching IDs
id_list = record["IdList"]# Fetch sequences for the matching IDs
for uid in id_list:
handle = Entrez.efetch(db="nucleotide", id=uid, rettype="fasta", retmode="text")
sequence = handle.read()
handle.close()
print(sequence)
This example searches the NCBI nucleotide database for sequences of the CYP2D6 gene in Homo sapiens and retrieves the first 10 matching sequences in FASTA format.
- Accessing UniProt database:python
from Bio import ExPASy
from Bio import SeqIO# Access a specific UniProt entry
uniprot_id = "P68871"
handle = ExPASy.get_sprot_raw(uniprot_id)
record = SeqIO.read(handle, "swiss")
handle.close()# Accessing specific information from the record
print("ID:", record.id)
print("Description:", record.description)
print("Sequence:", str(record.seq))
This example accesses a specific entry (P68871) from the UniProt database and prints its ID, description, and sequence.
Biopython provides a convenient way to access and retrieve data from biological databases like NCBI and UniProt. It’s important to provide your email address to NCBI (using Entrez.email
) to comply with their usage policies.
Retrieving and parsing data from databases
To retrieve and parse data from biological databases using Biopython, you can use the Bio.Entrez
module for NCBI databases and various parsers such as Bio.SeqIO
for sequence data. Here’s a general example of how to retrieve and parse data from NCBI’s nucleotide database:
- Retrieve data from NCBI:python
from Bio import Entrez
# Provide your email address to NCBI
Entrez.email = "[email protected]"# Search NCBI nucleotide database
search_term = "CYP2D6[Gene Name] AND Homo sapiens[Organism]"
handle = Entrez.esearch(db="nucleotide", term=search_term, retmax=10)
record = Entrez.read(handle)
handle.close()# Get the list of matching IDs
id_list = record["IdList"]# Fetch sequences for the matching IDs
for uid in id_list:
handle = Entrez.efetch(db="nucleotide", id=uid, rettype="fasta", retmode="text")
sequence = handle.read()
handle.close()
print(sequence)
This example searches the NCBI nucleotide database for sequences of the CYP2D6 gene in Homo sapiens and retrieves the first 10 matching sequences in FASTA format.
- Parse the retrieved data:
Once you have retrieved the data, you can parse it using the appropriate parser. For example, to parse the retrieved sequences as FASTA records:
pythonfrom Bio import SeqIO
for uid in id_list:
handle = Entrez.efetch(db="nucleotide", id=uid, rettype="fasta", retmode="text")
records = SeqIO.parse(handle, "fasta")
for record in records:
print("ID:", record.id)
print("Description:", record.description)
print("Sequence:", str(record.seq))
handle.close()
This code uses
SeqIO.parse()
to parse the retrieved data as FASTA records and then iterates over each record to access its ID, description, and sequence.
Biopython provides similar functionality for other databases and data formats. You can use Bio.SeqIO
to parse sequence data in various formats (FASTA, GenBank, etc.) and Bio.PDB
to parse protein structure data in PDB format, among other capabilities.
Data Analysis and Visualization
Performing basic data analysis on biological data using Biopython
Biopython provides several tools for performing basic data analysis on biological data. Here’s a general overview of some common tasks you can perform using Biopython:
- Sequence analysis:
- Counting nucleotides or amino acids in a sequence.
- Calculating GC content of a DNA sequence.
- Translating DNA sequences to protein sequences.
pythonfrom Bio.Seq import Seq
from Bio.SeqUtils import GC# Create a DNA sequence
dna_seq = Seq("ATCGATCGATCG")# Count nucleotides
print("Number of A's:", dna_seq.count("A"))# Calculate GC content
print("GC content:", GC(dna_seq))# Translate DNA to protein sequence
protein_seq = dna_seq.translate()
print("Protein sequence:", protein_seq)
- Sequence alignment:
- Performing pairwise and multiple sequence alignments.
pythonfrom Bio import pairwise2, SeqIO
from Bio.pairwise2 import format_alignment# Read sequences from a file
seq1 = SeqIO.read("seq1.fasta", "fasta").seq
seq2 = SeqIO.read("seq2.fasta", "fasta").seq# Perform pairwise alignment
alignments = pairwise2.align.globalxx(seq1, seq2)# Print the alignments
for alignment in alignments:
print(format_alignment(*alignment))
- Phylogenetic analysis:
- Building phylogenetic trees from sequence data.
pythonfrom Bio import Phylo, AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor# Read sequences from a file
alignment = AlignIO.read("alignment.fasta", "fasta")# Calculate the distance matrix
calculator = DistanceCalculator('identity')
distance_matrix = calculator.get_distance(alignment)# Build the tree
constructor = DistanceTreeConstructor()
tree = constructor.upgma(distance_matrix)# Draw the tree
Phylo.draw(tree)
- Accessing and parsing biological databases:
- Retrieving and parsing data from databases like NCBI or UniProt.
pythonfrom Bio import Entrez, SeqIO
# Provide your email address to NCBI
Entrez.email = "[email protected]"# Search NCBI nucleotide database
search_term = "CYP2D6[Gene Name] AND Homo sapiens[Organism]"
handle = Entrez.esearch(db="nucleotide", term=search_term, retmax=10)
record = Entrez.read(handle)
handle.close()# Get the list of matching IDs
id_list = record["IdList"]# Fetch sequences for the matching IDs
for uid in id_list:
handle = Entrez.efetch(db="nucleotide", id=uid, rettype="fasta", retmode="text")
record = SeqIO.read(handle, "fasta")
print("ID:", record.id)
print("Sequence:", str(record.seq))
handle.close()
These examples cover some of the basic data analysis tasks you can perform using Biopython. Biopython provides extensive documentation and tutorials for more advanced analyses and specific tasks.
Visualizing biological data using matplotlib and other libraries
Visualizing biological data using libraries like matplotlib can be incredibly useful for gaining insights and communicating findings. Here are some examples of how you can visualize different types of biological data:
- Sequence composition:
- Visualize the nucleotide or amino acid composition of a sequence using a bar plot.
pythonimport matplotlib.pyplot as plt
from Bio.Seq import Seq
from Bio.SeqUtils import nt_distribution# Create a DNA sequence
seq = Seq("ATCGATCGATCG")# Calculate nucleotide distribution
nt_dist = nt_distribution(str(seq))# Plot the distribution
plt.bar(nt_dist.keys(), nt_dist.values())
plt.xlabel('Nucleotide')
plt.ylabel('Count')
plt.title('Nucleotide Composition')
plt.show()
- GC content:
- Visualize the GC content of sequences using a histogram.
pythonimport matplotlib.pyplot as plt
from Bio.SeqUtils import GC# Calculate GC content for multiple sequences
gc_contents = [GC(seq) for seq in sequences]# Plot the GC content distribution
plt.hist(gc_contents, bins=20)
plt.xlabel('GC Content (%)')
plt.ylabel('Count')
plt.title('GC Content Distribution')
plt.show()
- Sequence alignment:
- Visualize sequence alignments using a heatmap.
pythonimport seaborn as sns
import matplotlib.pyplot as plt
from Bio import AlignIO# Read the alignment from a file
alignment = AlignIO.read("alignment.fasta", "fasta")# Create a heatmap
sns.heatmap(alignment, cmap="viridis")
plt.xlabel('Position')
plt.ylabel('Sequence')
plt.title('Sequence Alignment')
plt.show()
- Phylogenetic tree:
- Visualize a phylogenetic tree using a dendrogram.
pythonfrom Bio import Phylo
import matplotlib.pyplot as plt# Read the tree from a Newick file
tree = Phylo.read("tree.nwk", "newick")# Plot the tree
Phylo.draw(tree)
plt.show()
- Protein structure:
- Visualize protein structures using external tools like PyMOL or Chimera. Alternatively, you can use libraries like nglview for interactive visualization in Jupyter notebooks.
pythonimport nglview as nv
# Load a protein structure file
view = nv.show_pdbid("1crn")# Display the structure
view
These examples demonstrate how you can use matplotlib and other libraries to visualize different types of biological data. Visualizations can be customized further to suit specific requirements and improve readability.
Biopython in Practical Applications
Biopython is the largest and most common bioinformatics package for Python. It contains a number of different sub modules for common bioinformatics tasks.
It is freely available here: http://biopython.org/wiki/Biopython
while the entire source code is hosted here: https://github.com/biopython/biopython
The Biopython code is distributed under the Biopython License.
Biopython sports several features. It provides:
Data structures for biological sequences and features thereof, as well as a multitude of manipulation functions for performing common tasks, such as translation, transcription, weight computations, …
Functions for loading biological data into -like data structures.
dict
Supported formats include, for instance: FASTA, PDB, GenBank, Blast, SCOP, PubMed/Medline, ExPASy-related formats
Functions for data analysis, including basic data mining and machine learning algorithm: k– Nearest Neighbors, Naive Bayes, and Support Vector Machines
Access to online services and database, including NCBI services (Blast, Entrez, PubMed) and ExPASY services (SwissProt, Prosite)
Access to local services, including Blast, Clustalw, EMBOSS.
In order to use local services (such as Blast), the latter must be properly installed and configured on the local machine.
Setting up the local services is beyond the scope of this course.
In order to install Biopython:
On a Debian-based distributions (for instance Ubuntu), just type:
$ sudo apt-get install python-biopython
On a generic GNU/Linux distribution, just type:
$ pip install biopython –user
Please refer to the official documentation for further instructions.
While Biopython is the main player in the field, it is not the only one. Other interesting packages are:
ETE and DendroPy, dedicated to computation and visualization of phylogenetic trees.
CSB for dealing with sequences and structures, computing alignments and profiles (with profile HMMs), and Monte Carlo sampling.
biskit is designed for structural bioinformatics tasks, such as loading 3D protein structures in Protein Data Bank (PDB) format, homology modelling, molecular dynamics simulations (using external simulators) and trajectory computations, and docking, among others.
You are free to mix and match all the packages you need.
Biopython: Sequences
Sequences lay at the core of bioinformatics: although DNA, RNA and proteins are molecules with specific structures and dynamical behaviors, their basic building block is their sequence.
Bio.Seq
Biopython encodes sequences using objects of type module. Take a look at their manual:
Seq
, provided by the
sub-
from Bio.Seq import Seq help(Seq)
The official documentation of the Biopython class can be found on the Biopython wiki.
Seq
Each object is associated to a formal alphabet.
Seq
Each sequence has its own alphabet
Conversions between different sequence types (transcription, translation) can be seen as a change of alphabet.
Some operations (e.g. concatenation) can not be done between sequences with incompatible alphabets.
The various alphabets supported by Biopython can be found in
Bio.Alphabet
.
Bio.Alphabet.IUPAC
The most basic alphabets are:
, e.g.
: a sequence of DNA
IUPAC.unambiguous_dna
: a sequence of RNA
IUPAC.unambiguous_rna
: a protein sequence
IUPAC.protein
Example. To create a generic sequence, i.e. a sequence not bound to any specific alphabet, write:
>>> from Bio.Seq import Seq
>>> s = Seq(“ACCGTTTAAC”) # no alphabet specified
>>> print type(s)
<class ‘Bio.Seq.Seq’>
>>> s
Seq(‘ACCGTTTAAC’, Alphabet())
>>> print s ACCGTTTAAC
To create a DNA sequence, specify the corresponding alphabet when creating the object:
Seq
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> s = Seq(“ACCGTTTAAC”, IUPAC.unambiguous_dna)
>>> s
Seq(‘ACCGTTTAAC’, IUPACUnambiguousDNA())
objects act mostly like standard Python strings . For instance, you can write:
str
Seq
>>> from Bio.Seq import Seq
>>> s = Seq(“ACTG”)
>>> len(s) 4
>>> s[2] ‘T’
>>> s[1:–1]
Seq(‘CT’, Alphabet())
>>> s.count(“C”) 3
However, some operations are restricted to objects with the same alphabet (in order to
Seq
prevent the user from, say, concatenating nucleotides and amino acids):
>>> s = Seq(“ACTG”, IUPAC.unambiguous_dna)
>>> s + s
Seq(‘ACTGACTG’, IUPACUnambiguousDNA())
>>> t = Seq(“AAADGHHHTEWW”, IUPAC.protein)
>>> s
Seq(‘ACTG’, IUPACUnambiguousDNA())
>>> t
Seq(‘AAADGHHHTEWW’, IUPACProtein())
>>> s + t
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
File “/home/stefano/.local/lib/python2.7/site-packages/Bio/Seq.py”, line 292, in add
self.alphabet, other.alphabet))
TypeError: Incompatible alphabets IUPACUnambiguousDNA() and IUPACProtein()
To convert back a
Seq
object into a normal
object, just use the normal type-casting
operator :
str()
str
>>> s = Seq(“ACTG”, IUPAC.unambiguous_dna)
>>> str(s)
Warning
Just like strings, objects are immutable:
Seq
>>> s = Seq(“ACCGTTTAAC”)
>>> s[0] = “G”
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
TypeError: ‘Seq’ object does not support item assignment
Thanks to their immutability, objects do).
Seq
objects can be used as keys in dictionaries (like
Note
str
Biopython also provides a mutable sequence type .
MutableSeq
You can convert back and forth between and :
MutableSeq
Seq
>>> s = Seq(“ACTG”, IUPAC.unambiguous_dna)
>>> muts = s.tomutable()
>>> muts[0] = “C”
>>> muts
MutableSeq(‘CCTG’, IUPACUnambiguousDNA())
>>> muts.toseq()
Seq(‘CCTG’, IUPACUnambiguousDNA())
Some operations are only available when specific alphabets are used. Let’s see a few examples.
Example. The method allows to complement a DNA or RNA sequence:
complement()
>>> s = Seq(“CCGTTAAAAC”, IUPAC.unambiguous_dna)
>>> s.complement()
Seq(‘GGCAATTTTG’, IUPACUnambiguousDNA())
while the also reverses the result from left to right:
reverse_complement()
>>> s.reverse_complement()
Seq(‘GTTTTAACGG’, IUPACUnambiguousDNA())
Warning
Neither of these methods is available for other alphabets:
>>> p = Seq(“ACLKQ”, IUPAC.protein)
>>> p.complement()
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
File “/home/stefano/.local/lib/python2.7/site-packages/Bio/Seq.py”, line 760, in complement
raise ValueError(“Proteins do not have complements!”) ValueError: Proteins do not have complements!
Example. (Adapted from the Biopython tutorial.) Transcription is the biological process by which a DNA sequence is transcribed into the corresponding DNA.
A picture of the transcription process:
5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’ (coding strand)
|||||||||||||||||||||||||||||||||||||||
3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’ (template strand)
|
Transcription
↓
5’ AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3’ (mRNA)
The same process can be “simulated” on objects. Starting from the coding strand (as is
Seq
usually done in bioinformatics), it is only a matter of converting T’s into U’s:
>>> coding_dna
Seq(‘ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG’, IUPACUnambiguousDNA())
>>> mrna = coding_dna.transcribe()
>>> mrna
Seq(‘AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG’, IUPACUnambiguousRNA())
From the template strand instead we first have to compute the reverse complement, and then perform the character substitution:
>>> template_dna
‘TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC’
>>> template_dna.reverse_complement().transcribe()
Seq(‘AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG’, IUPACUnambiguousRNA())
It is also easy to perform the reverse process, back-transcription:
>>> mrna
‘AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG’
>>> mrna.back_transcribe()
Seq(‘ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG’, IUPACUnambiguousDNA())
Warning
These functions also take care of checking that the input alphabet is the right one (e.g. DNA) and apply the right alphabet to the result (e.g. RNA).
Example. (Adapted from the Biopython tutorial.) Let’s not take a look at translation, i.e. mRNA to protein.
The code is as follows:
>>>from Bio.Seq import Seq
>>>from Bio.Alphabet import IUPAC
>>>mrna = Seq(“AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG”, IUPAC.unambiguous_rna)
>>>print mrna
AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG
>>>print mrna.translate() MAIVMGR*KGAR*
Here the stop codons are indicated with an asterisk .
“*”
Now, different coding tables are available, for instance:
>>> coding_dna.translate(table=“Vertebrate Mitochondrial”) Seq(‘MAIVMGRWKGAR*’, HasStopCodon(IUPACProtein(), ‘*’))
It’s also possible to tell the method to stop at the first stop codon:
translate()
>>> coding_dna.translate()
Seq(‘MAIVMGR*KGAR*’, HasStopCodon(IUPACProtein(), ‘*’))
>>> coding_dna.translate(to_stop=True) Seq(‘MAIVMGR’, IUPACProtein())
In this case the stop codon is not included in the resulting sequence.
Note
The Genetic Codes page of the NCBI provides the full list of translation tables used by Biopython.
In order to visualize the various translation tables, you can do:
from Bio.Data import CodonTable
ttable = CodonTable.unambiguous_dna_by_name[“Standard”] print ttable
ttable = CodonTable.unambiguous_dna_by_name[“Vertebrate Mitochondrial”] print ttable
You can also visualize what are the encodings of stop and start codons:
print ttable.stop_codons print ttable.start_codons
Let’s take a look at some common sequence formats.
FASTA is the most basic format, we have used it several times already:
>sp|P25730|FMS1_ECOLI CS1 fimbrial subunit A precursor (CS1 pilin) MKLKKTIGAMALATLFATMGASAVEKTISVTASVDPTVDLLQSDGSALPNSVALTYSPAV
NNFEAHTINTVVHTNDSDKGVVVKLSADPVLSNVLNPTLQIPVSVNFAGKPLSTTGITID SNDLNFASSGVNKVSSTQKLSIHADATRVTGGALTAGQYQGLVSIILTKST
>sp|P15488|FMS3_ECOLI CS3 fimbrial subunit A precursor (CS3 pilin) MLKIKYLLIGLSLSAMSSYSLAAAGPTLTKELALNVLSPAALDATWAPQDNLTLSNTGVS
NTLVGVLTLSNTSIDTVSIASTNVSDTSKNGTVTFAHETNNSASFATTISTDNANITLDK NAGNTIVKTTNGSQLPTNLPLKFITTEGNEHLVSGNYRANITITSTIK
GenBank is a richer sequence format for genes, it includes fields for various kinds of annotations (e.g. source organism, bibliographical references, database-specific accession number).
A GenBank file looks like this:
LOCUS
DEFINITION
5I3U_F
39 bp DNA
linear SYN 31–AUG–2016
ACCESSION
VERSION KEYWORDS SOURCE
ORGANISM
REFERENCE
AUTHORS
TITLE
JOURNAL PUBMED
…
FEATURES
source
Chain F, Structure Of Hiv–1 Reverse Transcriptase N–site Complex;
Catalytic Incorporation Of Aztmp To A Dna Aptamer In Crystal. 5I3U_F
5I3U_F
.
synthetic construct synthetic construct
other sequences; artificial sequences.
1 (bases 1 to 39)
Tu,X., Das,K., Han,Q., Bauman,J.D., Clark AD,Jr., Hou,X.,
Frenkel,Y.V., Gaffney,B.L., Jones,R.A., Boyer,P.L., Hughes,S.H., Sarafianos,S.G. and Arnold,E.
Structural basis of HIV–1 resistance to AZT by excision Nat. Struct. Mol. Biol. 17 (10), 1202–1209 (2010)
20852643
Location/Qualifiers 1..39
/organism=“synthetic construct”
/mol_type=“other DNA”
/db_xref=“taxon:32630”
ORIGIN
1 taatacncnc ccttcggtgc tttgcaccga agggggggn
//
UniProt (obtained by combining SwissProt, TrEMBL and PIR) is the GenBank equivalent for protein sequences:
ID 143B_HUMAN STANDARD; PRT; 245 AA. AC P31946;
DT 01–JUL–1993 (Rel. 26, Created)
DT 01–FEB–1996 (Rel. 33, Last sequence update)
DT 15–SEP–2003 (Rel. 42, Last annotation update)
DE 14–3–3 protein beta/alpha (Protein kinase C inhibitor protein–1) DE (KCIP–1) (Protein 1054).
GN YWHAB.
OS Homo sapiens (Human).
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
OX NCBI_TaxID=9606; RN [1]
RP SEQUENCE FROM N.A.
RC TISSUE=Keratinocytes;
RX MEDLINE=93294871; PubMed=8515476;
RA Leffers H., Madsen P., Rasmussen H.H., Honore B., Andersen A.H., RA Walbum E., Vandekerckhove J., Celis J.E.;
RT “Molecular cloning and expression of the transformation sensitive RT epithelial marker stratifin. A member of a protein family that has RT been involved in the protein kinase C signalling pathway.“;
The module provides a functions to parse all of the above (among others).
Bio.SeqIO
In particular, the returns an iterable of
SeqRecord
parse()
function takes a file handle and the format of the file (as a string) and objects:
from Bio.SeqIO import parse help(parse)
f = open(“test.fasta”)
for record in parse(f, “fasta”): print record.id, record.seq
Warning
An iterable is a function that allows to iterate over a collection of objects (similar to a list). When given an iterator, you can either use it as in the code above:
iterable = get_it_from_somewhere() # e.g. parse()
for element in iterable: do_something(element)
or extract one element at a time, with the method:
next()
iterable = get_it_from_somewhere() # e.g. parse()
first_element = iterable.next() do_something(first_element)
second_element = iterable.next() do_something(second_element)
# …
The iterable allows to iterate over a list of
objects. The
objects associate
a sequence with additional meta-data, such as human-readable descriptions, annotations,
Seq
SeqRecord
SeqRecord
and sequence features:
>>> from Bio.SeqRecord import SeqRecord
>>> help(SeqRecord)
Each object provides these attributes:
SeqRecord
- : the sequence itself, typically a object.
Seq
.seq
- : the primary identifier of the sequence – a string.
.id
- : a “common” identifier for the sequence, also a string.
.name
- : a human readable description of the sequence.
.description
- : a of additional information.
dict
.annotations
among others.
In order to read a object from a file:
SeqRecord
>>> from Bio import SeqIO
>>> record = SeqIO.parse(open(“sequences.fasta”), “fasta”).next()
>>> record
SeqRecord(seq=Seq(‘METNCRKLVSACVQLGVQPAAVECLFSKDSEIKKVEFTDSPESRKEAASSKFFP…RPV’,
SingleLetterAlphabet()), …
>>> record.id ‘Q99697’
>>> record.name ‘Q99697’
>>> record.annotations
{}
Here annotations.
record.annotations
is empty because the FASTA format does not really support sequence
The very nice thing about the
and
mechanism is that they allow to process all
of the different formats described above (FASTA, GenBank, UniProt) using the very same code!
SeqRecord
SeqIO
Let’s see what happens with a GenBank file:
>>> record = SeqIO.parse(open(“data.genbank”), “genbank”).next()
>>> record
SeqRecord(seq=Seq(‘TAATACNCNCCCTTCGGTGCTTTGCACCGAAGGGGGGGN’, IUPACAmbiguousDNA()), …
>>> from pprint import pprint
>>> pprint(record.annotations)
{‘accessions’: [‘5I3U_F’],
‘data_file_division’: ‘SYN’, ‘date’: ’31-AUG-2016′,
‘keywords’: [”],
‘organism’: ‘synthetic construct’,
‘references’: [Reference(title=’Structural basis of HIV-1 resistance to AZT by excision’,
…)],
‘source’: ‘synthetic construct’,
‘taxonomy’: [‘other sequences’, ‘artificial sequences’], ‘topology’: ‘linear’}
This way, you can (almost) forget about the format of the sequence data you are working with
– all the parsing is done automatically by Biopython.
Finally, the
SeqIO.write()
method allows to write a
objects or a list of
objects to disk, in a given format:
SeqRecord
SeqRecord
fr = open(“data.genbank”, “r”)
record = SeqIO.parse(fr, “genbank”).next()
fw = open(“seq1.fasta”, “w)
SeqIO.write(record, fw, “fasta”)
Biopython: Structures
Biopython allows to manipulate polypeptide structures via the module.
Bio.PDB
We will focus on data coming from the Protein Data Bank (PDB): http://www.rcsb.org/pdb/home/home.do
The PDB is by far the largest protein structure resource available online. At the time of writing, it hosts more thank 120k distinct protein structures, including protein-protein, protein-DNA, protein-RNA complexes.
In order to load the required code, just type:
from Bio.PDB import *
Note
For the documentation of the module, see:
Bio.PDB
http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ
Note
In the following we will use the Zaire Ebolavirus glicoprotein structure here:
5JQ3
http://www.rcsb.org/pdb/explore/explore.do?structureId=5JQ3 This is what this structure looks like:
, available
Protein Structure File Formats
The PDB distributes protein structures in two different formats:
- The XML-based file format. It is not supported by Biopython.
.cif
mmcif
- The
file format. (The actual file extension is
.) See for instance:
https://files.rcsb.org/view/5JQ3.cif
This is the most recent and “parsable” file format.
- The file format, which is a specially formatted text file. See for instance:
pdb
https://files.rcsb.org/view/5JQ3.pdb
This is the historical file format. It is (mostly) still distributed because of legacy tools that do
not (yet) understand the newer and cleaner files.
mmcif
Warning
Some (many?) PDB files distributed by the Protein Data Bank contain formatting errors
that make them ambiguous or difficult to parse.
The module attempts to deal with these errors automatically.
Bio.PDB
Of course, Biopython is not perfect, and some formatting errors may still make it do the wrong thing, or raise an exception.
Depending on your goal, you can either:
manually correct the structure, e.g. if you have to analyze a single structure
ignore the error, e.g. when you are just computing global statistics about a large number of structures, and a few errors do not really impact the overall numbers
We will see how to ignore these “spurious” errors soon.
pdb
The the
mmcif
Bio.PDB
module implements two different parsers, one for the format. Let’s try them out.
format and one for
Starting with:
Bio.MMCIF.MMCIFParser
>>> from Bio.PDB import *
To load a
cif
file, use the
module:
>>> parser = MMCIFParser()
>>> struct = parser.get_structure(“5jq3”, “5jq3.cif”) # four warnings
>>> struct
<Structure id=5jq3>
>>> print type(struct)
<class ‘Bio.PDB.Structure.Structure’>
To load a
Bio.PDB.PDBParser
pdb
file, use the
module:
>>> parser = PDBParser()
>>> struct = parser.get_structure(“5jq3”, “5jq3.pdb”) # four warnings
>>> struct
<Structure id=5jq3>
>>> print type(struct)
<class ‘Bio.PDB.Structure.Structure’>
As you can see, both parsers give you the same kind of object, a .
Structure
The header of the structure is a which stores meta-data about the protein structure,
dict
including its resolution (in Angstroms), name, author, release date, etc. To access it, write:
>>> print struct.header.keys()
[‘structure_method’, ‘head’, ‘journal’, ‘journal_reference’, ‘compound’, ‘keywords’, ‘name’, ‘author’, ‘deposition_date’,
‘release_date’, ‘source’, ‘resolution’, ‘structure_reference’]
>>> print struct.header[“name”]
crystal structure of ebola glycoprotein
>>> print struct.header[“release_date”] 2016-06-29
>>> print struct.header[“resolution”]
2.23 # Å
Warning
The header dictionary may be an empty, depending on the structure and the parser. Try it
out with the parser!
mmcif
Note
You can download the required files directly from the PBD server with:
>>> pdbl = PDBList()
>>> pdbl.retrieve_pdb_file(‘1FAT’) Downloading PDB structure ‘1FAT’…
‘/home/stefano/scratch/fa/pdb1fat.ent’
The file will be put into the current working directory.
The object also allows you to retrieve a full copy of the PDB database, if you wish
PDBList
to do so. Note that the database takes quite a bit of disk space.
Structures as Hierarchies
Structural information is natively hierarchical.
The hierarchy is represented by Structural Bioinformatics FAQ):
Bio.PDB
as follows (image taken from the Biopython
Example. For instance, the PDB structure is composed of:
5JQ3
A single model, containing Two chains:
chain A, containing 338 residues chain B, containing 186 residues
Each residue is composed by multiple atoms, each having a 3D position represented as by (x, y, z) coordinates.
More specifically, here are the related classes:
Bio.PDB
A describes one or more models, i.e. different 3D conformations of the very
Structure
same structure (for NMR). The
Structure.get_models()
method returns an iterator over the models:
>>> it = struct.get_models()
>>> it
<generator object get_models at 0x7f318f0b3fa0>
>>> models = list(it)
>>> models
[<Model id=0>]
>>> type(models[0])
<class ‘Bio.PDB.Model.Model’>
Here I turned the iterator into a proper list with the function.
Model
dict
list()
A
Structure
object:
also acts as a
: given a model ID, it returns the corresponding
>>> struct[0]
<Model id=0>
A
Model
The
>>> chains = list(models[0].get_chains())
>>> chains
[<Chain id=A>, <Chain id=B>]
>>> type(chains[0])
<class ‘Bio.PDB.Chain.Chain’>
>>> type(chains[1])
<class ‘Bio.PDB.Chain.Chain’>
Model.get_chain()
describes exactly one 3D conformation. It contains one or more chains. method returns an iterator over the chains:
A also acts as a
dict
Model
mapping from chain IDs to
objects:
>>> model = models[0]
>>> model
<Model id=0>
>>> model[“B”]
<Chain id=B>
A describes a proper polypeptide structure, i.e. a consecutive sequence of bound
Chain
residues. The
Chain.get_residues()
method returns an iterator over the residues:
>>> residues = list(chains[0].get_residues())
>>> len(residues) 338
>>> type(residues[0])
<class ‘Bio.PDB.Residue.Residue’>
A
Residue
Chain
The atoms:
Residue.get_atom()
holds the atoms that belong to an amino acid.
(unfortunately, “atom” is singular) returns an iterator over the
>>> atoms = list(residues[0].get_atom())
>>> atoms
[<Atom N>, <Atom CA>, <Atom C>, <Atom O>, <Atom CB>, <Atom OG>]
>>> type(atoms[0])
<class ‘Bio.PDB.Atom.Atom’>
An holds the 3D coordinate of an atom, as a :
Vector
Atom
>>> atoms[0].get_vector()
<Vector -68.82, 17.89, -5.51>
The x, y, z coordinates in a can be accessed with:
Vector
>>> vector = atoms[0].get_vector()
<Vector -68.82, 17.89, -5.51>
>>> coords = list(vector.get_array())
[-68.81500244, 17.88899994, -5.51300001]
The other methods of the
norm()
Vector
class allow to compute the
with another vector,
compute its
angle()
, map it to the unit sphere with
, and multiply the
coordinates by a matrix with .
left_multiply()
normalize()
Example. To iterate over all atoms in a structure, type:
parser = PDBParser()
struct = parser.get_structure(“target”, “5jq3.pdb”)
for model in struct:
for chain in model:
for res in chain:
for atom in res: print atom
and to print a single atom:
>>> struct[0][“B”][502][“CA”].get_vector()
<Vector -64.45, 8.02, 1.62>
There are also shortcuts to skip the middle levels:
for res in model.get_residues(): print residue
and you can go back “up” the hierarchy:
model = chain.get_parent()
Finally, you can inspect the properties of atoms and residues using the methods described here:
http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ#analysis
The meaning of residue IDs are not obvious. From the Biopython Structural Bioinformatics FAQ:
What is a residue id?
This is a bit more complicated, due to the clumsy PDB format. A residue id is a tuple with three elements:
The hetero-flag: this is plus the name of the hetero-residue (e.g. ‘H_GLC’ in the
“H_”
case of a glucose molecule), or ‘W’ in the case of a water molecule. The sequence identifier in the chain, e.g. 100
The insertion code, e.g. ‘A’. The insertion code is sometimes used to preserve a certain desirable residue numbering scheme. A Ser 80 insertion mutant (inserted e.g. between a Thr 80 and an Asn 81 residue) could e.g. have sequence identifiers and insertion codes as follows: Thr 80 A, Ser 80 B, Asn 81. In this way the residue numbering scheme stays in tune with that of the wild type structure. The id of the above glucose residue would thus be (‘H_GLC’, 100, ‘A’). If the hetero-flag and insertion code are blank, the sequence identifier alone can be used:
# Full id
residue = chain[(‘ ‘, 100, ‘ ‘)]
# Shortcut id
residue = chain[100]
The reason for the hetero-flag is that many, many PDB files use the same sequence identifier for an amino acid and a hetero-residue or a water, which would create obvious problems if the hetero-flag was not used.
Writing a file
pdb
pdb
Given a
Structure
object, it is easy to write it to disk in
format:
struct = … # a Structure
io = PDBIO()
io.set_structure(struct) io.save(“output.pdb”)
And that’s it!
Structure Alignment
In order to be comparable, structures must first be aligned: superimposed on top of each other.
Note
Structure alignment is very common problem.
Say you have a wildtype protein and an SNP mutant, and you would like to compare the structural consequences of the mutation. Then in order to compare the structure, first you want them to match “as closely as possible”.
It also occurs in protein structure prediction: in order to evaluate the quality of a structural predictor (a software that takes the sequence of a protein and attempts to figure out how the structure looks like), the predictions produced by the software must be compared to the real corresponding protein structures. The same goes, of course, for dynamical molecular simulation software.
For a quick overview of several superposition methods, see: https://en.wikipedia.org/wiki/Structural_alignment
The most basic superposition algorithms work by rototranslating the two proteins so to maximize their alignment.
In particular, the quality of the alignment is measured by the Root Mean Squared Error (RMSD), defined as:
RM SD(s, s′) = 1
√
n
n
∑ ∥xi
− x’i ∥2
i=1
This is a least-squares minimization problem, and is solved using standard decomposition techniques.
Warning
This is a useful, but naive interpretation of the structural alignment problem. It may not be suitable to the specific task.
For instance, it is well known that there is a correlation between the RMSD values and the length of the protein, which makes average RMSD values computed over different proteins very biased.
Different models of structure alignment avoid this issue, such as TM-score: http://zhanglab.ccmb.med.umich.edu/TM-score/
We will not cover them here.
The
Bio.PDB
module supports RMSD-based structural alignment via the
class,
let’s see how.
Superimposer
First, let’s extract the list of atoms from the two chains of our structure:
>>> atoms_a = list(struct[0][“A”].get_atoms())
>>> atoms_b = list(struct[0][“B”].get_atoms())
We make sure that the two lists of atoms are of the same length; the work otherwise:
Superimposer
does not
>>> atoms_a = atoms_a[:len(atoms_b)]
Now we create a object and specify the two lists of atoms that should be aligned:
Superimposer
>>> si = Superimposer()
>>> si.set_atoms(atoms_a, atoms_b)
Here the atoms
are fixed, while the atoms
are movable, i.e. the second
structure is aligned on top of the first one.
atoms_b
atoms_a
The superposition occurs automatically, behind the scene. We can access the minimum RMSD
value from the object with:
si
>>> print si.rms 26.463458137
as well as the rotation matrix and the translation vector:
>>> rotmatrix = si.rotran[0]
>>> rotmatrix
array([[-0.43465167, 0.59888164, 0.67262078],
[ 0.63784155, 0.73196744, -0.23954504],
[-0.63579563, 0.32490683, -0.70014246]])
>>> transvector = si.rotran[1]
>>> transvector
Finally, we can apply the optimal rototranslation to the movable atoms:
array([–92.66668863, 31.1358793 , 17.84958153])
>> si.apply(atoms_b)
Conclusion and Next Steps
In this tutorial, we covered several key concepts in Biopython and bioinformatics, including:
- Biopython Basics: We learned how to manipulate biological sequences, perform sequence alignments, access biological databases, and visualize biological data using libraries like matplotlib.
- Sequence Analysis: We covered basic sequence analysis tasks such as counting nucleotides, calculating GC content, and translating DNA sequences.
- Sequence Alignment: We learned how to perform pairwise and multiple sequence alignments using Biopython’s
Bio.pairwise2
andBio.Align
modules. - Phylogenetic Analysis: We covered how to build phylogenetic trees from sequence data using Biopython’s
Bio.Phylo.TreeConstruction
module. - Biological Database Access: We learned how to access and retrieve data from biological databases like NCBI and UniProt using Biopython’s
Bio.Entrez
andBio.ExPASy
modules.