Fundamentals of Homology Modeling in Bioinformatics

March 14, 2024 Off By admin

Table of Contents

Introduction to Homology Modeling

Homology modeling, also known as comparative modeling, is a computational method used to predict the three-dimensional structure of a protein based on its similarity to known protein structures. The principles of homology modeling are based on the assumption that proteins with similar sequences are likely to adopt similar structures and functions. Here are the key principles of homology modeling:

Sequence Alignment: The first step in homology modeling is to align the target protein sequence with one or more template protein sequences whose structures have been experimentally determined.
Homology Assessment: The degree of sequence similarity between the target protein and the template(s) is assessed using bioinformatics tools such as BLAST or PSI-BLAST. Higher sequence similarity indicates a closer evolutionary relationship and higher likelihood of structural similarity.
Model Building: Once a suitable template is identified, the three-dimensional structure of the target protein is predicted by constructing a model that mimics the structure of the template. This is typically done using software that can generate coordinates for the atoms in the model.
Model Refinement: The initial model is refined to improve its quality and accuracy. This may involve adjusting the backbone and side-chain conformations, optimizing hydrogen bonding, and minimizing steric clashes.
Validation: The final model is validated using various techniques to ensure its reliability and accuracy. This may include assessing the stereochemical quality, checking for structural errors, and comparing the model to experimental data if available.
Application: The homology model can be used for various purposes, such as understanding protein function, predicting the effects of mutations, and designing novel proteins with specific properties.

Homology modeling is a powerful tool in structural biology and drug discovery, as it allows researchers to study proteins that are difficult to crystallize or experimentally characterize. However, it is important to note that homology models are predictions and should be interpreted with caution, especially if the sequence similarity between the target and template is low.

Importance and applications in bioinformatics and structural biology

Homology modeling is of great importance in bioinformatics and structural biology due to its wide range of applications. Here are some key aspects:

Structural Prediction: Homology modeling allows researchers to predict the three-dimensional structures of proteins based on their amino acid sequences. This is particularly useful for proteins whose structures have not been experimentally determined.
Functional Annotation: By predicting the structure of a protein, homology modeling can provide insights into its function. This information is valuable for understanding biological processes and designing experiments to study protein function.
Drug Discovery: Homology modeling is used in drug discovery to predict the structure of a protein target, such as a receptor or enzyme, that is involved in a disease. This information can be used to design small molecules that bind to the target and modulate its activity, leading to the development of new drugs.
Protein Engineering: Homology modeling can be used to design novel proteins with specific functions or properties. By predicting the structure of a protein and making targeted modifications to its sequence, researchers can create proteins with improved stability, activity, or binding specificity.
Evolutionary Studies: Comparing homology models of related proteins can provide insights into their evolutionary relationships and the structural basis of their differences in function. This information can help researchers understand how proteins evolve and adapt to different environments.
Structure-Function Relationships: Homology modeling can be used to study the relationship between protein structure and function. By comparing the structures of proteins with different functions but similar structures, researchers can identify key structural features that are important for function.

Overall, homology modeling is a versatile tool that has a wide range of applications in bioinformatics and structural biology. It allows researchers to study proteins at the atomic level, providing insights into their structure, function, and evolution.

Overview of comparative protein structure modeling

Comparative protein structure modeling, also known as homology modeling, is a computational technique used to predict the three-dimensional structure of a protein based on its amino acid sequence and the known structure of a related protein(s) (template). Here is an overview of the process:

Sequence Alignment: The first step is to align the target protein sequence with the sequences of one or more template proteins whose structures have been experimentally determined. This alignment is crucial for accurately predicting the structure of the target protein.
Template Selection: The template protein(s) with the highest sequence similarity to the target protein are selected. The quality of the alignment between the target and template sequences is assessed to ensure that it is reliable.
Model Building: Based on the alignment, a preliminary model of the target protein is built using the coordinates of the atoms in the template structure(s). The model is constructed by transferring the coordinates of equivalent residues from the template to the target sequence.
Model Refinement: The initial model is refined to improve its quality and accuracy. This may involve adjusting the backbone and side-chain conformations, optimizing hydrogen bonding, and minimizing steric clashes.
Validation: The final model is validated using various techniques to ensure its reliability and accuracy. This may include assessing the stereochemical quality, checking for structural errors, and comparing the model to experimental data if available.
Application: The homology model can be used for various purposes, such as understanding protein function, predicting the effects of mutations, and designing novel proteins with specific properties.

It’s important to note that while homology modeling is a powerful tool, the accuracy of the predicted model depends on the quality of the sequence alignment and the similarity between the target and template proteins. Therefore, the model should be interpreted with caution, especially if the sequence similarity is low.

Understanding Protein Structure

Basics of protein structure

Proteins are large, complex molecules that play essential roles in the structure, function, and regulation of cells and tissues. Understanding the basics of protein structure is fundamental to many areas of biology, including bioinformatics and structural biology. Here are the key components of protein structure:

Primary Structure: The primary structure of a protein refers to the linear sequence of amino acids linked together by peptide bonds. The sequence is determined by the genetic code stored in the DNA of an organism.
Secondary Structure: Secondary structure refers to the local spatial arrangement of the polypeptide backbone of a protein. The two most common types of secondary structure are alpha helices and beta sheets, which are stabilized by hydrogen bonds between the backbone atoms.
Tertiary Structure: Tertiary structure refers to the three-dimensional arrangement of the entire polypeptide chain of a protein. It is determined by the interactions between amino acid side chains, such as hydrogen bonds, hydrophobic interactions, and disulfide bonds.
Quaternary Structure: Quaternary structure refers to the arrangement of multiple protein subunits (polypeptide chains) that come together to form a functional protein complex. Each subunit may contribute to the overall structure and function of the complex.
Protein Folding: Protein folding is the process by which a protein adopts its functional three-dimensional structure. It is driven by the interactions between amino acid side chains and is guided by the primary structure of the protein.
Protein Domains: Protein domains are structural and functional units within a protein that can fold independently and often have specific functions. Domains are typically composed of 50-350 amino acids and can be found in a variety of proteins.
Protein Conformation: The conformation of a protein refers to its overall three-dimensional shape, which is determined by its primary, secondary, tertiary, and quaternary structure. Conformational changes can affect protein function and regulation.

Understanding protein structure is crucial for studying protein function, predicting protein properties, and designing drugs that target specific proteins. Techniques such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and computational modeling are used to determine and analyze protein structures at atomic resolution.

Primary, secondary, tertiary, and quaternary structures

Proteins are biological macromolecules made up of amino acid residues linked together by peptide bonds. The structure of a protein is hierarchical, with four levels of organization: primary, secondary, tertiary, and quaternary structure.

Primary Structure: The primary structure of a protein is the linear sequence of amino acids in a polypeptide chain. It is determined by the genetic code and is critical for the folding and function of the protein. Even a single change in the primary structure, such as a point mutation, can have profound effects on protein function.
Secondary Structure: Secondary structure refers to the local folding patterns that occur within a polypeptide chain. The two most common types of secondary structure are alpha helices and beta sheets. Alpha helices are right-handed coils stabilized by hydrogen bonds between the backbone amide hydrogen and carbonyl oxygen atoms. Beta sheets are formed by hydrogen bonding between adjacent strands of the polypeptide chain, creating a sheet-like structure.
Tertiary Structure: Tertiary structure refers to the overall three-dimensional conformation of a protein. It is determined by the interactions between amino acid side chains (e.g., hydrogen bonds, hydrophobic interactions, disulfide bonds) and is critical for the protein’s function. Tertiary structure is often described in terms of domains, which are distinct structural units within a protein that fold independently and often have specific functions.
Quaternary Structure: Quaternary structure refers to the arrangement of multiple protein subunits (polypeptide chains) in a multi-subunit complex. Each subunit may contribute to the overall structure and function of the complex. Quaternary structure is important for the function of many proteins, including enzymes, antibodies, and membrane proteins.

Overall, the four levels of protein structure are interconnected and together determine the overall shape, stability, and function of a protein. Understanding protein structure at these different levels is crucial for understanding protein function and for the design of therapeutics and other biotechnological applications.

Protein folding and stability

Protein folding is the process by which a protein adopts its three-dimensional structure, known as its native conformation. This process is guided by the protein’s primary structure, which determines how the amino acid sequence will fold into a functional protein. Protein folding is driven by various interactions, including hydrogen bonds, van der Waals forces, hydrophobic interactions, and disulfide bonds.

The folding process can be simplified into several key steps:

Primary Structure: The linear sequence of amino acids in a protein determines its folding pathway. The sequence dictates which amino acids are adjacent to each other in the chain, influencing how they interact during folding.
Secondary Structure Formation: The protein begins to fold into local structures, such as alpha helices and beta sheets, stabilized by hydrogen bonds between the backbone atoms.
Tertiary Structure Formation: The secondary structures further fold and pack together to form the protein’s overall three-dimensional shape. This process is driven by hydrophobic interactions, van der Waals forces, and additional hydrogen bonding and disulfide bond formation.
Quaternary Structure (if applicable): Some proteins consist of multiple subunits that assemble to form a functional protein complex. The assembly of these subunits into the quaternary structure adds another level of folding and stability to the protein.

Protein stability refers to the tendency of a protein to maintain its native conformation under a given set of conditions, such as temperature, pH, and presence of denaturants. Factors that affect protein stability include:

Hydrophobic Interactions: Hydrophobic interactions play a crucial role in protein folding and stability. Water molecules tend to exclude hydrophobic residues, driving them to the protein’s core, where they are shielded from the surrounding solvent.
Hydrogen Bonds: Hydrogen bonds between amino acid residues help stabilize secondary structures, such as alpha helices and beta sheets, as well as tertiary and quaternary structures.
Disulfide Bonds: Disulfide bonds form between cysteine residues and can stabilize the protein’s tertiary and quaternary structure. Disulfide bonds are particularly important for proteins that need to maintain stability in oxidative environments.
Van der Waals Forces: Van der Waals forces are weak attractive forces between atoms and molecules. These forces help stabilize the compact structure of a protein.
pH and Temperature: Changes in pH and temperature can disrupt protein stability by affecting the interactions that stabilize the protein’s structure. Proteins have an optimal pH and temperature range for stability, outside of which they may denature and lose their function.

Understanding protein folding and stability is crucial for various fields, including biochemistry, structural biology, and drug discovery. By studying these processes, researchers can gain insights into how proteins function and how their structure can be modified for therapeutic purposes.

Protein Structure Databases

Introduction to protein structure databases (e.g., PDB)

Protein structure databases are repositories that store experimentally determined three-dimensional structures of proteins and other biological macromolecules. One of the most widely used and comprehensive protein structure databases is the Protein Data Bank (PDB). Here’s an introduction to PDB and other related databases:

Protein Data Bank (PDB): The PDB is a freely accessible database that archives experimentally determined three-dimensional structures of proteins, nucleic acids, and complex assemblies. It provides a wealth of information about the structures of biological macromolecules, including atomic coordinates, experimental methods used for structure determination, and related metadata. Researchers use the PDB to study protein structure, function, and interactions, as well as for drug discovery and design.
Research Collaboratory for Structural Bioinformatics (RCSB) PDB: The RCSB PDB is a member of the worldwide PDB partnership and is responsible for the deposition, processing, and distribution of PDB data. It provides a user-friendly interface for searching and accessing PDB structures, as well as tools for visualizing and analyzing protein structures.
European Bioinformatics Institute (EBI) PDB: The EBI PDB is a European counterpart to the RCSB PDB and provides access to the same PDB data through its own web interface and services. It also offers additional resources and tools for analyzing protein structures.
Structural Classification of Proteins (SCOP): SCOP is a database that classifies protein structures into a hierarchy of structural domains based on their structural and evolutionary relationships. It helps researchers understand the relationships between protein structures and functions.
CATH (Class, Architecture, Topology, Homology) Database: Similar to SCOP, the CATH database provides a hierarchical classification of protein domain structures based on their architecture, topology, and homologous relationships. It is useful for studying protein structure evolution and function.
Protein Model Portal: The Protein Model Portal is a resource that provides access to comparative protein structure models generated by different modeling methods. It allows users to search for and compare protein models based on their sequence or structure similarity.

These databases play a crucial role in bioinformatics and structural biology, providing researchers with valuable resources for studying protein structure, function, and evolution.

Accessing and retrieving protein structures

Accessing and retrieving protein structures from databases like the Protein Data Bank (PDB) can be done through various methods, including web interfaces, APIs, and software tools. Here’s a general overview of how you can access and retrieve protein structures:

Web Interfaces:
- Visit the website of the protein structure database you want to access (e.g., RCSB PDB, EBI PDB).
- Use the search functionalities provided to search for specific proteins or structures of interest.
- View and download the structures directly from the website.
Programmatic Access (APIs):
- Many protein structure databases provide APIs (Application Programming Interfaces) that allow you to programmatically access and retrieve data.
- Use programming languages like Python, Perl, or Java to interact with the API and retrieve the desired protein structures.
Software Tools:
- There are several software tools available that can help you access and retrieve protein structures from databases.
- Examples include PyMOL, Chimera, and VMD, which allow you to visualize and analyze protein structures, as well as download them from databases.
Download Formats:
- Protein structures can be downloaded in various formats, such as PDB format (standard format for protein structures), CIF (Crystallographic Information File) format, and others.
- Choose the appropriate format for your needs based on the software or analysis you plan to perform.
Metadata and Annotations:
- Protein structure databases often provide additional metadata and annotations for each structure, such as experimental methods used for structure determination, resolution, and related publications.
- Use this information to understand the context and quality of the structure you are retrieving.

It’s important to cite the original source (database and relevant publication) when using retrieved protein structures in your research or publications to acknowledge the efforts of the researchers who determined the structures.

Understanding the importance of structure quality in modeling

The quality of a protein structure is crucial in molecular modeling and related studies because it directly impacts the reliability and accuracy of the models and predictions derived from it. Here are some key reasons why structure quality is important in modeling:

Reliability of Predictions: High-quality protein structures provide a more reliable basis for predicting protein function, interactions, and dynamics. Models built on low-quality structures may lead to inaccurate or misleading results.
Drug Design and Discovery: In structure-based drug design, the accuracy of the protein structure is critical for identifying potential drug binding sites and designing molecules that interact with the protein. Errors in the structure can lead to ineffective or even harmful drug candidates.
Functional Annotation: Protein structures are often used to infer the function of unknown proteins through structural homology. High-quality structures improve the accuracy of functional annotations based on structural similarity.
Molecular Dynamics Simulations: In molecular dynamics simulations, the accuracy of the initial protein structure directly affects the validity of the simulation results. Errors in the structure can propagate and lead to unrealistic dynamics.
Protein-Protein Interactions: Understanding protein-protein interactions relies on accurate structural information. Errors in the structure can affect the interpretation of binding interfaces and interaction mechanisms.
Comparative Modeling and Homology Studies: Comparative modeling relies on high-quality template structures for accurate model building. Low-quality templates can lead to inaccurate models and incorrect functional predictions.
Data Deposition and Sharing: High-quality protein structures deposited in public databases, such as the Protein Data Bank (PDB), serve as valuable resources for the scientific community. Ensuring the quality of deposited structures enhances the reliability of research based on these data.

In summary, the quality of protein structures is essential for the reliability and accuracy of molecular modeling studies, affecting a wide range of applications in structural biology, bioinformatics, and drug discovery.

Sequence Alignment Techniques

Sequence alignment is a fundamental technique used in bioinformatics to compare and identify similarities between sequences of DNA, RNA, or proteins. It helps in understanding evolutionary relationships, predicting structure and function, and annotating genes. There are two main types of sequence alignment: pairwise sequence alignment and multiple sequence alignment.

Pairwise Sequence Alignment: Pairwise sequence alignment is the comparison of two sequences to identify regions of similarity that may indicate functional, structural, or evolutionary relationships between the sequences. One of the most widely used methods for pairwise sequence alignment is the Basic Local Alignment Search Tool (BLAST). BLAST uses a heuristic algorithm to find regions of local similarity between sequences, which can be useful for identifying homologous sequences and functional domains.

Multiple Sequence Alignment: Multiple sequence alignment (MSA) is the alignment of three or more sequences to identify conserved regions and patterns of similarity across all sequences. MSA is important for understanding the evolutionary history of sequences, identifying functional domains, and predicting the effects of mutations. Clustal Omega is a commonly used algorithm for multiple sequence alignment, which uses a progressive alignment approach to align sequences based on their pairwise similarities.

In both pairwise and multiple sequence alignment, the goal is to maximize the number of matches and minimize the number of gaps (insertions or deletions) in the alignment, while considering the evolutionary relationships and biological significance of the sequences. These alignments can be visualized using tools such as Jalview, which provides a graphical representation of the aligned sequences, highlighting conserved regions and gaps.

Homology Modeling Steps

Homology modeling, also known as comparative modeling, is a computational method used to predict the three-dimensional structure of a protein based on its similarity to a known protein structure (template). Here are the general steps involved in homology modeling:

Template Selection: Identify one or more protein structures that are closely related to the target protein. The template should have a high sequence similarity and be structurally and functionally similar to the target protein.
Sequence Alignment with the Template: Align the amino acid sequence of the target protein with the sequence of the template protein. This step is crucial for mapping the target sequence onto the template structure and identifying equivalent residues.
Model Building: Build a preliminary model of the target protein based on the alignment with the template. This can be done using software that generates coordinates for the atoms in the model based on the template structure.
Model Refinement: Refine the initial model to improve its quality and accuracy. This may involve adjusting the backbone and side-chain conformations, optimizing hydrogen bonding, and minimizing steric clashes.
Evaluation of the Model Quality: Assess the quality of the final model using various validation techniques. This may include checking for stereochemical quality, evaluating the overall fold of the model, and comparing the model to experimental data if available.
Model Optimization: Iteratively refine the model based on validation results and expert knowledge. This may involve further refinement of the model structure and reassessment of the model quality.
Final Model Selection: Select the final model based on its quality and suitability for the intended application. The model can then be used for further studies, such as predicting protein-protein interactions, understanding structure-function relationships, or designing novel proteins.

Homology modeling is a powerful tool for predicting protein structures and is widely used in structural biology, drug discovery, and other areas of research. However, the accuracy of the homology model depends on the quality of the sequence alignment and the similarity between the target and template proteins.

Tools and Software for Homology Modeling

Homology modeling software plays a crucial role in predicting protein structures based on sequence similarity to known structures. Here’s an overview of some popular homology modeling software and tools:

MODELLER: MODELLER is a widely used software for homology modeling developed by the Sali Lab at the University of California, San Francisco. It uses a satisfaction of spatial restraints approach to generate 3D models of proteins based on alignment with one or more template structures. MODELLER is highly versatile and allows for advanced modeling protocols and optimization options.
SWISS-MODEL: SWISS-MODEL is an automated homology modeling server developed by the Swiss Institute of Bioinformatics. It provides an easy-to-use web interface for modeling protein structures based on a single template or multiple templates. SWISS-MODEL offers a range of modeling options and provides high-quality models suitable for many applications.
Phyre2: Phyre2 is a web-based tool for protein structure prediction developed by the Kelley Lab at the University of Oxford. In addition to homology modeling, Phyre2 offers ab initio modeling and fold recognition methods. It provides a user-friendly interface and is suitable for both novice and expert users.
I-TASSER: I-TASSER (Iterative Threading ASSEmbly Refinement) is a hierarchical approach to protein structure prediction developed by the Zhang Lab at the University of Michigan. It combines threading, ab initio modeling, and atomic-level structure refinement to generate accurate 3D models of proteins. I-TASSER is available as a web server and standalone software.
Rosetta: Rosetta is a suite of software tools for macromolecular modeling developed by the Baker Lab at the University of Washington. It includes modules for protein structure prediction, protein-protein docking, and protein design. Rosetta is widely used in the research community and is known for its accuracy and versatility.
PyMOL: PyMOL is a molecular visualization tool that can be used for homology modeling. While PyMOL itself does not perform the modeling, it can be used to visualize and analyze the models generated by other software tools. PyMOL is particularly useful for visualizing the 3D structures of proteins and examining their features.

Model Evaluation and Validation

Model evaluation and validation are critical steps in homology modeling to ensure the quality and reliability of the predicted protein structure. Here are some key techniques used for assessing and validating homology models:

Ramachandran Plot: The Ramachandran plot is a graphical representation of the phi (ϕ) and psi (ψ) angles of amino acid residues in a protein structure. It helps in identifying regions of the protein that have unusual backbone torsion angles, which can indicate errors in the model.
MolProbity: MolProbity is a web service that provides a comprehensive analysis of protein structures, including geometry, steric clashes, and hydrogen bonding. It evaluates the quality of the model based on the principles of stereochemistry and identifies potential errors or outliers.
VERIFY3D: VERIFY3D is a method for assessing the compatibility of an atomic model (3D) with its own amino acid sequence (1D). It uses a statistical potential derived from known protein structures to evaluate the model’s quality and assess its agreement with experimental data.
ProSA-web: ProSA-web is a web server for calculating the overall model quality based on the potential energy of the protein structure. It compares the energy of the model with the energies of experimentally determined structures to assess the model’s reliability.
Protein Structure Validation Suite (PSVS): PSVS is a software package that includes a collection of tools for validating protein structures, including PROCHECK, MolProbity, and other validation methods. It provides a comprehensive assessment of the model’s quality and helps in identifying potential errors.
Protein Data Bank Validation Reports: The Protein Data Bank (PDB) provides validation reports for deposited protein structures, which include information on geometry, bond lengths, and other quality indicators. These reports can be used to assess the quality of homology models based on comparison with experimental structures.
Visual Inspection: Visual inspection of the model using molecular visualization software (e.g., PyMOL, Chimera) is also important for identifying any obvious errors or discrepancies in the structure.

By using these techniques, researchers can assess the quality of homology models and identify areas for improvement, ultimately leading to more reliable and accurate protein structures.

Advanced Topics in Homology Modeling

Advanced topics in homology modeling involve addressing complex challenges that arise during the modeling process. Here are some key considerations:

Incorporating Structural Information from Templates:
- Templates may contain regions that are not present in the target sequence. These regions can be structurally important and may affect the overall fold of the protein.
- Techniques such as loop modeling and fragment insertion can be used to model these missing regions based on structural information from the template or other sources.
Dealing with Gaps and Insertions in the Alignment:
- Gaps and insertions in the alignment can arise due to sequence variations between the target and template proteins.
- Methods such as gap modeling and loop refinement can be used to optimize the alignment and fill in gaps, ensuring that the model accurately reflects the target sequence.
Handling Template Modeling Errors:
- Template structures may contain errors or inaccuracies that can affect the quality of the model.
- Techniques such as model refinement, energy minimization, and molecular dynamics simulations can be used to improve the accuracy of the model and correct for errors in the template structure.
Incorporating Multiple Templates:
- In cases where no single template provides a good match to the target sequence, multiple templates can be used to model different regions of the protein.
- Methods such as template combination and hybrid modeling can be used to integrate structural information from multiple templates into a single model.
Modeling Protein-Protein Interactions:
- Homology modeling can be used to predict the structures of protein complexes and study protein-protein interactions.
- Techniques such as interface modeling and docking can be used to model the binding interface between proteins and predict the structure of the complex.
Model Validation and Quality Assessment:
- Rigorous validation and quality assessment are essential for ensuring the reliability of homology models.
- Advanced validation methods, such as cross-validation, ensemble modeling, and model averaging, can be used to assess the quality of the model and identify potential errors or inaccuracies.

By addressing these advanced topics, researchers can improve the accuracy and reliability of homology models and gain deeper insights into protein structure and function.

Case Studies and Applications

Homology Modeling Exercise

We will investigate the structure of the influenza virus neuraminidase protein and look at how its function may be blocked by using the neuraminidase inhibitor oseltamivir, the active ingredient in the drug Tamiflu.

The substrate of influenza virus neuraminidase is sialic acid. The structures of sialic acid and the drugs oseltamivir (in Tamiflu) and zanamivir (in Relenza) are shown below (from P.J. Collins et al., Nature 453, 1258 (2008)).

Briefly describe the similarities and differences between enzyme substrate, sialic acid, and the enzyme inhibitors.

The structure of an N2 neuraminidase with sialic acid bound in the active site can be found in the PDB entry 2BAT. A N1 neuraminidase with the active site blocked by oseltamivir can be found in 2HU4. Some naturally occurring mutated variants of viral neuraminidase have been found to give rise to Tamiflu-resistant strains of influenza. One such mutant is avian influenza N1 His274Tyr. According to P.J. Collins et al. (Nature 453, 1258 (2008)), this mutant binds oseltamivir with much lower affinity than the wild-type enzyme. Enzyme inhibition is reduced by a factor ~265. The structure of oseltamivir bound to His274Tyr N1 neuraminidase can be found in the PDB-structure 3CL0.

Download the “PyMOL scene” or “PyMOL session” from this website. Save the file on your Desktop and open it in PyMOL. This PyMOL session contains:

2HU4, chain A: A wild-type N1 neuraminidase bound to oseltamivir (green)
2BAT, chain A: A wild-type N2 neuraminidase bound to sialic acid (blue)
3CL0, chain A: A His274Tyr N1 mutant bound to oseltamivir (yellow)

The proteins are shown in cartoon rendering and the ligands as sticks. You could easily have generated this “scene” yourself (and saved it as a PyMOL session), but we do not have time for that now! Actually, if you have used PyMOL a bit before and have the time, please download the PDB files and get the 3 structures into the same session. That is, do it “properly”.

It is much easier to see similarities and differences if we align the structures in 3D space. Type “align 2HU4, 2BAT” (watch the screen as you tap “enter”!) to make a superimposition of these two structures. This is PyMOL’s variant of intermolecular alignment. One of the structures is translated and rotated in order to get the lowest possible RMSD with respect to the other. Which of the two structures were moved?

Align also 3CL0 with the two others. Does sialic acid and oseltamivir bind in the same pocket on the enzyme surface? Would you describe the structures as identical? Similar? Dissimilar?

We will now look at 2HU4 only. Turn off the two other objects in the mini-menu on the right-hand side of the viewer window. You now see only wild-type N1 neuraminidase bound to oseltamivir in the active-site pocket.
“oselt-WT” is a selection containing only the oseltamivir molecule. Click on the selection to get that confirmed. Make a new selection containing only residues within 8 Å of oseltamivir, i.e. the “active-site region”. To do this type “select ActSite, oselt-WT around 8”. Hide everything and then show both oseltamivir and ActSite as “sticks”. Color them “by element” but use different colors for the C-atoms making it easy to see both the oseltamivir and the protein residues. Hide everything else. You should now be able to see all the residues of the neuramidase packing around oseltamivir (as seen below).

For the (ActSite) selection, do “L” → “residues” to get these residues labeled.
You can let PyMOL make an attempt on localizing H-bonds between oseltamivir and the protein by doing for the (oselt-WT) selection: “Actions” → “find” → “polar contacts” → “to other atoms in object”. As you see, there are too many H-bonds, but at least you get some idea.
List two acidic residues forming H-bonds to oseltamivir. Which basic residue is donating an H-bond to the amide group of oseltamivir? Two other basic residues contact the carboxyl group of oseltamivir. Which are they? One of them is even involved in strong, so-called “bidentate” H-bonding. Which one? List some other residues involved in van der Waals interactions packing with oseltamivir.
Turn on all three objects again, 2HU4, 2BAT, and 3CL0. Find residue 274 in the three structures. Show them as sticks. What are they? What can you say about the properties of these residues?
Color the three objects differently, but “by element”. Now you see the (aligned) active site residues of the three enzymes.

Above you have identified several residues involved in forming H-bonds with oseltamivir: Glu119, Asp151, Arg152, Arg292 and Arg371. Are these residues conserved in all three neuraminidases? What can you say about the conformations/rotamers of the side chains for these residues? Why do you think these residues are conserved? What would happen if for example the mutation Glu119Trp was introduced? Would the enzyme be inhibited by oseltamivir? Would it have any activity on sialic acid? Can you find any active site residues that are not conserved between the three neuraminidases?

Locate residues 274 and Glu276. Take a closer look at the conformation of Glu276 in the 3 structures. Are you able to explain why the His274Tyr mutant binds oseltamivir less efficiently than the wild-type enzyme? Can you explain why His274Tyr is “allowed”, i.e. the enzyme has wild-type activity on its substrate? Why is this mutation particularly good for the virus and bad for the doctor trying to treat a patient with Tamiflu?

Now let us try some modeling!

You have a cousin working for Médecins Sans Frontières near Goroka in Papua New Guinea. You get this e-mail from her:

Dear cousin,

we have some serious problems here with an outbreak of an influenza-like disease. We have high mortality rates and it appears to be highly contagious. We have some indications that this might be an oseltamivir-resistant strain of influenza. I remember you told me about that bioinformatics course and perhaps you can help me with some quick modeling. It will be much faster than going to the lab and we are certainly in a hurry here 🙁

Through another contact we managed to get the neuraminidase gene from this virus sequenced:

>Possible neuraminidase [Putative influenza A virus (Goroka)]

MNPNQKILTIGSVSLSIATICFLMQIAILVSTVTLHFKQYECNSPPQNQVMLCDPTIIERNITEIVYLTNTT

IEREICPRLAEYKNWTRPQCDISGFAPYSKDNSIRLSAGGDIWVIREPYLSCDPDKCYQFILGQGSTLNNVH

SNDTVHDRTPYRTLMMNELGVPFHLGTKNVCIAWSTSSCHDGKAWLHVCVMGDDKNASATFIYNGRLVDSIV

SWSRKILRTQESECVCINGSCTVVMQDGSASGKADTKILFIEDGKILHTSTLSGSAQHVEECSCYPRYPGVK

CVCRDNWKGSNRPLVDINIDYSIVTSYVCSGLIGDTPRRNDSSSSSHCLDPQNEEGGHGVKGWAFDDGNDVW

MGRTISEKLRSGYESFKVIEGWSKPNSKLNIRQVIVERGNRSGYSGIFSVEGRSCINRCFYVELIRGRKDET

EVLWTSSSIVVFCGTSGTYGTGSWPDGADL

Can you have a look at this and give me feedback? Can structural bioinformatics tell us anything about this strain? Might it be resistant to Tamiflu?

Please help us,

Your cousin

What should you do? In this case, would you recommend trying to crystallize the protein and investigate the structure by X-ray crystallography? Why not?

You decide that you will try homology modeling for the Goroka neuraminidase. What are the 6 steps involved?

Go to the NCBI website (http://blast.ncbi.nlm.nih.gov/Blast.cgi) and do “protein blast”. Use the Goroka-sequence as a query sequence, use blastp and search the pdb-database for possible templates. Make sure you use the pdb-database, and nothing else! Do you find any templates that are suitable for homology modeling? Search for the string “2HTY” on the results page. Below the header at the sequence alignment for this hit click on “See 23 more title(s)” to find 2HU4. Is 2HU4, containing oseltamivir, a possible template? What is the sequence identity between the template 2HU4 and the target (Goroka sequence)? Does homology modeling appear to be a possibility? Will you be able to use 2HU4 for modeling the full-length protein?

The sequence alignment you got from blastp is the following, in CLUSTAL format:

CLUSTAL

Target CDISGFAPYSKDNSIRLSAGGDIWVIREPYLSCDPDKCYQFILGQGSTLNNVHSNDTVHD

Templ CPINGWAVYSKDNSIRIGSKGDVFVIREPFISCSHLECRTFFLTQGALLNDKHSNGTVKD

Target RTPYRTLMMNELG-VPFHLGTKNVCIAWSTSSCHDGKAWLHVCVMGDDKNASATFIYNGR

Templ RSPHRTLMSCPVGEAPSPYNSRFESVAWSASACHDGTSWLTIGISGPDNGAVAVLKYNGI

Target LVDSIVSWSRKILRTQESECVCINGSCTVVMQDGSASGKADTKILFIEDGKILHTSTLSG

Templ ITDTIKSWRNNILRTQESECACVNGSCFTVMTDGPSNGQASYKIFKMEKGKVVKSVELDA

Target SAQHVEECSCYPRYPGVKCVCRDNWKGSNRPLVDINIDYSIVTSYVCSGLIGDTPRRNDS

Templ PNYHYEECSCYPNAGEITCVCRDNWHGSNRPWVSFNQNLEYQIGYICSGVFGDNPRPNDG

Target SSSSHCLDPQNEEGGHGVKGWAFDDGNDVWMGRTISEKLRSGYESFKVIEGWSKPNSKLN

Templ TGS—CGPVSSNGAYGVKGFSFKYGNGVWIGRTKSTNSRSGFEMIWDPNGWTETDSSFS

Target IRQVIVERGNRSGYSGIF—-SVEGRSCINRCFYVELIRGRKDETEVLWTSSSIVVFCG

Templ VKQDIVAITDWSGYSGSFVQHPELTGLDCIRPCFWVELIRGRPKES-TIWTSGSSISFCG

Target TSGTYGTGSWPDGADL

Templ VNSDTVGWSWPDGAEL

How many indels (insertions/deletions) are there? Is getting the correct sequence alignment important for homology modeling? How could you improve the alignment?

In order to do homology modeling for the Goroka target sequence, go to the SWISS-MODEL website (http://swissmodel.expasy.org). Click on “Start Modelling”. You can create an account, but that is not necessary for this exercise. Instead of using the default mode under “Modelling” which is the simplest (but might give more errors) we will try “Alignment Mode“. We need a good alignment to start this job. However, due to lack of time, we will try to use the alignment we got from blastp above. Open the template structure in PyMOL (for example the file

http://folk.uio.no/jonkl/pubstuff/2hu4ChA.pdb). Show as “cartoon” or “ribbon”. Locate the positions of the indels in the alignment above. Are they in loops/coils, as they should preferably be, or in helices/sheets? One of the indels is at the end of a beta strand, and we will move it into the loop after this strand by changing

Target IRQVIVERGNRSGYSGIF—-SVEGRSCINRCFYVELIRGRKDETEVLWTSSSIVVFCG

Templ VKQDIVAITDWSGYSGSFVQHPELTGLDCIRPCFWVELIRGRPKES-TIWTSGSSISFCG

Target IRQVIVERGNRSGYSGIFSVE—-GRSCINRCFYVELIRGRKDETEVLWTSSSIVVFCG

Templ VKQDIVAITDWSGYSGSFVQHPELTGLDCIRPCFWVELIRGRPKES-TIWTSGSSISFCG

Actually, what we have done here is to remove a very short alpha helix. It is only four residues and not much of an alpha helix anyway. We did this to move the indels out of the core of the protein and the main secondary structure elements. This gives us the following alignment of target and template, in CLUSTAL and FASTA format, respectively: