AI Revolution in Protein Structure Prediction

December 19, 2024 Off By admin

Table of Contents

The AI Revolution in Protein Structure Prediction: Unraveling the Mysteries of Proteins

For decades, the quest to predict a protein’s three-dimensional (3D) structure from its amino acid sequence has fascinated and challenged scientists. This endeavor is essential because a protein’s structure governs its function, playing a crucial role in biological processes and drug discovery. While experimental methods like X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy have provided invaluable insights, they are often resource-intensive and time-consuming. The advent of computational techniques, particularly those harnessing artificial intelligence (AI), has transformed this field, bridging the gap between the vast number of known protein sequences and experimentally determined structures.

Traditional Approaches: Setting the Stage

Before the AI revolution, two primary computational approaches dominated the protein structure prediction landscape:

Template-Based Modeling (TBM)

TBM utilizes known protein structures as templates to predict the structure of target proteins, leveraging sequence similarity. The process involves four steps:

Identifying suitable templates.
Aligning the target sequence with the template.
Constructing an initial framework.
Refining the model.

TBM can be divided into two categories:

Homology Modeling: Effective when sequence identity between target and template exceeds 30%.
Threading (Fold Recognition): Useful for lower sequence similarities, employing sequence profiles or hidden Markov models (HMMs) to identify compatible templates.

Template-Free Modeling (FM)

Also known as ab initio modeling, FM predicts structures without relying on templates. It uses coarse-grained representations, energy functions, and sampling techniques to build models from scratch. While FM excels in the absence of templates, its accuracy often lags behind TBM for well-characterized proteins.

The Game-Changing Role of Deep Learning

The integration of deep learning has revolutionized protein structure prediction, offering unprecedented accuracy and speed. Here are the key contributions of AI:

Contact and Distance Maps

Deep learning models predict residue-residue contact or distance maps, representing spatial relationships within a protein. These maps guide folding simulations. For instance, RaptorX-Contact introduced deep residual convolutional networks (ResNets) to improve contact map prediction.

End-to-End Folding

AlphaFold2 by DeepMind marked a paradigm shift, directly predicting 3D structures from amino acid sequences. Its accuracy rivals experimental methods, reshaping the landscape of structural biology.

Protein Language Models (PLMs)

Inspired by natural language processing, PLMs interpret protein sequences as a “language.” They generate embeddings that capture the context of each amino acid, enabling fast and efficient predictions. While PLM-based methods are faster, their accuracy still trails behind traditional MSA-based approaches.

Multi-Domain Protein Prediction: A Persistent Challenge

Many proteins consist of multiple independently folding domains, making their structure prediction complex. Current methods often model each domain separately and then assemble the full-length protein. Advanced tools like D-I-TASSER and LOMETS3 aim to improve template recognition and alignment accuracy for multi-domain proteins, but challenges remain.

CASP: The Olympics of Protein Structure Prediction

The Critical Assessment of Protein Structure Prediction (CASP) evaluates and benchmarks predictive methods biennially. Metrics such as the TM-score and global distance test (GDT) score assess structural quality. CASP competitions have highlighted the dominance of AlphaFold2 and its derivatives, particularly in CASP15, where ligand-binding predictions were also introduced as a new category.

The AlphaFold Protein Structure Database

AlphaFold DB, developed by DeepMind and EMBL-EBI, hosts over 200 million high-accuracy protein structure predictions. This freely accessible resource has become indispensable for researchers, aiding in understanding protein-ligand interactions, drug discovery, and functional annotation.

Timeline of Main Events

1970: Needleman and Wunsch publish their algorithm for sequence alignment, laying foundational work for comparing protein sequences.
1973: Anfinsen demonstrates that a protein’s amino acid sequence determines its tertiary structure, establishing the central paradigm for protein research.
1981: Smith and Waterman develop an algorithm for identifying common subsequences in protein sequences, enhancing sequence comparison.
1990: Skolnick and Kolinski perform simulations on the folding of a globular protein, a significant step in computational protein folding.
1990: Altschul, Gish, Miller, Myers, Lipman develop the BLAST algorithm, a fundamental tool in bioinformatics for finding similar sequences in databases.
1991: Chiu and Kolodziejczak infer consensus structure from nucleic acid sequences, which laid groundwork for structural inference techniques.
1992: Kuntz publishes his work on structure-based strategies for drug design and discovery.
1993: Šali and Blundell describe their method of comparative protein modeling by satisfaction of spatial restraints.
1994: The Critical Assessment of Protein Structure Prediction (CASP) is established by John Moult and others to objectively evaluate protein structure prediction techniques.
1997: The first version of Rosetta modeling software is released by David Baker’s group, a key development in free modeling.
1997: Fischer and Eisenberg assign folds to the proteins encoded by the genome of Mycoplasma genitalium.
1997: Sánchez and Šali evaluate comparative protein structure modeling by MODELLER-3.
2001: Jones developed FRAGFOLD for predicting novel protein folds.
2002: Pardo, Porta-Pardo and Godzik develop Cancer3D to understand cancer mutations through protein structures.
2002: Zhang and Kuntz describe using Homology models to determine the structure and function of G protein-coupled receptors.
2003: Schwede, Kopp, Guex, and Peitsch develop the SWISS-MODEL server for automated protein homology modeling.
2003: Kim, Chivian, and Baker develop the Robetta server for protein structure prediction and analysis.
2004: Zhang and Skolnick develop the TASSER method for automated structure prediction of weakly homologous proteins.
2004: Evers and Klebe apply virtual screening to identify an antagonist of the Neurokinin-1 Receptor using a ligand-supported homology model.
2004: Rohl, Strauss, Misura, and Baker publish work on protein structure prediction with Rosetta
2005: Söding, Biegert, and Lupas release the HHpred server for protein homology detection and structure prediction.
2005: Capriotti, Fariselli, and Casadio publish work on predicting stability changes upon mutation from the protein sequence or structure
2005: The Universal Protein Resource (UniProt) is formalized as a standard resource.
2006: Klebe reviews virtual ligand screening strategies.
2006: Vajda and Guarnieri investigate protein-ligand interaction sites.
2007: Malmström, Riffle, Strauss, Chivian, Davis, Bonneau, and Baker assign superfamilies for the Yeast Proteome through Integration of Structure Prediction with the Gene Ontology.
2007: Wu and Zhang develop the LOMETS server for protein structure prediction.
2007: Zhang and Kuntz describe using Homology models to determine the structure and function of G protein-coupled receptors.
2008: Cheng develops a multi-template combination algorithm for protein comparative modeling.
2009: Kelley and Sternberg develop the Phyre server for protein structure prediction.
2009: Loewenstein, Raimondo, Redfern, Watson, Frishman, Linial, Orengo, Thornton and Tramontano publish on protein function annotation by homology based inference.
2009: Tokuriki and Tawfik study the stability effects of mutations and protein evolvability.
2009: Zhang, Thiele, Weekes, Li, Jaroszewski, Ginalski, Deacon, Wooley, Lesley, Wilson, et al, find three dimensional structures of the Central Metabolic Network of Thermotoga maritima.
2010: Roy, Kucukural, and Zhang develop the I-TASSER platform for protein structure and function prediction.
2010: Xu and Zhang develop an ab initio protein structure assembly method.
2010: Xu and Zhang publish work on assessing protein structure similarity with TM-score
2011: Peng and Xu publish a method for exploiting structure information for protein alignment by statistical inference.
2011: Yang and Zhou develop techniques to improve protein fold recognition and template based modeling using predicted structural properties.
2011: Shan et al. describe how a drug molecule finds its target binding site.
2011: Mukherjee, Szilagyi, Roy, and Zhang perform genome-wide protein structure predictions.
2011: Liu, Jiang, and Li develop a hybrid approach for 3D molecular similarity calculation, SHAFTS.
2012: Zhou and Skolnick develop a structure-based small molecule virtual screening approach, FINDSITEX.
2012: Roy and Zhang develop a method for recognizing protein-ligand binding sites.
2012: Xu and Zhang use Ab Initio structure prediction for Escherichia coli: Towards genome-wide protein structure modeling and fold assignment.
2012: Cheng develops the MULTICOM toolbox for protein structure prediction.
2013: Song, DiMaio, Wang, Kim, Miles, Brunette, Thompson, and Baker develop High Resolution Comparative Modeling with RosettaCM
2013: Xue, Xu, Wang and Zhang develop ThreaDom for extracting protein domain boundary information from multiple threading alignments.
2014: Pires, Ascher, and Blundell develop mCSM for predicting the effects of mutations in proteins.
2014: Piana, Klepeis, and Shaw assess the accuracy of physical models used in protein folding simulations.
2014: Xu, Jaroszewski, Li, and Godzik develop FFAS-3D for improving fold recognition.
2014: Ma, Wang, Wang, and Xu publish work on protein homology detection using Markov Random Fields.
2015: Porta-Pardo, Hrabe, and Godzik develop Cancer3D, a tool for understanding cancer mutations through protein structures.
2015: Yang, Yan, Roy, Xu, Poisson, and Zhang release the I-TASSER Suite for protein structure and function prediction.
2015: Meier and Söding publish on automatic prediction of protein structures by probabilistic multi-template homology modeling.
2015: Xu, Jaroszewski, Li, and Godzik develop the AIDA method for ab initio domain assembly.
2015: Kelley, Mezulis, Yates, Wass, and Sternberg develop the Phyre2 web portal for protein modeling, prediction, and analysis.
2016: He, Zhang, Ren, and Sun develop deep residual learning for image recognition (ResNet), later applied to protein structure prediction.
2016: Porta-Pardo and Godzik study mutation drivers of immunological responses to cancer.
2016: Quan, Lv, and Zhang develop STRUM, a structure-based method for predicting protein stability changes upon single-point mutation.
2017: Ovchinnikov, Park, Varghese, Huang, Pavlopoulos, Kim, Kamisetty, Kyrpides, and Baker use metagenome sequence data to determine protein structure.
2017: Wang, Sun, Li, Zhang, and Xu introduce RaptorX-Contact, revolutionizing contact prediction using deep ResNets.
2017: Steinegger and Söding describe the MMseqs2 method for sensitive protein sequence searching.
2017: Paszke, Gross, Chintala, Chanan, Yang, DeVito, Lin, Desmaison, Antiga, and Lerer develop PyTorch.
2018: Sundaram et al. predict the clinical impact of human mutation with deep neural networks.
2018: Zhu, Wang, Bu, and Xu publish work on Protein threading using residue co-variation and deep learning.
2019: Zheng, Li, Zhang, Pearce, Mortuza, and Zhang develop Deep-learning contact-map guided protein structure prediction in CASP13.
2019: Greener, Kandathil, and Jones develop DMPfold and use deep learning for de novo protein modelling.
2019: Senior, Evans, Jumper, Kirkpatrick, Sifre, Green, Qin, Žídek, Nelson, Bridgland, et al. (Google DeepMind) use deep learning for improved protein structure prediction.
2019: Han et al. discover ARD-69 as a highly potent PROTAC degrader of the Androgen Receptor.
2019: Li et al. use deep learning with coevolutionary matrices for contact map prediction.
2019: Zheng et al. develop LOMETS2, an improved meta-threading server.
2019: Park, Ayoub, Lee, Xu, Kim, Zheng, Zhang, Sha, An, Zhang, et al. find a Cryo-EM structure of the human MLL1 core complex bound to the nucleosome.
2019: Zhou, Hu, Zhang, Zhang and Zhang assemble multidomain protein structures through analogous global structural alignments.
2019: Zheng et al. develop CEthreader for detecting distant-homology protein structures by aligning deep neural-network based contact maps.
2020: Senior, Evans, Jumper, Kirkpatrick, Sifre, Green, Qin, Žídek, Nelson, Bridgland, et al. publish on improved protein structure prediction using potentials from deep learning.
2020: Yang et al. show that interresidue orientations improve protein structure prediction.
2020: Zhang and Shen published work on template-based prediction of protein structure with deep learning.
2020: Bhattacharya, Roche, Moussad, and Bhattacharya develop DisCovER for threading using distance and orientation based covariational information.
2020: Zhang, Zheng, Mortuza, Li and Zhang use deep multiple sequence alignment to improve contact prediction and fold-recognition.
2020: Callaway reports on DeepMind’s significant leap in solving protein structures with AI.
2020: James, Charlie, Robert, Daniel, Vladimir, and Richard publish work on protein structural alignments from sequence.
2021: Pearce and Zhang publish work on the solution of the protein structure prediction problem.
2021: Du et al. release the trRosetta server for protein structure prediction.
2021: Woodard and Zhang develop ADDRESS, a database of disease associated human variants incorporating protein structures.
2021: Zheng et al. describe protein structure prediction with deep learning for distance and hydrogen bonding restraints.
2021: Zhang et al. demonstrate structure-based function and interaction prediction for essential genes in a minimal genome.
2021: Baek et al. introduce RoseTTAFold, a three-track network for protein structure prediction.
2021: Li et al. deduce high accuracy protein contact maps from triplets of coevolutionary matrices via deep learning.
2021: The UniProt Consortium publishes a 2021 update to the UniProt database.
2022: Varadi et al. create the AlphaFold Protein Structure Database (AlphaFold DB).
2022: Zhou et al. develop I-TASSER-MTD for multi-domain protein structure and function prediction
2022: Bludau et al. study the structural context of posttranslational modifications at a proteome-wide scale.
2022: Zheng, Wuyun, Zhou, Li, Freddolino and Zhang published LOMETS3 for integrating deep learning and profile alignment for template recognition.
2022: Zhou, Peng, Zheng, Li, Zhang, and Zhang improve multi-domain protein assembly using DEMO2
2022: Ruidong, Wang et al. introduce OmegaFold for de novo protein structure prediction.
2022: Aderinwale et al. describe real time structure search and classification for AlphaFold protein models.
2022: Liu et al. create the PSP dataset of millions of protein sequences for structure prediction.
2022: Ziyao Li et al. describe Uni-Fold, an open source platform for protein folding model development.
2022: del Alamo et al. describe sampling alternative conformations with AlphaFold2.
2022: Wayment-Steele, Ojoawo, Otten, Apitz, Pitsawong, Hömberger, Ovchinnikov, Colwell, and Kern describe predicting multiple conformations via sequence clustering and AlphaFold2.
2022: Richard et al. describe protein complex prediction with AlphaFold-Multimer.
2022: Cheng, Zhao, Lu, Fang, Yu, Zheng, Wu, Zhang, Peng, You create FastFold to reduce AlphaFold training time.
2022: Wang, Fang, Wu, Liu, Xue, Xiang, Yu, Wang and Ma create HelixFold, an implementation of AlphaFold2 using PaddlePaddle.
2023: Lin et al. demonstrate evolutionary-scale prediction of protein structure with a language model (ESMFold).
2023: Hekkelman et al. develop AlphaFill to enrich AlphaFold models with ligands and cofactors.
2023: Wehrspan et al. investigate iron-sulfur and zinc binding sites within AlphaFold DB.
2023: Jakubec et al. present PrankWeb 3 for ligand binding site prediction.
2023: van Kempen et al. release Foldseek for fast and accurate protein structure search.
2023: Bordin et al. use AlphaFold2 to reveal commonalities and novelties in protein structure space across 21 model organisms.
2023: Zheng, Wuyun, Freddolino and Zhang integrate deep learning, threading alignments, and a multi-MSA strategy for protein structure prediction in CASP15.
2023: Pang et al. develop CoDock-Ligand for ligand binding prediction.
2023: Xu et al. develop a template guided method for protein-ligand complex prediction.
2023: Shen et al. create zPoseScore for protein ligand docking pose scoring.
2023: Kotelnikov, et al. create ClusPro LigTBM for ligand-protein docking in CASP15
2023: Fang et al. introduce HelixFold-Single, an MSA-free protein structure prediction method using a protein language model.
2023: Li, Zhang, Feng, Pearce, Freddolino and Zhang develop a method integrating deep learning for RNA structure prediction.
2023: Wang, Feng, Han, Wang, Ye, Du, Wei, Zhang, Peng and Yang publish work on RNA 3D structure prediction with a transformer network (trRosettaRNA).
2023: Ruffolo et al. describe fast, accurate antibody structure prediction using deep learning.
2023: Jing, Wu, Luo and Xu created RaptorX-Single, a method for single sequence protein structure prediction integrating protein language models.
2023: Zheng et al. present their method of protein structure prediction using deep learning distance and hydrogen-bonding restraints in CASP14
2023: Park et al. develop single sequence protein structure prediction using a supervised transformer language model
2024: Pantolini, Studer, Pereira, Durairaj, Tauriello, and Schwede create Embedding-based alignment, combining protein language models with dynamic programming alignment
2024: Baek, McHugh, Anishchenko, Jiang, Baker and DiMaio published work on protein-nucleic acid complexes using RoseTTAFoldNA
2024: Terwilliger, Liebschner, Croll, Williams, McCoy, Poon, Afonine, Oeffner, Richardson, Read, et al. find that AlphaFold predictions are hypotheses, not replacements for experimental data.
2024: Llinares-Lopez, Berthet, Blondel, Teboul and Vert work on deep embedding and alignment of protein sequences.
2024: Hou, Jin, Cui, Peng, Zhao, Song, and Zhang work on protein multiple conformation prediction.
2024: Wuyun et al. publish a review of recent progress in protein tertiary structure prediction.
2024: Li et al. produce a detailed description of deep learning geometrical potentials for protein structure prediction.
2024: Kaminski et al. publish pLM-BLAST: Distant homology detection based on direct comparison of sequence representations from protein language models
2024: Bhattacharya et al. develop DisCovER: Distance- and orientation-based covariational threading for weakly homologous proteins.

Challenges and Future Directions

Despite remarkable progress, significant challenges remain:

Multi-Domain Proteins: Accurate modeling of complex, multi-domain proteins is still in its infancy.
MSA Dependence: High-quality multiple sequence alignments (MSAs) are computationally expensive, prompting the exploration of MSA-free approaches like PLMs.
Beyond Proteins: Predicting RNA and RNA-protein complex structures is an emerging frontier.

Conclusion

The field of protein structure prediction has undergone a seismic transformation, thanks to AI-driven approaches like AlphaFold2. These advancements not only bring us closer to solving the protein folding problem but also open new avenues for biological research and therapeutic discovery. As deep learning techniques continue to evolve, the promise of unraveling even the most complex biological structures is within reach.

FAQ on Recent Advances in Protein Tertiary Structure Prediction

What is the significance of protein tertiary structure prediction, and why has it been a focus of research? Proteins perform a wide array of critical functions, from providing structural support to catalyzing biochemical reactions, and their functionality is directly linked to their unique three-dimensional (3D) structures. Understanding the sequence-structure-function paradigm is essential in modern biomedical studies. Despite the abundance of known protein amino acid sequences, understanding their biological functions requires knowledge of their 3D structures, which aren’t always easily obtained through experimental methods. Accurate and efficient computational methods to predict protein structure are vital for advancing our understanding of biological processes and developing new therapies.
What are the main categories of protein structure prediction methods and how do they differ? Protein structure prediction methods can be broadly categorized into template-based modeling (TBM) and free modeling (FM). TBM methods use known protein structures (templates) to predict the structure of a target protein that shares sequence similarity. In contrast, FM methods, also known as ab initio or de novo modeling, build structures from scratch without relying on global templates. Additional approaches include contact-based methods which predict residue-residue contacts and distance-based methods which expand on this by predicting distances between residues. More recently, end-to-end methods, which use deep learning to directly generate structures, and protein language model (PLM)-based methods, which leverage large protein sequence datasets, have emerged as powerful strategies.
How does template-based modeling (TBM) work, and when is it most effective? TBM operates by identifying template structures in the Protein Data Bank (PDB) that are similar to the target protein. It involves four key steps: (i) template identification, (ii) sequence alignment between the query and template, (iii) building the initial structure using aligned regions, and (iv) constructing and refining unaligned regions. TBM is most effective when there is significant sequence identity (usually 30% or greater) between the target protein and a template (homology modeling). It can also be used when sequence identity is below this threshold by using “threading” or fold recognition methods, which utilize the conserved structural elements.
What are the challenges associated with free modeling (FM), and how have these been addressed? FM methods face challenges because they rely on building structures without the help of known templates. Molecular dynamics simulations theoretically can achieve this but require immense computational power, and are limited to short peptides. FM methods address this by using simplified, coarse-grained representations of proteins and employing physics- or knowledge-based energy functions to guide structure assembly through extensive sampling procedures. Initially, they had lower accuracy compared to TBM, but with techniques like fragment assembly, better energy functions, and expanded conformational search approaches, methods like Rosetta and QUARK now have comparable accuracy, especially when templates are not available.
What is the role of Multiple Sequence Alignments (MSAs) in protein structure prediction, and why is it a potential bottleneck? Multiple sequence alignments (MSAs) are created by searching databases of protein sequences to identify sequences that are related to the target protein. Information derived from an MSA, such as co-evolutionary coupling (correlated mutations) between pairs of residues, is used as input for machine learning models to predict structural properties, like contact maps or distance maps. The MSA generation process, although essential for capturing key structural information, is a computationally demanding step that can take significantly longer than the actual structure prediction, acting as a bottleneck in high-throughput analyses.
How have deep learning and protein language models (PLMs) revolutionized protein structure prediction, and what are some examples of tools using these methods? Deep learning, particularly using deep residual neural networks, has significantly advanced contact and distance map prediction accuracy. It has also enabled the development of end-to-end methods for structure prediction. PLMs, trained on large protein sequence datasets, have provided a means to learn co-evolutionary information directly from sequences, rather than relying on MSAs, facilitating fast and accurate MSA-free structure prediction. Examples include AlphaFold2, a top performing method using deep learning and MSA, and methods such as ESMFold, OmegaFold, and HelixFold-Single which incorporate PLMs for fast, single sequence based structure prediction.
What are the specific challenges associated with predicting multi-domain protein structures, and how are these being addressed? Multi-domain proteins pose a challenge as they have multiple independently folding units, and their overall structure is determined by not only individual domain structure, but also how these domains interact and orient in 3D space. Modeling multi-domain proteins typically involves splitting the sequence into individual domains, predicting their structures separately, and then assembling them into full-length structures. Approaches include linker-based modeling and rigid-body docking using information from homologous proteins. Also, new methods, such as LOMETS3, have emerged that improve template recognition and alignment for multi-domain proteins, along with methods for generating deeper MSAs, like DeepMSA2. The most advanced methods, like the D-I-TASSER pipeline, integrate these approaches with deep learning spatial restraint predictions to directly predict full-length structures, outperforming previous multi-domain structure prediction efforts.
How are protein structure prediction methods being used for downstream applications and what are their limitations? Highly accurate protein structure prediction from tools like AlphaFold2 are broadly useful for research. For example, the AlphaFold DB is a freely accessible database with over 200 million predicted structures used for understanding protein function, drug discovery, and more. Methods for predicting ligand binding sites, mapping post-translational modifications, and searching for structural similarity, all utilize these newly abundant structures. Despite the progress, challenges remain, such as modeling multi-domain proteins, very large proteins, or capturing multiple conformations and interactions, and methods are continuing to be refined to address these limitations. Also, it should be noted that despite high accuracy, predicted structures are still models, and do not always replace the need for experimental structure determination.

Glossary of Key Terms

Ab initio/De novo Modeling: Protein structure prediction methods that build models from scratch without relying on global templates; also referred to as “free modeling (FM).”
AlphaFold2: A deep learning-based program developed by DeepMind that has achieved breakthrough performance in predicting protein structures.
AlphaFold DB: The AlphaFold Protein Structure Database, a large and freely accessible database of protein structure predictions.
CASP (Critical Assessment of Protein Structure Prediction): A biennial experiment designed to assess the current state of the art in protein structure prediction.
Contact Map: A matrix representing whether pairs of amino acid residues in a protein are in close proximity (within 8 Å).
Co-evolution: The phenomenon that occurs when mutations in different parts of a protein, often in contact with each other, are correlated, suggesting a functional or structural relationship.
Deep Learning: A type of machine learning based on artificial neural networks with multiple layers (deep neural networks).
Distance Map: A map indicating the distances between all pairs of amino acid residues in a protein.
D-I-TASSER: A protein structure prediction method developed by Yang Zhang’s group that uses a distance-based approach and incorporates deep learning and homology modeling.
End-to-End Methods: Prediction methods that directly input sequence data and output final structures, streamlining the process.
Free Modeling (FM): Protein structure prediction methods that build models from scratch without relying on global templates; also referred to as “ab initio” or “de novo” modeling.
GDT Score (Global Distance Test): A score used in CASP to assess the quality of protein structure predictions with a focus on backbone modeling quality.
HMM (Hidden Markov Model): A probabilistic model that captures the evolutionary changes within a multiple sequence alignment (MSA).
Homology Modeling: Also known as comparative modeling, a TBM method used when there is substantial sequence similarity (typically 30% or greater) between the template and the protein of interest.
LOMETS: A meta-threading server that combines the results of multiple threading programs for enhanced accuracy.
MSA (Multiple Sequence Alignment): An alignment of multiple protein sequences to identify conserved regions and co-evolutionary patterns.
Molecular Dynamics (MD): A computer simulation method for studying the physical movements of atoms and molecules.
PDB (Protein Data Bank): A database of experimentally determined 3D structures of biological macromolecules, including proteins.
PLM (Protein Language Model): A neural network model trained on extensive protein sequence data to learn the “language” of proteins.
Position-specific Scoring Matrix (PSSM): Captures the amino acid tendencies at each position within the multiple sequence alignment (MSA).
Replica Exchange Monte Carlo (REMC): A sampling algorithm often used to escape local energy minima during protein folding simulations.
ResNet (Residual Neural Network): A deep neural network architecture that facilitates the training of very deep networks through the use of residual connections.
Rosetta: A software suite for protein structure prediction developed by David Baker’s group.
Template-Based Modeling (TBM): A method of protein structure prediction that relies on using known protein structures as templates to model the structure of a new protein.
Threading (Fold Recognition): A TBM method used when sequence similarity drops below 30% between the template and target, relying on structural profile comparisons rather than direct sequence alignments.
TM-score (Template Modeling score): A metric that assesses the global similarity between two protein structures, ranging from 0 to 1. TM-scores above 0.5 indicate that the proteins share the same fold.
UniProt: A comprehensive database of protein sequence and functional information.

Protein Tertiary Structure Prediction: A Review

Quiz

Instructions: Answer the following questions in 2-3 sentences each.

What is the significance of understanding the protein sequence-structure-function paradigm in modern biomedical studies?
Briefly describe the four steps involved in template-based modeling (TBM).
How do free modeling (FM) methods differ from template-based modeling (TBM) methods in protein structure prediction?
Explain the concept of a contact map in protein structure prediction and its role.
What is the key difference between contact map prediction and distance map prediction?
What is a protein language model (PLM) and how are PLMs used in protein structure prediction?
What is a TM-score, and why is it used in protein structure prediction?
What are the advantages of using protein language models over MSAs in structure prediction?
How do linker-based and docking-based methods differ in multi-domain protein structure assembly?
What is the purpose of the Critical Assessment of Protein Structure Prediction (CASP) competition?

Quiz Answer Key

Understanding the protein sequence-structure-function paradigm is critical because a protein’s 3D structure, which is determined by its amino acid sequence, dictates its function. Gaining insight into this paradigm allows researchers to better understand various biological processes and develop targeted therapies.
TBM involves four main steps: identifying templates from the Protein Data Bank (PDB) that are related to the query protein, aligning the query protein with these templates, building an initial structural framework by replicating aligned regions, and constructing unaligned regions and refining the final structure.
Free modeling methods construct protein structures from scratch without depending on global templates, using coarse-grained protein elements and physics or knowledge-based energy functions, while TBM methods rely on known protein structures as templates to guide prediction.
A contact map is a binary matrix representing whether pairs of residues are in contact (within 8 angstroms) within a protein’s 3D structure. These maps are used to predict the spatial relationships of the protein and assist in building structural models.
Contact map prediction involves determining binary values of whether residues are in contact, while distance map prediction estimates the likelihood of residue distances falling within specific ranges, providing a more detailed picture of inter-residue spatial arrangements.
A protein language model (PLM) is a type of neural network trained on extensive protein sequence data and it learns the “semantic meaning” of each amino acid within the full protein sequence. PLMs are used in structure prediction by generating high-dimensional embeddings of protein sequences that can facilitate the identification of distant homologous relationships and the prediction of tertiary structures.
The TM-score is a metric that measures the structural similarity between a predicted protein structure and the experimentally determined native structure. It ranges from 0 to 1, with a score greater than 0.5 indicating that the structures share the same global topology, and is a more reliable metric than RMSD for comparing protein structures.
PLMs, by implicitly encoding co-evolutionary information, can offer faster and more efficient MSA-free predictions, thus overcoming the time-intensive bottleneck associated with MSA construction, particularly useful in high-throughput applications like protein design.
Linker-based methods focus on the modeling of linker regions between domains using generic interactions and exploring conformational space. Docking-based methods use rigid body docking to assemble individual domain structures, using known templates.
CASP is a biennial competition that objectively evaluates the performance of protein structure prediction methods using a double-blind approach. It is considered a crucial benchmark in the field, designed to stimulate research and advancement in this field.

Essay Questions

Instructions: Answer the following questions in a well-developed essay format.

Discuss the evolution of protein structure prediction methods, highlighting key advancements from template-based modeling to deep learning-based approaches. What are the strengths and limitations of each approach?
Compare and contrast the roles of contact maps and distance maps in protein structure prediction. How have these techniques, in conjunction with deep learning, revolutionized the field?
Explain the significance of protein language models (PLMs) in overcoming the limitations of MSA-based protein structure prediction. How do PLMs generate valuable embeddings and what advantages do they offer for structural biologists?
Analyze the impact of AlphaFold2 on the field of protein structure prediction. What specific innovations contributed to its success and how has it influenced the development of new structure prediction tools and databases?
Describe the challenges of multi-domain protein structure prediction and how recent advancements, such as those implemented in D-I-TASSER, aim to address these difficulties.

Reference

Wuyun, Q., Chen, Y., Shen, Y., Cao, Y., Hu, G., Cui, W., … & Zheng, W. (2024). Recent progress of protein tertiary structure prediction. Molecules, 29(4), 832.