Protein Secondary Structure Prediction-Tips to improve prediction
August 21, 2019Protein Secondary Structure Prediction-Background theory
The secondary structures imply the hierarchy by providing repeating sets of interactions between functional groups along the polypeptide backbone chain that creates, in turn, irregularly shaped surfaces of projecting amino acid side chains. The secondary structures in proteins arise from repeating patterns of similar peptide dihedral angles (φ and Ψ)for successive residues
Alpha helix
The most common type of secondary structure in proteins is the α-helix. Linus Pauling was the first to predict the existence of α-helices. The prediction was confirmed when the first three-dimensional structure of a protein, myoglobin (by Max Perutz and John Kendrew) was determined by X-ray crystallography.
An α-helix is a right-handed coil of amino-acid residues on a polypeptide chain, typically ranging between 4 and 40 residues. This coil is held together by hydrogen bonds between the oxygen of C=O on top coil and the hydrogen of N-H on the bottom coil. Such a hydrogen bond is formed exactly every 4 amino acid residues, and every complete turn of the helix is only 3.6 amino acid residues. This regular pattern gives the α-helix very definite features with regards to the thickness of the coil and the length of each complete turn along the helix axis.
The α-helix is not the only helical structure in proteins. Other helical structures include the 3_10 helix, which is stabilized by hydrogen bonds of the type (i, i+3) and the π-helix, which is stabilized by hydrogen bonds of the type (i, i+5). The 3_10 helix has a smaller radius, compared to the α-helix, while the π-helix has a larger radius.
The alpha helix conformation has a particular stability for two main reasons. Firstly the side chain groups are quite well separated. Secondly, and most importantly, each peptide link is involved in two hydrogen bonds. The C=O is hydrogen bonded to the N_H of the peptide link four units ahead in the primary structure , while it follows that the N_H is hydrogen bonded to the C=O of the peptide link four units behind.
Alpha helix – Amino-acid propensities
Different amino-acid sequences have different propensities for forming α-helical structure. Methionine, alanine, leucine, glutamate, and lysine uncharged (“MALEK” in the amino-acid 1-letter codes) all have especially high helix-forming propensities, whereas proline and glycine have poor helix-forming propensities.
The amino acids that make up a particular helix can be plotted on a helical wheel, a representation that illustrates the orientations of the constituent amino acids (see the article for leucine zipper for such a diagram). Often in globular proteins, as well as in specialized structures such as coiled-coils and leucine zippers, an α-helix will exhibit two “faces” – one containing predominantly hydrophobic amino acids oriented toward the interior of the protein, in the hydrophobic core, and one containing predominantly polar amino acids oriented toward the solvent-exposed surface of the protein.
Coiled coils
Coiled-coil α helices are highly stable forms in which two or more helices wrap around each other in a “supercoil” structure. Coiled coils contain a highly characteristic sequence motif known as a heptad repeat, in which the motif repeats itself every seven residues along the sequence (amino acid residues, not DNA base-pairs). The first and especially the fourth residues (known as the a and d positions) are almost always hydrophobic; the fourth residue is typically leucine – this gives rise to the name of the structural motif called a leucine zipper, which is a type of coiled-coil. These hydrophobic residues pack together in the interior of the helix bundle. In general, the fifth and seventh residues (the e and g positions) have opposing charges and form a salt bridge stabilized by electrostatic interactions.
Beta sheet
The second major type of secondary structure in proteins is the β-sheet. β-sheets consist of several β-strands, stretched segments of the polypeptide chain kept together by a network of hydrogen bonds.
The β-sheet (also β-pleated sheet) is a common motif of regular secondary structure in proteins. Beta sheets consist of beta strands (also β-strand) connected laterally by at least two or three backbone hydrogen bonds, forming a generally twisted, pleated sheet. A β-strand is a stretch of polypeptide chain typically 3 to 10 amino acids long with backbone in an extended conformation.
ß sheets are further subdivided into parallel and antiparallel β sheets, depending on whether the strands run in the same or opposite directions (N- to C-terminus). Antiparallel β sheets are slightly more stable than parallel β sheets because the hydrogen bonding pattern is more optimal.
Parallel Beta sheets
Beta sheets are parallel if the polypeptide strands run in the same direction, N-terminus to C-terminus. The N-terminus of one beta strand will be opposite the N-terminus of the other beta strand.
The parallel arrangement is less stable because the geometry of the individual amino acid molecules forces the hydrogen bonds to occur at an angle, making them longer and thus weaker.
Anti-parallel beta sheets
Beta sheets are anti-parallel if the polypeptide strands run in opposite directions. The N-terminus of one beta strand will be opposite the C-terminus of the other beta strand.
In the anti-parallel arrangement the hydrogen bonds are aligned directly opposite each other, making for stronger and more stable bonds. An anti-parallel beta-pleated sheet forms when a polypeptide chain sharply reverses direction. This can occur in the presence of two consecutive proline residues, which create an angled kink in the polypeptide chain and bend it back upon itself.
Beta sheets -Amino acid propensities
Large aromatic residues (tyrosine, phenylalanine, tryptophan) and β-branched amino acids (threonine, valine, isoleucine) are favored to be found in β-strands in the middle of β-sheets. Different types of residues (such as proline) are likely to be found in the edge strands in β-sheets, presumably to avoid the “edge-to-edge” association between proteins that might lead to aggregation and amyloid formation.
Computational prediction of protein secondary structure
The function of a protein is closely related to its structure; therefore, it is clear that the prediction of unknown protein structures of a proteome could have a strong positive impact on any attempt to discover its function. A large volume protein sequence data are available in database with very low cost due to the development of advanced sequencing technologies; however, the amount of the structures of these sequences known by man only is about 0.2% as a whole, and the recognized functions are much less. Therefore, it still is a big challenge for biologists and medical scientists to understand the structures and functions of proteins from such massive sequences data. As a result, by the using advanced computer technologies to learn the structural information of proteins becomes a basic task in protein science and bioinformatics, and it can be used to understand how proteins to exercise its biological function and the relationship between protein and protein molecules.
The secondary structure is a bridge between the primary and tertiary structure, and it is the early folding stage of protein molecule as the foundation of protein 3-D structure. As a result, the research of protein secondary structures is indispensable as the first and the most important step in 3-D structure studies, which can help to understand the relationship between the function and primary structure of proteins. Except to learn the 3-D spatial structure of protein, it can also be used in many protein science fields, such as the prediction of native tertiary structure, prediction of transition-state position, real value prediction of solvent accessibility, prediction of protein-protein interactions, prediction of protein structural classes, prediction of protein domains, prediction of π62 turns in proteins and so on.
Three generations of methods for improving secondary structure prediction
Secondary structure prediction techniques have been classified into three generations. In the first generation, secondary structures were predicted from a protein sequence according to statistical propensities of amino acid residues towards a specific secondary structure element . The most representative of this type of first-generation methods is the Chou–Fasman method, which combined propensities with heuristic rules. The second-generation methods, represented by the Garnier-Osguthorpe-Robson (GOR) method and the Lim method, used a sliding window of neighbouring residues and various theoretical algorithms such as statistical information, graph theory, neural networks, logic-based machine learning techniques and nearest neighbouring methods. The use of information from neighbouring residues is made possible as more protein structures became available to estimate pairwise, triplet or longer-segment frequencies. The third generation of techniques is characterized by using evolutionary information derived from alignment of multiple homologous sequences . During this period, new computational algorithms have been implemented. Examples are support vector machines, Bayesian or hidden semi-Markov network and conditional random fields for combined prediction. Out of these methods, neural-network-based models have been seen the highest reported accuracy.
Case study- Secondary structure prediction using PSIPRED
In order to perform any structure prediction, we will need the amino acid sequence of the unknown protein structure (so-called target structure). Our structure in question is the PR3 from Xenopus laevis (some froggish creature). We will perform a secondary structure element prediction, and then explore the fold of it.
Input: Amino acid sequence of proteinase 3 L homeolog precursor [Xenopus laevis]
The proteinase 3 [Xenopus laevis] sequence can be found here::https://www.ncbi.nlm.nih.gov/protein/147904866?report=fasta
1. Open the web-browser and go to http://bioinf.cs.ucl.ac.uk/psipred/ From here we will submit our sequence, and explore the prediction result of the SSE’s. First, choose ”Predict Secondary Structure”, and copy the target sequence ( [Xenopus laevis] ) into the form called ”Input Sequence”. Fill in ”Short identifier for submission”
before starting the prediction, and optionally your e-mail address so you can fetch the result at a later stage. Please wait for the prediction to finish. After a few minutes the results of the prediction (E), or coil (C). Along with this assignment a confidence level on the scale from 0=low to 9=high is also reported
will appear on the screen. Each residue has now been assigned to either helix (H), strand.
2.Interestingly, the server has predicted an alpha helix near the N-terminus. Perhaps rather unexpected due to our knowledge of the human PR3 structure. More similar to this, we also see several beta strands, and the C-terminus alpha helix. We have now successfully performed a SSE prediction of the Xenopus laevis PR3.
However, we haven’t seen this N-terminus alpha helix before. Let’s find out wether this is unique for the Xenopus laevis, or if we can find it in the Human PR3 as well. Perform the same type of analysis of the human PR3 sequence ( [Homo sapiens] ). How does the SSE’s compare?
3. Go to http://bioinf.cs.ucl.ac.uk/psipred/ again, choose ”Fold Recognition (pGenTHREADER – with profiles and predicted secondary structure)”, and paste in the sequence of [Xenopus laevis] . This will take some time, and we don’t want to overload the server. You can thus download the results here.
Not surprisingly, the result of the fold recognition reveals two familiar structures. Interestingly, the human HNE structure (2z7f) shows higher score than the human PR3. To find out why, we should take a look on the alignments between them.
Advantage of secondary structure prediction
Secondary structure predictions are used automatically by methods aiming at higher dimensional aspects of protein structure and at improving database searches and alignment accuracy. One method has
successfully related secondary structure predictions automatically to functional aspects. However, secondary structure-based identifications of binding sites or other functional aspects are still restricted to
single-case expert analyses
Proteins can be classified into families based on predicted and observed secondary structure. However, such procedures have been limited to a very coarse-grained grouping only exceptionally useful for inferring function. Nevertheless, in particular, predictions of membrane helices and coiled-coil regions are crucial for genome analysis.
Practical observations need to considered when using secondary structure predictions
(i) How to obtain the best results?
The major source of improvement is the divergence of the multiple sequence alignment used for prediction. Thus, if you have a small family, the expected prediction accuracy is lower.
Particularly sensitive to divergence are the reliability indices; i.e., less divergence yields overestimated reliability indices.
The most successful strategy to find the most reliably predicted regions may be to use the reliability index provided by a method rather than the agreement between different methods.
If you know there are nonglobular or structural domains in your protein, chop it up before you build the alignment.If you can improve the alignment, try to do so before the prediction
(ii) Identify membrane proteins?
Predicted membrane helices indicate that your protein is not globular. The accurate membrane predictions are usually more reliable than those for globular proteins. Thus, membrane helix predictions should be given preference. Globular methods often do not predict globular helices at positions of membrane helices; rather, often membrane helices are predicted as strand by mistakenly applied globular methods. In contrast, globular methods appear relatively more accurate for porin-like betastrand membrane regions. Detection of membrane proteins has less than a 3% error rate for the best methods. Most helices are correctly predicted, yet the number of helices may nevertheless vary. Helix caps are clearly predicted inaccurately.
Note that general methods predicting three-state secondary structure for
globular proteins also predict caps less accurately.
(iii) Classify through coiled-coil regions?
Predictions of long coiled-coil regions clearly indicate that your protein is locally nonglobular. Long coiled-coil proteins are likely to be structural proteins. Longer regions are predicted more accurately.
(iv) Classify through secondary structure content?
Classifying proteins according to the secondary structure composition is helpful, but arbitrary. One hope may be to infer from the predicted
secondary structure content that a particular protein is not typical.
However, this attempt fails, since known protein structures vary
significantly between 10 and 90% of regular secondary structure (helix, strand). Thus, secondary structure composition does not help to predict globularity.
(v) Identify domains or structural regions?
If you see two separate secondary structure patterns, you may suspect that the protein has two structural domains. An extreme example is an Nterminal all-alpha region and a C-terminal all-beta region. If you have to cut your protein, stay more than two residues away from
predicted helices and strands
(vi) Monitor influences of point mutations?
Secondary structure prediction methods are—on average—as accurate in
predicting the overall content of secondary structure as are careful CD and FTIR methods. However, such methods allow you to monitor in detail structural responses to mutations. Such changes are less likely to be reflected as accurately by prediction methods.
(vii) Find binding sites or motifs?
Most often, binding sites lie in nonregular secondary structure elements. For example, we have not predicted regular secondary structure for any of the known nuclear localization signals (128).
Secondary structure predictions do not suffice to identify binding motifs, such as the zinc-finger II motif. However, the combination of sequence motif and predicted secondary structure may be very helpful.
(viii) Infer functional/structural similarity?
If you know the function/structure of protein A and want to infer whether B shares this function/structure, a similarity in the local secondary structure may help you substantially
List of protein secondary structure prediction programs
Name | Method description | Type | Link | Initial release |
---|---|---|---|---|
Porter 5 | Fast, state-of-the-art ab initio prediction of protein secondary structure in 3 and 8 classes | Webserver/downloadable | server/download | 2018 |
SPIDER2 | The most comprehensive and accurate prediction by iterative Deep Neural Network (DNN) for protein structural properties including secondary structure, local backbone angles, and accessible surface area (ASA) | Webserver/downloadable | server/download | 2015 |
s2D | Predicts disorder and secondary structure in one unified framework. Trained on solution-based NMR data. | Webserver/downloadable | server/download | 2015 |
RaptorX-SS8 | predict both 3-state and 8-state secondary structure using conditional neural fields from PSI-BLAST profiles | Webserver/downloadable | server download | 2011 |
NetSurfP | Profile-based neural network | Webserver/downloadable | server | 2009 |
GOR | Information theory/Bayesian inference | Many implementations | Basic GORGOR V | 2002 (GOR V) |
Jpred | Multiple Neural network assignment from PSI-BLAST and HMMER profiles. Predicts secondary structure and solvent accessibility | Webserver | server and API | 1998 |
Meta-PP | Consensus prediction of other servers | Webserver | server | 1999 |
PREDATOR | Knowledge-based database comparison | Webserver | server | 1997 |
PredictProtein | Profile-based neural network | Webserver | server | 1992 |
PSIPRED | two feed-forward neural networks which perform an analysis on output obtained from PSI-BLAST | Webserver | server | 1999 |
SOPMA | Self OPtimised Prediction Method from multiple Alignments (based on nearest neighbour method) | Webserver | server | 1995 |
Homology Modeling Professional for HyperChem | Frequency analysis of amino acid residues observed in proteins | Commercial | algorithm | 2002 |
SymPred | an improved dictionary based approach which captures local sequence similarities in a group of proteins | Webserver | server | 2004 |
YASSPP | Cascaded SVM-based predictor using PSI-BLAST profiles | Webserver | server | 2006 |
PSSpred | Multiple backpropagation neural network predictors from PSI-BLAST profiles | Webserver/downloadable program | server and downloadable program | 2012 |
HCAM | Hidropathy Clustering Assisted Method by detection of physicochemical patterns | downloadable program plus database | main page | 2013 |
Frag1D | Prediction of both secondary structure and Shape Strings (discrete states of dihedral angles) using profile based fragment matching | Webserver/downloadable program | main page | 2010 |
References:
1.Rost, B. (2001). Protein secondary structure prediction continues to rise. Journal of structural biology, 134(2-3), 204-218.
2.Yang, Y., Gao, J., Wang, J., Heffernan, R., Hanson, J., Paliwal, K., & Zhou, Y. (2016). Sixty-five years of the long march in protein secondary structure prediction: the final stretch?. Briefings in bioinformatics, 19(3), 482-494.
3.Jiang, Q., Jin, X., Lee, S. J., & Yao, S. (2017). Protein secondary structure prediction: A survey of the state of the art. Journal of Molecular Graphics and Modelling, 76, 379-402.