Introduction to Protein Primary structure Analysis
July 4, 2021Primary Structure
Proteins are macromolecules comprising the 20 naturally occurring amino acids. Amino acids, held together by peptide bonds, make up proteins. Under physiological conditions, proteins fold into characteristic three-dimensional structures that dictate their biological properties. The common configuration of natural amino acids is characterized by an amino and a carboxyl group around a central α carbon atom.
Primary structure is the simplest level of protein structure. The amino acid sequence determines the protein’s shape and structure and consequently its function.
The specific sequence is very important, since even a small change (called a mutation) could cause a disorder. For example, sickle cell anemia is a disorder in which the body’s hemoglobin contains just single amino acid substitution, in which glutamate at position 6th has been replace with valine.
Structure of Amino acids
All 20 of the common amino acids are alpha-amino acids. They contain a carboxyl group, an amino group, and a side chain (R group), all attached to the α-carbon.
Exceptions are:
Glycine, which does not have a side chain. Its α-carbon contains two hydrogens.
Proline, in which the nitrogen is part of a ring.
Thus, each amino acid has an amine group at one end and an acid group at the other and a distinctive side chain. The backbone is the same for all amino acids while the side chain differs from one amino acid to the next.
All of the 20 amino acids except glycine are of the L-configuration, as for all but one amino acid the α-carbon is an asymmetric carbon. Because glycine does not contain an asymmetric carbon atom, it is not optically active and, thus, it is neither D nor L.
Classification of amino acids on the basis of R-group
- Nonpolar, Aliphatic amino acids: The R groups in this class of amino acids are nonpolar and hydrophobic. Glycine, Alanine, Valine, leucine, Isoleucine, Methionine, Proline.
- . Aromatic amino acids: Phenylalanine, tyrosine, and tryptophan, with their aromatic side chains, are relatively nonpolar (hydrophobic). All can participate in hydrophobic interactions.
- Polar, Uncharged amino acids: The R groups of these amino acids are more soluble in water, or more hydrophilic, than those of the nonpolar amino acids, because they contain functional groups that form hydrogen bonds with water. This class of amino acids includes serine, threonine, cysteine, asparagine, and glutamine.
- Acidic amino acids: Amino acids in which R-group is acidic or negatively charged. Glutamic acid and Aspartic acid
- Basic amino acids: Amino acids in which R-group is basic or positively charged. Lysine, Arginine, Histidine
The respective side chain of each amino acid determines the chemical properties, such as hydrophobic, polar, acidic, or basic. If the characteristics of a protein were to depend solely on the unfolded amino acid sequence (frequently referred to as the primary structure), similar properties would be expected due to the limitation of just 20 amino acids. Indeed, denatured (unfolded) proteins have very similar properties that correspond essentially to a homogeneous cross-section of randomly distributed side chains. Nevertheless, the primary structure is essential for determining secondary and tertiary structure and with that, the three-dimensional conformation of the protein.
Polypeptide primary structure, i.e., the amino acid sequence from the N- to the C- terminus, can contain between three and several hundred amino acids. Each amino acid in the polypeptide chain is abbreviated either with a three letter or one letter code.
Functions of Amino acids
In particular, 20 very important amino acids are crucial for life as they contain peptides and proteins and are known to be the building blocks for all living things.
The linear sequence of amino acid residues in a polypeptide chain determines the three-dimensional configuration of a protein, and the structure of a protein determines its function.
Amino acids are imperative for sustaining the health of the human body. They largely promote the:
Production of hormones
• Structure of muscles
• Human nervous system’s healthy functioning
• The health of vital organs
• Normal cellular structure
The amino acids are used by various tissues to synthesize proteins and to produce nitrogen-containing compounds (e.g., purines, heme, creatine, epinephrine), or they are oxidized to produce energy.
The breakdown of both dietary and tissue proteins yields nitrogen-containing substrates and carbon skeletons.
The nitrogen-containing substrates are used in the biosynthesis of purines, pyrimidines, neurotransmitters, hormones, porphyrins, and nonessential amino acids.
The carbon skeletons are used as a fuel source in the citric acid cycle, used for gluconeogenesis, or used in fatty acid synthesis.
Why we Need to Determine the Primary Structure of Protein?
Protein play important role as “the building block of life” –
- Enzyme, Hormones and hormone receptors
- Sensing device: such as rhodopsin
- Role in immune system
- Expression of genetic information (transcription)
- Constituent of important body part: collagen
- Transporters: Albumin, Myoglobin & Hb.
- To locate the gene of interest in the host cell.
- To artificial synthesis the above products by using the applications of biotechnology, we need the determine the primary structure of protein.
Protein databases
Primary databases are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure. Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature. Once given a database accession number, the data in primary databases are never changed: they form part of the scientific record. Protein sequence information has been effectively dealt in a concerted approach by establishing, maintaining and disseminating databases, providing user-friendly software tools and develop state-of-the-art analysis tools to interpret structural data. Databases are central, shareable resources made available in public domain and represent convenient and efficient means of storing vast amount of information. Depending on the nature of the different levels of information, databases are classified into different types for the end user.
There are various databases for each of the nature of protein information that range from primary, composite, secondary and pattern databases.
1. SWISS-PROT
This protein database was produced collaboratively by the Department of Medical
Biochemistry at the University of Geneva and the EMBL (European Molecular Biology
Laboratory). Since 1994, it moved to EMBL’s UK outstation, the EBI (European
Bioinformatics Institute) and in April 1998, it moved to Swiss Institute of Bioinformatics (SIB) and is maintained collaboratively by SIB and EBI/EMBL. It provides the description of the function of proteins, structure of its domains, post-translational modifications etc., is minimally redundant and is interlinked to many other resources.
- TrEMBL
This database has been designed to allow rapid access to protein sequence data. TrEMBL refers to Translated EMBL and was created as a supplement to SWISS-PROT in 1996 to include translations of all coding sequences in EMBL. - PIR
This is the Protein Information Resource developed as a Protein sequence database at the National Biomedical Research Foundation (NBRF) in the early 1960s and collaboratively by PIR-International since 1988. The consortia include the PIR at NBRF, JIPID the International Protein Information Database of Japan and MIPS the Martinsried Institute for Protein Sequences.
Composite protein sequence databases
Composite databases have been created to simplify the sequence search for a protein query in a single compilation in context of the many different primary database searches, by merging a variety of different primary resources. These databases are non-redundant and render sequence searching much more efficient.Non-Redundant DataBase (NRDB) is the default database of the NCBI (National Center for
Biotechnology Information) BLAST (Basic Local Alignment Search Tool) service and is acomposite of GenPept, PDB sequences, SWISS-PROT, SPupdate (weekly update of SWISSPROT), PIR and GenPept update (daily updates of GenPept). It provides comprehensive upto-date information and is non-identical rather than non-redundant, that is, it reiterates only identical sequence copies and hence results in artifacts. SWISS-PROT + TrEMBL: It is a combined resource of SWISS-PROT + TrEMBL at the EBI and is minimally redundant. It can be searched at the SRS sequence retrieval system on the EBI webserver
Secondary databases
Secondary databases are a consequence of analyses of the sequences of the primary
databases, mainly based from SWISS-PROT. Such databases augment the primary database
searches, derived from multiple sequence information, by which an unknown query
sequence can be searched against a library of patterns of conserved regions of sequence alignments which reflect some vital biological role, and based on these predefined characteristics of the patterns, the query protein can be assigned to a known family. However, secondary databases can never replace the primary sources but supplement the primary sequence search.
Prosite
It is the first secondary database and consists of entries describing the protein families, domains and functional sites as well as amino acid patterns, signatures, and profiles. This database was created in 1988 and is manually curated by a team of the Swiss Institute of Bioinformatics and tightly integrated into Swiss-Prot protein annotation.
Blocks
Blocks are multiply aligned ungapped segments corresponding to the most highly
conserved regions of proteins. The blocks for the Blocks database are made automatically by looking for the most highly conserved regions in groups of proteins documented in InterPro. Results are reported in a multiple sequence alignment format without calibration and in the standard Block format for searching.
Prints
This is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterize a protein family by iterative scanning of a SWISS-PROT/TrEMBL composite. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs. PRINTS can be accessed by Accession number, PRINTS code, database code, text, sequence, title, number of motifs, author or query language
Profiles
In the motif-based approach of protein family characterization, it is probable that variable regions between conserved motifs also contain valuable sequence information. Profiles indicate where the insertions and deletions are allowed in the complete sequence alignment and provide a sensitive means of detecting distant sequence relationships.
Pfam
The Pfam database contains information about protein domains and families. For each entry a protein sequence alignment and a hidden Markov model is stored. These hidden Markov models can be used to search sequence databases. For each family in Pfam it is possible to look at multiple alignments, view protein domain architectures, examine species distribution, follow links to other databases and view known protein structures.
Primary structure analysis bioinformatics tools
Many programs exist for, e.g.,
- Prediction of physico-chemical parameters of a protein sequence (amino-acid and atomic compositions, pI, extinction coefficient, etc.).
- Detection of repetitive protein sequences.
- Statistical analysis of protein sequences.
- Prediction of coiled coil regions in proteins (different methods).
- Identification of PEST regions.
- Prediction of peptide binding.
- Amino acid scale representation (Hydrophobicity, other conformational parameters, etc.).
- Representations of a protein fragment as a helical wheel.
Located on the ExPASY server are:
ProtParam – Physico-chemical parameters of a protein sequence (amino-acid and atomic compositions, pI, extinction coefficient, etc.).
Compute pI/Mw – Compute the theoretical pI and Mw from a SWISS-PROT or TrEMBL entry or for a user sequence.
ProtScale – Amino acid scale representation (Hydrophobicity, other conformational parameters, etc.).
RandSeq – Random protein sequence generator.
Amino acid composition, mass & pI:
Amino acid composition & Mass – ProtParam (ExPASy, Switzerland)
Isoelectric Point – Compute pI/Mw tool (ExPASy, Switzerland). If you want a plot of the relationship between charge and pH use ProteinChemist (ProteinChemist.com) or JVirGel Proteomic Tools (PRODORIC Net, Germany).
Mass, pI, composition and mol% acidic, basic, aromatic, polar etc. amino acids – PEPSTATS (EMBOSS). Biochemistry-online (Vitalonic, Russia) gives one % composition, molecular weight, pI, and charge at any desired pH.Peptide Molecular Weight Calculator (GenScript) – the online calculator determines the chemical formula and molecular weight of your peptide of interest. You can also specify post-translational modifications, such as N- and C- terminal modifications and positioning of disulfide bridges, to obtain more accurate outputs.
Isoelectric Point Calculator 2.0 (IPC 2.0) – is a server for the prediction of isoelectric points and pKa values using a mixture of deep learning and support vector regression models. The prediction accuracy (RMSD) of IPC 2.0 for proteins and peptides outperforms previous algorithms. (Reference: Kozlowski LP (2021) Nucl. Acids Res. Web Server issue).Composition/Molecular Weight Calculation (Georgetown University Medical Center, U.S.A.) – the only problem with this site is that when run in batch mode it does not identify the sequence by name, merely sequential number
Batch Protein Isoelectric Point determination – part of the Sequence Manipulation Suite or ENDMEMO
Batch Protein Molecular Weight determination – part of the Sequence Manipulation Suite or ENDMEMO
Protein calculator (C. Putnam, The Scripps Research Institute, U.S.A.) – calculates mass, pI, charge at a given pH, counts amino acid residues etc.Tm Predictor (P.C. Lyu Lab., National Tsing-Hua University, Taiwan) – calculates the theoretical protein melting temperature.
Computation of size of DNA and Protein Fragments from Their Electrophoretic Mobility
Exercise
Example: Using ProtParam, Compute pI/Mw, ProtScale and RandSeq for the analysis of SWISS_PROT Accn.no P35523.
Enter the Accession no into the query box for Acc.no. and hit Submit.
ProtParam: What are the Parameters of the potential extracellular domain (aa 858-988) of the protein?
Compute pI/Mw: What are the Parameters of the potential extracellular domain (aa 858-988) of the protein?
ProtScale: View the Hphobhobicity plot according to Kyte & Doolittle (default) with a Window size of 21. Click Sumbit on the next page again. Can you predict the number of transmembrane spanning domains (very hydrophobic regions) using this tool? (12 tms)
RandSeq: Change the radio button from average amino acid composition (default) to Composition of a specific sequence and enter your SWISS-PROT/TrEMBL ID P35523. Click Submit. A random sequence will be generated.
REFERENCES
- Xiong J. (2006). Essential Bioinformatics. Texas A & M University. Cambridge University Press.
- Arthur M Lesk (2014). Introduction to bioinformatics. Oxford University Press. Oxford, United Kingdom
- http://www.electronicsandcommunications.com/2018/08/secondary-databases-in-bioinformatics.html
4.https://www.ebi.ac.uk/training/online/course/bioinformatics-terrified-2018/primary-and-secondary-databases