Introduction to Protein Primary structure Analysis

July 4, 2021 Off By admin

Table of Contents

Primary Structure

Proteins are macromolecules comprising the 20 naturally occurring amino acids. Amino acids, held together by peptide bonds, make up proteins. Under physiological conditions, proteins fold into characteristic three-dimensional structures that dictate their biological properties. The common configuration of natural amino acids is characterized by an amino and a carboxyl group around a central α carbon atom.

Primary structure is the simplest level of protein structure. The amino acid sequence determines the protein’s shape and structure and consequently its function.

The specific sequence is very important, since even a small change (called a mutation) could cause a disorder. For example, sickle cell anemia is a disorder in which the body’s hemoglobin contains just single amino acid substitution, in which glutamate at position 6th has been replace with valine.

Structure of Amino acids

All 20 of the common amino acids are alpha-amino acids. They contain a carboxyl group, an amino group, and a side chain (R group), all attached to the α-carbon.

Exceptions are:

Glycine, which does not have a side chain. Its α-carbon contains two hydrogens.
Proline, in which the nitrogen is part of a ring.
Thus, each amino acid has an amine group at one end and an acid group at the other and a distinctive side chain. The backbone is the same for all amino acids while the side chain differs from one amino acid to the next.
All of the 20 amino acids except glycine are of the L-configuration, as for all but one amino acid the α-carbon is an asymmetric carbon. Because glycine does not contain an asymmetric carbon atom, it is not optically active and, thus, it is neither D nor L.

Classification of amino acids on the basis of R-group

Nonpolar, Aliphatic amino acids: The R groups in this class of amino acids are nonpolar and hydrophobic. Glycine, Alanine, Valine, leucine, Isoleucine, Methionine, Proline.
. Aromatic amino acids: Phenylalanine, tyrosine, and tryptophan, with their aromatic side chains, are relatively nonpolar (hydrophobic). All can participate in hydrophobic interactions.
Polar, Uncharged amino acids: The R groups of these amino acids are more soluble in water, or more hydrophilic, than those of the nonpolar amino acids, because they contain functional groups that form hydrogen bonds with water. This class of amino acids includes serine, threonine, cysteine, asparagine, and glutamine.
Acidic amino acids: Amino acids in which R-group is acidic or negatively charged. Glutamic acid and Aspartic acid
Basic amino acids: Amino acids in which R-group is basic or positively charged. Lysine, Arginine, Histidine

The respective side chain of each amino acid determines the chemical properties, such as hydrophobic, polar, acidic, or basic. If the characteristics of a protein were to depend solely on the unfolded amino acid sequence (frequently referred to as the primary structure), similar properties would be expected due to the limitation of just 20 amino acids. Indeed, denatured (unfolded) proteins have very similar properties that correspond essentially to a homogeneous cross-section of randomly distributed side chains. Nevertheless, the primary structure is essential for determining secondary and tertiary structure and with that, the three-dimensional conformation of the protein.

Polypeptide primary structure, i.e., the amino acid sequence from the N- to the C- terminus, can contain between three and several hundred amino acids. Each amino acid in the polypeptide chain is abbreviated either with a three letter or one letter code.

Functions of Amino acids

In particular, 20 very important amino acids are crucial for life as they contain peptides and proteins and are known to be the building blocks for all living things.
The linear sequence of amino acid residues in a polypeptide chain determines the three-dimensional configuration of a protein, and the structure of a protein determines its function.
Amino acids are imperative for sustaining the health of the human body. They largely promote the:
Production of hormones
• Structure of muscles
• Human nervous system’s healthy functioning
• The health of vital organs
• Normal cellular structure
The amino acids are used by various tissues to synthesize proteins and to produce nitrogen-containing compounds (e.g., purines, heme, creatine, epinephrine), or they are oxidized to produce energy.
The breakdown of both dietary and tissue proteins yields nitrogen-containing substrates and carbon skeletons.
The nitrogen-containing substrates are used in the biosynthesis of purines, pyrimidines, neurotransmitters, hormones, porphyrins, and nonessential amino acids.
The carbon skeletons are used as a fuel source in the citric acid cycle, used for gluconeogenesis, or used in fatty acid synthesis.

Why we Need to Determine the Primary Structure of Protein?

Protein play important role as “the building block of life” –

Enzyme, Hormones and hormone receptors
Sensing device: such as rhodopsin
Role in immune system
Expression of genetic information (transcription)
Constituent of important body part: collagen
Transporters: Albumin, Myoglobin & Hb.
To locate the gene of interest in the host cell.
To artificial synthesis the above products by using the applications of biotechnology, we need the determine the primary structure of protein.

Protein databases

Primary databases are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure. Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature. Once given a database accession number, the data in primary databases are never changed: they form part of the scientific record. Protein sequence information has been effectively dealt in a concerted approach by establishing, maintaining and disseminating databases, providing user-friendly software tools and develop state-of-the-art analysis tools to interpret structural data. Databases are central, shareable resources made available in public domain and represent convenient and efficient means of storing vast amount of information. Depending on the nature of the different levels of information, databases are classified into different types for the end user.

There are various databases for each of the nature of protein information that range from primary, composite, secondary and pattern databases.

1. SWISS-PROT
This protein database was produced collaboratively by the Department of Medical
Biochemistry at the University of Geneva and the EMBL (European Molecular Biology
Laboratory). Since 1994, it moved to EMBL’s UK outstation, the EBI (European
Bioinformatics Institute) and in April 1998, it moved to Swiss Institute of Bioinformatics (SIB) and is maintained collaboratively by SIB and EBI/EMBL. It provides the description of the function of proteins, structure of its domains, post-translational modifications etc., is minimally redundant and is interlinked to many other resources.

TrEMBL
This database has been designed to allow rapid access to protein sequence data. TrEMBL refers to Translated EMBL and was created as a supplement to SWISS-PROT in 1996 to include translations of all coding sequences in EMBL.
PIR
This is the Protein Information Resource developed as a Protein sequence database at the National Biomedical Research Foundation (NBRF) in the early 1960s and collaboratively by PIR-International since 1988. The consortia include the PIR at NBRF, JIPID the International Protein Information Database of Japan and MIPS the Martinsried Institute for Protein Sequences.

Composite protein sequence databases
Composite databases have been created to simplify the sequence search for a protein query in a single compilation in context of the many different primary database searches, by merging a variety of different primary resources. These databases are non-redundant and render sequence searching much more efficient.Non-Redundant DataBase (NRDB) is the default database of the NCBI (National Center for
Biotechnology Information) BLAST (Basic Local Alignment Search Tool) service and is acomposite of GenPept, PDB sequences, SWISS-PROT, SPupdate (weekly update of SWISSPROT), PIR and GenPept update (daily updates of GenPept). It provides comprehensive upto-date information and is non-identical rather than non-redundant, that is, it reiterates only identical sequence copies and hence results in artifacts. SWISS-PROT + TrEMBL: It is a combined resource of SWISS-PROT + TrEMBL at the EBI and is minimally redundant. It can be searched at the SRS sequence retrieval system on the EBI webserver

Primary structure analysis bioinformatics tools

Many programs exist for, e.g.,

Prediction of physico-chemical parameters of a protein sequence (amino-acid and atomic compositions, pI, extinction coefficient, etc.).
Detection of repetitive protein sequences.
Statistical analysis of protein sequences.
Prediction of coiled coil regions in proteins (different methods).
Identification of PEST regions.
Prediction of peptide binding.
Amino acid scale representation (Hydrophobicity, other conformational parameters, etc.).
Representations of a protein fragment as a helical wheel.

Located on the ExPASY server are:
ProtParam – Physico-chemical parameters of a protein sequence (amino-acid and atomic compositions, pI, extinction coefficient, etc.).
Compute pI/Mw – Compute the theoretical pI and Mw from a SWISS-PROT or TrEMBL entry or for a user sequence.
ProtScale – Amino acid scale representation (Hydrophobicity, other conformational parameters, etc.).
RandSeq – Random protein sequence generator.

Amino acid composition, mass & pI:

Amino acid composition & Mass – ProtParam (ExPASy, Switzerland)
Isoelectric Point – Compute pI/Mw tool (ExPASy, Switzerland). If you want a plot of the relationship between charge and pH use ProteinChemist (ProteinChemist.com) or JVirGel Proteomic Tools (PRODORIC Net, Germany).
Mass, pI, composition and mol% acidic, basic, aromatic, polar etc. amino acids – PEPSTATS (EMBOSS). Biochemistry-online (Vitalonic, Russia) gives one % composition, molecular weight, pI, and charge at any desired pH.
Peptide Molecular Weight Calculator (GenScript) – the online calculator determines the chemical formula and molecular weight of your peptide of interest. You can also specify post-translational modifications, such as N- and C- terminal modifications and positioning of disulfide bridges, to obtain more accurate outputs.

Isoelectric Point Calculator 2.0 (IPC 2.0) – is a server for the prediction of isoelectric points and pK_a values using a mixture of deep learning and support vector regression models. The prediction accuracy (RMSD) of IPC 2.0 for proteins and peptides outperforms previous algorithms. (Reference: Kozlowski LP (2021) Nucl. Acids Res. Web Server issue).
Composition/Molecular Weight Calculation (Georgetown University Medical Center, U.S.A.) – the only problem with this site is that when run in batch mode it does not identify the sequence by name, merely sequential number

Batch Protein Isoelectric Point determination – part of the Sequence Manipulation Suite or ENDMEMO

Batch Protein Molecular Weight determination – part of the Sequence Manipulation Suite or ENDMEMO

Protein calculator (C. Putnam, The Scripps Research Institute, U.S.A.) – calculates mass, pI, charge at a given pH, counts amino acid residues etc.
Tm Predictor (P.C. Lyu Lab., National Tsing-Hua University, Taiwan) – calculates the theoretical protein melting temperature.

Computation of size of DNA and Protein Fragments from Their Electrophoretic Mobility

Exercise

Example: Using ProtParam, Compute pI/Mw, ProtScale and RandSeq for the analysis of SWISS_PROT Accn.no P35523.

Enter the Accession no into the query box for Acc.no. and hit Submit.
ProtParam: What are the Parameters of the potential extracellular domain (aa 858-988) of the protein?
Compute pI/Mw: What are the Parameters of the potential extracellular domain (aa 858-988) of the protein?
ProtScale: View the Hphobhobicity plot according to Kyte & Doolittle (default) with a Window size of 21. Click Sumbit on the next page again. Can you predict the number of transmembrane spanning domains (very hydrophobic regions) using this tool? (12 tms)
RandSeq: Change the radio button from average amino acid composition (default) to Composition of a specific sequence and enter your SWISS-PROT/TrEMBL ID P35523. Click Submit. A random sequence will be generated.

REFERENCES

Xiong J. (2006). Essential Bioinformatics. Texas A & M University. Cambridge University Press.
Arthur M Lesk (2014). Introduction to bioinformatics. Oxford University Press. Oxford, United Kingdom
http://www.electronicsandcommunications.com/2018/08/secondary-databases-in-bioinformatics.html
4.https://www.ebi.ac.uk/training/online/course/bioinformatics-terrified-2018/primary-and-secondary-databases

A Biologist's Guide to Computers and Bioinformatics

Practical RNA-Seq Data Analysis with Galaxy - A Complete Guide

An In-depth Exploration of Antibody Sequencing and Production Technologies

Biochemistry Basics: A Comprehensive Beginner's Guide with Applications in Bioinformatics and Chemin...

Bioinformatics Tools for Sequence Analysis

Insights into Sequence and Structure Databases and Their Multidimensional Impact

Converting DNA to Protein Sequence

Exploring the Impact of Personalized Medicine on Health Outcomes

Top 7 Ways to Gain Real-World Experience in Data Science and Bioinformatics

Bioinformatics glossary - S

Cloud Computing and Open Source Tools: The Future of Accessible Bioinformatics

Top Bioinformatics Software and Tools for 2024