Predicting function of protein using Interpro database and Interproscan- tutorial
June 11, 2019What is InterPro?
InterPro is an integrative database which was established 10 years prior when the PROSITE, PRINTS, Pfam and ProDom databases framed a consortium to amalgamate the prescient marks they independently delivered into a solitary asset. From that point forward, six other part databases have additionally joined and their information has been incorporated: SMART, TIGRFAMs , PIRSF , SUPERFAMILY, PANTHER and Gene3D . The marks of every part database are assembled utilizing diverse however corresponding philosophies.
At the point when distinctive marks coordinate a similar arrangement of proteins in a similar district on the grouping, they are attempted to depict the equivalent utilitarian family, area or site and are put into a solitary InterPro section by a guardian. Gathering proportionate marks from various sources has evident advantages, giving marks steady names and explanation. It likewise features conceivably incorrect mark hits. One would expect that remote homologues may just match a solitary mark from a numerous mark passage however these exceptions could likewise be clarified by single matches being false positive, subsequently the client should respect these outcomes all the more mindfully.
All things considered considering the complete arrangement of marks from the part databases likewise builds by and large inclusion of protein space. InterPro signature matches to the UniProt Knowledgebase [UniProtKB] are consistently determined utilizing the InterProScan programming bundle and this data is utilized to help UniProtKB caretakers in their explanation of Swiss-Prot proteins, just as being the premise of the programmed frameworks which add comment to UniProtKB/TrEMBL . The UniParc protein chronicle and UniMES meta-genomic grouping databases are likewise put through InterPro examination pipelines and numerous genomic sequencing ventures keep on utilizing InterPro and its product to practically portray entire genomes.
On the off chance that a mark just matches a subset of proteins contrasted with another mark, almost certainly, this mark is more practically or systematically explicit than the other. For this situation, the marks would be regarded to be connected; the mark coordinating the subset would be named a youngster, the other mark being its parent. These parent– kid connections are made by InterPro’s custodians amid the joining procedure and a pecking order of how the incorporated marks identify with one another is accordingly developed. Thusly, InterPro likewise builds the profundity of explanation of protein space.
When an InterPro passage is made, custodians include comment, for example, a distinct dynamic, name and cross-references to different assets, including Gene Ontology (GO) terms . Self-loader techniques make and keep up connections to a variety of different databases, including the protease asset MEROPS, the protein collaboration database IntAct , the protein arrangement bunches in CluSTr and the 3D protein structure database PDB . Moreover, if a protein has a settled 3D structure in PDB or a structure demonstrated in either the MODBASE or SWISS-MODEL databases, this data is appeared with the part databases’ mark coordinates in the graphical showcase on the InterPro Web interface.
Clients can get to all pre-processed matches of marks to UniProtKB by means of the web interface in an assortment of graphical and content based organizations. They can change how these matches are appeared by either arranging by UniProtKB identifier or name, for instance, or by choosing to show matches dependent on their scientific categorization, understood 3D structures or join variations. They can likewise download XML-design documents of matches to UniProtKB, the UniProt Archive (UniParc) and UniMES meta-genomic grouping database.
InterProScan is made accessible through the web at http://www.ebi.ac.uk/Tools/InterProScan/, and the whole bundle can be downloaded from the FTP website ftp://ftp.ebi.ac.uk/bar/programming/unix/iprscan/index.html. InterProScan enables clients to present their own groupings to the pursuit calculations and preparing from InterPro and its part databases. They can get results in different arrangements demonstrating the marks that coordinate their sequence(s), the InterPro passage (assuming any) into which every mark is incorporated and any GO expressions related with those sections. Cleanser based web benefits likewise exist (http://www.ebi.ac.uk/Tools/webservices/WSInterProScan.html) which enable clients to present their very own nucleotide and protein successions automatically.
Methodology
- Searching interpro with an amino acid sequence
First we need to locate the sequence search box at the interpro homepage. Now we need a sequence to search against the database. Copy and paste a sequence in the sequence search box and press ‘search’.
This is interpro sequence search result overview page.
Section A – demonstrates the family to which InterPro predicts the grouping has a place. This is shown as a progression, where suitable. Tapping the connection will take you to the InterPro passage page for the family, where itemized data about its capacity might be found.
Section B -.condenses the area and rehashes that InterPro predicts the protein to contain. The succession is spoken to as a dark bar with its length in amino acids showed along the base. Spaces and rehashes are shown as hued bars. Mousing over the bars uncovers the kind of space or rehash that they speak to, alongside their situation on the arrangement and a connection to the important InterPro passage page.
Section C – holds point by point signature coordinate data, appearing crude match position of all the distinctive marks in InterPro to the grouping, including (where accessible) marks speaking to families, areas, rehashes and destinations, and unintegrated marks that are not related with InterPro passages. The data showed in this segment can be controlled utilizing the intuitive menu on the left hand side of the screeen (Section D).
Section E- shows the Gene Ontology (GO) terms predicted for the protein. These terms are assigned based on the matches to the InterPro entries shown above.
We will first examine the domains that interpro predicts the query protein to contain.
Mousing over domains reveals their name , position and the interpro entries which they are associated.
This is the Interpro entry image for the death domain. By examining the entry page for the death domain , we can see that our domain is the protein interaction module that is involved in association of receptors so that they can signl downstream events. The other proteins In uniprotkb that interpro predicts to contain such a domain can be accessed by clicking the “proteins matched” link on the left side menu.
navigating back to the search result page , we can examine the second domain that interpro predicts the query protein to contain. Following the link to the interpro entry page, we can find out more about this domain including its involvement in the signalling in the following figure . as with the death domain entry page , more informations on the protein predicted to contain a TIR domain can be found using the left hand side menu.
Interpro entry page for TIR domain. Clicking on the link will take us to the InterPro entry page for this family in the figure below . This page explains what MyD88 is, what its functions are, and the GO terms that can be applied to members of the family. Clicking on the left hand menu allows us to examine the different UniProt proteins that are predicted to belong to the family (Proteins matched), their different domain architectures (Domain organisations) and the species in which they are found (Species). Clicking on any of these menu items opens a new page with the appropriate information and links that allow you to download specific datasets (e.g., all the proteins in a family, or all the proteins with a particular domain architecture, or all the protein family members from a particular kingdom, class or species).
InterPro entry page for the myeloid differentiation primary response protein MyD88 family of proteins.
It is to find all of the UniProtKB proteins that match a protein family, using the ‘Proteins matched’ link on the left hand side of the family page in the figure above.
Following the ‘Proteins matched’ link takes us to a page like the one shown in figure below. This is a paginated list of UniProtKB proteins, showing their name, accession number, species and domain architectures (if known). Those sequences with a gold star are from Swiss-Prot, the manually annotated section of UniProtKB, whilst those with a silver star are from the automatically annotated section TrEMBL. Proteins for which structural information is available are shown with a ‘3D’ icon, next to their accession number. A FASTA file containing the sequences of all of the proteins can be downloaded by clicking the ‘Export FASTA’ button at the top right of the page.
The proteins matched page, displaying UniProtKB proteins belonging to the same protein family as the query sequence.
In order to classify proteins into families and to predict the presence of important domains or sequence features, we require computational tools. One such set of tools are predictive models known as protein signatures.
Signatures are built by the member databases in the InterPro consortium. Different member databases use different methods to construct their signatures, and they have their own particular focus of interest: structural and/or functional domains, protein families, or protein features such as active sites or binding sites.
More information is available from the side menu of an InterPro entry page, including links to the proteins matched by that entry and the species in which they are found. As for the first side link is protein match. It contains more detailed information about the proteins. The second link is domain organisation. It consists of domain context or combination of domains of matched proteins. The third side link in pathway and interactions. It contains further details about the pathways about the proteins. The fourth link is species. It contains list of species with proteins of matching entry. The fifth one is structure. It contains link to the structural database. The next side link is literature and the last one is cross reference to other resources such as enzyme database.
Other than that , we need interpro so that it can reduces redundancy and simplifies protein sequence analysis by integrating signatures from different member databases that represent the same protein family, domain or site. It also unites the member databases, capitalising on their individual strengths to produce a powerful classification tool. Interpro also provides a single convenient searchable location, allowing simultaneous querying of all member databases. Interpro adds information (including descriptive abstracts and Gene Ontology terms) to the signatures, which may be used to annotate the proteins they match.
. InterPro is used to provide annotation for UniProtKB The sequences stored in the universal protein database UniProtKB are analysed regularly using InterPro. In this way, InterPro helps to provide annotation for uncharacterised sequences in the UniProtKB database
Figure shows flow of data between uniprot and interpro
InterPro can be used to analyse any protein sequence. Users can also choose to analyse their own sequences for predictions about their function and/or the presence of certain domains and sequence features.
Interpro can be used when the user have an amino acid sequence or set of sequences and you want to know: what they are – the family to which they belong, what their function is and how it can be explained in structural terms. User can use InterPro for a variety of other purposes, such as predicting GO terms for a set of sequences, or identifying all sequences in the UniProtKB database that are predicted to belong to a given family or to possess a particular domain.
A user cannot always use interpro especially when the user want to perform structural alignment of protein sequences. Other than that, When the user have a genomic DNA sequence and are interested in gene annotation (intron / exon predictions, identification of promoter regions, etc).
Conclusion
In a nutshell, InterPro is an asset for grouping proteins into families and foreseeing areas, rehashes and utilitarian destinations. InterPro coordinates protein marks from 14 part databases that are a piece of the InterPro consortium: CATH-Gene3D, CDD, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE Patterns, PROSITE Profiles, SFLD, SMART, SUPERFAMILY and TIGRFAMs. Protein marks are prescient models dependent on similitudes among proteins that have a similar structure or capacity. In addition, InterPro with an amino corrosive arrangement can be searched utilizing the site. The InterPro site can likewise be sought with a word or expression, a passage ID, UniProtKB increases and identifiers, GO terms and basic identifiers. Bunches of groupings can be looked against InterPro by downloading the InterProScan device, or utilizing the EBI’s web administrations. The InterPro protein see gives a graphical portrayal of the marks that coordinate a specific protein, with data about protein family enrolment, succession highlights, auxiliary highlights, and basic forecasts for that protein.
References
Margaret. B. (2002, September 1). Applications of InterPro in protein annotation and genome analysis. Retrieved from https://academic.oup.com/bib/article/3/3/285/239740
Nicola. M. (2007, January). The new developments in interpro database. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1899100/
Alex.M. (2014, November 26). The InterPro protein families database. Retrieved from https://academic.oup.com/nar/article/43/D1/D213/2439465
Sarah. H. (2009, January). InterPro: The integrative protein signature database. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2686546/