Characterizing a protein using protein domain identification – Tutorial

March 13, 2024 Off By admin
Shares

Relevant websites:

Protein Information Tutorial

SMART (normal mode): http://smart.embl-heidelberg.de/

SMART (batch mode): http://smart.embl-heidelberg.de/smart/batch.pl HMMER search: https://www.ebi.ac.uk/Tools/hmmer/

InterProScan: https://www.ebi.ac.uk/interpro/

DTU prediction Server https://services.healthtech.dtu.dk/ ELM (Eukaryotic Linear Motif) http://elm.eu.org/

PhosphoELM http://phospho.elm.edu.org/

Characterizing a protein using protein domain identification and prediction servers on the web.

In this tutorial you will use known protein sequence and submit it to a variety of prediction servers to learn how to interpret the output from these

Figure 1: Submission form for the SMART database.

servers.

Pay attention to the output from the various programs. If you do not understand it, look for help files or links to information explaining the output. The DTU server has a link from the output of most of their programs that describes the output in detail and how to interpret it. .

Use the SMART database in normal mode (single protein submission)

Search the SMART database using the human protein NEK2 with the Uniprot accession number P51955. Click on the Normal mode graphic and it will bring up a search window as shown in Figure 1. If you use Uniprot/SwissProt accession numbers, you can simply type in the accession number in the text box Sequence ID or ACC. Check the boxes PFAM domains and signal peptides. Then click the Sequence SMART button.

The output includes a graphic of any domains found in the protein as shown in Figure 2.

Figure 2: Domain found in the sequence Nek2_HUMAN (P51955). The vertical lines represent intron/exon boundaries.

The graphic is interactive. If you pause the mouse over one of the

intron lines, it will display the position and reading frame. If you pause the mouse over the domain graphic, it will expand the display to show more information about the domain, including the position of the match and the E-value for the match. The domains are represented on the protein sequence in the location that they are found. The bright green horizontal bars represent coiled-coiled regions and the bright pink/magenta color represents regions of low complexity. If you click on the graphic, it will change the window to show the feature details including the alignment. At the top of the new window, there is a link full annotation. Click on that and it will open a new window which shows the sequence of the domain with the catalytic residues highlighted in green (Figure 3).

This can be a very useful feature is you are interested in making constitutively active mutants or catalytically inactive mutants. This option of showing catalytic residues is not available for every domain.

Figure 3: SMART domain detail page. Catalytic residues are shown in green.

At the bottom of the SMART output there are also a number of expandable menus that contain links to additional information about the domain shown.

BATCH submission using SMART database

    1. Click on the link for the SMART batch submission available on the Exercise 5 homepage.

Figure 4: SMART Batch output (partial)

    1. You can submit fasta sequences or Ensemble or Uniprot Protein IDs or accession numbers.
    2. Download the Excel file “SecretedProteins4ProteinInfoTutorial” from the Exercise 5 homepage. This worksheet has 61 proteins that were identified as secretory proteins and most should be recognized by the SMART database.
    3. Copy and paste the Uniprot IDs into the Identifiers box on the Batch retrieval page. Click the options for include PFAM domains and include signal peptides then click the Submit button.
    4. The page the opens will list the IDs that had no matches and then give you a long list of matches with graphical output as shown in Figure 4.

If you click on the protein name (i.e. C1QT1_MOUSE) it will bring up a page with the same information as you would get from a single protein submission. This is not the most efficient way of looking at a list of proteins, but it does provide an immediate graphical look at the domains (for proteins available in the SMART database) is one of the few web-based servers that allow batch submission using Protein IDs. You can do batch submission of protein sequences at HMMER, but the results take much longer (several days) to come back.

HMMER database searches

Access to the PFAM database is available via InterproScan or via the HMMER search interface. The difference between them is that InterProtScan combines the searches of several different protein domain databases while the HMMER searches only the PFAM database.

Copy the CCR7 protein sequence from the Ex. 5 homepage and paste it into the search box for the HMMER search. Leave the default search against Reference proteomes. You should get a result that shows a match to the 7tm_1 domain. If you click the Show Hit Details, it will provide more detailed information about the hit as well as show a second domain hit. By default, it will show matches for transmembrane domains “TM” and signal peptides.

Figure 5: Output of HMMER search with CCR7 protein

Scroll down the window to see the taxonomic distribution of the hits.

Sequence submissions using InterProScan

InterProScan is well support by EBI and this is well integrated with their databases, particularly Uniprot.

Submit the sequence for the CCR7 protein. Once the output comes back you will see that it includes matches to multiple protein

Figure 6: InterProScan output for CCR7 protein

domains including Prosite and Prints. If you mouse over the different domain graphics, a pop-up window will provide information on the source and statistical evaluation of the match. The IPR#### represent different families of proteins that have the same domain content and are built using a combination of signatures from the different domain databases. Click on one of those to see what information is available.

There is a LOT of information available, particularly for well annotated proteins from human and mouse. If you click the Search tab from the InterProScan homepage, there is an

option to search using Domain architecture. So, if there is a particular combination of domains that are of interest and you can use this to see how many other proteins share the combination.

Using TMHMM program at the DTU prediction server

Submit the CCR7 sequence to the TMHMM program (linked via the DTU prediction server on ex. 5 homepage), using extensive graphics as the output. The output should look like that displayed in Figure 7

The red bars at the top represent the positions of probably transmembrane spanning domains. The pink or blue lines connecting the TM domains provide the topology relative to the cell membrane. If you wanted to design a

Figure 7: TMHMM output for CCR7 protein from the CBS prediction server

peptide antibody to this protein, would you choose a region predicted to be on the inside or outside of the membrane?

Do these results concur with those obtained from InterProScan? What additional information do you get from this server that you do not from InterProScan? Submit the NEK2 protein sequence to the TMHMM prediction server. Does it predict the presence of any TM domains?

Use the prediction servers to look for signal peptides and phosphorylation sites.

Scan either the CCR7 or NEK2 protein sequence using the DTU PredictionServers program SignalP to look for signal peptides.

Given the results of the TMHMM predictions, would you expect this protein to be secreted? This site also predicts signal anchor peptides, which is a more likely result given the number of predicted TM domains.

Use the NetPhos program at the DTU Prediction Servers site to predict the potential threonine and tyrosine phosphorylation sites for NEK7 using a threshold of 0.7. Then check the number of experimentally verified phosphorylation sites using Phospho.ELM (Uniprot accession P519955). Did any of the sites in Phospho.ELM overlap with the predictions from NetPhos? While useful for informing possible experiments, keep in mind that not all predictions have the same level of true positives. Some motifs/functional domains/signals are more readily identified computationally than others. TM domain predictions are pretty strait forward, computationally speaking, to predict accurately. Small linear peptides that are phosphorylated are not.

The ELM (Eukaryotic Linear Motif) resource

This site is a repository of manually curated, experimentally validated short linear motifs (SLiMs). Try out the search using the CCR7 and NEK2 sequences from the exercise web page. Select homo sapiens for the context and do not specify a cellular context. The output for the CCR7 protein is shown in Figure 8. There are many potential matches to various ELMs, but ost do not have a favorable context. Mouse over one of the darker blue boxes and it should provid information on the

Figure 8: ELM output for CCR7

conservation score. You can click on the link on the left side of the graphic to find more information about one of the matches.

The NEK protein is likely to be in the cytosol whereas the CCR7 is a transmembrane protein localized to either the plasma membrane or mitochondrial membrane. Try redoing the searches using a cellular context selectd.

Other prediction tools

The tools listed here are just a few of the hundreds that are available. If you are interested in a particular feature

of a protein, such as if it has a myristylation moiety or is ubiquitinated at certain residues, there are prediction programs for that. Start with a Google search and you can probably find one or more sites that have a web interface to the prediction algorithm. Always keep in mind that a prediction does not mean that the event actually occurs in the cell. If there is more than one prediction algorithm, you might want to test your protein(s) with multiple prediction algorithms to determine the overlap and try to find a consensus.

 

Shares