Uniprot -The protein database -Bioinformatics
October 17, 2023Uniprot –The protein database
Introduction
There are many databases in the world such as NCBI, EMBL and so on. Each of these databases contain data on many information. For example, NCBI is a database where information regarding proteins and nucleotides are easily available. Just like that, there is a database which specifically caters for and provides information on protein molecules. It is the UniProt database. The UniProt database is known as a central hub for the collection of functional information on proteins with accurate, consistent and rich annotation. Information such as the amino acid sequence, protein name, description of the protein, taxonomic data and citation information is available in this database. In addition to that, there are also biological ontologies, classifications and cross-references, and clear indications of the quality of annotation in the form of evidence attribution of experimental and computational data available in the database.
UniProt database provides four core databases where information is stored and can be explored. They are UniProtKB which has 2 sections, UniParc and UniRef. The UniPortKB is a protein database which is partially curated by experts and consists of 2 sections which are called the UniProtKB/Swiss-Prot and the UniProtKB/TrEMBL. The UniProtKB/Swiss-Prot is a section which contains manually annotated records with information which is extracted from literature and curator-evaluated computational analysis whereas the UniProtKB/TrEMBL is a section which has computationally analysed records that still wait for full annotation. In short, the UniProtKB/Swiss-Port is a section where the data/records are reviewed and manually annotated while the UniPortKB/TrEMBL is a section which contains records that are unreviewed and are automatically annotated.
UniParc which is also known as UniProt Archive is another core database. This database is a comprehensive and non-redundant database. It contains all the protein sequences from the main protein sequence databases which are publicly available. To prevent redundancy, this UniProt database stores each unique protein sequence once only. This is mainly due to the fact that proteins can exists in multiple databases in multiple copies which is redundant and a waste of resource. The identical sequences are merged despite coming from different or same species. This database stores the protein sequences without annotations and each of the sequence stored is provided a unique identifier to ease the identification process. Any changes in the protein sequence stored or the source of database is changed, the UniParc database will tract the change, takes down all the changes and archives them. A few databases that UniParc is following are the EMBL, DDBJ, GenBank, Ensembl, EPO database and many more.
The fourth core database in UniProt is the UniProt Reference Clusters (UniRef). This database, consists of 3 databases which contains 3 clustered sets of protein sequences from UniProtKB and a few selected records from UniParc records; UniRef100, UniRef90 and UniRef50. The sequences in this database records are clustered using certain algorithms. The algorithm used by UniRef100 is the CD-HIT algorithm. Clustering the sequences is more efficient as it significantly reduces the database size and enable quicker searches for the user.
In this exercise, we shall extract information from the protein database, Uniprot. This database is administrated in collaboration between Swiss Institute of Bioinformatics (SIB), European Bioinformatics Institute (EBI), England, and Georgetown University, Washington DC, USA.
UniProt, http://www.uniprot.org/, consists of three parts:
- UniProt Knowledge-base (UniProtKB)
- protein sequences with annotation and references
- UniProt Reference Clusters (UniRef)
- homology-reduced database, where similar sequences (having a certain percentage identity) are merged into clusters, each with a representative sequence
- UniProt Archive (UniParc)
- an archive containing all versions of Uniprot without annotations
Of these databases, Uniprot Knowledge-base is the most useful, and this is the database we shall be using today. Uniprot Knowledge-base consists of two parts:
- UniProtKB/Swiss-Prot
- a manually annotated (reviewed) protein-database.
- UniProtKB/TrEMBL
- a computer-annotated supplement to Swiss-Prot, that contains all translations of EMBL nucleotide sequences not yet included in Swiss-Prot.
Simple text mining
First, we will find some UniProt entries using simple text mining. You are supposed to find the entry for human insulin.
- Open the UniProt home-page https://www.uniprot.org/
- Type human insulin in the search field in the top of the page. Leave the search menu on “UniProtKB“, which is default. Press Enter or click the Search button.
- If you are new to UniProt, you will be asked whether you want to view your results as “Cards” or “Table”. Choose “Table”.
- QUESTION 1.1:
- How many hits do you find? (tip: See the number above the results list)
- How many of these hits are from Swiss-Prot? (tip: See under “Reviewed” at the top left)
- Can you identify the correct hit (i.e. see which one is actually human insulin and not something else)? If yes, write down is Accession code and Entry name (also called ID).
In this case, it was relatively easy to spot the correct hit, but sometimes it is more difficult. If you do not identify the correct hit immediately, it will often help to narrow down the search, and that is exactly what we ask you to do in the next four questions.
The first step is searching for proteins that actually come from the organism “human” and are named something containing the word “insulin”, as opposed to just containing the words “human” and “insulin” somewhere in the entry.
On the left, you can see a list of “Model organisms”. Try to click “Human”.
- QUESTION 1.2:
- How many hits are now left? How many of these are from Swiss-Prot?
However, to really solve the problem, we have to enter Advanced mode. Click on Advanced in the right part of the search field. Search for human in the Organism [OS] field, then click Add field and search for insulin in the Protein Name [DE] field.
- QUESTION 1.3:
- How many hits are now left? How many of these are from Swiss-Prot? And what has the search string in the text box at the top of the page now turned into?
Now, you should exclude proteins that are not insulin, but only insulin-like. Open the Advanced menu again, add a field, make sure it is combined by NOT instead of AND, and remove hits that have insulin-like in the protein name.
- QUESTION 1.4:
- How many hits are now left? How many of these are from Swiss-Prot? And what is the search string?
Note that you can also edit the search string directly, instead of going through the Advaced menu every time.
- Try now to exclude proteins that are insulin receptors (or substrates for insulin receptors).
- QUESTION 1.5:
- How did you do this?
- How many hits are now left? How many of these are from Swiss-Prot?
The contents of UniProt
We shall now see what information is contained in a UniProt entry, and what further information is available as links in each entry.
Click on the accession-code or ID for insulin. This will take you to the insulin entry in the UniProtKB/Swiss-Prot database. Spend some time to get an overview of the page and the information it contains.
- Note that you can click on the headings in the left side of the page to scroll to different sections of the page. Try it!
- Note also that every time there is a small “i” after a term on the page, you can click it to get information about the term. Try it!
Now click on Publications in the top part of the window. Click on UniProtKB/Swiss-Prot under Source to show only those references that are part of the entry and exclude those that are “computationally mapped”. Note that it is indicated what each reference has contributed (“Cited for“). You can get to the PubMed literature database at NCBI by clicking at the link “PubMed” for a reference — try this. The abstract of a publication can be read here (or directly in UniProt using the “View abstract“-link), if the work is an actual published article and not a “direct submission”.
- QUESTION 2.1:
- How many references are there in the insulin entry?
- Why do you think insulin is such a highly investigated protein? (Hint: see other sections of the entry, e.g. Function and Disease & Drugs, especially the subsections Involvement in disease and Pharmaceutical)
- Scroll back to Function and read the free-text description at the top of the section. Also have a look at the controlled vocabulary annotations: “Gene Ontology” (GO) and Keywords. Note that both of these are split into two different aspects: Molecular function and Biological process.
- Now scroll to Subcellular Location and read what is written there. Note that you find another set of “Gene Ontology” (GO) and Keywords annotations here; this time labelled Cellular component.
- QUESTION 2.2:
- Where in the cell / outside the cell do you find insulin?
- Why do you think is it found there? (Hint: consider the function)
Just like in GenBank, a UniProt entry has a Feature Table containing annotations that are coupled to specific parts of the sequence. In the default view, the Feature Table is not so easy to spot, since it is split up under different sections corresponding to the biological significance of the various annotations. However, in the top part of the window you can click on Feature viewer, which shows the feature table information in a graphical form. Try it. Then click on Molecule processing to show the signal peptide and the propeptide.
Now switch back to the default (Entry) view. In the following, you will see some examples of Feature Table annotations.
- Under Disease & Drugs, the subsection Variants lists the variants (mutations) of insulin that have been described in the literature. Under the heading Change, it is indicated which amino acid is changed into which other amino acid. If the variant is known to be associated with a disease, this is indicated under the heading Description.
- Under PTM/Processing, the subsection Features shows that insulin has both a signal peptide and a pro-peptide. These are both cleaved off before secretion. The mature insulin (the A and B chains) is hence much smaller than what was shown under Sequences.
- QUESTION 2.3:
- How long is the signal peptide and the propeptide, respectively?
- Under Structure, the subsection Features shows the secondary structure elements “Helix” (α-helix), “Beta strand” (part of a β-pleated sheet) or “Turn”. The regions without specified secondary structure are often called “Loop” or “Coil”.
- QUESTION 2.4:
- Which positions are in β-sheet conformation in insulin?
Other databases linked from UniProt
UniProt has many useful links to other databases. In the graphical view, the cross-references are spread among several different headings, just like the feature table is.
Under the heading Sequence & Isoform, there is a sub-heading named Sequence databases. Here, you can e,g, find links to nucleotide sequences in the databases EMBL / GenBank / DDBJ. Try clicking one of the GenBank links marked “Genomic DNA”; that should take you to a page that looks like something you have seen last week.
Under the heading Structure, there is an interactive window showing a three-dimensional structure of insulin. Note that you can rotate the structure with your mouse. Actually, this structure is not part of UniProt itself, it is a cross-link to the protein structure database PDB. Below the interactive window, you can see the actual cross-links to PDB. Note that PDB is not one single database – just like it was the case for the nucleotide databases, there is a European version (PDBe), an American version (RCSB-PDB), and a Japanese version (PDBj), but luckily, they contain the same data. We will work with the American version of PDB later in the course. As you can see, there are many PDB structures of insulin; in other words, the 3D structure of insulin has been determined several times.
Under the heading Family & Domains, there is a subsection named Family and domain databases. It has links to databases containing proteins that are similar (protein families). These have been collected using various techniques that you will hear about later in the course (multiple alignment). In some cases, the proteins are similar only in smaller parts (domains) but not in other parts, and in some cases the databases can tell which parts of the actual protein are known in other species. Some large proteins (not small ones like insulin) can contain several different parts (domains) each with their own evolutionary history. The most important of these databases is InterPro, because it collects the information from most of the other databases. Try to click on one of the InterPro links. This will take you to the Interpro page with lots of information about the protein family that insulin belongs to.
Text format
Until now, we have been working with the graphical user interface to UniProt. However, all the information is also available in plain text format, and that’s what you will be working with if you are going to analyze larger amounts of UniProt data later in your studies. For now, let’s just have a look at it.
Scroll to the top of the Human Insulin page and find the menu labeled Download. It looks like this: . Click it, and then right-click the option Text and open it in a new tab. What you see here basically contains all the information you have seen in the graphical interface.
Scroll through the plain text file and see if you can find the same information that you just found in the graphical interface. Note that every line starts with a two-letter code specifying the type of the information in the line. Here are some examples:
- ID: Entry name (ID). There is only one ID.
- AC: Accession code. There may be more than one.
- DE: Description (protein names).
- GN: Gene Name
- OS: Organism/Species.
- OC: Organism Classification.
- OX: TaxID (as defined in the NCBI Taxonomy database).
- RN, RP, RX, RA, RT, RL: References.
- CC: Comments (annotations pertaining to the whole protein).
- DR: Cross-references to other databases.
- KW: Keywords.
- FT: Feature Table (annotations pertaining to specified parts of the sequence).
- SQ: Sequence header line.
Advanced search
The UniProt interface allows you to use most of the fields in the database for searching, not only the fields like name and organism, as we did previously, but also the functional and structural annotations. We shall now try a few of these.
- Go back to UniProt’s main page, http://www.uniprot.org/. Important: If the search string from the previous search is still shown in the search field, clear it. Then click Advanced to the right of the search field. This brings up a box with a new interface.
- Now we will find out how many proteins have signal peptides (just like insulin has). In the drop-down menu that appears in the box, select PTM/Processing, then select Molecule Processing, then select Signal peptide. In the empty field that now appears to the right of the word Signal, type a * (otherwise, it will not work). Click the Search button.
- QUESTION 3.1:
- How many proteins did you find, how many of them are from Swiss-Prot, and what was the search string (the text that appeared in the search field)?
- Evidence: The proteins we find in this way include proteins that are predicted to have signal peptides, without necessarily having any experimental evidence for the signal peptides. We will now limit the search to experimentally confirmed signal peptides. Click Advanced again (without erasing your previous search) and change the Evidence menu to Any experimental assertion.
- QUESTION 3.2:
- How many proteins do you find now, how many of them are from Swiss-Prot, and what has the search string changed into?
- Combining fields: How many experimentally confirmed signal peptides are found in humans? Click on Advanced Search again and click Add field to get a second search line. Leave the menu to the left on AND, select Organism [OS] in the drop-down menu, type human in the field Term, accept the suggestion “Homo sapiens (Human) [9606]” and click the Search button.
- QUESTION 3.3:
- How many proteins do you find now, and what is the search string? (Note that you can always perform the search by editing the text in the search field — however to do this you need to know the names for the fields).
About strains and subspecies
Let us now try something different. If you search for proteins from a microbial species, you may run into trouble, because each subspecies or strain has its own TaxID, and you probably want all possible strains. Let’s try an example (first, clear the previous search): Say you want all proteins from the bacterium Bacillus subtilis — a very important production organism in biotechnology. Try to type Bacillus subtilis in the Organism [OS] field: you will see a suggestion named “Bacillus subtilis [1423]” – accept that.
- QUESTION 3.4:
- How many proteins are there in UniProt from Bacillus subtilis with the default TaxID [1423]? How many of these are from Swiss-Prot? And what is the search string?
The number of entries in Swiss-Prot may seem low for such a well-studied organism. In addition, you may note that there is a link next to the total number of results saying “or expand search to “1423” to include lower taxonomic ranks“. Click it.
- QUESTION 3.5:
- How many proteins are there in UniProt from Bacillus subtilis in total (all strains and subspecies)? How many of these are from Swiss-Prot? And what is the search string?
In conclusion, use the field Taxonomy [OC] instead of Organism [OS] when working with microbial species where you want all strains.
Searching for short proteins
- Numerical field: Now we will try to answer a completely different question: Which extremely short proteins are present in UniProt? Clear the previous search. In the advanced drop-down menu, select Sequence and then Sequence length. Now two new fields appear where you can define the lower and upper limits for the search. Type 1 and 10 and search. Note: in your answers to the questions below, include the search string just like you did in the questions above!
- QUESTION 3.6:
- How many proteins of maximum length 10 do you find?
- Extremely short proteins are often mistakes translated directly from a nucleotide sequence with no evidence for the sequences being protein coding. Limit your search to proteins that actually have evidence for their existence at the protein level (add a field, and set the drop-down menu to Protein existence [PE] and select Evidence at protein level).
- QUESTION 3.7:
- How many proteins are now left?
- A large fraction of the proteins identified in this way are fragments. Try to exclude fragments from the search. Add a field. In the drop-down menu, choose Sequence, then Fragment, then No.
- QUESTION 3.8:
- How many proteins are now left?
- And how many of these proteins are found in humans?. Do as before…
- QUESTION 3.9:
- How many human non-fragment proteins of maximum length 10 do you find in UniProt?
- Finally you can save the results of your search. First, sort them by length by clicking on the column header. Then, click on Download above the list of results. You can now save the search results in the format you prefer (try FASTA (canonical) and click Preview).
- QUESTION 3.10:
- Copy the FASTA sequences to your report.
Results and Discussion
Question 1
- How many hits do you find?
- 4747 hits in total
- How many of these hits are from Swiss-Prot?
- 1597 reviewed by Swiss-Prot
- Can you identify the correct hit? If yes, write down is Accession code and Entry name.
- Example of one correct hit.
Accession Code: P01308;
Entry Name: INS_HUMAN
Question 2
- How many hits are left?
- 1617 hits (For Homo sapiens)
- How many of these are from Swiss-Prot?
- 1074 hits
Question 3
- How many hits are now left?
- 208 hits
- How many of these are from Swiss-Prot?
- 60 hits
Question 4
- How many hits are now left?
- 113 hits
Question 5
- How did you do this?
- Go to advanced search and select a new search field. Select NOT, Protein Name and type the term “insulin receptor”. Select search and the results produced will exclude any data regarding insulin receptors.
- How many hits are now left?
- 62 hits.
Question 6
- How many references are there in the insulin entry?
- 36 entries.
- Why do you think insulin is such a highly investigated protein?
- It is considered to be a very important protein in the human life as it is responsible in maintaining the glucose concentration levels in humans. There are also diseases that are caused due to the lack of this protein in humans, which is commonly known as Diabetes. This is just a very basic function of this protein, in the website above, insulin’s more detailed function can be studied. Furthermore, the research of this protein is not only in human aspects, it also falls under many other categories such as pathology, biotechnology and many more. This research has allowed the pharmaceuticals use of insulin which has helped to boost the lives of diabetic patients who can’t produce their own insulin.
Question 7
- Where in the cell / outside the cell do you find insulin?
- Insulin are produced in the cells inside the pancreas called beta cells. They diffuse into the bloodstream and will enter a cell via an insulin receptor that is made up of two receptor subunits that are located on the outside of the cell membrane. From there is can be found in transport vesicle in the cell.
- Why do you think is it found there?
- When the insulin hormone is entering a cell from the receptors in the cell membrane, it will enter the cell via receptor-mediated endocytosis, which results in a transport vesicle.
Question 8
- How long is the signal peptide and the propeptide, respectively?
- Signal peptide: Position 1 to 24 [24 length]
- Propeptide: Position 57 to 87 [31 length]
Question 9
- Which positions are in β-sheet conformation in insulin?
- Positions 26 – 29, 48 – 50, 56 – 58, 74 – 76, 98 – 101.
Question 10
- How many proteins did you find, and what was the search string?
- Proteins found: 15566563 proteins
- Search string: annotation:(type:signal)
Question 11
- How many proteins do you find now, and what has the search string changed into?
- Proteins found: 3758
- Search string: annotation:(type:signal evidence:experimental)
Question 12
- How many proteins do you find now, and what is the search string?
- Proteins found: 729
- Search string: annotation:(type:signal evidence:experimental) AND organism:”Homo sapiens (Human) [9606]”
Question 13 A
- How many proteins are there in UniProt from Neisseria gonorrhoeae with the default TaxID [485]?
- 12597 proteins
Question 13 B
- How many proteins are there in UniProt from Neisseria gonorrhoeae in total (all strains and subspecies)?
- 28412 proteins
Question 13 C
- What does the search string look like now?
- From organism:”Neisseria gonorrhoeae [485]” to taxonomy:”Neisseria gonorrhoeae [485]”
Question 14
- How many proteins of maximum length 10 do you find?
- 17648
- length: [10 TO 10] search string
Question 15
- How many proteins are now left?
- 456 proteins
- length:[10 TO 10] existence:”Evidence at protein level [1]” search string
Question 16
- How many proteins are now left?
- 255 proteins
- length:[10 TO 10] existence:”Evidence at protein level [1]” fragment:no
Question 17
- How many human non-fragment proteins of maximum length 10 do you find in UniProt?
- 201 proteins
- length:[10 TO 10] existence:”Evidence at protein level [1]” fragment:yes
Question 18
- Copy the FASTA sequences to your report.
- >sp|P0DJH3|VMP3A_DEIAC Zinc metalloproteinase-disintegrin-like AAV1 (Fragment) OS=Deinagkistrodon acutus OX=36307 PE=1 SV=1
- DVVSPPVCGN
- >sp|P0DKX2|VSP1_CRODM Thrombin-like enzyme Cdc SI (Fragment) OS=Crotalus durissus cumanensis OX=184542 PE=1 SV=1
- VIGGDECNIN
- >sp|P85103|VMXP_PHIPA Snake venom metalloproteinase patagonfibrase (Fragment) OS=Philodryas patagoniensis OX=120310 PE=1 SV=2
- LSTDIVAPPV
- >sp|B3EWG2|LACC_LEPMG Laccase (Fragment) OS=Lepiota magnispora OX=182864 PE=1 SV=1
- VTIGKEGTLT
- >sp|P0DJE8|VSPDV_BOTJR Thrombin-like enzyme D-V (Fragment) OS=Bothrops jararacussu OX=8726 PE=1 SV=1
- VVGADNCNFN
- >sp|P11180|ODP2_BOVIN Dihydrolipoyllysine-residue acetyltransferase component of pyruvate dehydrogenase complex (Fragment) OS=Bos taurus OX=9913 GN=DLAT PE=1 SV=1
- VETDKATVGF
- >sp|B3A0L4|LAC1_HERCO Laccase (Fragment) OS=Hericium coralloides OX=100756 PE=1 SV=1
- AVGDDTPQLY
- >sp|B3EWG8|PA2A_PORNA Acidic phospholipase A2 PnPLA2 (Fragment) OS=Porthidium nasutum OX=74558 PE=1 SV=1
- DLLQFXDMMK
- >sp|C0HLB1|ATP5E_YARLI ATP synthase subunit epsilon, mitochondrial (Fragment) OS=Yarrowia lipolytica (strain CLIB 122 / E 150) OX=284591 GN=ATP15 PE=1 SV=1
- MSAWMSAGFS
- >sp|B3EWR3|3SAW_NAJNA Cytotoxin NN-32 (Fragment) OS=Naja naja OX=35670 PE=1 SV=1
- LKCNKLVPLF
Conclusion
The practical conducted above provides the student an idea of how the UniProt database works. Each database has different functions and search properties. The UniProt database has its own unique search options which allows for an extensive search. In the practical exercise, the search string was noted in order to identify how a search option affects the overall search results. The Search engine of the database was explored and understood. Furthermore, other functions of the database were explored. In conclusion, the UniProt database is a database which specifically caters in identifying information regarding protein molecules which include hormones, receptors, enzymes and so on.
References
- Exercise: The protein database UniProt – teaching. (2017, February 10). UniProt. Retrieved October 26, 2021, from https://teaching.healthtech.dtu.dk/teaching/index.php/Exercise:_The_protein_database_UniProt
- UniProt. (n.d.). * in UniRef. Retrieved October 26, 2021, from https://www.uniprot.org/uniref/
- UniProt. (2021a). UniProt. Retrieved October 26, 2021, from https://www.uniprot.org/
- UniProt. (2021b, January 28). UniProtKB. Retrieved October 26, 2021, from https://www.uniprot.org/help/uniprotkb