Exploring Protein Information and Analysis with UniProt

March 15, 2024 Off By admin
Shares

This outline provides a comprehensive overview of UniProt, including data retrieval, analysis, integration with other tools, real-world applications, and future trends, with a focus on hands-on learning and practical skills development.

Introduction to UniProt

UniProt is a comprehensive resource for protein sequence and functional information. It consists of several databases, including the UniProt Knowledgebase (UniProtKB), UniProt Reference Clusters (UniRef), and the UniProt Archive (UniParc). Here’s an overview of each:

  1. UniProt Knowledgebase (UniProtKB): This is the central hub of the UniProt database. It provides a wealth of information on protein sequences and functional annotation. UniProtKB is divided into two sections:
    • Swiss-Prot: This section contains manually curated, high-quality protein sequences with detailed functional information.
    • TrEMBL: This section contains computationally analyzed protein sequences that await manual curation. It is larger than Swiss-Prot and is continuously updated with new protein sequences from various sources.
  2. UniProt Reference Clusters (UniRef): UniRef clusters protein sequences based on their similarity into three databases:
    • UniRef100: Clusters are created such that each cluster contains sequences that have at least 90% sequence identity to the representative sequence.
    • UniRef90: Sequences in this database have at least 90% sequence identity to the representative sequence.
    • UniRef50: Sequences in this database have at least 50% sequence identity to the representative sequence.

    These databases help reduce redundancy and speed up sequence similarity searches.

  3. UniProt Archive (UniParc): UniParc is a comprehensive repository that contains all publicly available protein sequences, including those not yet integrated into UniProtKB. It serves as a historical record of all protein sequences and ensures that no sequence is lost, even if it is later removed from UniProtKB.

UniProt is an invaluable resource for researchers in bioinformatics, molecular biology, and related fields, providing easy access to up-to-date and curated information on protein sequences and functions.

Importance of UniProt in bioinformatics

UniProt plays a crucial role in bioinformatics for several reasons:

  1. Comprehensive Protein Information: UniProt provides a comprehensive collection of protein sequences and functional information, making it a valuable resource for researchers studying proteins and their functions.
  2. Standardized Data: UniProt ensures that protein data are standardized and organized, allowing for easier comparison and analysis across different species and datasets.
  3. Annotation: UniProt provides detailed annotations for proteins, including information on function, structure, subcellular location, post-translational modifications, and interactions. This annotation helps researchers understand the role of proteins in biological processes.
  4. Data Integration: UniProt integrates data from various sources, including experimental data, computational predictions, and literature curation, providing a unified and reliable source of protein information.
  5. Sequence Similarity Searches: UniProt offers tools for sequence similarity searches, such as BLAST, allowing researchers to identify similar sequences and infer evolutionary relationships.
  6. Protein Classification: UniProt classifies proteins into families and superfamilies based on sequence similarity and other criteria, helping researchers identify functionally related proteins.
  7. Proteome Analysis: UniProt provides proteomes for various organisms, allowing researchers to study the complete set of proteins encoded by a genome.
  8. Tool Integration: UniProt integrates with various bioinformatics tools and databases, enhancing its utility for researchers in analyzing protein data.

In summary, UniProt is an essential resource in bioinformatics, providing standardized, comprehensive, and up-to-date information on proteins that is crucial for understanding biological processes and advancing research in the field.

Accessing UniProt: website and API

Accessing UniProt can be done through its website and API:

  1. Website: The UniProt website (https://www.uniprot.org/) provides a user-friendly interface for searching and browsing protein data. Users can search for proteins by name, gene, keyword, or sequence. The website also offers tools for sequence similarity searches, multiple sequence alignment, and protein structure analysis.
  2. API: The UniProt API (Application Programming Interface) allows programmatic access to UniProt data. This is particularly useful for developers and bioinformaticians who want to access UniProt data in an automated manner. The API provides access to a wide range of data, including protein sequences, annotations, and metadata.

To use the UniProt API, you need to make HTTP requests to the UniProt server. The API supports various query parameters and output formats, allowing you to customize your queries and retrieve data in a format that suits your needs. You can find more information about the UniProt API and how to use it in the UniProt documentation: https://www.uniprot.org/help/api_access

UniProt Data Retrieval

Searching UniProt: basic and advanced search options

UniProt offers both basic and advanced search options to help users find relevant protein information. Here’s an overview of each:

  1. Basic Search: The basic search option allows users to search for proteins using simple keywords. Users can enter protein names, gene names, accession numbers, keywords, or even sequences to retrieve matching results. The basic search is convenient for quickly finding specific proteins or protein families.
  2. Advanced Search: The advanced search option provides more refined search capabilities, allowing users to construct complex queries. Users can specify search criteria such as organism, protein function, subcellular location, and post-translational modifications. The advanced search is useful for narrowing down search results to find proteins with specific characteristics or properties.

Both basic and advanced search options can be accessed from the UniProt website’s search bar. Users can switch between the two modes depending on their search requirements. Additionally, UniProt provides search tips and examples to help users construct effective queries and retrieve relevant results.

Retrieving protein sequences and annotations

To retrieve protein sequences and annotations from UniProt, you can use the UniProt website or the UniProt API. Here’s how you can do it:

  1. Using the UniProt Website:
    • Go to the UniProt website (https://www.uniprot.org/).
    • Use the search bar to search for the protein of interest by name, gene, keyword, or sequence.
    • Click on the protein entry in the search results to view its details.
    • On the protein entry page, you can find the protein sequence under the “Sequence” section and annotations under various sections such as “Function,” “Subcellular location,” “Pathway,” etc.
    • You can also download the protein sequence in various formats (FASTA, XML, etc.) using the “Download” button on the protein entry page.
  2. Using the UniProt API:
    • Construct a query to retrieve protein information using the UniProt API. For example, to retrieve the sequence and annotations of a protein with accession number “P12345,” you can use the following API endpoint:
      arduino
      https://www.uniprot.org/uniprot/P12345.xml

      This endpoint will return the protein information in XML format. You can replace “P12345” with the accession number of the protein you are interested in.

    • You can also use the UniProt API to retrieve protein information in other formats such as JSON, HTML, or tab-separated values (TSV) by specifying the format in the API request.

Using either method, you can retrieve protein sequences and annotations from UniProt for further analysis and research.

Downloading data: formats and options

When downloading data from UniProt, you have several options for formats and download types. Here are some common formats and options:

  1. Formats:
    • FASTA: A widely used format for representing nucleotide or protein sequences. Each sequence is represented by a header line starting with “>” followed by the sequence itself.
    • XML: A structured format that provides detailed information about proteins, including annotations, cross-references, and sequence data. XML is useful for parsing and extracting specific information.
    • TXT: A plain text format that may include various types of information about proteins, such as annotations, descriptions, and references.
    • HTML: A format that includes formatting for web display, suitable for viewing protein information in a web browser.
  2. Download Types:
    • Full Entry: Downloads the complete entry for a protein, including all available information such as sequence, annotations, and references.
    • Customized: Allows you to select specific fields or sections of the protein entry to download. This can be useful if you only need certain types of information.
  3. Other Options:
    • Compressed: Some formats, such as FASTA and XML, may be available for download in a compressed (e.g., gzip) format to reduce file size.
    • Batch Download: Allows you to download data for multiple proteins at once, either by specifying a list of accession numbers or by using a search query to retrieve matching entries.

When downloading data from UniProt, consider the format and download type that best suit your needs based on the information you require and how you plan to use it.

UniProt Data Analysis

Sequence analysis tools in UniProt

UniProt offers several sequence analysis tools that allow users to analyze protein sequences and perform various types of sequence-based analyses. Some of the key tools available on UniProt include:

  1. BLAST: The Basic Local Alignment Search Tool (BLAST) allows users to compare a protein sequence against a database of known protein sequences to identify similar sequences. BLAST helps in identifying homologous proteins and inferring functional relationships.
  2. Align: The Align tool allows users to align multiple protein sequences to identify conserved regions and analyze sequence similarities and differences. This tool is useful for phylogenetic analysis and identifying functional domains.
  3. Retrieve/ID mapping: This tool allows users to retrieve UniProtKB entries based on different types of identifiers, such as UniProtKB AC/ID, gene name, RefSeq, etc. It also provides mappings between different types of identifiers.
  4. Peptide search: The Peptide search tool allows users to search for peptide sequences within UniProtKB. This tool is useful for identifying proteins based on peptide sequences obtained from experimental data, such as mass spectrometry.
  5. Batch search: The Batch search tool allows users to search for multiple protein sequences at once. Users can upload a file containing protein sequences in FASTA format and perform various analyses on these sequences, such as BLAST searches or sequence annotation.
  6. Advanced search: The Advanced search tool provides a more comprehensive search interface, allowing users to search for proteins based on specific criteria such as keyword, organism, function, subcellular location, and more.

These tools are designed to help researchers analyze protein sequences, identify homologous proteins, and gain insights into protein function and evolution.

Functional annotation tools and resources

Functional annotation of proteins is a critical step in understanding their biological roles and relationships. UniProt provides extensive functional annotation for proteins, including information on protein function, subcellular localization, post-translational modifications, protein-protein interactions, and pathways. Here are some key tools and resources for functional annotation:

  1. UniProtKB Annotations: UniProtKB provides curated annotations for proteins, including information on function, domain structure, subcellular location, and biological processes. These annotations are manually curated from the scientific literature and other databases.
  2. Gene Ontology (GO): UniProt annotates proteins with terms from the Gene Ontology (GO) project, which provides a structured vocabulary for describing the functions of genes and proteins in any organism. GO terms are used to annotate proteins with information about their molecular functions, biological processes, and cellular components.
  3. Enzyme Classification (EC) Numbers: UniProt annotates enzymes with EC numbers, which provide information about the catalytic reactions catalyzed by the enzyme. EC numbers are part of the Enzyme Nomenclature system and help in understanding the biochemical functions of enzymes.
  4. Pathway Annotation: UniProt annotates proteins with information about their involvement in biological pathways. This information is sourced from pathway databases such as Reactome, KEGG, and BioCyc, providing insights into the role of proteins in various biological processes.
  5. Protein-Protein Interactions: UniProt annotates proteins with information about their interactions with other proteins. This information is sourced from interaction databases such as IntAct, BioGRID, and STRING, providing insights into the functional networks of proteins.
  6. Subcellular Localization: UniProt annotates proteins with information about their subcellular localization, indicating where the protein is located within the cell. This information is important for understanding the function and regulation of proteins.
  7. Post-Translational Modifications (PTMs): UniProt annotates proteins with information about post-translational modifications, such as phosphorylation, glycosylation, and acetylation. These modifications can affect the function, localization, and stability of proteins.

By leveraging these tools and resources, researchers can annotate proteins with detailed functional information, facilitating further analysis and interpretation of protein data in various biological contexts.

Structural annotation and visualization tools

Structural annotation and visualization of proteins are crucial for understanding their 3D structure, function, and interactions. UniProt provides some basic structural information, such as protein domains and regions, but for more detailed structural annotation and visualization, researchers often use specialized tools and resources. Here are some commonly used tools for structural annotation and visualization:

  1. Protein Data Bank (PDB): The PDB is a repository of experimentally determined 3D structures of proteins and other biological molecules. Researchers can search the PDB to find structures of proteins and visualize them using molecular visualization software.
  2. PyMOL: PyMOL is a popular molecular visualization tool that allows users to visualize and analyze protein structures. It provides a wide range of features for manipulating and analyzing protein structures, such as measuring distances, calculating angles, and creating high-quality images.
  3. UCSF ChimeraX: UCSF ChimeraX is a next-generation molecular visualization program with advanced features for visualizing and analyzing molecular structures. It supports interactive exploration of large-scale molecular models and integrates with a variety of structural biology databases.
  4. Swiss-PdbViewer (DeepView): Swiss-PdbViewer is a tool for viewing and analyzing protein structures. It allows users to visualize protein structures in 3D, superimpose structures, and analyze protein-ligand interactions.
  5. RCSB PDB Protein Workshop: The RCSB PDB Protein Workshop is a web-based tool for visualizing and analyzing protein structures. It provides a range of tools for exploring protein structures, such as zooming, rotating, and coloring residues based on various properties.
  6. Coot: Coot is a molecular modeling program specifically designed for building and refining protein structures. It is often used in conjunction with crystallography and cryo-EM data to visualize and improve protein structures.
  7. Jmol: Jmol is an open-source Java viewer for chemical structures in 3D. It can read molecular files in various formats and allows users to visualize and manipulate protein structures.

These tools provide researchers with the ability to annotate and visualize protein structures, helping them understand the structure-function relationships of proteins and facilitating drug discovery and protein engineering efforts.

UniProt Tools Integration

Integrating UniProt data with other bioinformatics tools and databases

Integrating UniProt data with other bioinformatics tools and databases can provide researchers with a more comprehensive understanding of protein function, structure, and interactions. Here are some ways to integrate UniProt data with other tools and databases:

  1. Protein-Protein Interaction (PPI) Databases: Integrate UniProt data with PPI databases such as BioGRID, STRING, and IntAct to explore protein interactions and functional networks.
  2. Structural Databases: Combine UniProt data with structural databases like the Protein Data Bank (PDB) to visualize protein structures and understand their functional implications.
  3. Gene Expression Databases: Integrate UniProt data with gene expression databases such as Gene Expression Omnibus (GEO) or ArrayExpress to correlate protein function with gene expression patterns.
  4. Pathway Analysis Tools: Use UniProt data to annotate proteins involved in biological pathways and integrate this information with pathway analysis tools such as KEGG or Reactome to understand the role of proteins in cellular processes.
  5. Sequence Alignment Tools: Utilize UniProt data for sequence alignments and integrate it with tools like Clustal Omega or MUSCLE for comparative sequence analysis.
  6. Gene Ontology (GO) Annotation: Integrate UniProt data with GO annotations to understand the biological processes, molecular functions, and cellular components associated with proteins.
  7. Literature Mining Tools: Integrate UniProt data with literature mining tools such as PubMed or Europe PMC to retrieve relevant literature and explore the functional context of proteins.

By integrating UniProt data with other bioinformatics tools and databases, researchers can gain a deeper understanding of protein function, interactions, and pathways, leading to new insights and discoveries in biology and medicine.

Using UniProt in conjunction with molecular modeling and simulation tools

Integrating UniProt data with molecular modeling and simulation tools can enhance the understanding of protein structure, function, and dynamics. Here’s how you can use UniProt in conjunction with these tools:

  1. Structural Modeling: Use UniProt data, such as protein sequences and functional annotations, as input for homology modeling tools like SWISS-MODEL or Modeller. These tools predict the 3D structure of a protein based on its sequence and known structures of homologous proteins.
  2. Molecular Dynamics (MD) Simulations: Use UniProt data, including structural annotations and post-translational modifications, to set up and analyze MD simulations. Tools like GROMACS or AMBER can simulate the dynamic behavior of proteins and their interactions with other molecules.
  3. Ligand Docking: Use UniProt data, such as binding sites and ligand interactions, to perform molecular docking simulations. Tools like AutoDock or Vina can predict the binding modes of small molecules to protein targets.
  4. Analysis of Structural Features: Use UniProt data to analyze structural features of proteins, such as secondary structure elements, domains, and active sites. Tools like PyMOL or VMD can visualize these features and analyze their functional implications.
  5. Integration with Structural Databases: Integrate UniProt data with structural databases like the Protein Data Bank (PDB) to compare predicted or simulated structures with experimentally determined structures. This can help validate models and provide additional insights into protein structure and function.
  6. Function Prediction: Use UniProt data, including functional annotations and domain information, to predict the function of uncharacterized proteins. Tools like InterProScan or Pfam can annotate protein sequences with functional domains and predict protein function based on domain architecture.

By integrating UniProt data with molecular modeling and simulation tools, researchers can gain a more comprehensive understanding of protein structure, function, and dynamics, leading to insights into biological mechanisms and potential drug targets.

Workflow examples: from UniProt to functional analysis

Here are some workflow examples that demonstrate how to go from UniProt data to functional analysis using bioinformatics tools:

  1. Protein Function Prediction Workflow:
    • Step 1: Retrieve Protein Sequence from UniProt: Use the UniProt website or API to retrieve the protein sequence of interest.
    • Step 2: Domain Annotation: Use tools like InterProScan or Pfam to annotate protein domains and functional motifs in the sequence.
    • Step 3: Functional Annotation: Use tools like Blast2GO or DAVID to annotate the protein with Gene Ontology (GO) terms based on sequence similarity.
    • Step 4: Pathway Analysis: Use tools like KEGG or Reactome to analyze the protein’s involvement in biological pathways based on the annotated GO terms.
    • Step 5: Protein-Protein Interaction (PPI) Analysis: Use PPI databases like STRING or BioGRID to analyze the protein’s interactions with other proteins in the context of its function.
  2. Comparative Genomics Workflow:
    • Step 1: Retrieve Orthologous Proteins: Use UniProt to retrieve protein sequences of orthologs from different species.
    • Step 2: Multiple Sequence Alignment: Use tools like Clustal Omega or MAFFT to align the orthologous protein sequences.
    • Step 3: Phylogenetic Analysis: Use tools like PhyML or RAxML to construct a phylogenetic tree based on the aligned sequences.
    • Step 4: Functional Divergence Analysis: Use tools like DIVERGE or BUSTED to analyze the functional divergence of orthologous proteins across different species.
  3. Protein Structure-Function Relationship Workflow:
    • Step 1: Retrieve Protein Structure from UniProt or PDB: Use UniProt to retrieve the protein sequence and map it to a known 3D structure from the Protein Data Bank (PDB).
    • Step 2: Structural Analysis: Use tools like PyMOL or VMD to visualize the protein structure and analyze its structural features, such as domains, active sites, and binding pockets.
    • Step 3: Ligand Docking: Use tools like AutoDock or Vina to dock small molecules to the protein structure and predict binding modes.
    • Step 4: Functional Implications: Analyze the structural features and ligand interactions to infer the protein’s function and potential drug targets.

These workflow examples illustrate how UniProt data can be used in conjunction with various bioinformatics tools to perform functional analysis of proteins.

Case Studies and Applications

Uniprot exercise

In this exercise, we shall extract information from the protein database, Uniprot. This database is administrated in collaboration between Swiss Institute of Bioinformatics (SIB)European Bioinformatics Institute (EBI), England, and Georgetown University, Washington DC, USA.

UniProt, http://www.uniprot.org/, consists of three parts:

  • UniProt Knowledge-base (UniProtKB)
    • protein sequences with annotation and references
  • UniProt Reference Clusters (UniRef)
    • homology-reduced database, where similar sequences (having a certain percentage identity) are merged into clusters, each with a representative sequence
  • UniProt Archive (UniParc)
    • an archive containing all versions of Uniprot without annotations

Of these databases, Uniprot Knowledge-base is the most useful, and this is the database we shall be using today. Uniprot Knowledge-base consists of two parts:

  • UniProtKB/Swiss-Prot
    • a manually annotated (reviewed) protein-database.
  • UniProtKB/TrEMBL
    • a computer-annotated supplement to Swiss-Prot, that contains all translations of EMBL nucleotide sequences not yet included in Swiss-Prot.

Questions

Simple text mining

First, we will find some UniProt entries using simple text mining. You are supposed to find the entry for human insulin.

  • Open the UniProt home-page https://www.uniprot.org/
  • Type human insulin in the search field in the top of the page. Leave the search menu on “UniProtKB”, which is default. Press Enter or click the Search button.
  • If you are new to UniProt, you will be asked whether you want to view your results as “Cards” or “Table”. Choose “Table”.

QUESTION 1.1:

  1. How many hits do you find? (tip: See the number above the results list)
  2. How many of these hits are from Swiss-Prot? (tip: See under “Reviewed” at the top left)
  3. Can you identify the correct hit (i.e. see which one is actually human insulin and not something else)? If yes, write down is Accession code and Entry name (also called ID).

In this case, it was relatively easy to spot the correct hit, but sometimes it is more difficult. If you do not identify the correct hit immediately, it will often help to narrow down the search, and that is exactly what we ask you to do in the next four questions.

The first step is searching for proteins that actually come from the organism “human” and are named something containing the word “insulin”, as opposed to just containing the words “human” and “insulin” somewhere in the entry.

On the left, you can see a list of “Model organisms”. Try to click “Human”.

QUESTION 1.2:

How many hits are now left? How many of these are from Swiss-Prot?

However, to really solve the problem, we have to enter Advanced mode. Click on Advanced in the right part of the search field. Search for human in the Organism [OS] field, then click Add field and search for insulin in the Protein Name [DE] field.

QUESTION 1.3:

How many hits are now left? How many of these are from Swiss-Prot? And what has the search string in the text box at the top of the page now turned into?

Now, you should exclude proteins that are not insulin, but only insulin-like. Open the Advanced menu again, add a field, make sure it is combined by NOT instead of AND, and remove hits that have insulin-like in the protein name.

QUESTION 1.4:

How many hits are now left? How many of these are from Swiss-Prot? And what is the search string?

Note that you can also edit the search string directly, instead of going through the Advaced menu every time.

  • Try now to exclude proteins that are insulin receptors (or substrates for insulin receptors).

QUESTION 1.5:

  1. How did you do this?
  2. How many hits are now left? How many of these are from Swiss-Prot?

The contents of UniProt

We shall now see what information is contained in a UniProt entry, and what further information is available as links in each entry.

Click on the accession-code or ID for insulin. This will take you to the insulin entry in the UniProtKB/Swiss-Prot database. Spend some time to get an overview of the page and the information it contains.

  • Note that you can click on the headings in the left side of the page to scroll to different sections of the page. Try it!
  • Note also that every time there is a small “i” after a term on the page, you can click it to get information about the term. Try it!

Now click on Publications in the top part of the window. Click on UniProtKB/Swiss-Prot under Source to show only those references that are part of the entry and exclude those that are “computationally mapped”. Note that it is indicated what each reference has contributed (“Cited for”). You can get to the PubMed literature database at NCBI by clicking at the link “PubMed” for a reference — try this. The abstract of a publication can be read here (or directly in UniProt using the “View abstract”-link), if the work is an actual published article and not a “direct submission”.

QUESTION 2.1:

  1. How many references are there in the insulin entry?
  2. Why do you think insulin is such a highly investigated protein? (Hint: see other sections of the entry, e.g. Function and Disease & Drugs, especially the subsections Involvement in disease and Pharmaceutical)
  • Scroll back to Function and read the free-text description at the top of the section. Also have a look at the controlled vocabulary annotations: “Gene Ontology” (GO) and Keywords. Note that both of these are split into two different aspects: Molecular function and Biological process.
  • Now scroll to Subcellular Location and read what is written there. Note that you find another set of “Gene Ontology” (GO) and Keywords annotations here; this time labelled Cellular component.

QUESTION 2.2:

  1. Where in the cell / outside the cell do you find insulin?
  2. Why do you think is it found there? (Hint: consider the function)

Just like in GenBank, a UniProt entry has a Feature Table containing annotations that are coupled to specific parts of the sequence. In the default view, the Feature Table is not so easy to spot, since it is split up under different sections corresponding to the biological significance of the various annotations. However, in the top part of the window you can click on Feature viewer, which shows the feature table information in a graphical form. Try it. Then click on Molecule processing to show the signal peptide and the propeptide.

Now switch back to the default (Entry) view. In the following, you will see some examples of Feature Table annotations.

  • Under Disease & Drugs, the subsection Variants lists the variants (mutations) of insulin that have been described in the literature. Under the heading Change, it is indicated which amino acid is changed into which other amino acid. If the variant is known to be associated with a disease, this is indicated under the heading Description.
  • Under PTM/Processing, the subsection Features shows that insulin has both a signal peptide and a pro-peptide. These are both cleaved off before secretion. The mature insulin (the A and B chains) is hence much smaller than what was shown under Sequences.

QUESTION 2.3:

How long is the signal peptide and the propeptide, respectively?

  • Under Structure, the subsection Features shows the secondary structure elements “Helix” (α-helix), “Beta strand” (part of a β-pleated sheet) or “Turn”. The regions without specified secondary structure are often called “Loop” or “Coil”.

QUESTION 2.4:

Which positions are in β-sheet conformation in insulin?

Other databases linked from UniProt

UniProt has many useful links to other databases. In the graphical view, the cross-references are spread among several different headings, just like the feature table is.

Under the heading Sequence & Isoform, there is a sub-heading named Sequence databases. Here, you can e,g, find links to nucleotide sequences in the databases EMBL / GenBank / DDBJ. Try clicking one of the GenBank links marked “Genomic DNA”; that should take you to a page that looks like something you have seen last week.

Under the heading Structure, there is an interactive window showing a three-dimensional structure of insulin. Note that you can rotate the structure with your mouse. Actually, this structure is not part of UniProt itself, it is a cross-link to the protein structure database PDB. Below the interactive window, you can see the actual cross-links to PDB. Note that PDB is not one single database – just like it was the case for the nucleotide databases, there is a European version (PDBe), an American version (RCSB-PDB), and a Japanese version (PDBj), but luckily, they contain the same data. We will work with the American version of PDB later in the course. As you can see, there are many PDB structures of insulin; in other words, the 3D structure of insulin has been determined several times.

Under the heading Family & Domains, there is a subsection named Family and domain databases. It has links to databases containing proteins that are similar (protein families). These have been collected using various techniques that you will hear about later in the course (multiple alignment). In some cases, the proteins are similar only in smaller parts (domains) but not in other parts, and in some cases the databases can tell which parts of the actual protein are known in other species. Some large proteins (not small ones like insulin) can contain several different parts (domains) each with their own evolutionary history. The most important of these databases is InterPro, because it collects the information from most of the other databases. Try to click on one of the InterPro links. This will take you to the Interpro page with lots of information about the protein family that insulin belongs to.

Text format

Until now, we have been working with the graphical user interface to UniProt. However, all the information is also available in plain text format, and that’s what you will be working with if you are going to analyze larger amounts of UniProt data later in your studies. For now, let’s just have a look at it.

Scroll to the top of the Human Insulin page and find the menu labeled Download. It looks like this:  . Click it, and then right-click the option Text and open it in a new tab. What you see here basically contains all the information you have seen in the graphical interface.

Scroll through the plain text file and see if you can find the same information that you just found in the graphical interface. Note that every line starts with a two-letter code specifying the type of the information in the line. Here are some examples:

  • ID: Entry name (ID). There is only one ID.
  • AC: Accession code. There may be more than one.
  • DE: Description (protein names).
  • GN: Gene Name
  • OS: Organism/Species.
  • OC: Organism Classification.
  • OX: TaxID (as defined in the NCBI Taxonomy database).
  • RNRPRXRARTRL: References.
  • CC: Comments (annotations pertaining to the whole protein).
  • DR: Cross-references to other databases.
  • KW: Keywords.
  • FT: Feature Table (annotations pertaining to specified parts of the sequence).
  • SQ: Sequence header line.

Advanced search

The UniProt interface allows you to use most of the fields in the database for searching, not only the fields like name and organism, as we did previously, but also the functional and structural annotations. We shall now try a few of these.

  • Go back to UniProt’s main page, http://www.uniprot.org/Important: If the search string from the previous search is still shown in the search field, clear it. Then click Advanced to the right of the search field. This brings up a box with a new interface.
  • Now we will find out how many proteins have signal peptides (just like insulin has). In the drop-down menu that appears in the box, select PTM/Processing, then select Molecule Processing, then select Signal peptide. In the empty field that now appears to the right of the word Signal, type a * (otherwise, it will not work). Click the Search button.

QUESTION 3.1:

How many proteins did you find, how many of them are from Swiss-Prot, and what was the search string (the text that appeared in the search field)?

  • Evidence: The proteins we find in this way include proteins that are predicted to have signal peptides, without necessarily having any experimental evidence for the signal peptides. We will now limit the search to experimentally confirmed signal peptides. Click Advanced again (without erasing your previous search) and change the Evidence menu to Any experimental assertion.

QUESTION 3.2:

How many proteins do you find now, how many of them are from Swiss-Prot, and what has the search string changed into?

  • Combining fields: How many experimentally confirmed signal peptides are found in humans? Click on Advanced Search again and click Add field to get a second search line. Leave the menu to the left on AND, select Organism [OS] in the drop-down menu, type human in the field Term, accept the suggestion “Homo sapiens (Human) [9606]” and click the Search button.

QUESTION 3.3:

How many proteins do you find now, and what is the search string? (Note that you can always perform the search by editing the text in the search field — however to do this you need to know the names for the fields).

About strains and subspecies

Let us now try something different. If you search for proteins from a microbial species, you may run into trouble, because each subspecies or strain has its own TaxID, and you probably want all possible strains. Let’s try an example (first, clear the previous search): Say you want all proteins from the bacterium Bacillus subtilis — a very important production organism in biotechnology. Try to type Bacillus subtilis in the Organism [OS] field: you will see a suggestion named “Bacillus subtilis [1423]” – accept that.

QUESTION 3.4:

How many proteins are there in UniProt from Bacillus subtilis with the default TaxID [1423]? How many of these are from Swiss-Prot? And what is the search string?

The number of entries in Swiss-Prot may seem low for such a well-studied organism. In addition, you may note that there is a link next to the total number of results saying “or expand search to “1423” to include lower taxonomic ranks”. Click it.

QUESTION 3.5:

How many proteins are there in UniProt from Bacillus subtilis in total (all strains and subspecies)? How many of these are from Swiss-Prot? And what is the search string?

In conclusion, use the field Taxonomy [OC] instead of Organism [OS] when working with microbial species where you want all strains.

Searching for short proteins

  • Numerical field: Now we will try to answer a completely different question: Which extremely short proteins are present in UniProt? Clear the previous search. In the advanced drop-down menu, select Sequence and then Sequence length. Now two new fields appear where you can define the lower and upper limits for the search. Type 1 and 10 and search. Note: in your answers to the questions below, include the search string just like you did in the questions above!

QUESTION 3.6:

How many proteins of maximum length 10 do you find?

  • Extremely short proteins are often mistakes translated directly from a nucleotide sequence with no evidence for the sequences being protein coding. Limit your search to proteins that actually have evidence for their existence at the protein level (add a field, and set the drop-down menu to Protein existence [PE] and select Evidence at protein level).

QUESTION 3.7:

How many proteins are now left?

  • A large fraction of the proteins identified in this way are fragments. Try to exclude fragments from the search. Add a field. In the drop-down menu, choose Sequence, then Fragment, then No.

QUESTION 3.8:

How many proteins are now left?

  • And how many of these proteins are found in humans?. Do as before…

QUESTION 3.9:

How many human non-fragment proteins of maximum length 10 do you find in UniProt?

  • Finally you can save the results of your search. First, sort them by length by clicking on the column header. Then, click on Download above the list of results. You can now save the search results in the format you prefer (try FASTA (canonical) and click Preview).

QUESTION 3.10:

Copy the FASTA sequences to your report.

On your own

QUESTION 4: Now that you are proficient in UniProt searches, try the following:

(As always, remember to write your search string in the answer).

  1. Find out how many proteins from Escherichia coli (all strains) there are in UniProt.
  2. How many of these are from the notorious pathogenic serotype O157:H7 (including its sub-strains)?
  3. Find insulin from as many organisms as possible, without including entries that are not insulin (Hint: If you attempt to do this with the Protein Name field only, it will require an unwieldy amount of kill-words. Therefore, take the gene name into account).
  4. Find alpha-globin (the alpha subunit of hemoglobin) from as many ruminants as possible (see the GenBank exercise).
  5. Find alpha-A globin and alpha-D globin from Columba livia (Hint: You can use a “*” to perform the search with one search string).

Solutions

Question 1

  1. many hits do you find?
  • 4747 hits in total
  1. How many of these hits are from Swiss-Prot?
  • 1597 reviewed by Swiss-Prot
  1. Can you identify the correct hit? If yes, write down is Accession code and Entry name.
  • Example of one correct hit.

Accession Code: P01308;

Entry Name: INS_HUMAN

Question 2

  1. How many hits are left?
  • 1617 hits (For Homo sapiens)
  1. How many of these are from Swiss-Prot?
  • 1074 hits

Question 3

  1. How many hits are now left?
  • 208 hits
  1. How many of these are from Swiss-Prot?
  • 60 hits

Question 4

  1. How many hits are now left?
  • 113 hits

Question 5

  1. How did you do this?
  • Go to advanced search and select a new search field. Select NOT, Protein Name and type the term “insulin receptor”. Select search and the results produced will exclude any data regarding insulin receptors.
  1. How many hits are now left?
  • 62 hits.

Question 6

  1. How many references are there in the insulin entry?
  • 36 entries.
  1. Why do you think insulin is such a highly investigated protein?
  • It is considered to be a very important protein in the human life as it is responsible in maintaining the glucose concentration levels in humans. There are also diseases that are caused due to the lack of this protein in humans, which is commonly known as Diabetes. This is just a very basic function of this protein, in the website above, insulin’s more detailed function can be studied. Furthermore, the research of this protein is not only in human aspects, it also falls under many other categories such as pathology, biotechnology and many more. This research has allowed the pharmaceuticals use of insulin which has helped to boost the lives of diabetic patients who can’t produce their own insulin.

Question 7

  1. Where in the cell / outside the cell do you find insulin?
  • Insulin are produced in the cells inside the pancreas called beta cells. They diffuse into the bloodstream and will enter a cell via an insulin receptor that is made up of two receptor subunits that are located on the outside of the cell membrane. From there is can be found in transport vesicle in the cell.
  1. Why do you think is it found there?
  • When the insulin hormone is entering a cell from the receptors in the cell membrane, it will enter the cell via receptor-mediated endocytosis, which results in a transport vesicle.

Question 8

  1. How long is the signal peptide and the propeptide, respectively?
  • Signal peptide: Position 1 to 24 [24 length]
  • Propeptide: Position 57 to 87 [31 length]

Question 9

  1. Which positions are in β-sheet conformation in insulin?
  • Positions 26 – 29, 48 – 50, 56 – 58, 74 – 76, 98 – 101.

Question 10

  1. How many proteins did you find, and what was the search string?
  • Proteins found: 15566563 proteins
  • Search string: annotation:(type:signal)

Question 11

  1. How many proteins do you find now, and what has the search string changed into?
  • Proteins found: 3758
  • Search string: annotation:(type:signal evidence:experimental)

Question 12

  1. How many proteins do you find now, and what is the search string?
  • Proteins found: 729
  • Search string: annotation:(type:signal evidence:experimental) AND organism:”Homo sapiens (Human) [9606]”

Question 13 A

  1. How many proteins are there in UniProt from Neisseria gonorrhoeae with the default TaxID [485]?
  • 12597 proteins

Question 13 B

  1. How many proteins are there in UniProt from Neisseria gonorrhoeae in total (all strains and subspecies)?
  • 28412 proteins

Question 13 C

  1. What does the search string look like now?
  • From organism:”Neisseria gonorrhoeae [485]” to taxonomy:”Neisseria gonorrhoeae [485]”

Question 14

  1. How many proteins of maximum length 10 do you find?
  • 17648
  • length: [10 TO 10] search string

Question 15

  1. How many proteins are now left?
  • 456 proteins
  • length:[10 TO 10] existence:”Evidence at protein level [1]” search string

Question 16

  1. How many proteins are now left?
  • 255 proteins
  • length:[10 TO 10] existence:”Evidence at protein level [1]” fragment:no

Question 17

  1. How many human non-fragment proteins of maximum length 10 do you find in UniProt?
  • 201 proteins
  • length:[10 TO 10] existence:”Evidence at protein level [1]” fragment:yes

Question 18

  1. Copy the FASTA sequences to your report.
  • >sp|P0DJH3|VMP3A_DEIAC Zinc metalloproteinase-disintegrin-like AAV1 (Fragment) OS=Deinagkistrodon acutus OX=36307 PE=1 SV=1
  • DVVSPPVCGN
  • >sp|P0DKX2|VSP1_CRODM Thrombin-like enzyme Cdc SI (Fragment) OS=Crotalus durissus cumanensis OX=184542 PE=1 SV=1
  • VIGGDECNIN
  • >sp|P85103|VMXP_PHIPA Snake venom metalloproteinase patagonfibrase (Fragment) OS=Philodryas patagoniensis OX=120310 PE=1 SV=2
  • LSTDIVAPPV
  • >sp|B3EWG2|LACC_LEPMG Laccase (Fragment) OS=Lepiota magnispora OX=182864 PE=1 SV=1
  • VTIGKEGTLT
  • >sp|P0DJE8|VSPDV_BOTJR Thrombin-like enzyme D-V (Fragment) OS=Bothrops jararacussu OX=8726 PE=1 SV=1
  • VVGADNCNFN
  • >sp|P11180|ODP2_BOVIN Dihydrolipoyllysine-residue acetyltransferase component of pyruvate dehydrogenase complex (Fragment) OS=Bos taurus OX=9913 GN=DLAT PE=1 SV=1
  • VETDKATVGF
  • >sp|B3A0L4|LAC1_HERCO Laccase (Fragment) OS=Hericium coralloides OX=100756 PE=1 SV=1
  • AVGDDTPQLY
  • >sp|B3EWG8|PA2A_PORNA Acidic phospholipase A2 PnPLA2 (Fragment) OS=Porthidium nasutum OX=74558 PE=1 SV=1
  • DLLQFXDMMK
  • >sp|C0HLB1|ATP5E_YARLI ATP synthase subunit epsilon, mitochondrial (Fragment) OS=Yarrowia lipolytica (strain CLIB 122 / E 150) OX=284591 GN=ATP15 PE=1 SV=1
  • MSAWMSAGFS
  • >sp|B3EWR3|3SAW_NAJNA Cytotoxin NN-32 (Fragment) OS=Naja naja OX=35670 PE=1 SV=1
  • LKCNKLVPLF

Future Directions and Advanced Topics

Recent advancements in UniProt database and tools have focused on improving data integration, annotation quality, and user accessibility. Some of the key advancements include:

  1. Integration of Data Resources: UniProt has integrated data from additional resources, such as the Protein Expression Atlas, to provide more comprehensive information about protein expression levels in different tissues and conditions.
  2. Improved Functional Annotation: UniProt has improved its functional annotation pipelines to provide more accurate and detailed functional information about proteins, including their involvement in biological pathways and protein-protein interactions.
  3. Enhanced User Interface: UniProt has updated its website and tools to provide a more user-friendly interface, making it easier for researchers to search for and retrieve protein information.
  4. Increased Coverage and Quality: UniProt has continued to increase its coverage of protein sequences and improve the quality of its annotations, making it a more valuable resource for researchers.

Emerging trends in protein data analysis include:

  1. Integration of Multi-Omics Data: Researchers are increasingly integrating protein data with other omics data, such as genomics, transcriptomics, and metabolomics, to gain a more comprehensive understanding of biological systems.
  2. Machine Learning and AI: Machine learning and AI techniques are being used to analyze large-scale protein data sets, predict protein functions, and identify potential drug targets.
  3. Structural Bioinformatics: With the increasing availability of protein structures, there is a growing focus on structural bioinformatics to study protein structure-function relationships and predict protein structures.

Career opportunities in UniProt and bioinformatics include:

  1. Bioinformatics Research: Opportunities exist for bioinformaticians to conduct research in areas such as protein function prediction, structural bioinformatics, and systems biology.
  2. Database Curation: Bioinformaticians can work in database curation roles, such as curating protein sequences and annotations for databases like UniProt.
  3. Software Development: Bioinformaticians with programming skills can develop software tools and algorithms for analyzing protein data.
  4. Data Analysis: Bioinformaticians can work in data analysis roles, analyzing large-scale protein data sets to gain insights into biological systems.

To develop skills in UniProt and bioinformatics, it is important to have a strong background in biology, bioinformatics, and computer science. Additionally, staying updated with the latest advancements in the field and gaining hands-on experience with bioinformatics tools and databases, including UniProt, can help build a successful career in this field.

Conclusion

Summary of key learnings:

  1. UniProt Database: UniProt is a comprehensive resource for protein sequence and functional information, providing access to curated protein data and annotations.
  2. Functional Annotation: UniProt offers functional annotation for proteins, including information on protein domains, pathways, and interactions, which is essential for understanding protein function and biology.
  3. Integration with Tools: UniProt can be integrated with various bioinformatics tools and databases for functional analysis, structural modeling, and protein data analysis.
  4. Emerging Trends: Recent advancements in UniProt focus on data integration, annotation quality, and user accessibility. Emerging trends in protein data analysis include multi-omics integration, machine learning, and structural bioinformatics.

Resources for further exploration:

  1. UniProt Website: Explore the UniProt website (https://www.uniprot.org/) for access to protein data, tools, and resources.
  2. Bioinformatics Tools: Explore tools such as PyMOL, BLAST, and Clustal Omega for protein sequence analysis and structural modeling.
  3. Bioinformatics Databases: Explore databases such as the Protein Data Bank (PDB), Gene Ontology (GO), and KEGG for additional protein data and annotations.
Shares