Using NCBI and UCSC genome browser- Tutorial

March 13, 2024 Off By admin
Shares

Gene specific information using NCBI and genome viewers Gene

Web resources:

NCBI database: https://www.ncbi.nlm.nih.gov/

NCBI Gene database: https://www.ncbi.nlm.nih.gov/gene UCSC Genome browser: https://genome.ucsc.edu/ PubChem database: https://pubchem.ncbi.nlm.nih.gov/

Exercise 1 homepage: http://biochem.slu.edu/bchm628/exercise1.html

Goals: Learn how to efficiently navigate the NCBI, PubChem, and UCSC Genome databases to find information on specific genes.

Background on nomenclature:

Refseq refers to records that have been reviewed by the NCBI curation staff. The Refseq database is a precursor to the Gene database and is available as a Limits option in the protein and nucleotide databases. Curated Refseq records have the nomenclature: NM_#### for mRNA and NP_#### for protein records. Model mRNAs start with XM_####. This indicates a lack of experimental evidence for the transcript. Other designations are described in the PDF file RefseqNomenclature.pdf available from the Exercise 1 homepage.

Conduct text-based searches of NCBI Gene Database:

This is an integrated database of information that is gene centric, but includes genomic, sequence, expression, structure, function, citation and homology data. It is directly linked to most of the databases that make up NCBI. All genes that have a RefSeq record are added to this database and each gene is assigned a unique GeneID. We will use these gene identifiers repeatedly over the next few weeks.

We will start with a simple search on a gene name. Open the NCBI Gene Database and type in “p53” into the text box. How many results were returned? [22578 in May 2022].

Why so many results?

  1. By default, every field in the record is searched, not just the gene name.
  2. You are not using the official HGNC (Human Genome Nomenclature Committee) gene name and there are several different aliases for this gene.
  3. The p53 protein interacts with >100 other proteins so there is a lot of literature that mention this protein and thus the name will appear in the records of many other genes.

If you want to find the human gene, you can scan down and see if it shows up in the list or you can try limiting the search to human by typing “p53 AND human[organism]” into the text box. Now it should return about 2500 records and a link to the record for the human TP53 gene shows up in a box at the top as shown in Fig. 1. It is still searching all fields for p53 and since it interacts with so many different proteins, there are still a lot of records returned.

Click on the link for TP53 to open the record. There are multiple sections to the record, identified by grey bars at the top of each one. Not all records have information for all sections because some genes have less annotation associated with them. If you click on the arrows at the top of a section, it will collapse.

Figure 1: Search of Gene database using p53 and human [organism]

Fig. 2 shows the sections of data for TP53 full collapsed. The top of the record is drop down menu from the link Full

report. Here you can jump to different views or sections of the record. Spend some time familiarizing yourself with the features of this database.

Finding transcript information about a specific gene

Figure 2: Gene database sections

Human genes are complex and often have several transcript isoforms. The curation of gene models to identify all possible and expressed transcripts uses several computational and experimental techniques, including tissue specific RNAseq, which provides direct support for expression of exons.

The curation of genes at NCBI uses a single pipeline and collects the curated genomic, transcript and protein sequences into the RefSeq database. The nomenclature identifies those sequences that are considered Reference (NG_ (genomic) NM_ (mRNA) and NP_ (protein) versus

those with only computational support (XM_ or XP_). There is a PDF on the exercise 1 homepage that describes the Refseq nomenclature.

a) Within the NCBI gene record for the TP53 gene there are 2 sections that provide transcript/protein information: 1) Genomic regions, transcripts and products (Fig. 3) NCBI Reference Sequences (RefSeq).

In Fig. 3, note the menu bar above the chromosome ruler. In that bar, you will see a button with 3 colors on it. Click on that to change the view to show the associated proteins with each transcript. As you scroll further down in this section, you should see another set of transcripts

corresponding to the Ensembl transcripts. Ensembl is maintained by the European Bioinformatics Institute and they maintain their own pipeline for gene annotation. As a consequence, there can be some differences in the transcripts named/identified between NCBI and EBI. For the most part, the differences occur within model transcripts or ones that are only computationally predicted. You can turn off tracks by clicking the red X in the top right of track.

In the menu bar above the solid blue line, there is an option to Download a PDF of the graphic. Do that and save the PDF. You will need to do this for one section of your exercise 1 writeup.

Figure 3: Transcripts for TP53 in NCBI gene record Genomics section

You should note several things about these transcript searches:

  1. TP53 has a large number of transcript isoforms. Not all human genes have this many, but if you want to conduct a whole genome expression experiment, one consideration is whether to analyze the data on a gene or exon basis. Differential use of exons can identify novel transcripts but is more complex to analyze.

Exploring the genomic context of genes using NCBI and UCSC Genome browser.

The genomic context means where on the genome the gene is located. That is:

    • What chromosome number
    • Where on that chromosome
    • What strand
    • What genes are upstream/downstream

This the pretty much all of the genomic context provided in the Gene Record. Thus, we will use the UCSC Genome browser to explore options for visualizing additional data for genes within a genomic context. These data are included as additional tracks of information (from a few to hundreds depending on the genome) and include such data as:

    • Location of repetitive sequences
    • Level of homology to other genomes
    • SNP or variants within the genome of interest
    • TF binding sites

The data behind a genome browser is enormous and can be quite complex to sort through. This amount of data can also be slow to load. Spend some time turning tracks on and off and following links or pop-ups that explain the different data sources.

Using the UCSC Genome browser

UCSC Genome browser: https://genome.ucsc.edu/

Open the UCSC genome browser from the link in this document or from the Exercise 1homepage.

Below the headers is a dark blue bar with the link Genomes. Mouse over it and select human genome GRCh38/hg38. Or click the link and it will open a search window for the latest Human assembly as a default option. Type in TP53 into the search text box and it will list many possible matches. Select the second one which corresponds to tumor protein p53 (from HGNC TP53). This should open a window that looks something like Fig. 4.

Figure 4: UCSC view of TP53 with default tracks

The gene size and coordinates of where this gene falls on Chr 17 should be very similar if not identical to the coordinates listed for the NCBI gene record.

Scroll down through the graphics. Click on the graphic or clicking on the name of the track will pop open a window with information about the track. Click on any single transcript to see details about the transcript.

A FEW of the questions you can ask with a genome browser include (depending on the genome and available track information):

  1. What genes are located near it or may share promoters?
  2. What SNPs are found in my gene and are they located in introns, promoters or exons?
  3. What strand is my gene encoded on?
  4. What regulatory elements are located within or near my gene?
  5. What clinical variants are associated with my gene?

A relatively new default track at USCS is the gene expression data in different tissues from the NIH Genotype-Tissue Expression (GTEx) project. This project was created to establish a sample and data resource for studies on the relationship between genetic variation and gene expression in multiple human tissues. This track shows median gene expression levels in 54

tissues and 2 cell lines, based on RNA-seq data from the V8 GTEx data release (October 2019). This release includes data from 17382 tissue samples obtained from 948 adult post-mortem individuals.

Using PubChem to explore compound-protein interactions

PubChem database: https://pubchem.ncbi.nlm.nih.gov/

PubChem tutorials: https://pubchemdocs.ncbi.nlm.nih.gov/tutorials Background paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4702940/

PubChem is the world’s largest collection of freely accessible chemical information. Search chemicals by name, molecular formula, structure, and other identifiers. Find chemical and physical properties, biological activities, safety and toxicity information, patents, literature citations and more. As it is focused on chemical entities, it can be more challenging to biologists to navigate. My purpose for introducing this site is that drug discovery and drug interactions are becoming an increasing part of basic science research and this site provides a great deal of information about different chemical compounds. The amount of information about how they interact with different proteins is limited by the assays published and shared.

I will not provide an in-depth tutorial here but walk you through finding information for a single protein. Extensive tutorials are

Figure 5: Result of text search in PubChem

provided in the link above if you think this site will be useful for your research.

Type BRCA1 in the search box and click the search icon (magnifying glass) to the right of the search text box. It will return many records under different categories as shown in Fig. 5. Click on the Genes

tab and you should find a link to human BRCA1 at the top. (Gene ID 672). Click on it and it will open a window that should resemble that found in Fig. 6.

Figure 6: Gene record for BRCA1 in PubChem

Using the Contents menu to the right skip over the first 4 content lines and click on

“4 Chemicals and Bioactivities” which will drop you down on the record. There are two tables of interest,

    1. Tested Compounds

and 5.1 BioAssays. Here you can find compounds that have been demonstrated to have some sort of activity against BRCA1. To the right, above the column headers of the tables is a link to download. It will give you different options, but choose CSV as it can be imported directly into Excel. Note that the CSV file for Tested Compounds (named GeneID_672_bioactivity_gene. csv) has >350,000 records so it will take a while to load into Excel.

What would you do with this file? The data file includes column for activity (active, inactive, ect), assay type, activity value (reported in M) among other data types. You could sort by Activity, keeping only active compounds and then by activity value (acvalue), keeping only those with a value less than 1-2 M. This would narrow down a search for compounds or related compounds that may be of interest. The table 5.1 Bioassay (downloads as GeneID_672_bioassays.csv) contains only 11 rows listing the different types of bioassays referenced in this gene record. Not the easiest to read because some of the data cells have paragraphs of text in them, but it does provide information/description on assays conducted with BRCA1, where they are done and compounds tested in those assays. Again, the PubChem site is probably of more use to those who are working in labs with an active interest in identifying new therapeutic candidates. But it may be of use t know what kinds of assays have been developed for your protein of interest.

Take-home points:

      1. There is a LOT of biomolecular data and much of it is shared between databases.
      2. The various web tools/interfaces provide different approaches to viewing and interacting with this data. There is likely to be more than one way to answer whatever questions you might have. The most important thing is to document what tools you used, when you accessed them and, for genomic data, what assembly/version of the genome you accessed.
      3. This tutorial provides only a sliver of the types of searches and questions you can ask. Take your time going through it and spend some extra time exploring what other features are there. All of the main websites will have readily available information and/or tutorials to explain what you are looking at.

 

Shares