How to use and interpret NCBI BLAST?

December 3, 2024 Off By admin

Table of Contents

Introduction to NCBI BLAST

The National Center for Biotechnology Information (NCBI) BLAST (Basic Local Alignment Search Tool) is a widely used bioinformatics tool that allows researchers to compare nucleotide or protein sequences against sequence databases. It is essential for tasks such as identifying genes, discovering proteins, analyzing evolutionary relationships, and finding functional annotations for sequences.

In this guide, we will explore:

How to search for genes and proteins in NCBI: Learn to locate sequence data for specific genes or proteins using NCBI’s databases.
How to use Nucleotide BLAST (BLASTn): Perform sequence alignment for nucleotide sequences to identify closely related sequences, find homologous genes, or determine sequence variations.
How to use Protein BLAST (BLASTp): Align protein sequences to explore functional similarities, identify conserved domains, and predict protein structures.
How to interpret BLAST results: Understand key components of the BLAST output, including alignment scores, E-values, query coverage, and sequence identity, to draw meaningful conclusions from your analysis.

This guide is designed to help beginners and intermediate users navigate NCBI BLAST efficiently and gain insights from their sequence analyses. Let’s get started!

Searching for Gene Information

NCBI Gene

We illustrate by finding what we can about the gene that controls lactose digestion in people.

We start with a simple search in PubMed for information on Lactose Intolerance.

1. Search PubMed with our term and click Search.

NCBI returns various PubMed articles that deal with lactose intolerance.

2. We then use the Find Related Data box on the bottom of the right had column. This will return the records in the Gene database for genes mentioned in the PubMed articles.

Change the database to Gene and click Find Item.

The resulting page includes the gene for lactase in humans, LCT. Be careful to pick the correct record by making sure that it is for the organism that you are interested in.

Clicking on the LCT [Homo sapiens (human)] will bring us to the Gene database record for LCT:

Alternatively, if we already knew the gene name, we could have started with a direct search of the Gene database:

In either case, we will display the LCT gene record:

We now have detailed information about the gene, including:

Gene ID
Official Symbol and Full Name
Organism
Other names used for this gene
Other organism that have this gene
Summary of the role this gene plays

The full record is fairly large, so the Table of Contents on the right top column can be used as an index into the complete record.

The Table of Contents links to sections within the Gene record:

Genomic context: chromosomal location and Exon count
Genomic regions, transcripts, products: graphical view of gene features
Bibliography: related citations in PubMed
Variation: links to variants in ClinVar, dbVar
General gene information: markers, homology clone names, gene ontology
General protein information: names and accession numbers of protein products
NCBI Reference sequences: links to curated and annotated reference sequence records for the gene (accession number prefix NG), mRNA (NM) and protein (NP).

NG accession number links to the GenBank record, FASTA sequence, and Sequence viewer in the Nucleotide database.
NM accession number links to the mRNA record in the Nucleotide database.
NP accession number links to the protein record in the Protein database.

Navigating to the NCBI Reference Sequence, we can click on the RefSeq number to see the curated Nucleotide GenBank record for this gene.

GenBank record:

We now have the latest curated detailed information about the gene sequence, as well as, sequence data for it’s products.

The Reference Sequence number is a combination of the Reference number and latest version of the sequenced data. It is important to use the RefSeq ID to search the Nucelotide database because the underlying data for sequence data changes. You always want to be using the latest sequence available.

The complete number of base pairs (bp) is also returned.

Various sequence data can be viewed further down in this record:

Clicking on any of the features will highlight the feature’s sequence:

mRNA – will connect the individual exons into the resultant messanger RNA
exons – will highlight the sequence of this exon
CDS – Coding Sequence of the resultant protein product from this gene

Searching for Protein Information

NCBI Protein

Continuing the example used in the Gene discussion, we will use the protein product that is expressed by the LCT gene.

We could use the same strategy that we used previously (that is, PubMed and then used Related Date to Find protein records), but this time we will start in the Protein database.

Our Search is for lactase AND human[Organism] :

The results returned must be analyzed closely in order to insure the best answer.

These are 3 records that are strong candidates. The way to pick the correct record is to look for the curated item – the only one with an accession number the begins with NP_.

Choosing #8 returns the following record:

This record shows both the version and the curated record for the mRNA (starting with MN_).

Using the Go to: link, you can now get the amino acid sequence:

We could also get here directly from the Gene record of the previous example:

Structure

If there is an experimentally determined Structure for the protein available, a thumbnail of the structure would appear.

BLAST Overview

NCBI provides a trademarked tool for sequence searching – Basic Local Alignment Sequence Tool (BLAST).

BLAST can be used to:

Find similar database records for protein or nucleic acid sequences
Find Conserved Domains within sequences
Compare and align sequences

The BLAST home page: http://blast.ncbi.nlm.nih.gov/

The mathematics behind this resource are fairly complicated. Please keep in mind that, depending on the specific search, the results may take a long time to appear.

Protein Example

BLASTP Example

This illustration shows how to search for a protein sequence using BLASTP.

You can start from the NCBI BLAST home page: http://blast.ncbi.nlm.nih.gov/Blast.cgi

Or, you can run BLASTP directly from the RefSeq protein record as in the previous examples:

At the BLASTP page you can search by RefSeq for the protein or by amino acid sequence.

1. RefSeq:

Or,

Search by amino acid sequence. From the bottom of the Gene record:

Capture the amino acid sequence by clicking on the CDS link and then cut&paste into the BLASTP search screen:

In either case, choosing the non-redundant protein sequences (nr) database (the default), will return the largest candidate list

Conserved Domains

An important result from the BLASTP search results are the Conserved Domains that are found for this protein:

BLASTN Nucleotide Search

BLASTN Example

This illustration demonstrates how to use BLASTN to search starting with a nucleotide sequence.

You can start from the NCBI BLAST home page: http://blast.ncbi.nlm.nih.gov/Blast.cgi

Or, you can start from a Nucleotide record. In this case, we start from the RefSeq mRNA link within the LCT Gene record:

Click on the NM_002290.2 record ID in order to display the Nucleotide record:

To search BLASTN using the mRNA RefSeq:

Or, from the Nucleotide record, find the base pair (bp) sequence:

Click on the CDS link to highlight the base sequence:

Grab the sequence and then cut&paste into the BLASTN search box:

In either case, you can use the default databse Nucleotide Collection (nr/nt) for the widest search and then click BLAST to run the search:

BLASTN Result

The following is the top of the resulting nucleotide BLASTN search:

BLAST Results

Analyzing the results of a BLAST search, while similar, will depend on whether the original search was for a nucleotide or amino acid sequence.

Looking at the section “Sequences producing significant alignments” we see:

Amino Acid (Protein Result)

Nucleotide (mRNA)

In either case, the items of interest are:

Max[imum] Score: the highest alignment score calculated from the sum of the rewards for matched nucleotides or amino acids and penalities for mismatches and gaps.
Tot[al] Score: the sum of alignment scores of all segments from the same subject sequence.
Query Cover[age]: the percent of the query length that is included in the aligned segments.
E[xpect] Value: the number of alignments expected by chance with the calculated score or better. The expect value is the default sorting metric; for significant alignments the E value should be very close to zero.
Ident[ity]: the highest percent identity for a set of aligned segments to the same subject sequence.

These results can be helpful in identifying what the searched sequence matches and what other species have similar substances.

Clicking on the name of any of the results will. again, display different results:

Amino Acid (Protein Result)

The results show the amino acid matches

Nucleotide (mRNA)

The results show the alignment of the base pairs

Compare Multiple Sequences

Another form of searching is to compare 2 sequences to each other. The image below is from BLASTP, but the BLASTN has a similar facility. This is activated by clicking the “align two or more sequences” link:

Taxonomy Tree

Another interesting result is the report of the taxonomy tree of the significant matching sequences. Once again the results are similar for BLASTN and BLASTP. The example shown is from BLASTP: