NCBI BLAST: Getting Started Tutorial
March 13, 2024This tutorial aims to provide a comprehensive understanding of NCBI BLAST, covering basic to advanced features and practical applications. Participants will gain the skills needed to effectively use BLAST for sequence analysis in bioinformatics and molecular biology research.
Websites used in this tutorial:
BLAST homepage: http://blast.ncbi.nlm.nih.gov/Blast.cgi
BLAST help page is accessible as a Tab from the main BLAST page. It has information about all the different BLAST programs and databases.
Table of Contents
Activities:
Do a BLASTN and a BLASTP using NCBI BLAST homepage. Restrict BLAST searches to particular species/taxonomic groups
Use Blast2Sequences to compare sequences with different parameters
Follow ENTREZ links from BLAST output to find additional information about the query sequence. This information includes conserved protein domains (CDD), taxonomic reports, structure links and publications.
HINT: Sign in to your myNCBI account before starting your BLAST analyses. If you accidentally close a window, you can always return to the BLAST homepage and find your searches under the Recent Results tab.
Nucleotide BLAST:
In this section of the tutorial you will use Protein Kinase Inhibitor alpha (PKI alpha) from mouse as the query in the search. Go to Entrez and search for protein kinase inhibitor, NM_008862, in the core nucleotide database.
- Click on the record to bring up the GenBank entry for it.
- Switch Display setting to FASTA
- High-‐light and copy (CTRL C) the sequence, including the definition line that begins with >gi line
Figure 1: Display options for GenBank record
Now that we have a sequence to use for our search, open the BLAST homepage in your browser.
- Under Basic BLAST, click on the “nucleotide blast” link.
- Paste the sequence into the text window at the top of the page.
- Notice that as you tab out of the text box for the sequence, the definition line from the sequence is added to the job title box. Change this to a more descriptive title such as mousePKIalpha_nr .
- If you log into myNCBI before initiating BLAST searches, all previous searches are available for 36 hours.
- For Database, scroll down below the Others menu, and select Nucleotide collection (nr/nt).
- Under program selection, choose Highly similar sequences (megablast).
- To see the parameters of the different programs, click on the Algorithm parameters
link below the BLAST button. Note the Word size, Match/mismatch scores, ect.
- NOTE that any time a default parameter is not used, that parameter is highlighted in yellow
- Leave all other menu choices with default settings and launch the query by pressing the BLAST button.
The next page returned is a status page with Job Title that contains the name of your query sequence or a job title if you typed one into the text box. The next 5 lines give you in order:
- Request ID: a 11 character string which you can use to retrieve results later
- Status of the search
- Time submitted
- Current time
- Time since submission
After some time, the results will appear, or you can go on and do another search while waiting. The blast output is divided into 4 sections. The bottom 3 can be expanded/collapsed with the blue arrows to the left of each section.
- Header region which lists the parameters of the search
- Graphical summary
- Descriptions
- Alignments
Figure 3: BLAST output sections
Figure 4: BLAST graphical summary
Within the graphical output:
- Bars correspond to regions of similarity.
- Color coding is based on alignment scores.
- Information is displayed by moving the mouse over the bars.
- Bars are hot links to the actual alignments displayed below on the page.
Figure 5: BLAST output descriptions section
Within the descriptions section:
- Each match has a link to the Genbank record.
- The links under the Score column take you to the corresponding alignment
- The E-value is a statistical measure of the significance of the alignment
- There are links to other NCBI database resources listed at the far right of each match. The legend for those links is shown at the top of the page.
Figure 6: BLAST alignments section
Within the alignments section:
- Each alignment with a score > threshold (up to limit defined) is shown.
- The sequence listed is the one which matched the query sequence
- In the red box are shown the Score, Expect value, Identities, Gaps and which strands of the query and database sequence aligned.
NOTE: Change the program selection from Meagablast to BLASTN. Note how many more hits you get with this program than with Megablast. If you have a relatively short sequence (<500 bp) of unknown origin, then you would want to use BLASTN rather than MEGABLAST to identify the sequence.
Protein BLAST:
Next we will do a BLASTP using the mouse PRI alpha protein sequence.
- Retrieve the protein record for mouse PKIa (NP_032888).
- Obtain the fasta formatted sequence and copy it.
- Return to the BLAST home page.
- Chose the protein blast link under the Basic BLAST section.
- Paste the mouse protein sequence into the text box.
- Select the database Non-redundant protein sequences (nr)
- Click on the arrow to the left of algorithm parameters to expand that area. The parameters should be set as shown in Figure 7.
- Make sure the algorithm blastp is selected and click the BLAST button at the bottom.
Figure 7: Default parameters for BLASTP
Figure 8: Initial status page of BLASTP with a graphic showing conserved protein domains identified in query protein sequence
When the search is done, an output page appears with the same sections as we saw with BLASTN. There are two differences from the BLASTN output sections. Within the graphical summary, there may be an additional graphic showing conserved protein domains found in the query. Within the alignments section, the top of each alignment has an additional score for the number of positives within the alignment. This represents matches within the sequence that have a positive BLOSUM score.
Figure 9: BLASTP alignment with scoring parameters.
If you want to know more about the conserved protein domains found in your query sequence, click on the graphics shown at the top of the BLAST output page. This will take you to the Conserved Domain database and display a summary page for that domain.
Figure 10: Conserved domain summary page.
On the conserved domain summary page, you can view the consensus sequence for that conserved protein domain. By clicking on the PKI graphic again, it will open a page from the CDD database with more comprehensive information about the conserved domain.
For PKI, this is shown in Figure 7.
Figure 11: Conserved Domain Database listing for PKI conserved domain.
Included on this page is the source for the Conserved Domain, which in this case is a PFAM model. A description of the conserved domain is given. Click on the [+] links located to the left of the Links, Statistics and Interactive view bars. Under the Links menu, The Taxa link shows the largest taxonomic group in which this domain has been found. In the case of PKI, the group listed is Amniota. Click on “Amniota” to view what taxonomic groups are included. The presence or absence of the protein domain in various taxonomic groups is also a clue as to which species you might expect to find homologous proteins.
To return to the BLAST output page, close the window that shows the Conserved Domain information or click on the tab with the BLAST results.
Note the following features of the sequences in the match list:
- Database-specific accession numbers and locus names (linked to Entrez).
- Truncated sequence descriptions.
- Scores in bits (linked to alignments).
- Score probabilities as E values.
- Hits are displayed if the E value is below the E threshold set (default is 10).
Follow the link to the highest scoring pair by clicking on the score and note:
- Complete description of the subject sequence.
- Exact sequence match to the probe.
- Score in bits with raw score in parenthesis.
- Expect value in log base 10.
- Identities = (100%), Positives = (100%).
- Alignment of query and subject sequences; this is a single, uninterrupted HSP. (HSP stands for High Scoring Pair).
Scroll down to the next few hits:
- These are very similar to the query sequence.
- The further down the list, the less similar the sequences are to the query sequence
Note the identities and positives. Identities correspond to exact matches and positives are similarities based on the scoring matrix used.
Taxonomy Report:
Up to now, you’ve only looked at the search summary report. The BLAST search also returns 4 other reports: taxonomy, distance tree, related structures and multiple alignments.
Scroll up to the top of the BLAST report and click on “Taxonomy report”. It is located just above the Query information and is a single text link.
- Scroll down and look at the organisms represented in the taxonomy report. Note that most of the hits fall within rodents and primates, with a fewer matches to lagomorphs, fish, birds and amphibians.
- Are there any sequences that do not fall under the taxonomy tree? What are they?
- Think about what is not represented in this list in terms of other eukaryotic species. Are they not there because there are no homologs in those species or because of limitations on the blast search? The default BLAST search returns only 100 matches. What is the highest e-value listed in the report? Is it still lower than the default cut-off of 1? If so, then to confirm the lack of a match, you should redo the BLAST search and either increase the number of hits returned or restrict the database by excluding groups that have a very larger number of matches.
Figure 12: List of taxonomic groups represented in BLAST search of Mouse PKI to NR database
There are several links on each line of the Tax BLAST Report. Within the Lineage report, clicking on the species name (a) will open up the Taxonomy browser for that species.
Clicking on the number of hits (b) will take you to the part of the BLAST output representing those hits. Clicking on the taxonomic group (c) will open the Taxonomy browser for that group. Clicking on the protein name (d) to the right of the taxonomic group will open the Genbank record for that protein.
If you scroll down further in the taxonomy report, you will see the Organism Report. This report lists the score for each organism, listed from the most similar to least similar.
Again, you can link to the actual alignment or to the Genbank record for each protein hit.
Scroll to the top of your blast output page. There should be a link “Edit and Resubmit”. Click this and then scroll down to the parameters section. Click on the + button to open the menu and change the Max Target Sequences from 100 to 1000. Click the BLAST button to start the search.
- Once the search is complete, examine the number of hits, their respective scores and look at the taxonomy report
- Do you now see matches to other non-vertebrate eukaryotes? How do the e-values compare to matches within vertebrate species? Would you call these homologs?
You can also leave the database set to NR and then restrict it by taxonomic group or exclude taxonomic groups. If you start to type vertebrate and pause, it will finish it for you. You can also type in mouse or fungi or bacteria or gallus gallus or use the TaxID number. By restricting the search to a smaller database, the search times are speeded up and you have fewer hits to sort through. Restricting the database also allows you to determine if there are matches in other taxonomic groups which may be missed because of the limit on the number of search results returned.
Using Blast2Seq
Go to the BLAST homepage and click on the section titled Specialized BLAST is a link at the bottom “Align two sequences using BLAST (bl2seq).” Click on the Align link and it will open up the Blast2 Sequences interface. We will use this to test the effect of changing various parameters on the alignment of 2 sequences.
The sequences we will use are the Toll-like receptor 3 from human, mouse and zebra fish.
Figure 13: BLASTN interface for Blast2Sequences.
Figure 14: DotMatrix View of the BLASTN alignment of human and mouse Tlr3
- Nucleotide records: NM_003265, NM_126166 & NM_001013269 (human, mouse & zebra fish)
- Protein records: NP_003256, NP_569054 & NP_001013287 (human, mouse & zebra fish)
You may want to check the accuracy of these accession numbers using Entrez.
The parameters that can be changed will depend on whether you want to align nucleotide or protein sequences. We will start with nucleotide sequences.
- Start out by leaving everything as default.
- Align the human and mouse mRNAs first by putting their accession numbers into the box: Enter accession or GI.
- You can also paste in the sequence in FASTA format or upload it from a text file.
- Click the Align button.
- On the subsequent page, click on the Dot Matrix view to see a graphic representation of how the 2 sequences are related.
Using the Edit and Resubmit button at the top of the report, return to the submission page and align the human and zebrafish mRNAs using the default parameters. Did you get a significant result? Change the program setting to optimize for more dissimilar sequences (discontiguous blast) and resubmit. Now look at the Dot Matrix view of the result.
Notice that the region of alignment is much shorter than for the human and mouse. These two sequences have diverged a fair amount. Try changing some of the parameters to see if you can get more of these sequences to align.
Now align the human and mouse TLR3 protein and human and zebrafish TLR3 protein sequences. Compare the results of the nucleotide alignments to those of the protein alignments. Which is a more sensitive method to find related sequences?
You can compare the alignments of 2 proteins to alignments of other proteins using the % identity and % similarity as well as the bit scores and Expect values.
You could use BLASTN as part of Blast2seq with alternative transcripts to determine the region where they differ.
Using Primer-BLAST
Primer-BLAST can be used to design PCR primers as well as check their specificity. The interface is not quite as well developed as the other BLAST programs. The results are not stored the way other BLAST results are kept and there is no link for editing and resubmitting the same query. However, it does provide a quick way to check your primers. The primer design algorithm is Primer3, which is the standard algorithm used by most sequence analysis programs.
From the BLAST home page, click on the link Primer-BLAST under Specialized BLAST section. It brings up a window with 4 sections that have a number of options. The default options are set to design and test primers for human mRNAs.
Figure 14: Dot matrix output for BLASTN alignment for human and mouseTlr3
Figure 15: Dot matrix output for BLASTN alignment for human and zebrafish Tlr3
Figure 16: Output for BLASTP alignment of human and zebrafish Tlr3 proteins.
Start out in the PCR Template section by designing primers to amplify a region of the mouse transcript for Pki-alpha. The parameters are that the PCR product be ~300 bp in length and that the primers span an intron/exon junction.
Type in or paste in the Refseq accession number for the mouse Pkia gene (NM_008862).
Define a range for the forward and reverse primers so that the product size comes close to
300. Generally, when designing primers, you define a range and let the primer-picking program select the actual primers as the algorithm takes into account factors such as melting temperature which affect the compatibility of the 2 primers to work in the same reaction. I would suggest a range of ~200 bp for each primer. Skip the section titled Primer Parameters and in the Exon/intron selection, change the menu selection to Primer must span exon-‐exon junction. Under the section Primer Pair Specificity Checking, change the Organism to house mouse (taxid:10090).
Sometimes the primer design fails for the given parameters. Usually increasing the range for both primers will give the program enough leeway to find compatible primers. When you do get output, it should look something like Figure 17. I used 1-200 for primer1 and 300-500 for primer 2. I suggested a product size of 80-450. It came back with 5 primer pairs, but I’m only showing the results for the first pair.
Figure 17: Primer-BLAST output
To test the specificity of already designed primers, enter the forward and reverse primers in the appropriate text boxes under the section titled Primer Parameters. Change the organism to whatever is appropriate and select whether you want to test specificity in transcripts (mRNA) or genomic. I’ve put a set primers designed Pki in on the Exercise 3 homepage.
Leave the template and range boxes blank. Choose the database for checking specificity,
i.e. what organism and whether you want to test specificity in transcripts or genomic sequence. Even though you are only testing specificity, click the Get Primers button to execute.
Test the tutorial primers against mouse mRNA and mouse genomic sequence. What were the expected product sizes? Did the primers cross and exon-exon boundary? Were they perfect matches in both the mRNA and genomic sequence?