Blast bioinformatics
October 17, 2023NCBI BLAST Bioinformatics
Introduction
What is BLAST? BLAST is also known as the Basic Local Alignment Search Tool. It is an algorithm as well as a program which is used commonly to compare primary biological sequences such as amino acids, protein sequences, DNA nucleotides, RNA sequences and so on. This program is one of the most fundamental program for sequence searching in the field of bioinformatics. Researchers are able to analyse and address the fundamental problems in bioinformatics research using the BLAST program.
Before program such as BLAST existed, the sequence searching process was a very tedious and time consuming procedure as a full alignment procedure was conducted to find the biological sequence from databases. In the 1990s, a little after the creation of FASTA, another algorithm used in sequences searching. The BLAST program is much more time efficient than the previous tool, FASTA, as it searches the more significant patterns in a particular sequence and also with comparative sensitivity. the BLAST tool is now available in many websites. One of the most common website would be the NCBI website.
When a search is conducted using the BLAST program, depending on the sequence in the query, different types of BLASTs will be available for its user. For example, when a DNA query is given to find a similar DNA sequence from the DNA database, the Nucleotide-nucleotide BLAST (blastn) is used. When a protein query is given to fins a similar protein sequence from the protein database, the Protein-protein BLAST (blastp) is used. To input a sequence in the program, the sequence has to either be in the FASTA or Genbank format. Furthermore, databases to search from and other parameters could be set to obtain a more precise search. The program will run based in the target sequence. The program will find for HSPs, High-scoring Segment Pairs, which is the sequence that is most similar to the target sequence. The output will then be provided by the program in different types of format such as HTML, text, XML and so on. Certain websites also produce graphic results which is easier to read.
There are many uses for the BLAST program. The main purposes of the program are to identify species, establishing phylogeny, locating domains, DNA mapping and comparison. The BLAST program is able to correctly identify a species based on a given DNA sequence. This is helpful in order to identify an unknown species. The most similar DNA sequence produced by the BLAST program would provide sufficient information for researchers to identify the species. The results produced by the BLAST program can be used to create a phylogenetic tree. This function would be available in the BLAST website. In terms of locating domains, the program will be able to locate the domain of a protein sequence. Finally, this program is capable of mapping the DNA of an unknown sequence based on the relevant sequence in the database and the chromosomal point of interest.
Objective
- Identify what is BLAST and how it works
- Complete the BLAST exercises and answer the questions
In this exercise we will be using BLAST (Basic Local Alignment Search Tool) for searching sequence databases such as GenBank (DNA data) and UniProt (protein). When using BLAST for sequence searches it is of utmost importance to be able to critically evaluate the statistical significance of the results returned.
The BLAST software package is free to use (Open Source) and can be installed on any local system — it’s originally written for UNIX type Operating Systems. The package contains both programs for performing the actual sequence searches against preexisting databases (e.g. “blastn” for DNA databases and “blastp” for protein databases), as well as a tool for creating new databases from scratch (the “fortmatdb” program).
In this exercise we will be using the Web-interface to BLAST hosted by the NCBI. For our purpose there are several advantages to this approach:
- We don’t have to mess around with a UNIX command prompt.
- NCBI offers direct access to preformatted BLAST databases of all the data that they host:
- GenBank (+ derivates)
- Full Genome database
- Protein database (Both from translated GenBank and UniProt)
It should be noted that running BLAST locally (for example at the super-computer cluster at DTU) offers much more fine-grained control of DATA and workflow (everything can be scripted/automated) than running BLAST through a web-interface.
Links
- NCBI BLAST main page: http://blast.ncbi.nlm.nih.gov/
- Notice: There are links to “Nucleotide BLAST” (including “blastn”) and “Protein BLAST” (including “blastp”) from this page.
- NCBI BLAST help pages
- IMPORTANT: BLAST is a quite computationally intensive algorithm, and we have in recent years run into issues with overburdening the NCBI server, with 150+ students submitting jobs at the same time. We have therefore implemented a few optimization/work-arounds, that it is important you remember to follow. In some of the sections below, you will be asked to limit your search to a certain subset of the BLAST database (e.g. only search in the “bacterial” part of the NR database). This will limit the amount of data to search through, and will make the search finish faster.
Part 1: Your first BLAST search
- IMPORTANT: do NOT limit your search to “bacteria” in PART 1 (we are looking for insulin).
Below is the mRNA sequence for insulin from a South American rodent, the Degu (Octodon degus).
>gi|202471|gb|M57671.1|OCOINS Octodon degus insulin mRNA, complete cds GCATTCTGAGGCATTCTCTAACAGGTTCTCGACCCTCCGCCATGGCCCCGTGGATGCATCTCCTCACCGT GCTGGCCCTGCTGGCCCTCTGGGGACCCAACTCTGTTCAGGCCTATTCCAGCCAGCACCTGTGCGGCTCC AACCTAGTGGAGGCACTGTACATGACATGTGGACGGAGTGGCTTCTATAGACCCCACGACCGCCGAGAGC TGGAGGACCTCCAGGTGGAGCAGGCAGAACTGGGTCTGGAGGCAGGCGGCCTGCAGCCTTCGGCCCTGGA GATGATTCTGCAGAAGCGCGGCATTGTGGATCAGTGCTGTAATAACATTTGCACATTTAACCAGCTGCAG AACTACTGCAATGTCCCTTAGACACCTGCCTTGGGCCTGGCCTGCTGCTCTGCCCTGGCAACCAATAAAC CCCTTGAATGAG
We will now use a BLASTN search at NCBI to determine whether this sequence looks like the human mRNA for insulin. There are two ways we can do this:
- search the entire database and look for human hits in the results,
- specifically search the human part of the database.
We will try both of these possibilities.
Search against NR
- Follow the “nucleotide blast” link from the main BLAST page.
- In the section “Program Selection” select the option “Somewhat similar sequences (blastn)“
- Choose “Nucleotide collection (nr/nt)” as the search database. NR is the “Non Redundant” database, which contains all non-redundant (non-identical) sequences from GenBank and the full genome databases.
- Click the BLAST button to launch the search.
After the search has completed, make yourself familiar with the BLAST output page. After a header with some information about the search, there are three main parts:
- Graphic Summary
- each hit is represented by a line showing which part of the query sequence the alignment covers. The lines are coloured according to alignment score.
- Descriptions
- a table with a one-line description of each hit with some alignment statistics.
- Alignments
- the actual alignments between the query and the database hits.
Note that you can toggle between hiding and showing each part by clicking on the part title (try it!).
The columns in the Descriptions table are:
- Description — the description line from the database
- Max score — the alignment score of the best match (local alignment) between the query and the database hit
- Total score — the sum of alignment scores for all matches (alignments) between the query and the database hit (if there is only one match per hit, these two scores are identical)
- Query cover — the percentage of the query sequence that is covered by the alignment(s)
- E value — the Expect value calculated from the Max score (i.e. the number of unrelated hits with that score or better you would expect to find for random reasons)
- Ident — the percent identity in the alignment(s)
- Accession — the accession number of the database hit.
- QUESTION 1.2:
Answer the same questions as before about the hit you found now.
Search against Human G+T
Note: In this context, G+T does not mean Gin and Tonic.
Open a new window/tab with the BLAST home page. Make a new BLASTN search with the same query sequence, this time with Database set to Human genomic + transcript (Human G+T). Remember again to select Somewhat similar sequences (blastn) under Program Selection. Consider the best hit.
Note: even though you may not have found exactly the same database entry in the two searches, the alignment should be the same. Make sure this is the case by comparing the actual alignments in the two windows where you made the searches.
- QUESTION 1.3:
Answer the same questions as before about the best hit you found in this search.
Concerning database size and E-values
When answering the previous two questions, you may have noticed that the E-value changed, while the alignment score did not. We will now investigate this further.
- QUESTION 1.4:
What are the sizes (in basepairs) of the databases we used for the two BLAST searches? (Tip: Expand the “Search summary” section near the top by clicking it).
- QUESTION 1.5:
- Hint: remember, you can use Google as a calculator!
- What is the ratio between the database sizes in the two BLAST searches?
- What is the ratio between the E-values (for the best human hits) in the two BLAST searches?
- What is the relationship between database size and E-value for hits with identical alignment score?
- In conclusion: if the database size is doubled, what will happen to the E-value?
Part 2: Assessing the statistical significance of BLAST hits
- IMPORTANT: limit your search to “bacteria” (taxid: 2) in ALL of this section (PART 2) to make the BLAST searches run quicker.
As discussed in the lecture, there will be a risk of getting false positive results (hits to sequences that are not related to our input sequence) by purely stochastic means. In this first part of the exercise we will be investigating this further, by examining what happens when we submit randomly generated sequence to BLAST searches.
Rather than giving out a set of pre-generated DNA/Peptide sequences where you only have our word for their randomness, you’ll be generating your own random sequences with the Sequence Manipulation Suite. We previously used d4/d20 dice to generate these sequences manually, but we have decided to let the computer do the work in order for you to save some time. It is important to understand that these computer generated sequences are totally random, just as if you were rolling a die to determine each nucleotide/amino acid in each sequence.
Random DNA sequences and BLASTN
- Generate three DNA sequences of length 25bp using the random DNA generator from the Sequence Manipulation Suite. Note: three is not an option, so just generate ten sequences and copy the first three.
- QUESTION 2.1:
- Report the three sequences in FASTA format.
We will now do a BLASTN search using these three random sequences as queries. Follow the “nucleotide blast” link from the main BLAST page, and, as before, select the option “Somewhat similar sequences (blastn)” in the section “Program Selection“. Choose “Nucleotide Collection (nr/nt)” as the search database.
VERY IMPORTANT: For this special situation where we BLAST small artificial sequences we need to turn off some the automatics NCBI incorporate when short sequences are detected. Otherwise we’ll not be able to see the intended results:
- Extend the “Algorithm parameters” section (see the screen shot below) in order to gain access to fine-tuning the options.
- Deselect the “Automatically adjust parameters for short input sequences” option.
- Set the E-value cut-off (“Expect threshold“) to 50
- QUESTION 2.2:
- Answer the following small questions, and document your findings by pasting in examples of alignments / text snippets from the overview table:
- Do you find any sequences that look like your input sequences (paste in a few example alignments in your report).
- What is the typical length of the hits (the alignment length)?
- What is the typical % identity?
- In what range is the bit-scores (“max score”)?
- Notice: This is conceptually the same as the “alignment score” we have already met in the pairwise alignment exercise.
- What is the range of the E-values?
- QUESTION 2.3:
- What is the biological significance of these hits / is there any biological meaning?
Random protein sequences and BLASTP
Now it’s time to work with a set of protein sequences: Generate three peptide sequences of length 25aa using the random protein generator.
- Notice 1: The distribution of amino acids will be equal (5% prob) and this is different from true biological sequences – however this is not important for this first part of the exercise.
- Notice 2: Please recall from the lecture that the way BLASTP selects candidate sequences for full Smith-Waterman alignment is different from BLASTN. (BLASTN – a single short (11 bp +) perfect match hit is needed. BLASTP – a pair of “near match” hits of 3 aa within a 40 aa window is needed).
- QUESTION 2.4:
- Report the sequences in FASTA format.
Locate the “Protein BLAST” page at NCBI and choose blastp as the algorithm to use.
Paste in your sequences in FASTA format, and choose the “NR” database (this is the protein version, consisting of translated CDS’es, UniProt etc).
VERY IMPORTANT: We also need to tweak the parameters this time – in the “Algorithm Parameters” section select BLOSUM62 as the alignment matrix to use and set the “Expect threshold” to 1000 (default: 10) – and DISABLE the “Short queries” parameters as we did in the DNA search a moment ago – otherwise our carefully tweaked parameters will be ignored.
- Perform the BLAST search.
- Inspect the results.
- QUESTION 2.5:
- (Remember to document your answers in the same manner as Q2.2)
- What is the typical length of the alignment and do they contain gaps?
- What is the range of E-values?
- Try to inspect a few of the alignments in details (“+” means similar sequences) – do you find any that look plausible, if we for a moment ignore the length/E-value?
- If we had used the default E-value cut-off of 10 would any hits have been found?
- QUESTION 2.6:
- If we compare the result from BLAST’ing random DNA sequences to random Peptide sequences – which kind of search has the higher risk of returning false positives (results that appear plausible, maybe even significant, but are truly unrelated)?
- Remember to take E-values into your consideration.
- If we compare the result from BLAST’ing random DNA sequences to random Peptide sequences – which kind of search has the higher risk of returning false positives (results that appear plausible, maybe even significant, but are truly unrelated)?
Part 3: using BLAST to transfer functional information by finding homologs
- IMPORTANT: limit your search to “bacteria” (taxid: 2) in ALL of this section (PART 3) to make the BLAST searches run quicker. (The organisms we’re looking for all belongs to the “Bacteria” domain of life, so this restriction is OK).
Homo-, Ortho- and Paralogs
One of the most common ways to use BLAST as a tool, is in the situation where you have a sequence of unknown function, and want to find out which function it has. Since a large amount of sequence data has been gathered during the years, chances are that an evolutionarily related sequence with known function has already been identified. In general such a related sequence is known as a “homolog“.
Homo-, Ortho- and Paralogs:
- A Homolog is a general term that describes a sequence that is related by any evolutionary means.
- An Ortholog (“Ortho” = True) is a sequence that is “the same gene” in a different organism: The sequences shared a single common ancestor sequence, and has now diverged through speciation (e.g. the Alpha-globin gene in Human and Mouse).
- A Paralog arises due to a gene duplication within a species. For example Alpha- and Beta-globin are each others paralogs.
Notice that in both cases it’s possible to transfer information, for example information about gene family / protein domains. We have already touched upon comparison of (potentially) evolutionarily related sequences in the pairwise alignment exercise. However, this time we do not start out with two sequences we assume are related, but we rather start out with a single sequence (“query sequence”) which we will use to search the databases for homologs (we often informally speak of “BLAST hits”, when discussing the sequences found).
BLAST example 1
Let’s start out with a sequence that will produce some good hits in the database. The sequence below is a full-length transcript (mRNA) from a prokaryote. Let’s find out what it is.
>Unknown_transcript01 CCACTTGAAACCGTTTTAATCAAAAACGAAGTTGAGAAGATTCAGTCAACTTAACGTTAATATTTGTTTC CCAATAGGCAAATCTTTCTAACTTTGATACGTTTAAACTACCAGCTTGGACAAGTTGGTATAAAAATGAG GAGGGAACCGAATGAAGAAACCGTTGGGGAAAATTGTCGCAAGCACCGCACTACTCATTTCTGTTGCTTT TAGTTCATCGATCGCATCGGCTGCTGAAGAAGCAAAAGAAAAATATTTAATTGGCTTTAATGAGCAGGAA GCTGTTAGTGAGTTTGTAGAACAAGTAGAGGCAAATGACGAGGTCGCCATTCTCTCTGAGGAAGAGGAAG TCGAAATTGAATTGCTTCATGAATTTGAAACGATTCCTGTTTTATCCGTTGAGTTAAGCCCAGAAGATGT GGACGCGCTTGAACTCGATCCAGCGATTTCTTATATTGAAGAGGATGCAGAAGTAACGACAATGGCGCAA TCAGTGCCATGGGGAATTAGCCGTGTGCAAGCCCCAGCTGCCCATAACCGTGGATTGACAGGTTCTGGTG TAAAAGTTGCTGTCCTCGATACAGGTATTTCCACTCATCCAGACTTAAATATTCGTGGTGGCGCTAGCTT TGTACCAGGGGAACCATCCACTCAAGATGGGAATGGGCATGGCACGCATGTGGCCGGGACGATTGCTGCT TTAAACAATTCGATTGGCGTTCTTGGCGTAGCGCCGAGCGCGGAACTATACGCTGTTAAAGTATTAGGGG CGAGCGGTTCAGGTTCGGTCAGCTCGATTGCCCAAGGATTGGAATGGGCAGGGAACAATGGCATGCACGT TGCTAATTTGAGTTTAGGAAGCCCTTCGCCAAGTGCCACACTTGAGCAAGCTGTTAATAGCGCGACTTCT AGAGGGGTTCTTGTTGTAGCGGCATCTGGGAATTCAGGTGCAGGCTCAATCAGCTATCCGGCCCGTTATG CGAACGCAATGGCAGTCGGAGCGACTGACCAAAACAACAACCGCGCCAGCTTTTCACAGTATGGCGCAGG GCTTGACATTGTCGCACCAGGTGTAAACGTGCAGAGCACATACCCAGGTTCAACGTATGCCAGCTTAAAC GGTACATCGATGGCTACTCCTCATGTTGCAGGTGCAGCAGCCCTTGTTAAACAAAAGAACCCATCTTGGT CCAATGTACAAATCCGCAATCATCTAAAGAATACGGCAACGAGCTTAGGAAGCACGAACTTGTATGGAAG CGGACTTGTCAATGCAGAAGCGGCAACACGCTAATCAATAATAATAGGAGCTGTCCCAAAAGGTCATAGA TAAATGACCTTTTGGGGTGGCTTTTTTACATTTGGATAAAAAAGCACAAAAAAATCGCCTCATCGTTTAA AATGAAGGTACC
BLASTN search
Perform a BLAST search in the NR/NT database (BLASTN) using default settings. Remember to set Expect threshold back to the default value, 10. (2021 update: The new default is 0.05, that should work fine as well).
- QUESTION 3.1:
- (Once again remember to document your findings)
- Do we get any significant hits?
- What kind of genes (function) do we find?
BLASTP search
Now let’s try to do the same at the protein level.
- Find the longest ORF using VirtualRibosome (hint: remember to search all positive reading frames) and save of copy the sequence in FASTA format.
- BLAST the sequence (BLASTP) against the NR database.
- QUESTION 3.2:
- (Document!)
- Report your translated protein sequence in FASTA format.
- Do we find any conserved protein domains? (Click the Graphic Summary tab). Identifying known protein domains can provide important clues to the function of an unknown protein.
- Do we find any significant hits? (E-value?)
- Are all the best hits the same category of enzymes?
- From what you have seen, what is best for identifying intermediate quality hits – DNA or Protein BLAST?
BLAST example 2
In the previous section we have been cheating a bit by using a sequence that was already in the database – let’s move on to the following sequence instead.
The sequence is a DNA fragment from an unknown non-cultivatable microorganism. It was cloned and sequenced directly from DNA extracted from a soil-sample, and it goes by the poetic name “CLONE12”. It was amplified using degenerated PCR primers that target the middle (“core cloning”) of the sequence of a group of known enzymes. (I can guarantee this particular sequence is not in the BLAST databases, since I have cloned and sequenced it myself, and it has never been submitted to GenBank).
LOCUS CLONE12.DNA 609 BP DS-DNA UPDATED 06/14/98 DEFINITION UWGCG file capture ACCESSION - KEYWORDS - SOURCE - COMMENT Non-sequence data from original file: BASE COUNT 174 A 116 C 162 G 157 T 0 OTHER ORIGIN ? clone12.dna Length: 609 Jun 13, 1998 - 03:39 PM Check: 6014 .. 1 AACGGGCACG GGACGCATGT AGCTGGAACA GTGGCAGCCG TAAATAATAA TGGTATCGGA 61 GTTGCCGGGG TTGCAGGAGG AAACGGCTCT ACCAATAGTG GAGCAAGGTT AATGTCCACA 121 CAAATTTTTA ATAGTGATGG GGATTATACA AATAGCGAAA CTCTTGTGTA CAGAGCCATT 181 GTTTATGGTG CAGATAACGG AGCTGTGATC TCGCAAAATA GCTGGGGTAG TCAGTCTCTG 241 ACTATTAAGG AGTTGCAGAA AGCTGCGATC GACTATTTCA TTGATTATGC AGGAATGGAC 301 GAAACAGGAG AAATACAGAC AGGCCCTATG AGGGGAGGTA TATTTATAGC TGCCGCCGGA 361 AACGATAACG TTTCCACTCC AAATATGCCT TCAGCTTATG AACGGGTTTT AGCTGTGGCC 421 TCAATGGGAC CAGATTTTAC TAAGGCAAGC TATAGCACTT TTGGAACATG GACTGATATT 481 ACTGCTCCTG GCGGAGATAT TGACAAATTT GATTTGTCAG AATACGGAGT TCTCAGCACT 541 TATGCCGATA ATTATTATGC TTATGGAGAG GGAACATCCA TGGCTTGTCC ACATGTCGCC 601 GGCGCCGCC //
- QUESTION 3.3 (Long question – read all):
- Your task is now to find out what kind of enzyme this sequence is likely to encode, using the methods you have learned.
INSTRUCTIONS: You are free to write the combined answer to this question in a free-style essay-like fashion – just be sure to include the subquestions in your answers. In an exam situation you will need to put all the clues together yourself, reason about the tools/databases to use, and document your findings.
STEP 1 – cleaning up the sequence:
The sequence is (more or less) in GenBank format and the NCBI BLAST server expects the input to be in FASTA format, or to be “raw” unformatted sequence.
- There are two solutions to this:
- Copy the sequence into a text-editor and manually create a FASTA file (“search and replace” and/or “rectangular selection” is useful for the reformatting).
This is the most robust solution: it will always work. (Look at the JEdit exercise for a reminder of how to do this). - Hope the creators of the web-server you’re using were kind enough to automatically remove non-DNA letters (paste in ONLY the DNA lines) – this turns out to be the case for both NCBI BLAST and VirtualRibosome, but it cannot be universally relied upon.
- Copy the sequence into a text-editor and manually create a FASTA file (“search and replace” and/or “rectangular selection” is useful for the reformatting).
Subquestion: convert the sequence to FASTA format (manually, in JEdit) and quote it in your report.
STEP 2 – thinking about the task:
Consider the following before you start on solving this task:
- Based on the information given: is the sequence protein-coding?
- If it is, can you trust it will contain both a START and STOP codon?
- Do we know if the sequence is sense or anti-sense?
and think which consequences the answers to these questions should have for your choice of methods and parameters.
Subquestion: Give a summary of your considerations.
STEP 3 – Performing the database search:
Significance: We will put the criteria for significance at 1e-10 (remember: the higher the E-value, the worse the significance).
Subquestion:
Cover the following in your answer:
- What tool(s) and database(s) will be relevant to use?
- Document the results from the different BLAST searches – what works and what does not work?
- You need to copy in small snippets of the BLAST results to document what you observe.
- In conclusion: What kind of enzyme is CLONE12? Gather as much evidence as possible.
Results & Answers
Part 1
Figure – the blastn search for mRNA sequence for insulin from a South American rodent, the Degu.
Figure – Shows the sequences list for the blastn search of the mRNA sequence for insulin from a South American rodent.
Question 1.1
What is the identifier?
- M57671.1
What is the alignment score (“max score”)?
- 780
What is the percent identity and query coverage?
- 100%
What is the E-value?
- 0.0
Are there any gaps in the alignment?
- No. (0%)
Figure – Shows the sequences that are related to the Homo sapiens
Question 1.2
What is the identifier (Accession)?
- NM_001185098.2
What is the alignment score (“max score”)?
- 205
What is the percent identity and query coverage?
- 74.49% and 76%
What is the E-value?
- 4e-48
Are there any gaps in the alignment?
- Yes, there are 15/341(4%)position gaps.
Figure – Blastn search made for the same mRNA but using the Human G+T database
Question 1.3
What is the identifier (Accession)?
- NM_001185098.2
What is the alignment score (“max score”)?
- 205
What is the percent identity and query coverage?
- 74.49% and 76%
What is the E-value?
- 3e-50
Are there any gaps in the alignment?
- Yes there are 15/341 (4%) position gaps.
Figure 5 – Shows the search summary of both blastn searches from above.
Question 1.4
The size of database in the first blast search – 531388148937
The size of database in the second blast search – 3864016495
Question 1.5
What is the ratio between the database sizes in the two BLAST searches?
- 531388148937 : 3864016495 = 137.5 : 1
What is the ratio between the E-values (for the best human hits) in the two BLAST searches?
- 4e-48 : 3e-50 = 0.887 : 1
What is the relationship between database size and E-value for hits with identical alignment score?
- The e-value is a parameter that describes the number of hits that is expected to be seen when searched in a database of a particular size. It is directly proportional to the database size.
In conclusion: if the database size is doubled, what will happen to the E-value?
- The E-value will also be doubled.
Part 2
Figure 6 – shows the 3 random sequences generated
Question 2.1
seq1 – GGACATTCGCGCGTACGCGGACTCA
seq2 – ACAGGCCTCTCAACCTGCTATCATA
seq3 – GGTGGCCTGAAGAAGCGGTTCTTAT
Figure 7 – Shows the blastp search for all 3 sequences.
Figure 8 – Shows the sequence results for seq 1
Figure 9 – shows the sequence results for Seq 2
Figure 10 – shows the sequence results for Seq 3
Question 2.2
Do you find any sequences that look like your input sequence?
What is the typical length of the hits (the alignment length)?
- 17 to 22 base pairs
What is the typical % identity?
- For all 3 sequences, around 90 to 100%
In what range is the bit-scores (“max score”)?
- For all 3 sequences, it is from the range of 35 to 39
What is the range of the E-values?
- Overall, it is around 9 to 35.
Question 2.3
What is the biological significance of the hits you found / is there any biological meaning?
- No. This is because the hits represent sequences that already exist in the database. The sequence obtained from the query are random sequences and do not have any relationship with the hits in the database.
Shows 3 more sequences generated
>seq1
DMNHNWNYFSKKENLRMTPFNRVNW
>seq2
NDSLRGCWCMKDVVWIISVRKHMFH
>seq3
FINNSMMLRVEDQHGFNCQFTMHGK
Figure 11 – shows the blastp search for the 3 sequences above
Question 2.5
What is the typical length of the alignment and do they contain gaps?
- Typically, from 5 to 20 and there weren’t many gaps found (rare).
What is the range of E-values?
- From 100 to 1000
Try to inspect a few of the alignments in details (“+” means similar sequences) – do you find any that look plausible, if we for a moment ignore the length/E-value?
- It is possible. The alignment shown below has a positive of 88% and 53% identity. However, based on the e-value, it is too short to be significant.
If we had used the default E-value cut-off of 10 would any hits have been found?
- No. I had tried in 10 but I failed to obtain any results.
Question 2.6
If we compare the result from BLAST’ing random DNA sequences to random Peptide sequences – which kind of search has the higher risk of returning false positives (results that appear plausible, maybe even significant, but are truly unrelated)?
- Based on the results above, it can be said that the risk of getting a false hit is higher when working with a DNA sequence. This is because it can be an unrelated sequence with a significant e-value.
Part 3
Figure 12 – shows the blastn of a full-length transcript (mRNA) from a prokaryote
Question 3.1
Do we get any significant hits?
- Yes. Over 19 sequences had an e-value of 0.0. furthermore, the first hit in the list also has a query coverage of 100% and also identity of 100%.
What kind of genes (function) do we find?
- Based on the first hit which is shown below, the genes found are alkaline serine proteases from the genus Bacillus.
Figure 13 – Blastp search of the virtual ribosome
Question 3.2
Report your translated protein sequence in FASTA format.
Do we find any conserved protein domains? (Indicated at the very top of the result page, and during the search). Identifying known protein domains can provide important clues to the function of an unknown protein.
- Based on the results above, there are no conserved domains.
Do we find any significant hits? (E-value?)
- It can be said that it is a significant hit as the e-value is 0.002.
Are all the best hits the same category of enzymes?
- Yes, they are both from the hypothetical proteins.
From what you have seen, what is best for identifying intermediate quality hits – DNA or Protein BLAST?
- High quality hits can be identified using both blastn and blastp as they both provide significant information. However, in terms of evolutionary distance, blastp would be more useful as it will be able to identify the specific protein sequence for it.
Question 3.3
- Cleaning the sequence
Genbank to FASTA
>UWGCG file capture
cndnangthunMChckAACGGGCACGGGACGCATGTAGCTGGAACAGTGGCAGCCGTAAA
TAATAATGGTATCGGAGTTGCCGGGGTTGCAGGAGGAAACGGCTCTACCAATAGTGGAGC
AAGGTTAATGTCCACACAAATTTTTAATAGTGATGGGGATTATACAAATAGCGAAACTCT
TGTGTACAGAGCCATTGTTTATGGTGCAGATAACGGAGCTGTGATCTCGCAAAATAGCTG
GGGTAGTCAGTCTCTGACTATTAAGGAGTTGCAGAAAGCTGCGATCGACTATTTCATTGA
TTATGCAGGAATGGACGAAACAGGAGAAATACAGACAGGCCCTATGAGGGGAGGTATATT
TATAGCTGCCGCCGGAAACGATAACGTTTCCACTCCAAATATGCCTTCAGCTTATGAACG
GGTTTTAGCTGTGGCCTCAATGGGACCAGATTTTACTAAGGCAAGCTATAGCACTTTTGG
AACATGGACTGATATTACTGCTCCTGGCGGAGATATTGACAAATTTGATTTGTCAGAATA
CGGAGTTCTCAGCACTTATGCCGATAATTATTATGCTTATGGAGAGGGAACATCCATGGC
TTGTCCACATGTCGCCGGCGCCGCC
- Thinking about the task
Based on the information given: is the sequence protein-coding?
- Yes. This is because the PCR primers used to clone the enzyme targets the known enzymes
If it is, can you trust it will contain both a START and STOP codon?
- No. the PCR primer will not replicate the stop and start codons. It will only replicate the middle part of the sequence.
Do we know if the sequence is sense or anti-sense?
- If the cloned sequence was an RNA, it could be stated. However, since the cloned sequence is an DNA strand, the sequence cannot be declared as either sense or anti-sense.
- Performing the database search.
What tool(s) and database(s) will be relevant to use?
- The tool used will be the Blastn, blastp and Virtual Ribosome to translate the protein.
Document the results from the different BLAST searches – what works and what does not work?
Blastn is used first where all parameters are set to default. The tool has been set to use the NR database. The image below shows the graphic summary of the blast above.
The DNA sequence which was changed into FASTA format was put in the Virtual Ribosome to get the ribosome sequence.
This is then converted to FASTA format
The blastp will use this sequence and the conditions are the same as for blastn.
The first hit seems to be an uncharacterised protein. Since it wouldn’t help in identifying the clone12, the second best hit will be selected.
Based on the picture above, it shows the alignment of the S8 family serine peptidase. Even though it not a perfect hit, its results are significant enough to be considered. It has an identity of 56.54% and a positive of 66%. Therefore, it can be concluded that the Clone12 is from the is from the S8 family serine peptidase.
Conclusion
Based on the experiment conducted above, it can be said that the various functions of the BLAST tool in NCBI was explored. In the experiment above, the student had used BLAST to find a number of homologous genes as well as protein products such as an unknown enzyme. The tool was utilised to identify sequences, best hit and so on. The different terms such as e-value, alignment score, identifier and many more were explored. Their functions were also known to the student. Both the BLAST types were used, BlastN and BlastP. Furthermore, the results obtained were analysed in terms of its alignment, graphically and based on the description table. In addition, the student was able to learn how to convert Genbank format to FASTA and also virtual ribosomes to FASTA.
References
- DTU Bioinformatics. (n.d.). SeqGen 1.0 Server. SeqGen 1.0 Server. Retrieved September 28, 2021, from http://www.cbs.dtu.dk/biotools/SeqGen-1.0/
- NCBI. (2007). Access denied – NCBI Bookshelf. National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/books/NBK1734/
- NCBI. (2021). Basic Local Alignment Search Tool. National Center for Biotechnology Information. https://blast.ncbi.nlm.nih.gov/Blast.cgi
- Rasmus Wernersson, raz@cbs.dtu.dk. (n.d.). Virtual Ribosome – query result. Rasmus Wernersson. Virtual Ribosome – a Comprehensive DNA Translation Tool with Support for Integration of Sequence Feature Annotation. Retrieved September 28, 2021, from http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi?jobid=6152BA3600004A8EC67DB2EB&wait=20
- SMS Format Conversion. (2004). GenBank to FASTA. Sequence Manipulation Suite: https://www.bioinformatics.org/sms2/genbank_fasta.html
- Wernersson, R., & Nielsen., H. (n.d.). Exercise: BLAST – teaching. Exercise: BLAST – Teaching. Retrieved September 28, 2021, from https://teaching.healthtech.dtu.dk/teaching/index.php/Exercise:_BLAST