Exploring NCBI and GenBank: A Comprehensive Guide to Bioinformatics

March 16, 2024 Off By admin

Table of Contents

Introduction to NCBI, GenBank, and Bioinformatics

Overview of NCBI: National Center for Biotechnology Information

The National Center for Biotechnology Information (NCBI) is a part of the United States National Library of Medicine (NLM), which is itself a branch of the National Institutes of Health (NIH). NCBI plays a crucial role in bioinformatics and computational biology research by providing access to a vast array of biological information and tools. Here’s an overview of its key components and services:

GenBank: NCBI hosts GenBank, a comprehensive database of genetic sequences from many different organisms. GenBank includes sequences from DNA, RNA, and protein, along with associated metadata.
PubMed: NCBI provides access to PubMed, a database of biomedical literature, including research articles, reviews, and other publications. PubMed is a valuable resource for researchers seeking information on specific topics in the life sciences.
BLAST: The Basic Local Alignment Search Tool (BLAST) is a widely used tool for comparing biological sequences. NCBI provides a web interface for BLAST, allowing users to search for similarities between a query sequence and sequences in the NCBI databases.
Entrez: Entrez is a powerful search engine that allows users to search multiple NCBI databases simultaneously. Entrez provides access to a wide range of biological information, including sequences, structures, and literature.
Toolbox: NCBI offers a variety of bioinformatics tools and resources, including sequence analysis tools, genome browsers, and databases for specific types of data (e.g., protein structures, gene expression data).
Genome Database Resources: NCBI provides access to a range of genome databases, including the Reference Sequence (RefSeq) database, which provides curated, annotated sequences for well-studied organisms.
Data Submission: NCBI allows researchers to submit their own data to its databases, including sequence data, gene expression data, and more. This allows researchers to contribute to the growing body of biological knowledge.
Education and Training: NCBI offers a variety of educational resources and training materials to help researchers and students learn how to use its databases and tools effectively.

Overall, NCBI plays a critical role in advancing research in the life sciences by providing access to a wealth of biological information and tools. Its databases and services are widely used by researchers around the world to study genetics, genomics, and related fields.

Introduction to GenBank: history, purpose, and scope

GenBank is a comprehensive database of genetic sequences that is maintained by the National Center for Biotechnology Information (NCBI), a part of the United States National Library of Medicine (NLM), which is itself a branch of the National Institutes of Health (NIH). Here’s an introduction to GenBank, covering its history, purpose, and scope:

History: GenBank was established in 1982 as a public database of DNA sequences generated by researchers around the world. It was originally created as a collaboration between the Los Alamos National Laboratory and the National Institutes of Health. Over the years, GenBank has grown significantly in size and scope, becoming a crucial resource for researchers in the fields of genetics, genomics, and bioinformatics.

Purpose: The primary purpose of GenBank is to provide a freely accessible repository of genetic sequences that can be used by researchers for a variety of purposes, including:

Sequence Analysis: Researchers use GenBank to compare new DNA sequences with existing sequences to identify similarities and differences, which can provide insights into the function and evolution of genes.
Gene Discovery: GenBank contains a vast array of genetic sequences from many different organisms, including humans, animals, plants, and microbes. Researchers can use GenBank to discover new genes and genetic variants associated with various traits and diseases.
Evolutionary Studies: By comparing genetic sequences from different organisms, researchers can study the evolutionary relationships between species and trace the history of genetic changes over time.
Biomedical Research: GenBank is used in biomedical research to study the genetic basis of diseases, develop new diagnostic tools, and identify potential drug targets.

Scope: GenBank contains a wide variety of genetic sequences, including:

DNA Sequences: GenBank contains sequences of DNA molecules, including coding sequences (exons), non-coding sequences (introns, promoters), and regulatory sequences.
RNA Sequences: GenBank also includes sequences of RNA molecules, including messenger RNA (mRNA), ribosomal RNA (rRNA), and transfer RNA (tRNA).
Organismal Diversity: GenBank contains sequences from a wide range of organisms, including bacteria, viruses, plants, animals, and fungi. This diversity of sequences allows researchers to study genetic variation across different species.
Annotation and Metadata: GenBank provides annotations and metadata for each sequence, including information about the organism from which the sequence was derived, the sequence’s length, and any known functions or features.

In summary, GenBank is a vital resource for researchers in the life sciences, providing access to a vast collection of genetic sequences that are used to advance our understanding of genetics, genomics, and related fields.

Basics of bioinformatics: data types, formats, and analysis methods

Bioinformatics is an interdisciplinary field that combines biology, computer science, mathematics, and statistics to analyze and interpret biological data. Here are the basics of bioinformatics, including data types, formats, and analysis methods:

Data Types:

Sequence Data: This includes DNA sequences, RNA sequences, and protein sequences. Sequences are represented by strings of letters, with each letter corresponding to a different nucleotide (A, T, C, G for DNA; A, U, C, G for RNA) or amino acid (e.g., A, R, N for protein).
Structural Data: This includes information about the three-dimensional structure of biological molecules, such as proteins and nucleic acids. Structural data is represented by coordinates that describe the positions of atoms in the molecule.
Expression Data: This includes data on gene expression levels, which can be measured using techniques such as microarrays or RNA sequencing. Expression data is typically represented as numerical values indicating the abundance of transcripts or proteins.
Functional Data: This includes information about the functions and interactions of biological molecules, such as protein-protein interactions, metabolic pathways, and gene regulatory networks.

Data Formats:

FASTA: A simple text-based format for representing sequences, with each sequence starting with a header line that begins with a “>” symbol, followed by the sequence itself.
FASTQ: A format used to store both sequencing reads and their quality scores. It consists of four lines for each read: a header line, the sequence, a separator line (usually “+”), and the quality scores.
SAM/BAM: The Sequence Alignment/Map (SAM) format is used to store sequence alignment data, including aligned reads and their mapping positions. The Binary Alignment/Map (BAM) format is a binary version of SAM that is more compact and efficient for large datasets.
PDB: The Protein Data Bank (PDB) format is used to store information about the three-dimensional structures of proteins and other biological macromolecules. It includes coordinates for atoms, as well as metadata about the structure.

Analysis Methods:

Sequence Alignment: This involves comparing two or more sequences to identify similarities and differences. Alignment algorithms, such as BLAST (Basic Local Alignment Search Tool), are commonly used for this purpose.
Genome Assembly: This involves reconstructing the complete genome sequence of an organism from short sequencing reads. Assembly algorithms use overlap information between reads to build longer contiguous sequences (contigs) and ultimately reconstruct the entire genome.
Gene Prediction: This involves identifying the locations of genes within a genome. Gene prediction algorithms use statistical models and sequence similarity to predict the presence of protein-coding genes and other functional elements.
Phylogenetic Analysis: This involves reconstructing the evolutionary relationships between organisms based on their genetic sequences. Phylogenetic analysis methods use sequence alignment data to construct phylogenetic trees that depict the evolutionary history of the organisms.
Structural Bioinformatics: This involves analyzing and predicting the three-dimensional structures of biological molecules. Structural bioinformatics methods use computational modeling techniques to predict protein structures and study their functions.

These are just a few examples of the data types, formats, and analysis methods used in bioinformatics. The field is vast and constantly evolving, with new technologies and techniques being developed to analyze and interpret biological data.

Navigating NCBI and GenBank

Accessing NCBI resources: website, databases, and tools

Accessing resources from the National Center for Biotechnology Information (NCBI) can be done through its website, which provides access to various databases and tools. Here’s a brief overview of how to access NCBI resources:

Website: The NCBI website (https://www.ncbi.nlm.nih.gov/) serves as the primary gateway to NCBI’s resources. From the website, you can access various databases, tools, and educational resources related to biotechnology and life sciences.

Databases:

PubMed: PubMed is a database of biomedical literature, including research articles, reviews, and other publications. You can search PubMed to find articles on specific topics in the life sciences.
GenBank: GenBank is a database of genetic sequences. You can search GenBank to find DNA, RNA, and protein sequences from a wide range of organisms.
BLAST: The Basic Local Alignment Search Tool (BLAST) is a tool for comparing biological sequences. You can use BLAST to search for similarities between your sequence of interest and sequences in the NCBI databases.
Entrez: Entrez is a search engine that allows you to search multiple NCBI databases simultaneously. You can use Entrez to search for sequences, structures, and other biological information.

Tools:

BLAST: The BLAST tool is available on the NCBI website for sequence comparison. You can use BLAST to search for similar sequences in the NCBI databases.
Genome Browser: The NCBI Genome Browser allows you to visualize and explore genomic data from various organisms. You can use the Genome Browser to view gene annotations, sequence variations, and other genomic features.
COBALT: COBALT is a tool for multiple sequence alignment. You can use COBALT to align multiple protein sequences and identify conserved regions.
Primer-BLAST: Primer-BLAST is a tool for designing PCR primers. You can use Primer-BLAST to design primers for amplifying specific DNA sequences.

These are just a few examples of the databases and tools available from NCBI. The NCBI website provides access to a wide range of resources for researchers, educators, and students in the life sciences.

Understanding GenBank records: metadata, sequence data, annotations

GenBank records contain important information about genetic sequences, including metadata, sequence data, and annotations. Here’s a breakdown of each component:

Metadata: Metadata in a GenBank record provides information about the sequence, such as its source, organism, sequence length, and other relevant details. It typically includes:
- LOCUS: This line provides information about the sequence, including its length, molecule type (e.g., DNA, RNA), and other key details.
- DEFINITION: A brief description of the sequence.
- ACCESSION: The unique accession number assigned to the sequence in the GenBank database.
- VERSION: The version number of the sequence.
- KEYWORDS: Keywords describing the sequence, which are used for searching and categorization.
- SOURCE: Information about the organism from which the sequence was derived, including the organism’s scientific name and taxonomy.
- REFERENCE: References to the literature describing the sequence, including authors, title, journal, and publication date.
Sequence Data: The sequence data in a GenBank record consists of the actual nucleotide or amino acid sequence. It is represented as a string of letters (A, T, C, G for DNA; A, U, C, G for RNA; and amino acid codes for proteins).
Annotations: Annotations in a GenBank record provide additional information about the sequence, such as the locations of genes, coding regions, and other features. Annotations are typically presented in a structured format that includes:
- FEATURES: This section describes the features of the sequence, such as genes, coding sequences, regulatory regions, and other functional elements. Each feature is annotated with its location (start and end positions), type, and qualifiers (additional information).
- ORIGIN: This section provides the actual sequence data in a readable format, with line breaks and numbering to indicate the sequence positions.

Overall, GenBank records are structured documents that contain a wealth of information about genetic sequences. Researchers use GenBank records to study the structure, function, and evolution of genes and genomes.

GenBank format and FASTA format: understanding sequence data formats

GenBank format and FASTA format are two common formats used for representing biological sequence data, such as DNA, RNA, and protein sequences. Here’s a comparison of the two formats:

GenBank Format:

Structure: GenBank format is a structured format that includes metadata, sequence data, and annotations.
Metadata: GenBank format includes metadata such as LOCUS, DEFINITION, ACCESSION, VERSION, KEYWORDS, SOURCE, and REFERENCE, providing detailed information about the sequence.
Sequence Data: GenBank format includes the actual sequence data, typically in a readable format with line breaks and numbering to indicate the sequence positions.
Annotations: GenBank format includes annotations that describe features of the sequence, such as genes, coding regions, and other functional elements. Annotations are structured and include information about the location, type, and qualifiers of each feature.

Example of a GenBank Format Record:

vbnet

LOCUS       SCU49845 5028 bp    DNA             PLN 21-JUN-1999
 DEFINITION  Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
 (AXL2) and Rev7p (REV7) genes, complete cds.
 ACCESSION   U49845
 VERSION     U49845.1 GI:1293613
 KEYWORDS    .
 SOURCE      Saccharomyces cerevisiae (baker's yeast)
 ORGANISM  Saccharomyces cerevisiae
 Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
 Saccharomycetales; Saccharomycetaceae; Saccharomyces.
 REFERENCE 1 (bases 1 to 5028)
 AUTHORS   Roemer,T., Madden,K., Chang,J. and Snyder,M.
 TITLE     Selection of axial growth sites in yeast requires Axl2p, a novel
 plasma membrane glycoprotein
 JOURNAL   Genes Dev. 10 (7), 777-793 (1996)
 ...
 ORIGIN
 1 gatcctggaa atgagaaaca caa...

FASTA Format:

Structure: FASTA format is a simple, text-based format that includes a header line starting with “>” followed by a description of the sequence, and the sequence data itself.
Metadata: FASTA format includes a header line that provides a brief description of the sequence, often including the sequence identifier and other relevant information.
Sequence Data: FASTA format includes the actual sequence data, represented as a string of letters (A, T, C, G for DNA; A, U, C, G for RNA; and amino acid codes for proteins).
Annotations: FASTA format does not include structured annotations. Additional information about the sequence is usually provided in the header line, but detailed annotations are not included.

Example of a FASTA Format Record:

shell

>sequence_name
 ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
 GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
 TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
 GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
 TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
 GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
 TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
 GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG

In summary, GenBank format is more structured and includes detailed metadata, sequence data, and annotations, making it suitable for storing and sharing comprehensive biological sequence information. FASTA format, on the other hand, is simpler and more lightweight, making it easier to work with for certain applications, such as sequence searches and alignments.

Searching GenBank

Basic search techniques: keyword search, organism search, sequence ID search

Basic search techniques in bioinformatics allow researchers to retrieve specific information from databases like GenBank. Here are three common search techniques:

Keyword Search: This technique involves searching for sequences or records containing specific keywords. For example, you could search for sequences related to “insulin” or “E. coli” to find sequences associated with those terms.
Organism Search: This technique involves searching for sequences from a specific organism. For example, you could search for sequences from “Homo sapiens” to find sequences from humans.
Sequence ID Search: This technique involves searching for sequences using their unique identifiers. For example, you could search for a specific GenBank accession number to retrieve the sequence associated with that ID.

Each of these search techniques can be used alone or in combination to retrieve specific sequences or records from databases like GenBank.

Advanced search techniques: using filters, search fields, and Boolean operators

Advanced search techniques in bioinformatics allow researchers to perform more precise and complex searches in databases like GenBank. Here are some advanced techniques:

Filters: Filters allow users to refine their search results based on specific criteria, such as sequence length, publication date, or organism. For example, you could filter your search results to only include sequences longer than 1000 base pairs.
Search Fields: Search fields allow users to specify where their search terms should appear in the database records. Common search fields include title, author, and abstract. For example, you could search for sequences with “insulin” in the title to find records specifically related to insulin.
Boolean Operators: Boolean operators (AND, OR, NOT) allow users to combine search terms to broaden or narrow their search results. For example, you could use OR to find records that contain either “insulin” or “glucose,” or you could use NOT to exclude records containing a certain term.

Using NCBI’s Entrez system for efficient data retrieval

NCBI’s Entrez system is a powerful tool for searching and retrieving biological data from various NCBI databases, including GenBank. Here’s how you can use Entrez for efficient data retrieval:

Access Entrez: Go to the NCBI website and navigate to the Entrez database you want to search (e.g., Nucleotide for GenBank sequences). You can also access Entrez directly at https://www.ncbi.nlm.nih.gov/gquery/.
Perform a Basic Search: Use the search bar to enter your query. You can search by keywords, accession numbers, organism names, or other relevant terms. Click “Search” to view the search results.
Refine Your Search: Use the filters on the left side of the search results page to refine your search. You can filter by organism, sequence length, publication date, and other criteria to narrow down your results.
View and Select Results: Review the search results to find the sequences or records you’re interested in. Click on a result to view more details about it.
Retrieve Data: To retrieve the sequence data or record, click on the “Send to” button near the top of the page and select the format in which you want to download the data (e.g., FASTA format for sequences).
Use Advanced Search: For more complex searches, use the Advanced Search feature. This allows you to use Boolean operators, search fields, and other advanced techniques to refine your search.
Save Searches: If you regularly perform similar searches, you can save your search query for future use. This can save you time when you need to retrieve the same type of data again.
Explore Related Information: Entrez provides links to related information, such as other sequences from the same organism or similar sequences from related organisms. This can help you explore additional data relevant to your research.

By using Entrez’s advanced search features and filters, you can efficiently retrieve the biological data you need from NCBI’s databases.

Sequence Retrieval and Analysis

Retrieving sequences: by accession number, keyword, organism, etc.

In NCBI’s Entrez system, you can retrieve sequences from databases like GenBank using various search criteria. Here’s how you can retrieve sequences by accession number, keyword, organism, and other methods:

By Accession Number: If you know the accession number of the sequence you’re looking for, you can search directly using the accession number. Simply enter the accession number into the search bar and press Enter. This will retrieve the sequence associated with that accession number.
By Keyword: You can search for sequences containing specific keywords by entering the keywords into the search bar. For example, you could search for sequences related to “insulin” by entering “insulin” into the search bar. This will retrieve sequences that contain the keyword “insulin” in their metadata or annotations.
By Organism: You can search for sequences from a specific organism by entering the organism’s name into the search bar. For example, you could search for sequences from “Homo sapiens” by entering “Homo sapiens” into the search bar. This will retrieve sequences associated with that organism.
By Advanced Search: Use the Advanced Search feature to perform more complex searches using Boolean operators, search fields, and filters. This allows you to refine your search based on multiple criteria, such as keywords, organism, and sequence features.
By Database: You can specify the database you want to search (e.g., Nucleotide for nucleotide sequences, Protein for protein sequences) to retrieve sequences from a specific database.
By Related Information: You can also retrieve sequences related to a specific sequence or record by using the “Related Information” feature. This allows you to explore sequences that are similar or related to a sequence of interest.

By using these methods, you can retrieve sequences from NCBI’s databases based on various search criteria, making it easier to find the sequences you need for your research.

Sequence alignment: using BLAST for sequence similarity search

Sequence alignment is a fundamental task in bioinformatics that involves comparing two or more sequences to identify similarities and differences. One of the most widely used tools for sequence alignment is BLAST (Basic Local Alignment Search Tool), which allows you to search for similar sequences in a database. Here’s how you can use BLAST for sequence similarity search:

Access BLAST: Go to the NCBI BLAST website (https://blast.ncbi.nlm.nih.gov/) to access the BLAST search tool.
Choose a BLAST Program: Select the appropriate BLAST program based on the type of sequences you’re working with. The most common options are:
- Nucleotide BLAST (blastn): For comparing nucleotide sequences (e.g., DNA sequences).
- Protein BLAST (blastp): For comparing protein sequences.
- BLASTX: For comparing a nucleotide query to a protein sequence database.
- TBLASTN: For comparing a protein query to a nucleotide sequence database translated in all reading frames.
- TBLASTX: For comparing the six-frame translations of a nucleotide query to the six-frame translations of a nucleotide sequence database.
Enter Query Sequence: Paste your query sequence into the “Enter Query Sequence” box. You can also upload a file containing your sequence.
Select a Database: Choose the database against which you want to search. Options include the default “nr” (non-redundant protein sequences) database or other specialized databases.
Adjust Parameters: Optionally, you can adjust the search parameters, such as the scoring matrix, gap penalties, and expectation value (E-value) threshold, to customize the search settings.
Run BLAST: Click the “BLAST” button to run the search. BLAST will compare your query sequence against the selected database and return a list of similar sequences, along with alignment scores and statistical significance.
Analyze Results: Review the BLAST results to identify sequences that are similar to your query sequence. The results will include information about the alignment, such as the alignment score, E-value, and percent identity.
Refine Search: If necessary, you can refine your search by adjusting the search parameters and running BLAST again to further explore the similarity between your query sequence and the database sequences.

By using BLAST, you can quickly and efficiently identify sequences in a database that are similar to your query sequence, which can provide valuable insights into the function and evolutionary relationships of the sequences.

Multiple sequence alignment: tools and methods for alignment analysis

Multiple sequence alignment (MSA) is a technique used to align three or more sequences to identify regions of similarity that may indicate functional, structural, or evolutionary relationships between the sequences. Here are some commonly used tools and methods for MSA analysis:

Practical Applications and Case Studies

Introduction

This exercise has two main goals:

Introduction to the types of DNA data contained in the GenBank database (data format, visualization, cross-database links, how biological “features” such as genes are annotated and described as coordinates in the DNA sequence).
Practice searching the online version of GenBank hosted at the NCBI. Since the number of sequences in GenBank is HUGE it’s critically important to be able to search and filter the information. Especially filtering the unwanted sequences can be a challenge, as we shall see.

Where to find GenBank

The GenBank database is hosted at NCBI (National Center for Biotechnology Information, USA) (Link: http://www.ncbi.nlm.nih.gov/ ). Besides the main GenBank database, NCBI also hosts a number of other biological databases (for example whole-genome databases for human, mouse, chimp etc.). In this particular exercise we will concentrate on the classical “GenBank” database.

Using the “Entrez” database browser

ALL the NCBI databases can be queried through a common search interface named Entrez. On next to all NCBI webpages a search box can be found in the upper part of the page, allowing an easy access for searching the individual databases (or searching across all databases). Click on the following link to open up a new browser window with Entrez, where the focus is pre-set to search in the GenBank database:

http://www.ncbi.nlm.nih.gov/nucleotide

(Alternatively go to the main NCBI webpage and choose “Nucleotide” as the database).

Part 1: Concerning the DATA in GenBank

This part of the exercise is about the types of data hosted in GenBank.

Searching for a specific ID

The typical case for searching for a specific ID in GenBank, will be looking up information from the literature (e.g. a gene found in a study), following up on information from other databases, investigation of lists of interesting genes etc. In this part of the exercise we will be working with a set of alpha-globin genes.

Search for AB001981 – by default the result is shown in the GenBank format.

QUESTION 1.1:

1. How many genes are contained in this entry?
2. From which organism does the DNA originate?
3. What kind of information is contained within the HEADER and within the FEATURE block?

PubMed links

Notice that the publication from which the DNA sequence originates is cited (and linked via a PubMed ID) within the header. Sometimes multiple publications related to the same gene is listed. This is of great importance since it makes it possible to trace the source(s) of the DNA sequence and investigate if the experiments carried out are to be trusted.

This can be of real importance if something seems “wrong” with the sequence (for example if this particular gene exhibits a really strange intron/exon structure compared to other closely related genes, or if it simply doesn’t match ANY other known genes of the same family). By investigation of the original publication it’s possible to double- check the experimental procedure. It may be that the article correctly states the gene to be of type XXX but when that data submitted it was accidentally annotated as YYY (it is the original researchers’ responsibility to double-check this). There can also be more serious problems with the experiments ranging from bad/wrong PCR primers, to contamination with DNA from a different species during a cloning step.

NEVER FORGET: biological data CAN be wrong.

Investigate the PubMed link(s):

Follow the PubMed link from the sequence entry.

Observe that it is always possible to read the ABSTRACT of the publication in PubMed, even if access to the publication requires subscription. For most (new) publications there will also be a direct link to the publication itself.

Return to the sequence entry once again (or perform the search again if you closed the window).

GenBank vs. FASTA format

View the sequence entry in FASTA format (Simply click on “FASTA” in the top part of the page, below the page title) Now the entire GenBank entry is shown in FASTA format.

QUESTION 1.2:

What happened to the alpha-globin genes? Can they still be found?
Which part of the GenBank entry has been converted?

Observe that the name of the sequence is based on the name of the GenBank entry.

Go back to GenBank format (Click on “GenBank”)

TASK: Save the GenBank “raw data” on your own computer:

Click on “Send:” in the upper right part of the page

Choose “Complete Record”, “File” and “Genbank(full)” and click on “Create file” Locate the downloaded file on your own computer

By default it has a pretty generic name (“sequence.gb”) – rename the file to “AB001981.gb”

Notice: The reason for renaming the file is simply a practice of good file management – now we can by just skimming the filenames guess that it’s a GenBank file (“*.gb”) and that it contains the “AB001981” entry.

Open it in Geany.

Notice: What we have now is the “raw” data behind the information shown online, with no fancy HTML formatting and cross-links.

Verify that the contents of the file is as expected by inspecting it in Geany (it should look exactly like the information shown online).

QUESTION 1.3: Does the downloaded file have UNIX or Windows line-endings?

Exploring the genes defined in a GenBank entry

Go back to the GenBank entry in your browser. Click the first “CDS” element (Alpha-D)

CDS = CoDing Sequences: The PROTEIN CODING part of a gene. Basically: the sequence you get when the CODING exons are concatenated (UTR regions are ignored). A CDS always starts with a START codon and ends with a STOP codon.

Hopefully it’s quite intuitive why some of the sequence is high-lighted – otherwise discuss it within the group (or with the instructor) Repeat the same procedure for the other CDS (Alpha-A).

When looking at the FEATURE table, the first line of text in the definition of each CDS is as follows:

join(1104..1192,1306..1510,1614..1742) join(4915..5009,5165..5369,5474..5602)

QUESTION 1.4: Based on your observations:

What do these numbers mean?
How many coding exons does each gene contain?

View both of the CDS’ in FASTA format (click “Send to” in the upper right corner, choose “Coding Sequences” and set format to “FASTA”)

QUESTION 1.5: What do the numbers in the sequence title represent?

Switch to Graphic view (Click on Graphics at the top of the page)

An interactive graphical representation of the GenBank entry will now be shown. The upper part of the visualization shows the entire length of the entry (5.891 bp) with bars representing the individual exons within the two genes.

This zoomed view below can be changed by dragging the transparent box with the blue borders in the overview representation at the top of the page. The zoom level can be changed.

By “mousing over” the bars additional information about that particular feature will be shown.

The graphical overview is mostly useful for inspecting GenBank entries with multiple genes (some entries have hundreds of embedded genes). Play around with the interface for a few minutes to see what functionality is offered.

Part 2: Searching GenBank

The key issue to keep in mind when searching GenBank is to avoid drowning in huge amounts of irrelevant data. It is therefore of great importance to filter out unwanted information, WITHOUT losing the relevant entries. Today we will work with searching the TEXTUAL annotation of GenBank entries (keywords, free text etc). We will later get back to sequence based searches (BLAST).

In the first part of the exercise we’ll investigate various ways to search using insulin as the example.

Naïve search

Search for GenBank entries containing the term “insulin”

Just do a simple search for INSULIN – don’t put anything else in the search box.

Observe the following:

A large number of entries are found.

Go through a few pages of results and notice that we are offered data from a diverse set of sources: Experimental work, Patent applications, predicted genes, partial genes etc.

QUESTION 2.1.1:

How many search results were returned?
Are they all from Human? If no, give a counterexample. (Would you have expected them to be all human?)
Are they all insulin? If no, give a counterexample.

By default the search term is matched against ALL POSSIBLE fields in the GenBank entries – including almost all text in the HEADER and FEATURE table. It’s even possible to pick up entries where the match is to one of the authors names and not a gene name! (Perhaps not an issue for insulin). Luckily it is possible to restrict the search to specific pre-indexed fields in the HEADER and FEATURE table (“Search fields“), which makes it possible to make the search much more focused.

How the search is interpreted

When you do a naïve search (just write a few terms google-style) GenBank tried to interpret what you most likely meant, it is has a behind-the-scene scheme to sorting the results to push the most interesting ones to the top. It is actually possible to see exactly how your search query is interpreted by locating the SEARCH DETAILS box.

QUESTION 2.1.2:

What have your search for “insulin” been expanded into?

Spend a few moments to investigate the HEADER section of the GenBank entry you have all received as a hand-out (X01831) to get an idea of how the data is related to specific sections (e.g. KEYWORDS and ORGANISM which we will use in a moment).

Try to find a search result that appears NOT to be the real insulin gene, and see why it was picked up by the search. If you have trouble finding one in your own result, search for DL142095.1 which came up around page 200 when the exercise was written.

The main issue here is that we find entries where “insulin” is mentioned anywhere in the entry, and sometimes it’s unrelated genes like “Insulin-receptor”, “Insulin inhibitor” etc.

Searching for human insulin

Search for human insulin and see what happens.

QUESTION 2.1.3:

How many search results were returned?
Can you find the human insulin entry? (If yes, write down its title and Accession)
How was your search interpreted by the system (the SEARCH DETAILS box)?

Advanced search

Looking at the SEARCH DETAILS from the naïve searches we have just performed, give us a good idea on how we can build our own more powerful searches. This can be done in two ways:

Simply writing the advanced search string yourself (e.g. “insulin[title]” – to search in the title field)
Using the “Search builder” to put together the query bit by bit.

But why did the naïve search for “human insulin” go so well?

If you just need a single (and well-known) gene from one of the well-known model organism, it will indeed work very well to do a simple search. (Much like when you do a Google search and get your desired hit on the first page).

However, there are some situations where it’s beneficial to specify the search in more details – e.g. for building data sets of the same gene across multiple species, or just trying to locate a slightly more obscure gene. (Same as when the link you were looking for at Google was on page 10+ and you have to provide more accurate search terms).

Now we are going to narrow down the search to specific parts of the annotation. Click on Advanced in the top of the page.

This brings up a form with a “Search Builder” that can be used to select and combine terms restricted to specific fields.

It’s possible to restrict the search to specific fields in the GenBank entires (click to open the entire list)

Select “Organism” and enter human. Select “Title” and enter insulin.

Click “Search”

QUESTION 2.2:

1. How many hits do we have now?
2. Are they all from Human? If no, give a counterexample.
3. Do they all appear to be insulin genes? If no, give a counterexample.

Now use the “Search Builder” to search for insulin in other fields instead of “Title” (still with “Organism” set to human)

QUESTION 2.3:

How many hits are found when “Keyword” is set to insulin?
How many hits are found when “Protein Name” is set to insulin?
Find the correct Human Insulin gene entry (the correct hit). Click on it and write down its Accession codes (there are more than one!), Locus name and Definition (title).

Note that the “Search Builder” simply is a tool for filling out the search box. If you know the names of the available search fields, it is often more convenient to type your search with the field names manually. A schematic overview of the search fields can be found on the NCBI homepage: Search Fields and Qualifiers .

Combining search terms using boolean operators: NOT, AND and OR

Our next task will be to find full length insulin genes from as many different organisms as possible using the Title field. Note that it might have been easier to use the Protein

Venn Diagrams for Boolean Logic

name or Keyword fields, but with Title we can immediately see the results of what we are doing, so we are using it for pedagogical reasons. We will now type the searches directly into the Search Box without using the Search Builder.

Let’s start out with a new clean search for Insulin:

Query:

insulin[title]

The number of hits is very high, and there are many partial genes and mRNA entries. Let’s now specify that the entries should be complete:

insulin[title] AND complete[title]

About the use of AND: The AND keyword is implicitly used when ever you enter more than one search term: “human globin” will be interpreted as “human AND globin” and only results where BOTH terms are found will be reported. We could therefore have omitted the “AND” in the previous query.

Observe that we still have many hits that are not actually insulin, so we want to add search terms to AVOID in order to bring down the false positive rate. By a brief inspection of some of the search hits, it turns out that some of them are, e.g., insulin receptors.

Let’s get rid of these with the NOT keyword:

insulin[title] complete[title] NOT receptor[title]

Conceptually what we are doing here is to conduct a number of searches that are either COMBINED or SUBTRACTED from each other. The “receptor[title]” search term finds all entries where this term is found. This list is then excluded from the combined “insulin[title] AND

complete[title]” list by using the NOT operator.

The use of boolean operators can be visualized graphically using Venn diagrams (see the figure to the right). A good strategy for narrowing down a GenBank search is to build a list of “kill words“/”filter words” (terms to avoid). More terms can be added to the list as search results are inspected, and it’s found out why strange entries appear on the result list.

A word of caution: Be careful of not throwing the baby out with the bath water – don’t add kill-words that are so broad that they will actually exclude the gene(s) we are looking for. And don’t add kill- words without specifying a search field – e.g. the search

insulin[title] complete[title] NOT receptor

would exclude some real insulin hits that just happened to mention “receptor” in some reference!

The final part of the exercise to continue to find terms to exclude on your own hand. The point is to bring down the number of search results to a level where it’s easy to pick the correct ones. Remember: the task is to find full length insulin genes from as many different organisms as possible using the Title field.

QUESTION 2.4:

Which search term did you end up using?
How many search results do you get now?

Notice: There are several possible answers to this question, as it will be a balance between filtering out False Positives (things that are NOT insulin) without filtering out (too many) True Positives (things that are actually insulin).

Free exercise

Now it’s time to perform a number of GenBank searches on your own. It’s important to think about the search strategy – discuss this within the group.

QUESTION 3: Do at least three of the below and report your findings. Remember to write down the search string you ended up using for each question.

Find the Rat and Mouse Insulin gene

1. Find the alcohol-dehydrogenase gene from as many organisms as possible.
2. Find the alpha-globin gene from Capra hircus – (Remember: Alpha-globin is part of hemoglobin).
3. Find the alpha-globin gene from all ruminants – (hint: inspect the ORGANISM fields in a GenBank entry from an animal you know to be a ruminant, in order to pick up a good search term). If you want to go deeper into the taxonomy, the Tree of Life project have an entry on placental mammals here:http://tolweb.org/tree?group=Eutheria&contgroup=Mammalia .

Find the actin gene from as many organisms as possible.

Avoid mRNA and entries that are part of whole chromosomes, cosmids etc

1. Find the human insulin receptor gene. Avoid partial genes / single exons in the results.

Answers

Part 1

QUESTION 1.1

Inspecting the FEATURE table of the entry reveals that two CDS regions are defined; therefore there are two genes in this entry. As stated on the GenBank hand-out “CDS” is the most stable definition of a protein coding gene used in the GenBank format – sometimes “gene” will also be present, but CDS is more commonly used.
Columba livia (Rock pigeon / domestic pigeon)
The HEADER contain general information about the entry: Organism, publication references, keywords, accession-ID etc. The FEATURE table contains information that refers to coordinates in the DNA sequence – for example definition of CDS regions.

QUESTION 1.2

Since the FEATURE table has been thrown away, we no longer have the coordinates for the genes. As such they are “in there” somewhere, but we cannot find them without using external information.
The entire “ORIGIN” block (all the DNA sequence) has been converted to FASTA format. The FEATURE table is discarded. From the HEADER block the definition (title) and accession number is preserved, the rest is discarded.

QUESTION 1.3

The downloaded file has Unix line endings. Remember from the JEdit exercise that line endings are indicated by the letters “U”, “W” or “M” in the lower right hand corner of the jEdit window.

QUESTION 1.4

The “join” statements defines how to extract the coding sequence from the entire length of DNA in the entry: “join(1104..1192,1306..1510,1614..1742)” is basically a recipe stating to paste together the three intervals – and we’ll get the protein coding part of the gene: the coding exons glued together. The CDS will always start with a START codon (e.g. ATG) and end with a STOP codon (e.g. TAA).
The gene contains three coding exons. Note: from a CDS definition we don’t get any information about

UnTranslated Regions (UTR’s) that are often found before and after the coding region in the mRNA).

QUESTION 1.5

The first number is the Gene Identifier (taken from the VERSION line in the header). The subsequent numbers are the positions (coordinates) in the original gene entry (taken from the join line).

Part 2

QUESTION 2.1.1

226,089 hits
No. There is e.g. the first hit, M57671.1, “Octodon degus insulin mRNA, complete cds” which is from a Degu (http://en.wikipedia.org/wiki/Degu), a rat-like carnivore from Chile. In fact, you can see in the right side of the results page that only 11,216 hits are from human. There is no reason to expect only human results from GenBank, since it is not a human-centric database.
No. There are many hits to complete or partial chromosome sequences which contain a lot of other genes. An example is JWIN03000075.1, “Camelus dromedarius breed African isolate Drom800 Contig74, whole genome shotgun sequence”.

QUESTION 2.1.2

In the Search details box, you find “insulin[All Fields]”.

QUESTION 2.1.3

18,111 hits.
Yes, it is among the hits on the first page of results.

Title: Homo sapiens insulin (INS) gene, complete cds

Accession: AH002844

(“Homo sapiens”[Organism] OR human[All Fields]) AND insulin[All Fields]

QUESTION 2.2

5548 hits.
Yes (except for 10 hits that are synthetic constructs, but based on human sequence). See the “Top Organisms” box on the right.
No.

There are many examples of insulin-degrading enzyme, insulin-like growth factor, insulin receptor and insulin-induced genes.

Many entries are mRNA and therefore not gene entries.

QUESTION 2.3

9 hits.
15 hits.
Accession codes: AH002844 J00265 J00268, Locus name: AH002844, Definition (title): “Human insulin gene, complete cds”.

QUESTION 2.4

The important thing here is not the precise search string, but that you understand the principle of using “kill-words”. One possible answer could be:

insulin[title] complete[title] NOT mRNA[title] NOT receptor[title] NOT receptor-like[title] NOT “insulin like”[title] NOT “insulin degrading”[title] NOT “growth factor”[title] NOT “family member”[title] NOT “insulin induced”[title] NOT “insulin dependent”[title] NOT “insulin promoter”[title]

which gives 19 hits, representing 13 organisms and some synthetic constructs.

Note: the use of double quotes (“”) to add two-word “kill phrases”.

Note: don’t kill “insulin precursor”! Insulin is always synthesized as a precursor, preproinsulin, that contains both a signal peptide, a propeptide, and the two mature chains. More about insulin in the exercises next week.

Part 3

QUESTION 3.1

It’s a good idea to separate the two logical parts of the search string: One for narrowing down the species:

(rat[ORGANISM] OR mouse[ORGANISM])

And one for actually searching for insulin:

insulin[KEYWORD]

They can then be AND’ed together:

(rat[ORGANISM] OR mouse[ORGANISM]) AND insulin[KEYWORD]

This gives 10 hits.

By manual inspection of the results, I then pick the following entries:

J00748 – Rat insulin II gene (ins-2) with two introns J00747 – Rat insulin-I (ins-1) gene

X04724 – Mouse preproinsulin gene II X04725 – Mouse preproinsulin gene I

Note: rodents have two copies of the insulin gene in their genomes.

Note: using “Protein Name” as field yields no results – you cannot assume that entries are always annotated with Protein Name.

QUESTION 3.2

It will never be possible to do this query perfectly – a good attempt could be:

“alcohol dehydrogenase”[title] complete[title] NOT mRNA[title] NOT synthetic[title]

which gives 2170 hits.

Note: as many as 360 of these hits are from one organism, Populus nigra (Poplar tree).

QUESTION 3.3

“Capra hircus”[ORGANISM] AND “alpha globin”[title]

This gives 6 hits. There are 2 alpha globin genes, HBAI and HBAII, and they are both present in two entries. Correct answers could be:

EU938074 Capra hircus I alpha globin (HBAI) gene, complete cds

EU938078 Capra hircus II alpha globin (HBAII) gene, complete cds

QUESTION 3.4

From Tree of Life we find that ruminants (Danish: “Drøvtyggere”) is contained in the taxon: “Ruminantia“. Since we can search any level of taxonomy in the ORGANISM field we can use this:

Ruminantia[ORGANISM] AND “alpha globin”[title]

This yields 16 hits (which will need a bit of clean-up).

QUESTION 3.5

Like in 3.2, it will never be possible to do this query perfectly – a good attempt could be:

actin[title] AND actin[protein name] NOT mRNA[title] NOT partial[title]

which yields 414 hits.

Note that this will miss entries that are not annotated with “Protein name”. Alternatively, you could search with the “Title” field, but that requires a lot of “kill words”:

actin[title] complete[title] NOT mRNA[title] NOT pseudogene[title] NOT regulator[title] NOT binding[title] NOT associated[title] NOT related[title]

yields 934 hits and still requires some cleanup.