Identification of gene in Prokaryotes and Eukaryotes-Tutorial
June 17, 2019What is Open Reading Frame (ORF) ?
The region of the nucleotide sequences from the start codon (ATG) to the stop codon is called the Open Reading frame.
Background Theory:
DNA (Deoxyribonucleic acid) is the genetic material that contains all the genetic information in a living organisms. The information is stored as genetic codes using adenine (A), guanine (G), cytosine(C) and thymine (T). During the transcription process, DNA is transcribed to mRNA. Each of these base pairs will bond with a sugar and phosphate molecule to form a nucleotide. Three nucleotides that codes for a particular amino acid during translation is called as a codon. The region of a nucleotide that starts from an initiation codon and ends with a stop codon is called an Open Reading Frame(ORF). Proteins are formed from ORF. By analyzing the ORF we can predict the possible amino acids that might be produced during translation. The ORF finder is a program available at NCBI website. It identifies all ORF or possible protein coding region from six different reading frame.
DNA (Deoxyribonucleic acid) is the genetic material that contains the genetic information for development and helps in maintaining all the functions in a living organisms.The information is stored as genetic codes using four different bases. They are adenine (A), guanine (G), cytosine(C) and thymine (T). In two strands of DNA, adenine always pair with thymine and guanine pair with cytosine. Each of these base pairs will bond with a sugar and phosphate molecule to form a nucleotide. The base pairing of DNA will result in a ladder shape structure of these strands which is called a double helix. RNA is differs from DNA only in 1 base pair i.e. in RNA it is uracil (U) instead of thymine(T). mRNA (messenger RNA) is a type of RNA which is formed from DNA transcription. During the transcription process, DNA is transcribed to mRNA in the nucleus and moves to the cytoplasm through the nuclear pores. This mRNA is translated to protein in the cytoplasm with the help of ribosomes. In mRNA, 3 nucleotides are considered at a time since a set of 3 nucleaotides (refered to as codon) codes for an amino acid. The region of a nucleotide that starts from an initiation codon and ends with a stop codon is called an Open Reading Frame(ORF). An initiation codon is the triplet codon that codes for the first amino acid in the translation process. The translation process will start only with the initiation codon, ATG which codes for the amino acid methionine. The translation process stops when it comes across a stop codon. There are three stop codons: TAA (“ochre”), TAG (“amber”) and TGA (“opal” or “umber”). Any of these codons can stop the translation. Genetic codon can form 64 triplets(43) from the 4 nucleotides that codes for amino acids. Protein is formed from the ORF.
ORF finding in Prokaryotes
Gene finding in organism specially prokaryotes starts form searching for an open reading frames (ORF). An ORF is a sequence of DNA that starts with start codon “ATG” (not always) and ends with any of the three termination codons (TAA, TAG, TGA). Depending on the starting point, there are six possible ways (three on forward strand and three on complementary strand) of translating any nucleotide sequence into amino acid sequence according to the genetic code. These are called reading frames.
Due to the absence of introns, protein-encoding genes in bacteria almost always possess a long and uninterrupted open reading frame. This is defined as a series of sense codons beginning with an initiation codon (usually ATG) and ending with a termination codon (TAA, TAG or TGA).
The simplest way to detect a long ORF is to carry out a six-frame translation of a query sequence using a program such as ORF Finder, which is available at the NCBI web site (http://www.ncbi.nih.gov). The genomic sequence is translated in all six possible reading frames, three forwards and three backwards. Long ORFs tend not to occur by chance so in practice almost always correspond to a gene. The majority of bacterial genes can be identified in this manner but very short genes tend to be missed because programs such as ORF finder require the user to specify the minimum size of the expected protein. Generally, the value is set at 300 nucleotides (100 amino acids) so proteins shorter than this will be ignored. Also, shadow genes (overlapping open reading frames on opposite DNA strands) can be difficult to detect. Content sensing algorithms, which use hidden Markov models to look at nucleotide frequency and dependency data, are useful for the identification of shadow genes.
However, as the number of completely sequenced bacterial genomes increases, it is becoming a more common practice to find such genes by comparing genomic sequences with known genes in other bacteria, using databank search algorithms such as BLAST or FASTA.
ORF Finder and similar programs also give the user a choice of standard or variant genetic codes, as minor variations are found among the prokaryotes and in mitochondria. The principles of searching for genes in mitochondrial and chloroplast genomes are much the same as in bacteria. Caution should be exercised however because some genes use
quirky variations of the genetic code and may therefore be overlooked or incorrectly delimited. One example of this phenomenon is the use of non-standard initiation codons. In the standard genetic code, the initiation codon is ATG, but a significant number of bacterial genes begin with GUG, and there are examples where UUG, AUA, UUA and CUG are
also used. In the initiator position these codons specify N-formylmethionine whereas internally they specify different amino acids. Since ATG is also used as an internal codon, the misidentification of an internal ATG as the initiation codon in cases where GUG, etc. is the genuine initiator may lead to gene truncation from the 5′ end. The termination codon TGA is another example of such ambiguity. In genes encoding selenoproteins, TGA occurs in a special context and specifies incorporation of the unusual amino acid selenocysteine. If this variation is not recognized, the predicted gene may be truncated at the 3′ end. This also occurs if there is a suppressor tRNA which reads through one of the normal termination codons
Tutorial:
Task: Finding gene in the following region (70000-72500 bp) of DNA sequence from the Haemophilus influenzae.
ORF Finder:
The ORF finder is a program available at NCBI website. It identifies the all open reading frames or the possible protein coding region in sequence. It shows 6 horizontal bars corresponding to one of the possible reading frame. In each direction of the DNA there would be 3 possible reading frames. So total 6 possible reading frame (6 horizontal bars) would be there for every DNA sequence. The 6 possible reading frames are +1, +2, +3 and -1, -2 and -3 in the reverse strand. The resultant amino acids can be saved and search against various protein databases using blast for finding similar sequences or amino acids. The result displays the possible protein sequence and the length of the open reading frame etc.
Using the ORF Finder, you will discover that this region contains the coding Sequence for the MutL gene, which is involved in DNA mis-match repair. To see this, do the following:
Cut and paste the FASTA file provided (which was retrieved from the Haemophilus influenzae genome sequence in
1. ENTREZ GENOMES:MICROBES) or provide GI number 16271976into the ORF Finder sequence box (http://www.ncbi.nlm.nih.gov/gorf/gorf.html), select “1 Standard” and hit the as the Genetic Code and click on the “OrfFind” button. See the resulting output of the ORF Finder algorithm—Note that there is only one very large open reading frame in the region
2. Select the largest open reading frame, and when the screen refresses, run a blastp search from the ORF Finder using the default configurations.
3. Note that in the resulting list of BLAST there is a result that indicates near identity with the well characterized DNA mismatch repair protein MutL of Escherichia coli
ORF finding in eurkaryotes
While eukaryotic gene finding is altogether a different task as the eukaryotic genes are not continuous and interrupted by intervening noncoding sequences called ‘introns’. Moreover organization of genetic information in eukaryotes and prokaryotes is different
Gene finding Problem and approaches to improve in Eukaryotes
All gene-finding algorithms must discriminate gene DNA from non-gene DNA, and this is achieved by the recognition of particular gene-specific features. Where simple detection of open reading frames is not sufficient, there are three types of feature that are recognized: signals, contents and homologies.
Two classic examples of signals identified by eukaryotic gene finders are CpG islands and binding sites for a poly(A) tail.
The most important features to identify are the splice junctions—the donor and acceptor sites. If these could be reliably detected from the genomic DNA the difficulty in identifying the coding regions would be greatly reduced because most genes could be recognized simply by finding the long ORFs. It would still be somewhat more difficult than for prokaryotes simply because genes are much less dense in eukaryotes, but a high degree of accuracy could be obtained easily. Unfortunately, splice junctions are not reliably detectable in the genomic sequence. The most common method for predicting them has been the “weight matrix.”
Other signals can also be useful in predicting exons. (Although technically incorrect, I will use the convention of the gene-finding field for “exon” to mean only the protein-coding portion.) The start and stop codons are essential in predicting the correct gene. Unfortunately they are fairly uninformative without knowing the reading frame. But they are essential in categorizing exons into four classes: single exon genes that begin with a start codon and end with a stop codon; initial exons that begin with a start codon and end with a donor site; terminal exons that begin with an acceptor site and end with a termination codon; and internal exons that begin with an acceptor site and end with a donor site . Initial and terminal exons tend to be the most difficult to identify, both because the signals are less informative and because they are often much shorter than internal exons and therefore harder to identify by content measures. Some programs also look for sites associated with promoters, such as TATA boxes, transcription factor (TF) binding sites, and CpG islands. Although identifying promoters on their own is a difficult problem, they can sometimes add information that is useful for predicting genes. Poly(A) addition signals are also used sometimes to aid in identifying the proper carboxyl terminus of the gene. In general, the use of these other types of signals provides a marginal improvement over methods
that do not use them.
What is Coding Sequence(CDS)? How is it different from the ORF?
The Coding Sequence (CDS) is the actual region of DNA that is translated to form proteins. While the ORF may contain introns as well, the CDS refers to those nucleotides(concatenated exons) that can be divided into codons which are actually translated into amino acids by the ribosomal translation machinery. In Prokaryotes the ORF and the CDS are the same.