Decoding the Metagenome: Strategies for Accurate Gene Prediction and Identification
September 14, 2023Metagenome – Gene Prediction
Identifying Gene Components
The basic building blocks within a genome are genes, which can be part of bigger functional systems like operons or networks. The task of pinpointing these genetic elements within a DNA sample is often referred to as gene prediction. The effectiveness of this task can hinge on several variables, including the quality of DNA sequences that have been assembled, raw sequencing reads, or a blend of both.
Typically, there are two primary strategies used to identify genes: one that’s ‘evidence-based’ and another known as ‘ab initio.’
– The ‘evidence-based’ tactic involves comparing DNA sequences to previously identified genes to find matches.
– On the other hand, ‘ab initio’ methods focus on the inherent features of the DNA string itself to separate coding from non-coding regions, which allows the discovery of previously unidentified genes. Various computational techniques are used in this approach, many of which are rooted in statistical learning models, including types of Markov chains.
For refining the accuracy of the gene identification, some software solutions utilize pre-defined sets of known genes from similar organisms as a training guide. Alternatively, other tools are capable of self-training, using the target DNA sequence for this purpose.
The following table shows a list of commonly used tools for gene prediction.
Year | Tools | Short Descriptions | URL |
---|---|---|---|
2007 | FGENESH | FGENESH is an application for finding (fragmented) genes in short reads. | FGENESH |
2010 | FragGeneScan | FragGeneScan is a HMM-based gene structure prediction (multiple genes, both chains) tool. | FragGeneScan |
2005 | GeneMark | GeneMark is a family of gene prediction programs developed at Georgia Institute of Technology. . | GeneMark |
2009 | GENSCAN | GENSCAN can predict the locations and exon-intron structures of genes in genomic sequences from a variety of organisms.. | GENSCAN |
2007 | Glimmer | Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria, archaea, and viruses. | Glimmer |
2012 | Glimmer-MG | Glimmer-MG is a system for finding genes in environmental shotgun DNA sequences. | Glimmer-MG |
2000 | HMMgene | HMMgene is a tool to do prediction of vertebrate and C. elegans genes. | HMMgene |
2007 | MED | MED is a new non-supervised gene prediction algorithm for bacterial and archaeal genomes. | MED |
2008 | MetaGeneAnnotator | MetaGeneAnnotator is a gene-finding program for prokaryote and phage. | MetaGeneAnnotator |
2013 | MetaGUN | MetaGUN is a gene prediction method for metagenomic fragments based on a machine learning approach of SVM. | MetaGUN |
2013 | MGC | MGC is an application for finding complete and incomplete genes in metagenomic reads. | MGC |
2009 | Orphelia | Orphelia is a metagenomic ORF finding tool for the prediction of protein coding genes in short, environmental DNA sequences with unknown phylogenetic origin. | Orphelia |
2012 | MetaProdigal | Prodigal can run in metagenomic mode and analyze sequences even when the organism is unknown. | MetaProdigal |
The quality of gene predictions in microbial metagenome data sets is inferior to those of sequenced genomes. Combining multiple gene finders, screening intergenic regions for overlooked genes and using dedicated frameshift detectors are common strategies to overcome at least some of these limitations.