Bioinformatics for analysing metagenomes
April 3, 2007The isolation, archiving and analysis of environmental DNA (or so-called ‘metagenomes’) has enabled us to mine microbial diversity, allowing us to access their genomes, identify protein coding sequences and even to reconstruct biochemical pathways, providing insights into the properties and functions of these organisms. The generation and analysis of (meta)genomic libraries is thus a powerful approach to harvest and archive environmental genetic resources. It will enable us to identify which organisms are present, what they do, and how their genetic information can be beneficial to mankind.
The mining of genomes and metagenomic libraries will not only provide new enzymes for biotechnological processes and a basis to study new protein structures and catalytic mechanisms, but will also enable the functional assignment of many proteins found in abundance in databases and currently designated as ‘hypothetical’ or ‘conserved hypothetical’ proteins. The identification of novel catalysts will both improve existing processes and will lead to the design of novel processes for making innovative products or high-value intermediates.
One of the main focus in analysing metagenomes using genomic analysis tools to find novel genes, discovering novel pathways, functional groups and evolutionary related studies.
Gene finding
Gene finding is a fundamental goal in virtually all metagenomics projects, regardless of whether complete genome sequences can be assembled or not.
>>Gene prediction can be done using GLIMMER which is trained on long open reading frames.
Discovering novel pathways & functionalgroups
>>Predicted genes blasted against COGs or KEGG database
>>To perform single-linkage hierarchical clustering (eg.Cluster & Treeview)
Dealing with partial sequences
Many metagenomes contain partial sequences. The partial sequences create obstacle in phylogenetic studies. However the problem can be solved by aligning the partial sequences against the complete ones and the phylogenetic assignment performed by finding the closest sequences in the database.
>>Performing semi-global multiple alignment (i.e., terminal gaps are not penalized). The most widely used alignment tools are based on global or local alignments and do not correctly handle partial sequences.
>>Muliple alignment using MUSCLE tool although not optimized for partial sequences, MUSCLE do a reasonable job, as ascertained by several criteria: the number of internal gaps was small, sequences shorter than the read length had either no beginning gaps or no ending gaps (since the gene length is greater than the read length), and the total length was comparable to related proteins.