Keeping up with the Human genome – Tim Hubbard

January 10, 2007 Off By admin

Abstract from Tim Hubbard talk:
Thirty times bigger than the worm genome that we were only just getting to grips with and with far greater numbers of interested users. The Ensembl project was started from scratch to handle this data: a system to store the data in an RDBMS; a pipeline to generate a pre-computed set of analysis; an API to provide both web and programmatic access. Ensembl evolves continuously: a new release is made every 2 months and in nearly every release the schema is updated to handle new data types. It now integrates more than thirty large genomes and provides researchers with a resource of >300Gb of data, all of which is free to download. The website alone generates >1million page impressions per week. However, with genome sequencing output per machine recently jumped 300 fold and costs having dropped 10 fold, with more drops promised, what Ensembl deals with now is tiny compared to what is to come.

Despite all this data, we are far from understanding our genome. Given the complexity of the system it is probably only feasible to tackle it as a huge global collaborative project, making data integration and exchange critical. One of the most significance features of the genome sequence is that it provides a framework to organize other biological information. However, there’s a limit to how much can be usefully imported into a single database, especially as new resources spring up continuously and frequently are of unknown scientific value. The web has been constructed on links, however its hard to compare data unless it is easily aggregated. The Distributed Annotation System (DAS) is essentially a system of standardized web services: each provider runs a DAS server; DAS clients can aggregate data from as many servers as they wish around a single coordinate system, i.e. a genome sequence. Ensembl is both a DAS server and DAS client. There are analogies with layering data on maps.google.com and google earth, except that here the servers of different layers are distributed. However visual integration is only a first step: the genome is too big for researchers to explore manually. We are going need to computational guide researchers to the most interesting areas of the genome.