bigdatainbiology-omicstutorials

Big Data Meets Biology- Large Scale Sequencing Projects

August 5, 2021 Off By admin
Shares

Biologists are now part of the big-data club. With the advent of high-throughput genomics, life scientists are beginning to contend with huge data volumes, meeting obstacles in handling, processing, and transporting data that were previously the domain of astronomers and high-energy physicists.

Sequencing the human genome twenty years ago was one of the most audacious scientific endeavours ever attempted. Its 3 billion DNA base pairs and approximately 20,000 genes appear insignificant in comparison to the over 100 billion bases and millions of genes found in bacteria found in the human body.

The “Human Genome Project” (HGP) began in 1986 and completed a first draught of the genome (90 percent completion) in 2001 and a “compete” sequence (99 percent of the euchromatic genome with 99.99 percent accuracy) in 2006. HGP completed its major gene sequencing and annotation efforts in 2006, when the sequence and annotations for chromosome 1 (the largest chromosome) were published. These projects cost approximately $4.2 billion in today’s dollars. As a logical consequence from HGP, the “ENCODE” Project began in 2003 with the goal of identifying all functional elements in the human genome, such as transcripts, promoters, and long-range regulatory regions.

The “Roadmap Epigenomics Project” aims to map the human genome’s DNA methylation, histone modifications, chromatin accessibility, and small RNA transcripts. Both projects make extensive use of next-generation sequencing techniques such as RNA sequencing, ChIP sequencing, and bisulfite sequencing.

The US-funded “Cancer Genome Atlas” and the EU-funded “Cancer Genome Project” are sequencing the genomes, exomes, and transcriptomes of thousands of cancer samples in order to identify common cancer-causing mutations. The “HapMap Project” and its successor, the “1000 Genomes Project,” provide a detailed map of the human population’s genomic variations (SNPs and structural variants).

One of the largest completed human sequencing projects was Genomics England’s “100,000 Genomes Project,” which generated genomics, transcriptomics, and epigenomics datasets for 100,000 cancer and rare disease patients.Apart from human genomes, sequences of plants, animals, fungi, and microbial communities are being compiled. The “Human Microbiome Project” profiled our closest companions, the microbes that live in our guts and on our skin, using 16S sequencing, whereas the “100K Foodborne Pathogen Genome Project” aims to sequence the genomes of hundreds of thousands of disease-causing bacteria and viruses. Several other large-scale metagenomics projects have been dubbed the “Earth Microbiome Project.” In the case of plants, the first large-scale projects have already sequenced the genomes of over 1000 Arabidopsis varieties and 3,000 rice varieties (BGI extended the Project to cover 10000 varieties). Additionally, the University of Alberta intends to sequence 1000 transcriptomes from various plant species, the JGI is conducting a “1000 Fungal Genome Project,” the “Genome10k” project will sequence 16000 vertebrates, the “i5k” project will sequence 5000 insect genomes, and the “Fish-T1K” project will sequence 1000 fish transcriptomes.

Here is the list of a selection of big data projects in the life sciences exploring health, the environment and beyond.

The 100,000 Genomes Project


The 100,000 Genomes Project is a now-completed UK Government project managed by Genomics England that is sequencing whole genomes from National Health Service patients. The project is focusing on rare diseases, some common types of cancer, and infectious diseases.

Website link: https://www.genomicsengland.co.uk/about-genomics-england/the-100000-genomes-project/

The NIH Roadmap Epigenomics Mapping Consortium


NIH Roadmap Epigenomics Program was established with the goal of elucidating how epigenetic processes contribute to human biology and disease. One of the major components of this programme consists of the Reference Epigenome Mapping Centers (REMCs), which systematically characterized the epigenomic landscapes of representative primary human tissues and cells. They used a diversity of assays, including chromatin immunoprecipitation (ChIP), DNA digestion by DNase I (DNase)7,18, bisulfite treatment, methylated DNA immunoprecipitation (MeDIP), methylation-sensitive restriction enzyme digestion (MRE), and RNA profiling, each followed by massively parallel short-read sequencing (-seq). The resulting data sets were assembled into publicly accessible websites and databases, which serve as a broadly useful resource for the scientific and biomedical community.

Website link: http://www.roadmapepigenomics.org/

The Cancer Genome Atlas


The Cancer Genome Atlas (TCGA), a ground-breaking cancer genomics initiative, has molecularly described approximately 20,000 primary cancer and matched normal samples from 33 different cancer types. This collaborative initiative between the National Cancer Institute and the National Human Genome Research Institute began in 2006 and brought together researchers from a variety of disciplines and organisations.

TCGA generated about 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data during the next decade. The data, which has already improved our ability to identify, treat, and prevent cancer, will remain open for use by anybody in the research community.

Website link: https://portal.gdc.cancer.gov/

Encyclopedia of DNA Elements (ENCODE)


ENCODE has generated enormous volumes of data, which are freely accessible via the project’s publicly accessible database, the ENCODE Portal. ENCODE’s “Encyclopedia” classifies these records according to two levels of annotation: 1) annotations at the integrative level, including a registry of putative cis-regulatory elements; and 2) annotations at the ground level, derived directly from experimental data.

ENCODE data are widely used as a result of outreach and collaboration. The ENCODE Portal is a list of publications that make use of ENCODE resources. Additionally, the ENCODE Portal provides data from the modENCODE and RoadMap Epigenomics and Genomics of Gene Regulation projects. Additionally, the ENCODE Portal contains information regarding data standards and norms, as well as uniform data processing.

The ENCODE Project began in 2003 with the ENCODE Pilot Project, which focused on 1% of the human genome. Following that, two more phases (ENCODE 2 and ENCODE 3) were completed, which included whole-genome investigations of the human and mouse genomes. Parallel to this endeavour, the modENCODE Project focused on whole-genome analysis of the C. elegans and D. melanogaster genomes.

With the success of these three phases of the ENCODE Project and recognition of the need for additional effort to complete and understand the catalogue of candidate regulatory elements compiled, NHGRI funded the fourth phase of ENCODE (ENCODE 4) in February 2017 to continue and expand its work to understand the human and mouse genomes.

This map of the human genome’s functional elements — areas that regulate gene expression — comprises more than 15 gigabytes of raw data.

Website link: https://www.encodeproject.org/

Human Microbiome Project


A research endeavour aimed at advancing our understanding of the microbial flora that contribute to human health and disease. The first phase (HMP1), which began in 2007, was devoted to discovering and characterising the human microbial flora. The second phase, dubbed the Integrative Human Microbiome Project (iHMP), began in 2014 with the goal of developing tools for characterising the microbiome and clarifying bacteria’ involvement in health and disease states.

This initiative, one of several aimed at describing the microbiome in various areas of the body, has generated 18 terabytes of data — around 5,000 times the amount generated by the initial human genome project.

Important components of the HMP were culture-independent methods of microbial community characterization, such as metagenomics (which provides a broad genetic perspective on a single microbial community), as well as extensive whole genome sequencing (which provides a “deep” genetic perspective on certain aspects of a given microbial community, i.e. of individual bacterial species). The latter served as reference genomic sequences — 3000 such sequences of individual bacterial isolates are currently planned — for comparison purposes during subsequent metagenomic analysis. The project also financed deep sequencing of bacterial 16S rRNA sequences amplified by polymerase chain reaction from human subjects.

Website link: https://hmpdacc.org/

The 100K Pathogen Genome Project


The 100K Pathogen Genome Project is producing draft and closed genome sequences from diverse pathogens. This project expanded globally to include a snapshot of global bacterial genome diversity. The genomes form a sequence database that has a variety of uses from systematics to public health.The 100K Pathogen Genome Project is using next-generation and third-generation sequencing approaches to uncover the vast diversity of bacterial genotypes that form the basis of identification and tracking using population genomic approaches that are only recently available to microbiology. Continual bacterial genetic evolution is hindering our ability to consistently detect and mitigate pathogens, which interfere with our preparedness to defend public health. By leveraging genome diversity it enables new diagnostic and public health approaches for the management of diagnostics and phylogeny.

Website link: https://100kgenomes.org/

1001 Genomes


The first genome sequence of any plant was from a single inbred strain (accession) of A. thaliana. Its complete release in 2000 was a major milestone for biology. The 120 Mb genome sequence of the Columbia (Col-0) accession propelled A. thaliana to the forefront of efforts to understand the genetic basis of quantitative variation among natural accessions. A particular advantage for such analyses is that locally adapted lines collected from the wild are typically inbred, because the species is predominantly selfing.Together with partners from around the world, the project initiated with the goal of describing the whole-genome sequence variation in 1,001 accessions of A. thaliana.

Website link: https://1001genomes.org/

1000 Fungal Genomes Project


With an estimated 1.5 million species, Fungi represent one of the largest branches of the Tree of Life. They have an enormous impact on human affairs and ecosystem functioning, owing to their diverse activities as decomposers, pathogens, and mutualistic symbionts. And perhaps more than any other group of nonphotosynthetic organisms, fungi are essential biological components of the global carbon cycle. Collectively, they are capable of degrading almost any naturally occurring biopolymer and numerous human-made ones. As such, fungi hold considerable promise in the development of alternative fuels, carbon sequestration and bioremediation of contaminated ecosystems.

The use of fungi for the continued benefit of humankind, however, requires an accurate understanding of how they interact in natural and synthetic communities. The ability to sample environments for complex fungal metagenomes is rapidly becoming a reality and will play an important part in harnessing fungi for industrial, energy and climate management purposes. However, our ability to accurately analyze these data relies on well-characterized, foundational reference data of fungal genomes.

To bridge this gap in our understanding of fungal diversity, an international research team in collaboration with the Joint Genome Institute of the Department of Energy has embarked on a five-year project to sequence 1000 fungal genomes from across the Fungal Tree of Life.

Website link: http://1000.fungalgenomes.org/home/

Earth Microbiome Project


The Earth Microbiome Project (EMP) is a massively collaborative effort to characterise the planet’s microbial life. We employ DNA sequencing and mass spectrometry to decipher patterns in microbial ecology throughout our planet’s biomes and habitats. The EMP is a model of open research in its entirety, employing a collaborative network of 500+ investigators, facilitating pre-publication data exchange, and crowdsourcing data analysis to enable the exploration of universal principles. Standardized data collection, curation, and analysis enable the interpretation of ecological trends to be more robust.

A global effort to define microbial communities has resulted in the generation of 340 gigabytes of sequencing data, representing 1.7 billion sequences from over 20,000 samples and 42 biomes. By the time the project is complete, scientists anticipate that 15 gigabytes of sequencing and other data will have been generated.

Website link: https://earthmicrobiome.org/

Genome 10K


Genome 10K is a project that aims to sequence the genomes of at least one member of each vertebrate genus, a total of around 10,000 genomes. It is a critical step toward the Vertebrate Genomes Project’s goal of identifying and sequencing at least one individual from each of approximately 66,000 vertebrate species.

This project to sequence and assemble the DNA of 10,000 vertebrate species and investigate their evolutionary links will generate more than 1 petabyte of raw data.

Website link: https://genome10k.soe.ucsc.edu/

B10K Project


The Bird 10,000 Genomes (B10K) Project is an initiative to generate representative draft genome sequences from all extant bird species. The establishment of this project is built on the success of the previous ordinal level project , which provided the first proof of concept for carrying out large-scale sequencing of multiple representative species across a vertebrate class and a window into the types of discoveries that can be made with such genomes.

Website link: https://b10k.genomics.cn/

5,000 Insect Genome Project (i5k)


The i5k Initiative aims to sequence the genomes of 5,000 insects and other arthropods over the next five years in order to “improve our lives by contributing to a better understanding of insect biology and transforming our ability to manage arthropods that threaten our health, food supply, and economic security

Website link: http://arthropodgenomes.org/

Fish-T1K (Transcriptomes of 1,000 Fishes) Project


An international project known as the “Transcriptomes of 1,000 Fishes” (Fish-T1K) project has been established to generate RNA-seq transcriptome sequences for 1,000 diverse species of ray-finned fishes. The first phase of this project has produced transcriptomes from more than 180 ray-finned fishes, representing 142 species and covering 51 orders and 109 families

Website link: https://db.cngb.org/fisht1k/

The Earth BioGenome Project (EBP)


The Earth BioGenome Project (EBP) is a ten-year endeavour aimed at sequencing and cataloguing the genomes of all currently identified eukaryotic species on Earth. The programme would create an open DNA database containing biological data that would serve as a platform for scientific inquiry and would also promote environmental and conservation efforts.

Website link: https://www.earthbiogenome.org/

Shares