Ace of (data)base

October 10, 2006 Off By admin

A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name; the input sequence with a description of the type of molecule; the scientific name of the source organism from which it was isolated; and, often, literature citations associated with the sequence.

For researchers to benefit from the data stored in a database, two additional requirements must be met:
1.Easy access to the information; and
2.A method for extracting only that information needed to answer a specific biological question.

The principal requirements on the public data services are:

* Data quality – data quality has to be of the highest priority. However, because the data services in most cases lack access to supporting data, the quality of the data must remain the primary responsibility of the submitter.
* Supporting data – database users will need to examine the primary experimental data, either in the database itself, or by following cross-references back to network-accessible laboratory databases.
* Deep annotation – deep, consistent annotation comprising supporting and ancillary information should be attached to each basic datat object in the database.
* Timeliness – the basic data should be available on an Internet-accessible server within days (or hours) of publication or submission.
* Integration – each data object in the database should be cross-referenced to representation of the same or related biological entities in other databases. Data services should provide capabilities for following these links from one database or data service to another.
Primary databases(consisting of data derived experimentally)
a.) Sequence databases
DNA / nucleotide databases
GenBank

GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known genetic sequences. It has a flat file structure that is an ASCII text file, readable by both humans and computers. part of the International Nucleotide Sequence Database Collaboration.It consists of the DNA Data Bank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank (NCBI).In addition to sequence data, GenBank files contain information like accession numbers and gene names, phylogenetic classification and references to published literature.There are approximately contains publicly available DNA sequences for more than 170,000 different organisms, obtained primarily through the submission of sequence data from individual laboratories and batch submissions from large-scale sequencing projects as of 2006.It exchanges data on daily basis.

http://www.ncbi.nlm.nih.gov

EMBL

The EMBL Nucleotide Sequence Database is a comprehensive database of DNA and RNA sequences collected from the scientific literature and patent applications and directly submitted from researchers and sequencing groups. Data collection is done in collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ). The database currently doubles in size every 18 months and currently (June 1994) contains nearly 2 million bases from 182,615 sequence entries.

http://www.ebi.ac.uk/embl/

DDBJ (DNA Data Bank of Japan)

DDBJ was established in 1986 at the National Institute of Genetics (NIG).It reorganized as the Center for Information Biology and DNA Data Bank of Japan (CIB/DDBJ) in 2001

http://www.ddbj.nig.ac.jp

Protein databases
SwissProt

SwissProt was established in 1986.It is maintained collaboratively by the EMBL Outstation (EBI) and the Swiss Institute of Bioinformatics (SIB). This is a protein sequence database that provides a high level of integration with other databases and also has a very low level of redundancy (means less identical sequences are present in the database).

http://www.expasy.org/sprot

TrEMBL (Translation of EMBL Nucleotide Sequence Databases)

It was created in 1996 as supplement to Swiss-Prot.It make new sequences available as quickly as possible
through computer-annotated entries derived from the translation of all coding sequences (CDS) in EMBL.

http://www.uniprot.org/database/knowledgebase.shtml

PIR (Protein Information Resource)
PIR was established in 1984 by the National Biomedical Research Foundation (NBRF), since 1988 maintained by PIR-International.It is partitioned into four sections by differences in classification, annotation and redundancy and cross-referencing to other biological databases.

http://pir.georgetown.edu
b.) Structure databases

PDB (Protein Data Bank)
Single worldwide repository for processing and distribution of 3-D biological macromolecular structure data.

http://www.rcsb.org/pdb/

NDB (Nucleic Acid Database)
The Nucleic Acid Database Project (NDB) assembles and distributes structural information about nucleic acids. The data available consist of coordinates, experimental details used to determine the structures, and derived information about the geometry of the structures.

http://ndbserver.rutgers.edu/

CCDB / CSD (Cambridge Crystallographic Data Centre / Cambridge Structural Database)
compilation of a computerised database containing comprehensive data for organic and metal-organic compounds studied by X-ray and neutron diffraction

http://www.ccdc.cam.ac.uk/prods/csd/csd.html

Secondary databases(derived information)

It contains derived information from a primary database, like information about conserved sequence, signature sequence and active site residues of the protein families arrived by multiple sequence alignment of a set of related proteins. secondary structure database contains entries of the PDB in an organized way (for instance, by classification of all PDB entries according to structures like alpha-helix or ß-sheets) and also information on conserved secondary structure motifs of a particular protein

ProSite (Database of Protein Families and Domains)
It contains patterns and profiles specific for more than a thousand protein families or domains and also background information on the structure and function of these proteins.

http://www.expasy.org/prosite

Pfam (Protein Families Database of Alignment and HMMs)
Large collection of multiple sequence alignments and hidden Markov models covering many protein domains and families.Pfam currently contains over 6,000 protein families and domains as of 2006.

http://www.sanger.ac.uk/Software/Pfam/

Enzyme (Enzyme Nomenclature Database)
Primarily based on the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB).It is a repository of information relative to the nomenclature of enzymes.

http://www.expasy.org/enzyme/

REBase (Restriction Enzyme Database)
It is collection of information about restriction enzymes and related proteins.Currently, there are over 4000 enzymes and over 7000 references stored in REBASE as of 2006.

http://rebase.neb.com/rebase/rebase.html

Genome-related Information
OMIM (Online Mendelian Inheritance in Man)
It is incorporated into NCBI’s Entrez system and can be queried using the same approach as the other Entrez databases such as PubMed and GenBank. It has catalog of human genes and genetic disorders includes information on genetic variation in humans and also contains textual information, pictures, and reference information.

http://www.ncbi.nlm.nih.gov/Omim/

TransFac (Transcription Factor Database)
Database on eukaryotic cis-acting regulatory DNA elements and trans-acting factors.It covers the whole range from yeast to human.

http://www.gene-regulation.com

Structure-related Information
HSSP (Homology-derived Secondary Structure of Proteins)
A database of homology-derived secondary structure of proteins (HSSP) by aligning to each protein of known structure all sequences deemed homologous on the basis of the threshold curve. For each known protein structure, the derived database contains the aligned sequences, secondary structure, sequence variability and sequence profile. Tertiary structures of the aligned sequences are implied, but not modelled explicitly.

http://www.sander.ebi.ac.uk/hssp/

FSSP (Fold classification based on Structure-Structure alignment of Proteins)
Based on exhaustive all-against-all 3D structure comparison of protein structures currently in the Protein Data Bank (PDB)

http://www.bioinfo.biocenter.helsinki.fi:8080/dali/

Pathway Information

KEGG (Kyoto Encyclopedia of Genes and Genomes)
It is a suite of databases and associated software, integrating knowledge on molecular interaction networks in biological processes, the information about the universe of genes and proteins, and the information about the universe of chemical compounds and reactions.It serves as bioinformatics resource for understanding higher order functional meanings and utilities of the cell or the organism from its genome information.

http://www.genome.ad.jp/kegg

Composite databases

composite databases joins a variety of different primary database sources, which obviates the need to search multiple resources

For more database listing see:

http://www.oxfordjournals.org/nar/database/cap/

References: