bioinformatics

Overview of Molecular Structure Databases

June 6, 2021 Off By admin
Shares

  1. Protein Data Bank (PDB)

The Protein Data Bank (PDB) is a database of experimentally determined crystal structures of biological macromolecules and is coordinated by a consortium located in the USA and Europe and Japan [wwpdb] (Berman et al. 2000). Probably, the best known web page of the PDB is that of the Research Collaboratory for Structural Bioinformatics [pdb]. The PDB was founded in 1971 at the Brookhaven National Laboratory, as reflected in the frequent use of the name Brookhaven Protein Data Bank. These are predominantly proteins, but also include DNA and RNA structures and protein–nucleic acid complexes. Structures of other macromolecules, e.g.,glycopeptides or polysaccharides, constitute only a very small proportion.

The PDB database offers a number of query options. A textbased search for a PDB-ID or a keyword can be initiated on the main page. Furthermore, a number of search options exist on the search database page, including detailed keywords and BLAST queries. A database record summarizes all of the information in the file, which is then elaborated on the subsequent pages. In addition, the molecular structure can be visualized by means of different applets.

  1. SCOP
    Proteins that perform a similar biological function and are evolutionary related must have a similar structural organization, at least in the region of their active centers. It should, therefore, be possible to predict the function of an unknown protein by comparison of its structural organization with that of known proteins. Two databases, SCOP and CATH, provide such predictions. SCOP (Structural Classifi cation Of Proteins) [scop]classifi es proteins of a known structure in a hierarchical manner. Families, groups, and subgroups are the three main classifications,
    super families, and folds. Families describe proteins with a clear evolutionary relationship to each other and are limited by a sequence identity that must be at least 30% over the total length of the proteins. Nevertheless, proteins that fall below this limit can be included in a family if relatedness can be shown due to
    proven similar structures and functions. Proteins with a very low sequence identity to one another, even with suggested relations due to structural and functional properties, are assigned into super families, however. Proteins that have the same arrangement of secondary structure elements in the same topology are classified as foldsIt is unimportant if the proteins have a functional relationship or if the similarity of the fold is based on physico-chemical principles.

3.CATH

The CATH database [cath] classifies proteinstructures hierarchically into four categories: Class (C), Architecture(A), Topology (T), and Homologous Superfamily (H).
Protein classification into the Class category is primarily based on automatic, but can be complemented manually when required. In the Class category, the proportion of secondary structure elements is taken into account without consideration of their arrangement or connections.

Four classes of proteins are distinguished:proteins that are composed mainly of helices (mainly-alpha); sheets (mainly-beta), or both helices and sheets (alpha–beta); and Finally, proteins with a small number of secondary structure elements.The The Architecture category describes the arrangement of secondary structure elements to one another and is curated manually. Its categorization is performed via simple descriptors such as barrel, sandwich, beta-propeller, etc. In the Topology category, protein form and the interconnections of secondary structure elements are described. Its categorization is based on an algorithm that uses empirically derived domain classification parameters.

The Homologous Superfamily category encompasses homologous protein domains, i.e., domains with a common origin. The similarity of the sequences is determined by sequence comparison followed by a structure comparison based on classification in the Topology category. In addition to these four categories (whose first letters form the database name), a fifth category has been added.

The Sequence Families have been defined.Domains are classified here based on high sequence identity (at least 35% identity over 60% of the length of the larger domain) and, thus, will likely possess similar functions.

4. PubChem

The PubChem database at NCBI [pubchem] stores small chemicals molecules and information about their biological activities. It consists of three components: PubChem Compound, PubChem Substance, and PubChem BioAssay.

A query is performed graphically via a molecular structure editor. That allows the drawing of the desired (partial) structure. Furthermore, PubChem Compounds allow for a Look for molecules that meet certain physicochemical criteria.e.g., a particular molecular weight range, a given number of acceptors or donors for hydrogen bonds, a certain log P range, etc.

PubChem Substance permits the search for substances produced by various manufacturers, compounds of unknown composition, and natural substances of unknown 2D molecular structure. The records of both databases are linked and include a link to the third database, PubChem BioAssay, provided that respective data are present. Information on biological assays and Molecules that have been tested in these systems are recorded in PubChem BioAssay and this database can be queried via text search in the Entrez system.

The PubChem databases have multiple applications due to internal and external database linking, including PubMed. For example, it is possible to find a known enzyme inhibitor other similar potential inhibitors. Furthermore, small chemicals molecules with different structures can be identified shown to have similar effects in a biological test system.

Shares