Understanding NCBI: A Beginner’s Guide to Bioinformatics Tools and Resources
September 11, 2023Table of Contents
NCBI:Bioinformatics Resources
Besides being a data repository, NCBI offers a variety of user-friendly tools, educational materials, and even specialized software for the coding community. Ready to explore? Their official website is your gateway to these resources.
Meet Entrez: Your Bioinformatics Search Engine
Among NCBI’s invaluable tools is the Entrez database system. This is more than a simple search tool; it’s a hub linking 38 different databases containing a mind-boggling 2.5 billion entries. If you’re looking for information on genes, proteins, or even health-related literature, Entrez has you covered. It enables you to perform straightforward text searches, download useful data in different formats, and even find interrelated database records. If you’re a programmer, you can access this system through a special set of tools known as E-utilities.
Tracing the Origins of NCBI Data
You may be wondering how NCBI accumulates all this data. Well, the answer is multi-faceted. One route is through individual submissions from researchers across the globe. Another is via collaborations with both national and international entities. Additionally, NCBI has a team dedicated to internally curating data to ensure its accuracy. For example, it manages GenBank, a significant public sequence database, and cooperates with international counterparts like the European Nucleotide Archive and the DNA Data Bank of Japan.
As you take your first steps into bioinformatics, NCBI is an indispensable resource to be aware of. It’s a haven for researchers, students, and developers alike. With its multifaceted Entrez database and rich tapestry of data sources, NCBI serves as a comprehensive platform for anyone curious about molecular biology and bioinformatics. Don’t wait any longer; dive deep into the NCBI universe today and uncover knowledge that might just revolutionize scientific understanding!
Unpacking the Latest in PubMed and Beyond
The Evolving World of PubMed
PubMed is constantly evolving. With a massive archive of over 28 million biomedical articles as of August 2018, improving search capabilities has been a never-ending project.
Introducing PubMed Labs
Late in 2017, a new initiative called PubMed Labs was launched. It serves as a testing ground for fresh search features and tools. For instance, it now shows helpful snippets from articles in your search results.
Expanding the Reach of PubMed Central
PubMed Central (PMC) hit a milestone in July 2018, crossing the 5 million-article mark. This was made possible through various efforts, including ongoing digitization projects, more research funding agencies mandating public access, and a growing number of journals wanting to be part of PMC. To help publishers understand what it takes to be included, PMC has released comprehensive guidelines focusing on scientific quality and editorial standards.
PMC’s Treasure Trove for Text Miners
Although many articles in PMC are under traditional copyright laws, there are collections where bulk downloading is allowed for text mining and similar activities. The biggest of these is the Open Access Subset, which surpassed 2 million articles in May 2018. These collections are incredibly diverse, containing centuries-old biomedical journals in machine-readable formats like XML or text.
What’s New in NCBI Bookshelf?
NCBI’s Bookshelf is another resource that’s been expanding its offerings. It now hosts more than 6,000 books and documents from over 150 sources in life sciences and healthcare. Bookshelf’s search capabilities have also been upgraded, now including Medical Subject Headings (MeSH) to make it easier for users to find exactly what they need. Moreover, an Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) service has been added, allowing even better access to the metadata and a subset of the full text of items in the archive.
Genome revisions
Sequence database search The National Centre for Biotechnology Information (NCBI) is modernising the sequence search experience by adding support for more natural language queries for gene, transcript, protein, and assembly data and by returning results that emphasise high-value content. A new search service, operating in parallel with NCBI’s Entrez search, identifies queries that frequently fail to return results in Entrez or that only return results when conducted in particular Entrez databases. These include several query types: ‘organism-gene’ (e.g. human BRCA1), ‘organism-transcript’ (e.g. rodent p53 transcripts), and ‘organism-assembly’ (e.g. dog reference genome). In addition, NCBI now provides enhanced search capabilities for curated gene sets that are part of the RefSeq Targeted Loci Project. These gene sets include ribosomal RNA genes in bacteria, archaea, and fungi, as well as internal transcribed spacer regions in fungi and oomycetes. Featured search results are displayed at the top of the Gene, Nucleotide, Protein, Assembly, and Genome database pages, as well as the global search results page. Efforts are ongoing to further enhance the search experience by enhancing query recognition and featured content.
NCBI has made available several classes of viral and viroid genome data through the NCBI Assembly Resource (www.ncbi.nlm.nih.gov/assembly/) to facilitate the viewing and downloading of viral and viroid genomes. This resource facilitates the retrieval of all nucleotide records constituting a singular genome. Individual segment sequences are aggregated into a single genome constellation, which is represented by a single accession, which makes this characteristic especially advantageous for segmented viruses.
Viral assemblies are available at www.ncbi.nlm.nih.gov/assembly?The origin of viruses. These include viral RefSeq sequences (10) and GenBank sequences designated as viral species exemplars by the International Committee for Taxonomy (11). A’reference’ subset of these RefSeq assemblies contains experimentally supported and manually curated annotations and is meant to provide high-quality reference templates for viral annotation. GenBank assemblies include species examples chosen by the International Committee on the Taxonomy of Viruses (ICTV), assemblies in GenBank that were used to make RefSeq assemblies, and a collection of complete viral genomes that have been checked by NCBI procedures. The initial scope of these NCBI-validated GenBank genomes is limited to a few viral taxa, but it is being expanded to encompass all viruses in the future.
In addition to including more viral genomes, NCBI has made several enhancements to the Assembly resource that facilitate locating and downloading genome data sets of interest. The addition of annotation status filters enables users to select genome assemblies with annotations. There are now filters available that make it simple to restrict search results to only assemblies derived from type strains or ICTV species exemplars. In the UCSC Genome Browser, UCSC assembly names have been introduced as searchable synonyms for the majority of recent assemblies. In addition, new file formats have been added to the ‘Download Assemblies’ interface, including a ‘Feature count’ file containing counts of specific gene, RNA, and CDS features and a ‘Translated CDS’ file containing conceptual translations of each CDS feature on the genome.
NCBI now employs the Average Nucleotide Identity method with optimal threshold ranges for prokaryotic organisms to assess all prokaryotic genome assemblies in GenBank and to rectify incorrectly assigned names when compared to genomes from type strains. This is the result of a 2015 NCBI workshop in which several members of the bacteriology community participatedand was recently described in greater detail.
SKESA de-novo assemblies in SRA SKESA is an NCBI-developed DeBruijn graph-based de-novo assembler for assembling Illumina reads of microbial genomes. NCBI utilises SKESA to support both SRA and the Pathogen Detection project (http://www.ncbi.nlm.nih.gov/pathogens/). Over 270 000 read sets within SRA now include SKESA assemblies, which are downloadable. On the Download pane of the SRA Run Browser, runs that contain SKESA assemblies will have a run.realign file listed. For instance, the run SRR498276 has a run.realign file with the name SRR498276.realign listed on this page: trace.ncbi.nlm.nih.gov/Traces/sra?run equals SRR498276. The SKESA source code is available for free at github.com/ncbi/SKESA/releases.
Genome data viewer New features in NCBI’s genome browser, Genome Data Viewer (GDV) (www.ncbi.nlm.nih.gov/genome/gdv), provide additional ways to analyse genomic data. GDV has incorporated additional options to facilitate the analysis of user-provided data. Users can now connect to files hosted on remote servers or that are part of track data centres (15), in addition to uploading files. Once connected, these externally supplied data appear as tracks alongside NCBI’s own track offerings and can be incorporated into PDF downloads of publication-quality from the browser. The BLAST widget incorporates genomic BLAST into GDV’s graphical display, enabling users to observe their extant results as browser tracks or perform new queries directly from the browser. The corresponding BLAST Alignment Inspector offers a graphical representation of the relationship between alignment results and NCBI RefSeq annotations. Additionally, BLAST result pages now include connections to GDV views of sequences aligned to a genome assembly or to RefSeq annotations on an assembly.
Data browsers for the Genome and BioProject databases The BioProject and Genome databases have redesigned interfaces and backends that facilitate enhanced data perusing. BioProject, which organises metadata associated with research projects, and Genome, which compiles data associated with the human genome, serve as entry points to a multitude of other NCBI resources. The new targeted search interfaces for these two resources (www.ncbi.nlm.nih.gov/bioproject/browse and www.ncbi.nlm.nih.gov/genome/browse#!/overview/) share a similar look and feel, and allow users to begin exploring by text searching or by filtering the data according to relevant categories, such as taxonomic restrictions. The tabular display of the results is highly customizable, with sortable columns and a variety of parameters available for display. Each retrieved record is linked to other NCBI data, such as specific records in BioProject, Genome, Taxonomy, and PubMed, and can be downloaded as tab-delimited files. The documentation provides additional information about the BioProject browser (www.ncbi.nlm.nih.gov/bioproject/docs/faq/#questions-about-the-browse-page).
dbSNP The Database of Single Nucleotide Polymorphisms (dbSNP) is a repository of genetic variations shorter than 50 base pairs. The human data alone has quadrupled in size in less than a year, from 150 million Reference SNPs (RS) in Build 149 to over 650 million RS records in Build 151. Moreover, Build 151 contains frequency information for more than 580 million of these RS recordings. In the past year, we made two significant modifications to dbSNP to address the challenges of processing, annotating, and exchanging a growing volume of data. First, dbSNP and EMBL-EBI signed a new agreement to share responsibility for managing data from worldwide genetic variation experiments. dbSNP now only manages human data, while all non-human organisms have been transferred to the EMBL-EBI European Variation Archive (EVA) (ncbiinsights.ncbi.nlm.nih.gov/2017/05/09/phasing-out-support-for-non-human-genome-organism-data-in-dbsnp-and-dbvar/). Secondly, dbSNP represents variants using the new SPDI data model (www.ncbi.nlm.nih.gov/variation/notation) and provides a new API based on this data model. In addition, dbSNP released a new RefSNP page for displaying variants in web browsers (ncbiinsights.ncbi.nlm.nih.gov/2017/07/07/dbsnp-redesign-supports-future-data-expansion/).
The NCBI dbVar Structural Variant database contains human genomic structural variants (SV) longer than 50 base pairs. Users can search, view, and download variant data from over 150 studies from the dbVar homepage (www.ncbi.nlm.nih.gov/dbvar), including 1000 Genomes Phase 3 (estd219), Simons Genome Diversity Project (nstd128), ClinGen (nstd45), ExAC (nstd151), and many others. Users can access variants via the graphical Study Browser or Genome Browser. Individual study and variant pages contain hyperlinks to raw data as well as relevant information from other NCBI and external resources. Downloads of bulk data are accessible through FTP (ftp.ncbi.nlm.nih.gov/pub/dbVar/data).
In 2018, dbVar released a new exhaustive set of non-redundant structural variants (NR SV) comprised of unique insertions, duplications, and deletions. These compressed files can be used as references for the analysis of human SV, including filtering and annotating other SV datasets, SV discovery, and identification of rare and/or clinical SV. The dbVar NR SV is presently comprised of >2.2 million deletions, 1.1 million insertions, and 300 thousand duplications, and it will be routinely updated as new variants are added to dbVar. At github.com/ncbi/dbvar/tree/master/Structural_Variant_Sets, users can find additional information about NR SV, as well as access to the NR SV FTP files.
BLAST revisions
NCBI has published a new version of the BLAST databases (version 5) that includes several improvements. First, the standalone BLAST+ executables (16) can now restrict searches by taxonomy without requiring the installation of additional files. Taxonomy can be used to include or exclude subject sequences from the search. Second, the new database version employs LMDB (Lightning Memory-Mapped Database) to execute sequence lookups by accession more quickly. The version 5 databases are only compatible with BLAST+ versions 2.8.0 and later.
IgBLAST IgBLAST (17), the NCBI instrument for analysing immunoglobulin and T cell receptors, has undergone significant revisions over the past year. First, IgBLAST can now process a large volume of queries more efficiently using a multithreaded approach. Second, IgBLAST can retrieve reads from the SRA database without requiring the user to obtain the sequences by specifying an SRA accession on the command-line. The AIRR (Adaptive Immune Receptor Repertoire) rearrangement format is now supported by IgBLAST. This format is a standard supported by the adaptive immune receptor repertoire (AIRR) community (docs.airr-community.org/en/latest/) and is intended for repertoire studies employing next-generation sequencing technology.
Protein advances
In April 2018, NCBI released an updated version of iCn3D (2.0) with additional features and enhanced performance. iCn3D provides equivalent functionality to Cn3D, NCBI’s standalone structure viewer, but operates directly in web browsers and does not require application installation. Molecular Modelling Database (MMDB) structure summary pages have been embedded with interactive iCn3D views, and iCn3D visualises the results of 3D structure comparisons computed by VAST+. iCn3D can display 3D structures, 2D interaction schematics, and protein/nucleotide sequences simultaneously, as well as import annotations such as sequence variants, protein domains, and functional and binding sites. Interactions between the displays facilitate a variety of selection, highlighting, and analysis operations. iCn3D now supports the export of stereolithography (STL) or Virtual Reality Modelling Language (VRML) files for 3D printing and can generate shareable links for custom displays (e.g., https://d55qc.app.google.com/HDuWMFAVokxvHMKSA). The iCn3D source code can be found at https://github.com/ncbi/icn3d.
Chemical innovations
PubChem (18–20) (pubchem.ncbi.nlm.nih.gov) provides chemical information for over 96 million compounds compiled from over 620 data sources. PubChem has undergone several significant enhancements over the past year. A data contribution from BioRad’s SpectraBase provided more than 630 000 spectra images for more than 225 000 compounds, along with relevant metadata. In addition, Springer Nature has contributed more than 28 million connections between more than 610 000 compounds and more than four million scientific articles, with weekly updates. Two million of these links lead to over 350 thousand open-access or free-to-read articles.
A collection of dyad pages is now available (pubchemdocs.ncbi.nlm.nih.gov/bioactivity-dyad-pages) to facilitate access to bioactivity-related content. These pages provide fast access to bioactivity information for a particular chemical tested in an assay, as well as information valuable for interpreting the bioactivity of a compound or constructing a structure-activity relationship for a given gene or protein target.
New co-occurrence knowledge panels (pubchemdocs.ncbi.nlm.nih.gov/knowledge_panels) display a catalogue of frequently co-occurring chemicals with a given compound in PubMed articles. For further analysis, users can obtain the list of PubMed articles co-mentioning the two chemicals. PubChem’s blog (pubchemblog.ncbi.nlm.nih.gov) contains additional information regarding these and other recent developments.