Bioinformatics Tools for Sequence Analysis
August 16, 2023Bioinformatics is a new and fascinating field of study that requires both computational and biological knowledge to create scientific tools and computational methods to increase the efficiency of scientific research. The computational power of bioinformatics’ scientific tools, such as sequence analysis packages (software), enables the study of genes and how they interact with one another, as well as the comparison of genomes from various species. Modern technologies have greatly improved the way bioinformatics scientists design computational tools, systems, and biological databases that are used in organizing, storing, and analyzing big data. There are still a vast number of big data complications in the field of bioinformatics that are yet to be hypothesize and solved, like analyzing relations among diseases and evolutionary research.
1.0 Introduction
Bioinformatics is a new and fascinating field of study that requires both computational and biological knowledge to create scientific tools and computational methods to increase the efficiency of scientific research. Not only that, bioinformatics uses the computational methods created to predict genes and proteins, establish evolutionary relationships between different types of organisms, sequence biological molecules, and many more.
There are three scientific developments that allow for the further understanding and study of genomics and proteomics. The first is a technology created solely for its high computing power, tools that can analyze biological samples without consuming time and effort. The second development is in bioinformatics itself, bioinformaticians use computational tools to store, organize, and analyze massive amounts of data using the tools that they create. The third development is the team formation of specific specialists that can handle the fields of computer science, mathematics, chemistry, and biology from all sorts of backgrounds.
The computational power of bioinformatics’ scientific tools, such as sequence analysis packages (software), enables the study of genes and how they interact with one another, as well as the comparison of genomes from various species. Systems biology is aimed at better understanding natural biological systems by establishing better studies of interactions among the system’s parts. However, loads of data are being generated in this specific field of study, and databases maintained by the NCBI (National Center for Biotechnology Information) are a must for the storage of all the data, such as PubMed and GenBank.
2.0 Sequence Analysis Tools
Sequence analysis is a computational method of analyzing DNA, RNA, and peptide sequences to understand their features and functions, relations, structures, or even evolutionary changes. Tools that are made for the purpose of efficient sequence analysis are called sequence analysis tools. They can be in physical form, like the DNA sequencing machine, or online software that consists of algorithms and block of codes, like BLAST.
2.1 Pairwise sequence alignment
Pairwise is defined as a pair at a time. Pairwise sequence alignment is used to quantify the similarity between pairs of DNA, RNA, or protein sequences. This similarity is measured using a matching score. This matching score determines the minimum number of mutations with penalty or award values needed to transform one sequence into another (Hou et al., 2016).
2.1.1 Dotlet
Dotlet is an online software program that uses the diagonal plot method to compare the sequences of DNA and proteins. The Swiss Institute of Bioinformatics maintains the Java web server, on which it is based (Junier & Pagni, 2000).
2.1.2 Dotlet examples
A dot plot is a schematic tool for comparing two biological sequences and distinguishing related regions. It is used to examine conserved regions, reverse matches, and repeats of sequences to study the evolutionary relationships of sequences (Krumsiek et al., 2007).
2.1.3 Align
EBI is responsible for maintaining the pairwise sequence alignment web server, now called EMBOSS. Global (Needle), local (Water), Stretcher and Matcher (LALIGN) alignment tools can be used. In Global Alignment, every amino acid or nucleotide present in the associated sequences is aligned over its entire length. Meanwhile, Local Alignment proves more useful to find similar sequence motifs among slightly differing sequences (Huang et al., 2019).
2.2 Multiple sequence alignment
Multiple sequence alignment is a basic step in many bioinformatics pipelines, including phylogenetic estimation, and analyses specifically aimed at understanding proteins (Ute et al., 2019). It is the alignment of three or more protein or nucleic acid sequences at similar sequence lengths to study evolutionary relationships and patterns between the genes. Comparative structure and function analysis of biological sequences requires multiple sequence alignment programs.
2.2.1 COBALT
COBALT is a constraint-based alignment program that applies a general architecture for multiple protein sequence alignment. COBALT assembles a set of pairwise constraints based on database queries, sequence similarity, and user feedback, integrates them, and then incorporates them into a progressive multiple alignment.
2.2.2 Clustal Omega
Clustal Omega is a redesigned and updated version of the Clustal series for multiple sequence alignment programs. It uses the mBED algorithm for computing guidance trees and is able to handle very vast numbers of nucleotide or protein sequences.
2.2.3 Muscle
Muscle is a web server that allows the construction of different sequence alignments and the comparison of those sequences. It has k-mer counting for quick sequence distance calculation, a profile feature that calculates log-expectation ratings, and tree-dependent sequence partitioning.
2.2.4 MAFFT
MAFFT is a program that arranges multiple sequences of amino acids or nucleotides using a fast Fourier transform. The Fast Fourier transform (FFT) algorithm uses the periodicity and symmetry of complex numbers. It is used to perform correlations on DNA sequences given time. The time goes up linearly, while the n goes up exponentially for any length sequence.
2.2.5 KALIGN
Kalign is a multiple sequence alignment program that was rewritten and updated. It is capable of aligning large numbers of protein or nucleotide. EBI is responsible for maintaining this web server.
2.2.6 T-Coffee
Cedric Notredame at the Center for Genomic Regulation in Spain built and maintains this web server with many resources for computing, analyzing, and modifying various alignments of DNA and protein sequences and structures.
2.2.7 Weblogo
WebLogo is a web-based program that makes generation series logos. A sequence logo is a graphical depiction of a multiple sequence alignment of amino acids or nucleic acids. The University of California at Berkeley is in-charge of its upkeep.
2.2.8 Skylign
Skylign is a web server that has many benefits over other logo development tools. It can create a static picture file as well as a new immersive web plot with zooming, scrolling, and value inspection for each letter stack (Wheeler et al., 2014). It also generates logos for all profile HMMs and various sequence alignments in a single context, which is an essential implementation of this service.
2.3 Sequence similarity search
Sequence similarity search is first step in analyzing new protein, DNA, and RNA sequences to find related sequences in broad sequence databases. Related sequences’ structural and functional annotations may be used to infer the roles and structural features of new sequences (G. Hu & Kurgan, 2019).
2.3.1 NCBI BLAST
NCBI BLAST is a database server that finds regions of local similarity between protein or nucleotide sequences using its database search programs. The algorithm compares nucleotide or protein sequences to database sequences and measures their statistical significance.
2.3.2 EBI Sequence Similarity Search
The EBI sequence similarity search homepage, which includes applications such as NCBI BLAST+, FASTA, HMMER3, and annotation program like the InterProScan5.
2.3.3 Sanger BLAST
The Sanger Institute’s entry page for the sequence projects BLAST search services, which includes its own computing resource and special genome databases created by completed or ongoing sequencing projects.
2.3.4 Ensembl BLAST/BLAT
BLAST and BLAT are search tools found on Ensembl platform. BLAT is a pairwise sequence alignment algorithm used to detect similarity in DNA and proteins but requires a similar or nearly exact match to get a hit, while BLAST finds regions of local similarity between protein or nucleotide sequences.
2.3.5 UCSC BLAT
UCSC BLAT is a web server for searching the query sequence ( can be multiple sequences ) to find its location in the genome.
2.3.6 HMMER Search
The HMMER Search program is a sequence search tool that is both fast and responsive. It enables the use of profile hidden Markov models for sequence analysis by combining advanced acceleration heuristics with mathematical and computational optimizations. EBI is in charge of maintaining it.
2.4 Web-based bioinformatics platforms
A web-based bioinformatics platform is a collection of code written in some programming languages that is used to generate the dynamic web pages that display information and provide web-based software which is use over the internet and is designed for extracting information from biological databases like NCBI and EBI to carry out sequence, structural analysis, or biological data retrieval. The amount of data deposited in biological databases has led to the development of many bioinformatics platforms and programs to manage, validate, compare, and interpret this large volume of data (Paxman & Heras, 2017).
2.4.1 NCBI BLAST tools
A BLAST platform from NCBI with a suite of programs that are used for alignments between a nucleotide or protein sequence and nucleotide or protein sequences within the biological databases
2.4.2 EBI EMBOSS tools
A web page for EMBOSS tools that includes Pairwise Sequence Alignment tools, Sequence Translation tools, Sequence Statistic tools, and Sequence Format Conversion tools.
2.4.3 EMBOSS Explorer
EMBOSS Explorer is a graphical user interface (GUI) to the EMBOSS bioinformatics tools maintained by the Bioinformatics Center at the Netherlands.
2.4.4 WebLab
WebLab is bioinformatics platform developed at Peking University. It is a data-centric knowledge-sharing portal that allows scientists to retrieve, manipulate, interpret, and exchange data. Databases are provided to store and manage input data as well as analysis results.
2.4.5 Pasteur Galaxy
The Galaxy server developed by the Pasteur Institute in France. It is a web-based computational workbench for analyzing large biological datasets. The framework integrates analysis and visualization tools that can be accessed using a web browser.
2.5 Bioinformatics packages that can downloaded and installed locally
Bioinformatics packages are software that provide computational tools for collecting and maintaining complex biological data. Specific algorithms have been developed and implemented in these packages to assist in understanding molecular biology in a technical way.
2.5.1 EMBOSS
EMBOSS stands for “The European Molecular Biology Open Software Suite”. It is an Open-Source software analysis package with tools that cope with data of different formats and a data retrieval system. It also allows the development and release of software as open source.
2.5.2 InterProScan
InterProScan is a software package that uses scanning algorithms for identifying protein families and domain against the InterPro database.
2.5.3 HMMER
HMMER is the web page for its HMMER package for sequence alignments and sequence database searches for homologs. It uses the hidden Markov model, which is a probability model. HMMER is developed and maintained at the Howard Hughes Medical Institute.
2.5.4 Phylogeny software
It is a web page by Joe Felsenstein that consists of all phylogeny programs that meet some standard of quality or importance. There are currently 392 phylogeny programs and 50+ web servers.
2.6 Protein sequence analysis and function prediction
Protein sequence analysis is the process of analyzing the composition and functions of a protein or peptide sequence using computational methods. Sequence alignments and biological database searches are among the techniques used. Protein function prediction is a method of protein function investigation using computational tools that uses a protein sequence as an input. A set of predictive Gene Ontology terms will be released based on the category of their functions (Pourreza Shahri et al., 2019).
2.6.1 ProtParam
ProtParam is a program that calculates the physical and chemical parameters for a protein stored in Swiss-Prot or TrEMBL. The molecular weight, theoretical pI, amino acid composition, atomic composition, extinction coefficient, approximate half-life, instability index, aliphatic index, and grand average are examples of the parameters.
2.6.2 ProtScale
ProtScale is a program that uses a sliding-window method and one of many amino-acid scale values. It enables the computation and representation of any amino acid scale’s profile on a chosen protein.
2.6.3 PredictProtein
A web server that provides this service scans the current public databases to generate alignments and predict the structure of proteins and their functions.
2.7 Primer Design
A PCR reaction and DNA replication require oligonucleotide primers that are complementary to the template DNA to be designed. The nucleotides are mixed together in a chemical reaction to create them. A web-based tool for designing complex product primer is to produce feasible primer pairs is needed of the experimental design sets by integrating with reference databases such as NCBI RefSeq (Chukwuemeka et al., 2020).
2.7.1 Primer Design
Primer Design is an online primer design tool with information and tips for designing a primer from the BioWeb web server. The optimal primer sequence and the correct primer concentration are essential for maximum specificity and PCR efficiency.
2.7.2 Primer BLAST
Primer Blast is a NCBI primer designing tool that combines BLAST for a full primer-target alignment which is leads to high sensitivity. This is to detect targets that have mismatches to primers.
2.7.3 Web Primer
Web Primer is an online primer design tool that uses the Primer3-py package from the yeast genome database.
2.8 Motif finding
Motif is a region of a sequence that has a specific structure. Motif finding is a first procedure for majority of the systems to study a gene’s function and one of the difficulties in sequence analysis (Hashim et al., 2019).
2.8.1 SMART
SMART stands for Simple Modular Architecture Research Tool, it is a web server that identifies and annotates the domain names of genetic cells and evaluates the architecture of the region.
2.8.2 MEME
MEME stands for Multiple Expectation maximizations for Motif Elicitation is a tool for finding motifs in related DNA or protein sequences.
2.8.3 InterPro
InterPro is a secondary database that implements the analysis of proteins functions by classifying those proteins into families and predicting their domains (Mitchell et al., 2015). This database is maintained by EMBL-EBI. This secondary database does not produce diagnostic models; rather, it groups one or more associated member database signatures and offers additional overarching functional annotations, including Gene Ontology (GO) wherever possible (Ashburner et al., 2000) .
2.8.4 TMHMM
TMHMM is a web server for predicting membrane protein topology based on a statistical model called the hidden Markov model (HMM).
It predicts transmembrane helices as well as accurately differentiates soluble and membrane proteins (TMHMM Server v. 2.0 — Prediction of Transmembrane Helices in Proteins | HSLS, n.d.) .
2.9 Gene identification
Gene identification is known as a process of identifying all the genes and their positions within the organism’s genomes (Baxevanis, 2004). Gene identification requires software that contains complex mathematical algorithms to solve the data. This is done by testing the algorithms to determine their effectiveness. E.g. Disease gene identification for disease, which accurately diagnose the patients’ disease genes whose function is obvious (Heemskerk & Fischbeck, 2005). Those complex algorithms will calculate the similarity of the disease gene with protein or gene databases.
2.9.1 FGeneSh
FGeneSh is a web server program based on HMM gene structure prediction. It has specific parameters for human, Drosophila, plants, yeast, and nematodes (Salamov & Solovyev, 2000). It uses an algorithm similar to GenScan. FGeneSh now has an updated version called FGeneSh +.
2.9.2 AUGUSTUS
AUGUSTUS is a program that uses algorithms based on comparative gene prediction to calculate the highest probability of a gene being transcribed in eukaryotic genomic sequences. AUGUSTUS exploits the fact that orthologous genes typically have congruent exon-intron structures. Comparative AUGUSTUS simultaneously predicts the genes in all input genome (Nachtweide & Stanke, 2019). The Bioinformatics Group of the Institute for Mathematics and Computer Science at the University of Greifswald in Germany is in charge of maintaining it.
2.9.3 GenScan
GenScan is a program on a web server that identifies the complete exon and intron structures of genes in genomic DNA. GenScan has higher accuracy than most methods of identifying the structure of genes when tested on standardized sets of human and vertebrate genes, with 75 to 80% of exons identified exactly (Burge & Karlin, 1997).
2.9.4 GenID
GenID is a web server application that is categorized as a gene prediction software that is maintained by University of Pompeu Fabra, Spain. This application will assemble the gene structure and maximizing the sum of the scores of the assembled exons (Parra et al., 2000). It uses very little memory and less computing power for computing and storing huge amounts of genomic sequences (Alioto et al., 2018).
2.9.5 HMMGene
HMMGene is a web server program that predicts and finds genes of vertebrate and Caenorhabditis elegans which causes intestinal epithelial cells infection. It is based on a statistical model called a hidden Markov model for probability. The gene discovery consists of splicing, coding, and intergenic regions (Krogh, 1997). The Technical University of Denmark is responsible for maintaining this web server.
2.9.6 GSDS
GSDS is known as Gene Structure Display Server maintained by the Center of Bioinformatics, Peking University. It is use for drawing schematic diagrams for gene structures as PNG or SVG file format. This helps in understanding the evolutionary relation and functions of the gene (B. Hu et al., 2015).
2.10 Protein Interactions
Protein interactions are a large, dynamic network of biochemical reactions that are critical to the regulation and execution of all biological systems (Garland et al., 2013). This interaction is important for discovering drugs that ccurediseases by understanding how the mechanism of infections works.
2.10.1 STRING
STRING stands for Search Tool for the Retrieval of Interacting Genes/Protein. It is a database for predicted protein-protein interactions (PPI) which is to store, score and uniting those publicly available data on PPI. It has a new feature that visualize subsets as interaction networks and performs gene-set enrichment analysis on the entire input (Szklarczyk et al., 2019). EMBL is in charge of maintaining this database.
2.10.2 STITCH
STITCH stands for Search Tool for Interactions of Chemicals which is a database for Chemical-Protein Interactions at EMBL. This database unites information about metabolic pathways interactions, crystal structures, binding experiments, and drug-target relationships (Kuhn et al., 2008).
2.11 Molecular phylogeny
Molecular phylogeny is the branch of phylogeny that uses inherited molecular variations in DNA sequences and genetics to establish an organism’s evolutionary relationship. It aids in assessing the mechanisms that will lead to a species’ diversity. This involves computational power to analyze gene and amino acid sequences (Macario & Conway de Macario, 2007).
2.11.1 MEGA
MEGA stands for the Molecular Evolutionary Genetics Analysis. It is a software package that contains various method of analysis and tools for phylogenomics and phylomedicine (Kumar et al., 2018). This package was programmed in C++ and is maintained by Masatoshi Nei.
2.11.2 PHYLIP
PHYLIP stands for Phylogeny Inference Package which has 35 programs that are categorized under molecular sequence methods (15), distance matrix methods (3), gene frequencies and continuous characters (3), discrete characters methods (9) and tree drawing, consensus, tree editing, and tree distances methods (5). These programs are used to determine the evolutionary relationships between organisms and were developed by Joseph Felsestein.
2.11.3 Tree-Puzzle
Tree-Puzzle is a web server application that uses the maximum likelihood method to deal with sequences to reconstruct phylogenetic trees. It uses a complex search algorithm called quartet puzzling. This algorithm analyze data sets and automatically assigns estimations of support to each internal branch (Schmidt et al., 2003).
2.11.4 PAML
PAML stands for Phylogenetic Analysis by Maximum Likelihood. This is a software package that uses the maximum likelihood method for analyzing phylogenetic of DNA and protein sequences to fit models. This program cannot be used for sequence alignment, gene prediction, and tree search in large data sets(Yang, 1997).
2.11.5 DAMBE
DAMBE is known as Data Analysis and Molecular Biology and Evolution. It is a software package that is used to analyze genes and phylogeny (Xia, 2018). The data analysis part of the software is used for descriptive and comparative sequence analysis. DAMBE was developed by Xuehua Xia and has been improved since then for a fast and accurate estimation of statistical models.
2.11.6 iTOL
iTOL stands for Interactive Tree of Life is an online tool maintained by EMBL. iTOL is used for generating, controlling, and annotating phylogenetic trees. It can generate phylogenetic trees from Microsoft Excel or Google Sheets, or any web data editing program as well. The trees can be manipulated in a way that the branches can be torn off and the roots relocated (Letunic & Bork, 2007).
2.12 3D Visualization
3D visualization is the digital process where objects or other contents are generated in 3-D using 3D software. High resolution 3D rendering, computer-generated imagery, 3D graphics are 3 of the main components in 3D visualization. 3D visualization requires hardware with high computational and graphical computing power because of the rendering processes. Portable devices have low computing power (Ramos et al., 2014), this will slow down the process of visualization as they is not optimized for high resolution rendering.
2.12.1 Swiss-PDB Viewer
Swiss-PDB Viewer is a large software package that analyzes and displays the structure of proteins. It has a graphical interface that is user-friendly. It has a database that consists of more than 3500 protein models that can be generated without any programming or data input (Guex & Peitsch, 1997).
2.12.2 Chimera
UCSF Chimera is a software that is used for visualization and modelling of biological structures at molecular level. Chimera also includes sequence alignments and an integrated sequence and structure view (E. C. Meng et al., 2006).
2.12.3 PyMol
PyMol is a graphical tool that was programmed using python. This tool is mostly used to visualize 3D structures of proteins and nucleic acid. Now it is widely used for macromolecules imaging for publications. PyMOL also includes plugins for macromolecular analysis, homology modeling, protein–ligand docking, pharmacophore modeling, VS, and MD simulations (Yuan et al., 2017).
2.12.4 VMD
VMD stands for Visual Molecular Dynamics. VMD is a 3D software for modeling, visualizing, and analyzing molecular assemblies such as proteins and nucleic acids. It also provides 2 interfaces: a graphical user interface and a text interface. graphic user interface for controlling the VMD software while the text interface allows the use of complex scripts for substituting variables, loop control, and calling functions (Humphrey et al., 1996).
2.12.5 YASARA
YASARA stands for Yet Another Scientific Artificial Reality Application. It is a graphic stimulation software for molecular modeling. The software is based on Portable Vector Language framework that uses Graphic Cards (Graphic Processing Unit / GPU) to visualize proteins of all sizes. This implementation also allows real-time simulations with accuracy. Modern graphic processing units have developed to be able to compute almost everything, including molecular modeling algorithms (Stone et al., 2010) because of parallel processing, where many calculations are carried out at the same time to accelerate scientific algorithm solving.
2.12.6 Cn3D
Cn3D stands “See in 3D” is a helper visualization software as a distributed part of Entrez from NCBI. Cn3D uses the Entrez search system to search for the 3D structure of known proteins (Y. Wang et al., 2000). Entrez has its own 3D-structured database, the Molecular Modeling Database (MMDB). It also provides viewing of biological sequences and sequence alignments.
2.12.7 Kinemage
Kinemage means kinetic images. Mage was the first version of KiNG (Kinemage, Next Generation) which was a tool for molecular illustration. KiNG is an open-source visualization software for visualizing macromolecules. It was programmed using Java language. KiNG specializes in the molecule-agnostic kinemage graphics format, the quality of its color palette and depth cueing, and the tools and features it offers (Chen et al., 2009).
2.12.8 RasMol
RasMol is an open-source web server application for molecular visualization. Roger Sayle was the person who originally created this application. RasMol uses low computing power with high efficiency which allows fast analysis of biological molecules. RasMol had been modified for the purpose of increasing the computing speed and accuracy of the structural analysis (Pikora & Gieldon, 2015).
2.13 Protein modeling
Protein modeling is a process of computing and visualizing 3D macromolecular structures from its sequence. It is the digital construction of the protein at atomic resolution from its amino acid sequence and an experimental 3D structure of a template protein. This is because evolutionary related proteins share similar structures (Sailapathi et al., 2021).
2.13.1 Phyre2
Phye2 stands for Homology/analogY Recognition Engine V2.0. It is a web server application that analyzes and predicts the structure, mutation, and functions of proteins. Phyre was the predecessor of Phyre2. Phyre2 uses optimal mesh algorithm to construct 3D models, predict ligand binding sites and analyze the effect of amino acid variants (Kelley et al., 2015).
2.13.2 Swiss-Model
Swiss-Model is an integrated online modeling system. The system will identify suitable experimental protein structures for the target protein. The workspace assists the user in constructing protein homology models based on levels of complexity (Guex & Peitsch, 1997).
2.13.3 SwissDock
SwissDock is a web server docking stimulation program for docking protein-ligand. This docking stimulation has become an important tool for drug discovery (X.-Y. Meng et al., 2012). The stimulation will construct small molecules, which are then docked into the macromolecular structures for scoring their complementary values at the binding sites (Saikia & Bordoloi, 2018).
2.13.4 AutoDock
AutoDock is a web server program for automated protein/protein and protein/ligand docking that is mostly used in research and drug discovery (Forli et al., 2016). AutoDock uses a grid-based method to evaluate multi-resolution grid data structures for the binding energy of trial conformations (Morris et al., 2009).
2.13.5 CASP
CASP is known as Critical Assessment of Protein Structure Prediction. It is an organization that accesses and identifies the progress of protein structure modeling from amino acid sequences. This organization blind tested structure prediction methods to determine the efficiency of the methods (Moult et al., 2014). This is to find the bottleneck of structure prediction tools.
2.14 RNA tools
RNA tools are computational tools that are specific to RNA. RNA is Ribonucleic acid, which is an important biological complex made of Guanine, Cytosine, Adenine and Uracil. RNA is present in all biological cells as it carries genetic information for protein synthesis (D. Wang & Farhana, 2021). These RNA tools come with many functions, like RNA-RNA interaction prediction, sequence-structure alignments, and RNA visualization.
2.14.1 MFOLD
MFOLD is a web server application that is used for predicting RNA secondary structure. The goal was to allow access to RNA, DNA folding, and hybridization software on the World Wide Web. The folding of single sequences can generate detailed output, in the form of structure plots with or without reliability information, single strand frequency plots, and ‘energy dot plots’ (Zuker, 2003).
2.14.2 RAIN
RAIN stands for RNA-protein Association and Interaction Networks. RAIN is a web server program that uses a certain type of scoring scheme for the assignment of a confidence score for each RNA-protein interaction. RAIN database includes ncRNA–RNA and ncRNA–protein associations, with protein–protein associations contained in the STRING database. This allows complex interaction networks with the STRING interface (Junge et al., 2017).
2.14.3 FoldAlign
FoldAlign is a web server program that uses a Sankoff-based algorithm because it uses a better calculation method for predictions. This makes it slower, as it needs more computing power and memory to solve the algorithm. This program is used to run structural alignments of RNA sequences (Havgaard et al., 2005). The sequences can be globally or locally aligned.
2.14.4 RNASNP
RNASNP stands for Ribonucleic Acid Single Nucleotide Polymorphisms, is a web server program to predict the effect of SNP on local the RNA secondary structure. SNPs have an impact on an individual’s susceptibility to various illnesses and even how their body reacts to drugs (Alwi, 2005). RNASNP applies a precomputed mathematical table of common logarithms for the distribution of the effect of SNP as a function of length and GC content (Sabarinathan et al., 2013).
2.14.5 RILogo
RILogo stands for RNA-rna Interaction Logos. It is a web server application that generates RNA-RNA interaction logos for a pair of RNA sequences. This can predict the formation of duplexes between the pair of RNAs. RILogo generates the alignments in sequence logos, as well as the information for base paired columns (Menzel et al., 2012).
2.15 List of bioinformatics tools at international bioinformatics centers
2.15.1 ExPASy tools
ExPASy World Wide Web server which is dedicated to analysis protein and nucleic acid sequences and 2-D PAGE contains tools for analyzing and predicting protein via online. It allows access to various protein and proteomics databases such as SWISS-PROT, PROSITE, and ENZYME. Analysis tools and software packages such as Swiss-Model, Melanie, and TOOLS can be accessed as well. The analysis tools can be used for specific tasks relevant to proteomics, similarity searches, pattern and profile searches, post-translational modification prediction, topology prediction, primary, secondary and tertiary structure analysis and sequence alignment (Gasteiger et al., 2003). The databases are closely interlinked to the analysis tools to improve predictions.
2.15.2 Tsinghua
Bioinformatics Division, Tsinghua University developed software tools for its own web server. The goal of developing the tools was to pursue excellence in scientific research and education on fundamental questions in life science with informatics and systems approaches (Research | Bioinformatics Division | CSSB, n.d.). The tools range from the analysis of gene expression patterns to the construction of synthetic molecular systems to study complex biological phenomena, and to engineer genetic circuits for manipulation of cells (Synthetic Biology | Bioinformatics Division | CSSB, n.d.).
2.15.3 DTU HealthTech Services
DTU HealthTech Services is a web server for bioinformatics tools that is used for gene finding and splice sites, genomic epidemiology, immunological features, post-translational modifications of proteins, protein function and structure, protein sorting, small molecules, bioinformatics tools and datasets (Services- DTU Health Tech, n.d.).
2.15.4 SMS
SMS stands for Sequence Manipulation Suite. It is a web server that contains JavaScript programs for producing and analyzing DNA and protein sequences at specific lengths. The suite is categorized into format conversion tools, sequence analysis tools, sequence figure tools, random sequence tools, and other miscellaneous tools (The Sequence Manipulation Suite, n.d.).
2.15.5 CoGe
CoGe stands for Comparative Genomics. This online platform is structured to manage the study genomic data, enabling both data- and hypothesis-driven comparative genomics. CoGe provides tools and information that can be used in organizing and analyzing publicly available as well as encrypted genomic data from any species (Castillo et al., 2018).
3.0 Conclusion
Modern technologies have greatly improved the way bioinformatics scientists design computational tools, systems, and biological databases that are used in organizing, storing, and analyzing big data. There are still a vast number of big data complications in the field of bioinformatics that are yet to be hypothesize and solved, like analyzing relations among diseases and evolutionary research. Limited hardware capabilities play an important role in computation and visualization as they determine the efficiency of the process of solving algorithms and processing data. Modern medicine depends heavily on bioinformatics tools, as these computational tools can identify susceptible genes and pathogenic pathways for a disease. The most efficient way to find a cure for a disease is to first understand it.
Table: Bioinformatics tools for sequence analysis and their URL
References
Alioto, T., Blanco, E., Parra, G., & Guigó, R. (2018). Using geneid to Identify Genes. Current Protocols in Bioinformatics, 64(1). https://doi.org/10.1002/cpbi.56
Alwi, Z. Bin. (2005). The Use of SNPs in Pharmacogenomics Studies. The Malaysian Journal of Medical Sciences : MJMS, 12(2), 4–12. http://www.ncbi.nlm.nih.gov/pubmed/22605952
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., & Sherlock, G. (2000). Gene ontology: Tool for the unification of biology. In Nature Genetics (Vol. 25, Issue 1, pp. 25–29). Nature Publishing Group. https://doi.org/10.1038/75556
Baxevanis, A. D. (2004). An Overview of Gene Identification: Approaches, Strategies, and Considerations. Current Protocols in Bioinformatics, 6(1), 4.1.1-4.1.9. https://doi.org/10.1002/0471250953.bi0401s6
Burge, C., & Karlin, S. (1997). Prediction of complete gene structures in human genomic DNA. Journal of Molecular Biology, 268(1), 78–94. https://doi.org/10.1006/jmbi.1997.0951
Castillo, A. I., Nelson, A. D. L., Haug-Baltzell, A. K., & Lyons, E. (2018). A tutorial of diverse genome analysis tools found in the CoGe web-platform using Plasmodium spp. As a model. Database, 2018(2018). https://doi.org/10.1093/database/bay030
Chen, V. B., Davis, I. W., & Richardson, D. C. (2009). KiNG (Kinemage, Next Generation): A versatile interactive molecular and scientific visualization program. Protein Science, 18(11), 2403–2409. https://doi.org/10.1002/pro.250
Chukwuemeka, P. O., Umar, H. I., Olukunle, O. F., Oretade, O. M., Olowosoke, C. B., Akinsola, E. O., Elabiyi, M. O., Kurmi, U. G., Eigbe, J. O., Oyelere, B. R., Isunu, L. E., & Oretade, O. J. (2020). In silico design and validation of a highly degenerate primer pair: a systematic approach. Journal of Genetic Engineering and Biotechnology, 18(1). https://doi.org/10.1186/s43141-020-00086-y
Forli, S., Huey, R., Pique, M. E., Sanner, M. F., Goodsell, D. S., & Olson, A. J. (2016). Computational protein-ligand docking and virtual drug screening with the AutoDock suite. Nature Protocols, 11(5), 905–919. https://doi.org/10.1038/nprot.2016.051
Garland, W., Benezra, R., & Chaudhary, J. (2013). Targeting protein-protein interactions to treat cancer-recent progress and future directions. In Annual Reports in Medicinal Chemistry (Vol. 48, pp. 227–245). Academic Press Inc. https://doi.org/10.1016/B978-0-12-417150-3.00015-6
Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R. D., & Bairoch, A. (2003). ExPASy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Research, 31(13), 3784–3788. https://doi.org/10.1093/nar/gkg563
Guex, N., & Peitsch, M. C. (1997). SWISS-MODEL and the Swiss-PdbViewer: An environment for comparative protein modeling. Electrophoresis, 18(15), 2714–2723. https://doi.org/10.1002/elps.1150181505
Hashim, F. A., Mabrouk, M. S., & Al-Atabany, W. (2019). Review of Different Sequence Motif Finding Algorithms. Avicenna Journal of Medical Biotechnology, 11(2), 130–148. http://www.ncbi.nlm.nih.gov/pubmed/31057715
Havgaard, J. H., Lyngsø, R. B., & Gorodkin, J. (2005). The Foldalign web server for pairwise structural RNA alignment and mutual motif search. Nucleic Acids Research, 33(SUPPL. 2). https://doi.org/10.1093/nar/gki473
Heemskerk, J., & Fischbeck, K. H. (2005). Therapeutics Development for Hereditary Neurological Diseases. In From NEUROSCIENCE To NEUROLOGY (pp. 285–291). Elsevier Inc. https://doi.org/10.1016/B978-012738903-5/50017-5
Hou, K., Wang, H., & Feng, W. C. (2016). AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-Based Multi-and Many-Core Processors. Proceedings – 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016, 780–789. https://doi.org/10.1109/IPDPS.2016.115
Hu, B., Jin, J., Guo, A. Y., Zhang, H., Luo, J., & Gao, G. (2015). GSDS 2.0: An upgraded gene feature visualization server. Bioinformatics, 31(8), 1296–1297. https://doi.org/10.1093/bioinformatics/btu817
Hu, G., & Kurgan, L. (2019). Sequence Similarity Searching. Current Protocols in Protein Science, 95(1), 1–19. https://doi.org/10.1002/cpps.71
Huang, M., Shah, N. D., & Yao, L. (2019). Evaluating global and local sequence alignment methods for comparing patient medical records. BMC Medical Informatics and Decision Making, 19(6), 1–13. https://doi.org/10.1186/s12911-019-0965-y
Humphrey, W., Dalke, A., & Schulten, K. (1996). VMD: Visual molecular dynamics. Journal of Molecular Graphics, 14(1), 33–38. https://doi.org/10.1016/0263-7855(96)00018-5
Junge, A., Refsgaard, J. C., Garde, C., Pan, X., Santos, A., Alkan, F., Anthon, C., Von Mering, C., Workman, C. T., Jensen, L. J., & Gorodkin, J. (2017). RAIN: RNA-protein association and interaction networks. Database, 2017(1), 167. https://doi.org/10.1093/database/baw167
Junier, T., & Pagni, M. (2000). Dotlet: Diagonal plots in a Web browser. Bioinformatics, 16(2), 178–179. https://doi.org/10.1093/bioinformatics/16.2.178
Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N., & Sternberg, M. J. E. (2015). The Phyre2 web portal for protein modeling, prediction and analysis. Nature Protocols, 10(6), 845–858. https://doi.org/10.1038/nprot.2015.053
Krogh, A. (1997). Two Methods for Improving Performance of a HMM and their Application for Gene Finding. Undefined.
Krumsiek, J., Arnold, R., & Rattei, T. (2007). Gepard: A rapid and sensitive tool for creating dotplots on genome scale. Bioinformatics, 23(8), 1026–1028. https://doi.org/10.1093/bioinformatics/btm039
Kuhn, M., von Mering, C., Campillos, M., Jensen, L. J., & Bork, P. (2008). STITCH: Interaction networks of chemicals and proteins. Nucleic Acids Research, 36(SUPPL. 1). https://doi.org/10.1093/nar/gkm795
Kumar, S., Stecher, G., Li, M., Knyaz, C., & Tamura, K. (2018). MEGA X: Molecular evolutionary genetics analysis across computing platforms. Molecular Biology and Evolution, 35(6), 1547–1549. https://doi.org/10.1093/molbev/msy096
Letunic, I., & Bork, P. (2007). Interactive Tree Of Life (iTOL): An online tool for phylogenetic tree display and annotation. Bioinformatics, 23(1), 127–128. https://doi.org/10.1093/bioinformatics/btl529
Macario, A. J. L., & Conway de Macario, E. (2007). Heat Shock Genes, Human. In Encyclopedia of Stress (pp. 284–288). Elsevier Inc. https://doi.org/10.1016/B978-012373947-6.00190-2
Meng, E. C., Pettersen, E. F., Couch, G. S., Huang, C. C., & Ferrin, T. E. (2006). Tools for integrated sequence-structure analysis with UCSF Chimera. BMC Bioinformatics, 7(1), 1–10. https://doi.org/10.1186/1471-2105-7-339
Meng, X.-Y., Zhang, H.-X., Mezei, M., & Cui, M. (2012). Molecular Docking: A Powerful Approach for Structure-Based Drug Discovery. Current Computer Aided-Drug Design, 7(2), 146–157. https://doi.org/10.2174/157340911795677602
Menzel, P., Seemann, S. E., & Gorodkin, J. (2012). RILogo: Visualizing RNA-RNA interactions. Bioinformatics, 28(19), 2523–2526. https://doi.org/10.1093/bioinformatics/bts461
Mitchell, A., Chang, H. Y., Daugherty, L., Fraser, M., Hunter, S., Lopez, R., McAnulla, C., McMenamin, C., Nuka, G., Pesseat, S., Sangrador-Vegas, A., Scheremetjew, M., Rato, C., Yong, S. Y., Bateman, A., Punta, M., Attwood, T. K., Sigrist, C. J. A., Redaschi, N., … Finn, R. D. (2015). The InterPro protein families database: The classification resource after 15 years. Nucleic Acids Research, 43(D1), D213–D221. https://doi.org/10.1093/nar/gku1243
Morris, G. M., Ruth, H., Lindstrom, W., Sanner, M. F., Belew, R. K., Goodsell, D. S., & Olson, A. J. (2009). Software news and updates AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility. Journal of Computational Chemistry, 30(16), 2785–2791. https://doi.org/10.1002/jcc.21256
Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T., & Tramontano, A. (2014). Critical assessment of methods of protein structure prediction (CASP) – round x. Proteins: Structure, Function and Bioinformatics, 82(SUPPL.2), 1–6. https://doi.org/10.1002/prot.24452
Nachtweide, S., & Stanke, M. (2019). Multi-genome annotation with AUGUSTUS. In Methods in Molecular Biology (Vol. 1962, pp. 139–160). Humana Press Inc. https://doi.org/10.1007/978-1-4939-9173-0_8
Parra, G., Blanco, E., & Guigó, R. (2000). GeneId in Drosophila. Genome Research, 10(4), 511–515. https://doi.org/10.1101/gr.10.4.511
Paxman, J. J., & Heras, B. (2017). Bioinformatics tools and resources for analyzing protein structures. Methods in Molecular Biology, 1549, 209–220. https://doi.org/10.1007/978-1-4939-6740-7_16
Pikora, M., & Gieldon, A. (2015). RASMOL AB – New functionalities in the program for structure analysis. Acta Biochimica Polonica, 62(3), 629–631. https://doi.org/10.18388/abp.2015_972
Pourreza Shahri, M., Srinivasan, M., Reynolds, G., Bimczok, D., Kahanda, I., & Kanewala, U. (2019). Metamorphic testing for quality assurance of protein function prediction tools. Proceedings – 2019 IEEE International Conference on Artificial Intelligence Testing, AITest 2019, 140–148. https://doi.org/10.1109/AITest.2019.00017
Ramos, F., Ripolles, O., & Chover, M. (2014). Efficient visualization of 3D models on hardware-limited portable devices. Multimedia Tools and Applications, 73(2), 961–976. https://doi.org/10.1007/s11042-012-1200-3
Research | Bioinformatics Division | CSSB. (n.d.). Retrieved May 21, 2021, from http://bioinfo.au.tsinghua.edu.cn/research/
Sabarinathan, R., Tafer, H., Seemann, S. E., Hofacker, I. L., Stadler, P. F., & Gorodkin, J. (2013). RNAsnp: Efficient detection of local RNA secondary structure changes induced by SNPs [Human Mutation 34, 4 (2013) 546-556] DOI 10.1002/humu.22273. In Human Mutation (Vol. 34, Issue 6, p. 925). https://doi.org/10.1002/humu.22323
Saikia, S., & Bordoloi, M. (2018). Molecular Docking: Challenges, Advances and its Use in Drug Discovery Perspective. Current Drug Targets, 20(5), 501–521. https://doi.org/10.2174/1389450119666181022153016
Sailapathi, A., Gunalan, S., Somarathinam, K., Kothandan, G., & Kumar, D. (2021). Importance of Homology Modeling for Predicting the Structures of GPCRs. In Homology Molecular Modeling – Perspectives and Applications. IntechOpen. https://doi.org/10.5772/intechopen.94402
Salamov, A. A., & Solovyev, V. V. (2000). Genomic DNA Drosophila Ab initio Gene Finding in References. https://doi.org/10.1101/gr.10.4.516
Schmidt, H. A., Strimmer, K., Vingron, M., Von Haeseler, A., Schmidt, H. A., Strimmer, K., & Von Haeseler, A. (2003). TREE-PUZZLE-Maximum likelihood analysis for nucleotide, amino acid, and two-state data. http://www.tree-puzzle.de/
Services- DTU Health Tech. (n.d.). Retrieved May 21, 2021, from https://services.healthtech.dtu.dk/
Stone, J. E., Hardy, D. J., Ufimtsev, I. S., & Schulten, K. (2010). GPU-accelerated molecular modeling coming of age. Journal of Molecular Graphics and Modelling, 29(2), 116–125. https://doi.org/10.1016/j.jmgm.2010.06.010
Synthetic Biology | Bioinformatics Division | CSSB. (n.d.). Retrieved May 21, 2021, from http://bioinfo.au.tsinghua.edu.cn/research/synthetic-biology/
Szklarczyk, D., Gable, A. L., Lyon, D., Junge, A., Wyder, S., Huerta-Cepas, J., Simonovic, M., Doncheva, N. T., Morris, J. H., Bork, P., Jensen, L. J., & Von Mering, C. (2019). STRING v11: Protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Research, 47(D1), D607–D613. https://doi.org/10.1093/nar/gky1131
The Sequence Manipulation Suite. (n.d.). Retrieved May 21, 2021, from http://www.bioinformatics.org/sms2/
TMHMM Server v. 2.0 — Prediction of transmembrane helices in proteins | HSLS. (n.d.). Retrieved May 20, 2021, from https://www.hsls.pitt.edu/obrc/index.php?page=URL1164644151
Ute, M. I. N., Aleh, E. H. S., & Arnow, T. A. W. (2019). Evaluating Statistical Multiple Sequence Alignment in Comparison to Other Alignment Methods on Protein Data Sets. 68(3), 396–411. https://doi.org/10.1093/sysbio/syy068
Wang, D., & Farhana, A. (2021). Biochemistry, RNA Structure. In StatPearls. StatPearls Publishing. http://www.ncbi.nlm.nih.gov/pubmed/32644425
Wang, Y., Geer, L. Y., Chappey, C., Kans, J. A., & Bryant, S. H. (2000). Cn3D: Sequence and structure views for Entrez. In Trends in Biochemical Sciences (Vol. 25, Issue 6, pp. 300–302). Elsevier Ltd. https://doi.org/10.1016/S0968-0004(00)01561-9
Wheeler, T. J., Clements, J., & Finn, R. D. (2014). Skylign: A tool for creating informative, interactive logos representing sequence alignments and profile hidden Markov models. BMC Bioinformatics, 15(1), 1–9. https://doi.org/10.1186/1471-2105-15-7
Xia, X. (2018). DAMBE7: New and improved tools for data analysis in molecular biology and evolution. Molecular Biology and Evolution, 35(6), 1550–1552. https://doi.org/10.1093/molbev/msy073
Yang, Z. (1997). Paml: A program package for phylogenetic analysis by maximum likelihood. Bioinformatics, 13(5), 555–556. https://doi.org/10.1093/bioinformatics/13.5.555
Yuan, S., Chan, H. C. S., & Hu, Z. (2017). Using PyMOL as a platform for computational drug design. In Wiley Interdisciplinary Reviews: Computational Molecular Science (Vol. 7, Issue 2, p. e1298). Blackwell Publishing Inc. https://doi.org/10.1002/wcms.1298
Zuker, M. (2003). Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Research, 31(13), 3406–3415. https://doi.org/10.1093/nar/gkg595
Related posts:
Elucidating Circular RNA Sequences: A Comprehensive Guide to Utilizing circBase
What is the difference between bioinformatics and computational biology?
Mastering Microarray Data Analysis: A Step-by-Step R/Bioconductor Tutorial
Foundations of Computing for Bioinformatics
Building a Bioinformatics Lab on a Budget: A Comprehensive Guide to Hardware and Software Requiremen...
Navigating the Frontier of Multi-Omics Integration, AI Advancements, and Beyond
Introducing Data Science to Undergraduate Students: A Practical Approach Using Bioinformatics
Decoding Your Future: A Beginner's Guide to Starting a Career in Bioinformatics
From Lens to Logic Gates: The Journey from Traditional Biology to Computational Bioinformatics
Navigating the Future of Bioinformatics: Trends, Innovations, and Key Players
Data Science and Big Data in Drug Discovery
Top Next-Generation Sequencing (NGS) Instruments of 2023: A Comparative Guide