Predicting motif for protein sequence- MyHits motif scan tutorial

June 11, 2019 Off By admin
Shares

INTRODUCTION

Protein motifs are most commonly extracted from an initial multiple sequence alignment, but sometimes the training sequences are not strictly homologous, or the sequences contain repeated sequences, rearrangements, or other common situations that disrupt alignment approaches.

The Motif Scan tool is a MyHits tool developed by the Swiss Institute of Bioinformatics (SIB). The tool uses databases from HAMAP, PROSITE, and Pfam to extract motifs similar to the one queried. The purpose of this tool is to identify the motifs or pattern found in a protein sequence. Determining the motifs of a protein will assist in the classification of a protein according to its family or domain.

METHODOLOGY

Motif Scan is a tool under MyHits. Like all MyHits tools, they operate using HitKeeper. HitKeeper was mainly written by Marco Pagni and Jörg Hau. It is a software package hit list management that contains a collection of scripts that interact with a relational database management system (RDMS). The software is mostly written in Perl.

With HitKeeper, the data is organized into three kinds. They are seq (biological sequences), mot (motifs), and cla (hierarchical classification). These kinds can be further organized into types. For seq, they are organized into pep (peptides) and nuc (nucleotides). For mot, they are arranged into pattern (protein patterns) and HMM (Hidden Markov Models). At the moment, cla is limited to taxonomy only.

As it can be seen, HitKeeper allows for the handling of multiple data for each kind. For each kind, multiple versions of a database can exist in a single “pipeline”. However, only the current database may be queried.

Several versions of these databases can exist in the same pipeline, but at different stages. To process all of these databases, three scripts are run simultaneously. These scripts are HKLoader, HKUpdater, and HKPublisher. HKLoader observes the source file data for changes by using the date and time stamp. This script plays a role in parsing and converting the raw data, detecting redundancy, and transferring the “clean” data to the SQL database. HKUpdater is responsible in updating the hit list. When a motif database enters a ‘prepare’ state, the new motifs are computed against the sequences in the ‘current’ state and vice-versa. HKPublisher deploys the databases to external computing elements and the databases flagged as ‘ready’ are promoted to ‘ready’.

HitKeeper was developed as the “back-end” of the MyHits website. From Figure 3, the tasks provided by HitKeeper are shown in blue. Services that provide infrastructure (MySQL, Apache) are displayed in green while computing services are illustrated in pink. The different tasks are distributed among different hosts, and the synchronization is controlled by HitKeeper.

CASE STUDY TUTORIAL

The Motif Scan homepage is pretty straightforward. The query box is in the middle. The preferences are below it and the search button is on the right-hand side.

Before beginning the search, a protein sequence must be chosen. Note that Motif Scan does not scan nucleotide sequences, and only protein sequences can be used. In this case, we will be using the Proprotein convertase subtilisin/kexin type 9 from the organism Lagothrix lagotricha (Brown woolly monkey). This sequence was obtained using UniProtKB. The sequence can be entered into the query box in a raw format, an identifier, a FASTA format, or a Swiss-Prot format. Here, we will use the FASTA format since it is the universal format in Bioinformatics.

Below the query box are preferences that determine which database the tool will use during the scan. These settings also determine the type of motif that the tool will search for.

The preferences are listed as ‘mot_source’. The term stands for motif source. By default, none of the options will be selected. However, the website will remember the last selected preferences and will appear with the chosen preferences during the next visit. At least one box should be selected to begin the search (otherwise, an error will occur). Choosing more options will cause the search to take a longer time, but more results will (or should) pour in.

The longest time the search will take is roughly five to ten minutes when all of the mot_source options have been chosen.

Once everything has been set, just click the “Search” button to begin the process.

After several moments, the tool will bring up the results according to the preferences set along with any similarities the motifs have.

The Map Match shows the position of the domains identified on the sequence. The motif scan tool usually displays motifs from the pfam database.

The “List of matches” category lists down all the potential motifs found in the queried sequence. The names of the matches are accompanied with its status. Strong matches are denoted with a [!] while weak matches are marked with [?].

Below that, the “Detail of Matches” segment illustrates the domains and functional groups of a sequence. The position of the motif will be displayed beside it along with the status of the match. Matches that are denoted with “?” status have insufficient evidence to identify it. The tool will try to identify it anyway, but the result is questionable.

Some of these weak matches are still measured by its raw score, N-score (normalized value), and E-value despite having weak evidence. The E-value indicates how similar the motif on the sequence is to the motif obtained from one of the databases. Motifs with weak evidence tend to have high E-values, indicating its dissimilarity from the motif on the sequence. The motif displayed is usually incomplete.

Motifs with strong evidence supporting its match would be tagged with “!” as its status. These motifs produce very small E-values, indicating its high similarity to the match found. The graphic that displays the structure of the motif is also complete compared to the weak results. This makes it easier for motifs to be identified and proteins to be classified.

 

CONCLUSION

Motif scan is a tool that finds motifs on a protein sequence and attempts to identify them. It assists in the classifications of proteins by identifying these motifs or patterns on a protein sequence. The software uses HitKeeper to manage its data so that they are in sync. Not many researchers use Motifs Scan because of how niche its function in, but it is very useful in identifying functional groups which can help scientists compare unknown proteins based on these groups. The tool might see more usage if more resources are made to help users navigate the it.

 

REFERENCES

Scores and match significance. (n.d.). Retrieved March 24, 2019, from https://myhits.isb-sib.ch/cgi-bin/help?doc=scores.html

The HitKeeper Homepage. (n.d.). Retrieved March 26, 2019, from http://hitkeeper.sourceforge.net/

Hau, J., Muller, M., & Pagni, M. (2007). HitKeeper, a generic software package for hit list management. Source Code for Biology and Medicine,2(1), 2. doi:10.1186/1751-0473-2-2

Shares