1. Introduction



Definition

A short linear functional motif is a protein sequence stretch that is sufficient to perform a distinct function, such as interaction with another molecule (which can further lead to modification, degradation or relocation of the motif-containig protein). Generally speaking, a motif, however, is not restricted to the exact sequence from the protein in which it occurs, but it is a set of sequence variants that retain the same specific function.

A motif instance then is a specific occurance of the motif in a specific protein.

Problem

A de novo search for motifs guided by the definition would involve resolving interactions experimentally, which is often too costly or not at all feasible (e.g., because the common interactor itself is not yet identified). Computational approaches that dock molecules in 3D work well generally only with known 3D structures and even then they encounter big challenges when dealing with conformational flexibility. Thus, it is practically impossible to de novo search for motifs directly in a set of proteins without the known common interaction partner and resolved 3D structures.

Tree model

Region model

To circumvent the problem, one should search for motifs indirectly. For this purpose simplified models of a motif were developed. The most conventional such model is regex (regular expression), which describes the conserved positions in the motif in form of simple amino acid lists. Sequence profiles are more informative models, as they describe exact amino acid frequencies (emission probabilities) at each position within the motif. However, both regexes and profiles imply that all the motif instances are equal and a high-quality alignment exists between each two instances. This assumption, however, does not go well with the principles of evolution. Therefore, HH-MOTiF comes up with the tree model. In this model one of the motif instances assumes the role of 'root', while all other instances become 'leaves'. Each leaf exhibits strong similarity to the motif root (which results in high-quality local alignment); however, a pair of leaves must not necessarily exhibit strong similarity to each other.

Proteome-wide search

The proteome-wide motif prediction is designed as an extension of the de novo search (although it of course can be used independently). It operates as a quick search of already known or already predicted motifs in the whole proteome(s) of the selected species. Such a search may serve the dual goal of finding additional motif instances, as well as to checking if the motif is enriched in the initial set.

2. Design goals



HH-MOTiF was designed to satisfy the following criteria as goals:

  • Simplicity: one-click submit with only an input FASTA file provided must be possible (although additional features and parameters are available in the advanced mode)
  • Imformativeness: detailed data on sequence similarity interplay must be available to the user for further consideration
  • Performance: it must outperform all other existing tools (residue-wise F1 value on ELM database is used as the benchmark)
  • Compatibility: the possibility to view predicted motifs in conventional forms (regex, aligned FASTA, and sequence logo) must be provided

The current version of HH-MOTiF satisfies all four criteria.

3. Workflow



Workflow scheme

The de movo motif prediction begins with an orthology search, surface accessibility check, and building hidden Markov model (HMM) profiles for all the input protein sequences. Then the resulting HMM profiles are compared to each other in an all-against-all manner to detect the pairwise similarity in the sequences. After this the statistics on similarity are accumulated in a residue-wise manner to identify the motif roots in individual proteins that exhibit high enough similarity to sequence stretches in a high enough number of other proteins. The identified motif trees undergo an extensive multi-step validation procedure, during which the motifs can be trimmed or discarded altogether. Finally, the interactive output HTML with information on the validated regions together with their FASTA and sequence logo representations is written.

The proteome-wide search works from a simplified scheme, where the HMM profile of the input template motif is compared in a profile-to-sequence one-to-many manner with all the sequences from a proteome. The matches with ha igh enough score and length are kept as results.