Use of motifs in sequence analysis

Joaquín Dopazo.

4-2-2000

Motivation
Motifs: description and construction

Regular expressions or patterns
Profiles
HMMs

Databases & Tools
Use of motifs

To know the motifs that our problem sequence has: guessing about function
To know what sequences in a database have a given motif: looking for remote homologous.
Constructing a motif and learning about it.

Sources of information
APENDIX: databases and tools.

1 Motivation

There are two main motivations that have directed the development of motif analysis methods and the construction of databases of motifs. Historically, computational motif analysis was used for the detection of related sequences, whose similarity to the query sequences was too low as to be detected with the classical methods like BLAST or FASTA. Motif analysis relies on the identification of small conserved regions in the proteins that can be identified in remote homologous, despite the lack of overall similarity. At the same time, the motifs exist because of particular requirements on the structure of specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity. These requirements impose very tight constraints on the evolution of those limited (in size) but important portion(s) of a protein sequence. Then, motif analysis can be used to get an idea on the function that a unknown protein is probably to have.

2. Motifs: description and construction

Motifs describe in different ways conserved regions in alignments. There are several ways of describing motifs but all of them are related with either regular expressions, profiles of HMMs.

2.1. Regular expressions

Let us assume that we have a partial alignment of three sequences around an active site residue (in this example an histidine whose position is marked with an asterisk) as shown below:

                    *
             ALRDFATHDDF
             SMTAEATHDSI
             ECDQAATHEAS

Here we would start scanning with a core pattern with the sequence A-T-H- [D or E] (this pattern is small and would probably pick up too many false positive results).
Patterns are described using the following conventions:

The standard IUPAC one-letter codes for the amino acids are used.
The symbol `x' is used for a position where any amino acid is accepted.
Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses `[ ]'. For example: [ALT] stands for Ala or Leu or Thr.
Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met.
Each element in a pattern is separated from its neighbor by a `-'.
Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x.
When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a `<' symbol or respectively ends with a `>' symbol.
A period ends the pattern.

Examples:

PA [AC]-x-V-x(4)-{ED}.

This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}

PA <A-x-[ST](2)-x(0,1)-V.

This pattern, which must be in the N-terminal of the sequence (`<'), is translated as: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val

Construction.

The first and most important criterion used in PROSITE is that a good signature pattern must be as short as possible, should detect all or most of the sequences it is designed to describe and should not give too many false positive results. In other words it must exhibit both high sensitivity and high specificity.
A number of the patterns described in PROSITE have been published. PROSITE developpers have tested those patterns on the SWISS-PROT data bank to see if the signature pattern was still specific to the group of family of proteins since the paper was published. If this was the case PROSITE uses the published pattern as such, otherwise they updated the pattern using methods similar to those used to develop a new pattern and which are described in the following sub-section.

The creation of a new pattern generally start by studying review(s) on a group or family of proteins. An alignment table of the proteins discussed in that review is built. If necessary new published sequences relevant to the subject under consideration are added to this table . Using such alignment tables, particular attention is payed to the residues and regions thought or proved to be important to the biological function of that group of proteins. These biologically significant regions or residues are generally:

Enzyme catalytic sites.
Prostethic group attachment sites (heme, pyridoxal-phosphate, biotin, etc).
Amino acids involved in binding a metal ion.
Cysteines involved in disulfide bonds.
Regions involved in binding a molecule (ADP/ATP, GDP/GTP, calcium, DNA, etc.) or another protein.

We then try to find a short (not more than four or five residues long) conserved sequence which is part of a region known to be important or which include biologically significant residue(s). We call the pattern(s) created at this stage the `core' pattern(s). The most recent version of the SWISS-PROT Protein Sequence databank is then scanned with these core pattern(s). If a core pattern will detect all the proteins under consideration and none (or very few) of the other proteins, we can stop at this stage and use the core pattern as a bona fide signature. In most cases we are not so lucky and we pick up a lot of extra sequences which clearly do not belong to the group of proteins under consideration. A further series of scans, involving a gradual increase in the size of the pattern, is then necessary. In some cases we never manage to find a good pattern and we have to retry with a core pattern from a different part of the sequence. It must also be noted that we take particular attention to try to avoid `false' patterns. In the example in the beggining of the section you can find a `false' pattern:
According to the procedure outlined above, we would then have to extend the core pattern. But in this case, any extension would be artificial and group together residues which have different properties and which are represented only once in a given position of the alignment. For example, we could scan with the pattern [R, T or D]-[D, A or Q]-[F, E or A]-A-T-H-[D or E]. This pattern would probably only pick up the sequences which are in the alignment, but it would be biologically meaningless; there is no consensus in the first three positions of the pattern and the pattern does not even group residues with identical physicochemical properties. Consequently, this pattern would probably fail to detect a new sequence containing the same active site but having a different N-terminal sequence.

2.2. Profiles

A profile or weight matrix (the two terms are used synonymously here) is a table of position-specific amino acid weights and gap costs. These numbers (also referred to as scores) are used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a given cut-off value constitutes a motif occurrence. As with patterns, there may be several matches to a profile in one sequence, but multiple occurrences in the same sequences must be disjoint (non-overlapping) according to a specific definition included in the profile.

The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers(3). Additional parameters allow representation of other motif descriptors, including the currently popular hidden Markov models. A technical description of the profile structure and of the corresponding motif search method is given in the file PROFILE.TXT included in each PROSITE release.

Profiles can be constructed by a large variety of different techniques. The classical method developed by Gribskov and co-workers(4) requires a multiple sequence alignment as input and uses a symbol comparison table to convert residue frequency distributions into weights. The profiles included in the current PROSITE release were generated by this procedure applying recent modifications described by Luethy and co-workers(5). In the future, we intend to apply additional profile construction tools including structure-based approaches and methods involving machine learning techniques. We also consider the possibility of distributing published profiles developed by others in PROSITE format along with locally produced documentation entries.

Unlike patterns, profiles are usually not confined to small regions with high sequence similarity. Rather they attempt to characterize a protein family or domain over its entire length. This can lead to specific problems not arising with PROSITE patterns. With a profile covering conserved as well as divergent sequence regions, there is a chance to obtain a significant similarity score even with a partially incorrect alignment. This possibility is taken into account by our quality evaluation procedures. In order to be acceptable, a profile must not only assign high similarity scores to true motif occurrences and low scores to false matches. In addition, it should correctly align those residues having analogous functions or structural properties according to experimental data.

Profiles are supposed to be more sensitive and more robust than patterns because they provide discriminatory weights not only for the residues already found at a given position of a motif but also for those not yet found. The weights for those not yet found are extrapolated from the observed amino acid compositions using empiric knowledge about amino acid substitutability. The effect of such a procedure is exemplified below.

Shown are a short alignment without gaps and the corresponding weighting table derived with our standard method.

                  F   K   L   L   S   H   C   L   L   V
                  F   K   A   F   G   Q   T   M   F   Q
                  Y   P   I   V   G   Q   E   L   L   G
                  F   P   V   V   K   E   A   I   L   K
                  F   K   V   L   A   A   V   I   A   D
                  L   E   F   I   S   E   C   I   I   Q
                  F   K   L   L   G   N   V   L   V   C

          A     -18 -10 -1 -8   8 -3   3 -10 -2 -8
          C     -22 -33 -18 -18 -22 -26 22 -24 -19 -7
          D     -35   0 -32 -33 -7   6 -17 -34 -31   0
          E     -27 15 -25 -26 -9 23 -9 -24 -23 -1
          F      60 -30 12 14 -26 -29 -15   4 12 -29
          G     -30 -20 -28 -32 28 -14 -23 -33 -27 -5
          H     -13 -12 -25 -25 -16 14 -22 -22 -23 -10
          I       3 -27 21 25 -29 -23 -8 33 19 -23
          K     -26 25 -25 -27 -6   4 -15 -27 -26   0
          L      14 -28 19 27 -27 -20 -9 33 26 -21
          M       3 -15 10 14 -17 -10 -9 25 12 -11
          N     -22 -6 -24 -27   1   8 -15 -24 -24 -4
          P     -30 24 -26 -28 -14 -10 -22 -24 -26 -18
          Q     -32   5 -25 -26 -9 24 -16 -17 -23   7
          R     -18   9 -22 -22 -10   0 -18 -23 -22 -4
          S     -22 -8 -16 -21 11   2 -1 -24 -19 -4
          T     -10 -10 -6 -7 -5 -8   2 -10 -7 -11
          V       0 -25 22 25 -19 -26   6 19 16 -16
          W       9 -25 -18 -19 -25 -27 -34 -20 -17 -28
          Y      34 -18 -1   1 -23 -12 -19   0   0 -18

Note that at certain positions, a residue not occurring in the alignment receives a higher score than one occurring in the alignment, as a result of other residues at that position. Thus A occurring in the third column has a lower score (-1) than M (+10) not occurring there but physicochemically similar to L, I, V, F found in the other sequences. Similar extrapolation procedures are used to derive position-specific insertion and deletion scores which further enhance the selectivity of the profile.

3 Gribskov M., McLachlan AD, Eisenberg D. Proc. Natl. Acad. Sci. U.S.A. 4:4355-4358(1987).
4 Gribskov M., Luethy R., Eisenberg D. Meth. Enzymol. 183:146-159(1990).
5 Luethy R., Xenarios I., Bucher P. Protein Sci. 3:139-146(1994).

2.3. Profile HMMs

Profile hidden Markov models (profile HMMs) are statistical models of the primary structure consensus of a sequence family. Anders Krogh, David Haussler, and co-workers at UC Santa Cruz introduced profile HMMs [Krogh et al., 1994], adopting HMM techniques which have been used for years in speech recognition. HMMs had been used in biology before the Krogh/Haussler work, but the Krogh paper had a particularly dramatic impact, because HMM technology was so well-suited to the popular ``profile'' methods for searching databases using multiple sequence alignments instead of single query sequences. Since then, several computational biology groups have rapidly adopted HMMs as the underlying formalism for sequence profile analysis.

``Profiles'' were introduced by Gribskov and colleagues [Gribskov et al., 1987,Gribskov et al., 1990] at about the same time that other groups introduced similar approaches, such as ``flexible patterns'' [Barton, 1990], and ``templates''[Bashford et al., 1987,Taylor, 1986]. The term ``profile'' has stuck. All of these are more or less statistical descriptions of the consensus of a multiple sequence alignment. They use position-specific scores for amino acids (or nucleotides) and position specific scores for opening and extending an insertion or deletion. Traditional pairwise alignment (for example, BLAST [Altschul et al., 1990], FASTA [Pearson and Lipman, 1988], or the Smith/Waterman algorithm [Smith and Waterman, 1981]) uses position-independent scoring parameters. This property of profiles captures important information about the degree of conservation at various positions in the multiple alignment, and the varying degree to which gaps and insertions are permitted.

The advantage of using HMMs is that HMMs have a formal probabilistic basis. We can use Bayesian probability theory to guide how all the probability (scoring) parameters should be set. Though this might sound like a purely academic issue, this probabilistic basis lets us do things that the more heuristic methods cannot do easily. For example, an HMM can be trained from unaligned sequences, if a trusted alignment isn't yet known. Another consequence is that HMMs have a consistent theory behind gap and insertion scores. In most details, HMMs are a slight improvement over a carefully constructed profile - but far less skill and manual intervention is necessary to train a good HMM and use it. This allows us to make libraries of hundreds of profile HMMs and apply them on a very large scale to whole-genome or EST sequence analysis. One such database of protein domain models is Pfam [Sonnhammer et al., 1997]; the construction and use of Pfam is tightly tied to the HMMER software package.

HMMs do have important limitations. One is that HMMs do not capture any higher-order correlations. An HMM assumes that the identity of a particular position is independent of the identity of all other positions. HMMs make poor models of RNAs, for instance, because an HMM cannot describe base pairs. Also, compare protein ``threading'' methods, which include scoring terms for nearby amino acids in a three-dimensional protein structure.

A general definition of HMMs and an excellent tutorial introduction to their use has been written by Rabiner [Rabiner, 1989]. Throughout, I will often use ``HMM'' to refer to the specific case of profile HMMs as described by Krogh et al. [Krogh et al., 1994]. This shorthand usage is for convenience only. For a review of profile HMMs, see [Eddy, 1996], and for a complete book on the subject of probabilistic modeling in computational biology, see [Durbin et al., 1998]

3. Databases & tools

The observation and knowledge on a number of motifs characteristic of properties or functions of proteins lead to the development of motif databases. These databases generally consist of two components: a discriminator section that is used for a specifically designed program to detect if a sequence belongs to a family or contains a specific domain; and a textual section that provides a description of the motif.
The most important motif databases are:
PROSITE: Describes over 1000 protein families and domains. The motif is described as a regular expression. Nevertheless, regular expressions have limitations for detecting proteins belonging to families whose members are extremely divergent. Profile methods are much more powerful in detecting remote homologies and have been included in PROSITE for many protein families. PROSITE motifs, as previously commented, are based on SWISS-PROT. That is, does not include sequence information coming from different sources.
BLOCKS: Is a database composed of small stretches of multiple alignments corresponding to entries od PROSITE. BLOCKS is more a method of detection of motifs than a database itself.
PRINTS: Is a database composed of series of conserved motifs. In this case, the motif does not need to be a contiguous sequence in the alignment, but can be composed by a series of stretches dispersed along the genes. For example, in the version of March 1999, it contained 1100 entries, 2 of them being single-motif fingerprints and 1098 multiple-motif fingerprints (446 available in PROSITE). These were equivalent to 6510 single motifs. This methodology offers in some cases a higher accuracy for detecting remote homologous sequences.
Pfam: is a database of multiple alignment of protein domains and conserved protein regions. Each entry is obtained from an alignment and used to derive a HMM statistical descriptor that constitutes one of the most sensitive tools for the detection of remote homologues.
There are other databases, but mostly, the information that they contain is redundant to the offered by the already described ones.
Sumarising, there is a consensus on how sensitive are the different tools/databases on finding remote homologous sequences. Profiles are considered to be more sensitive than regular expressions. And, within profiles, HMM profiles are considered to be more sensitive than profiles generated by dinamic programming (ej. PROSITE profiles).
Also you can find servers that contain several databases and the corresponding tools to query them. They are very appropriate because they offer the best of all the databases and methodologies in one single tool. See, for example how the GeneFind Family Identification System proceeds. Another good example is the MOTIF service of the Tokio University. See the appendix for a description of databases and tools.

4. Use of motifs

As previously commented, you can use motifs both for trying to predict protein function or for looking for remote homologous. This implies to use different tools and databases depending on the particular problem. You can find both tools and databases in the appendix.
There exist servers that perform a exahustive analysis using most of the available databses and tools. Its use is preferible whan possible. In this way the results tend to be much more reliable and robust.

4.1. To know the motifs that our problem sequence has: guessing about function.

Search with a protein query sequence against a database of regular expressions (PROSITE) or profiles (PROSITE profiles or Pfam). The result provides a description of the motivs found in the sequence. In many cases the funtion of the protein can be inferred, or at least suspected, from them.

4.2. To know what sequences in a database have a given motif: looking for remote homologous.

In this case, one must search with a profile against protein sequence databases Given a profile, protein sequence databases on the service you are using are retrieved to find out the protein families that have the same motif. The profile, either in PROSITE or Pfam format, could be calculated in from the multiple sequence alignment (see next section) or retrieved from motif library such as PROSITE or Pfam. The Pfsearch program is used to retrieve with PROSITE format profile and Hmmsearch is used for Pfam format one.
Alternatively you can search a protein sequence pattern (regular expression) against sequence databases This allows you to search protein sequence libraries with your patterns. Sequence pattern must be specified in the PROSITE pattern format.

4.3. Constructing a motif and learning about it

We can have another problem. Let's imagine that after aligning a family of proteins, we discover a conserved region, and this region was not found when we searched in the motifs databases with the sequences. In other words: we are in front of a putative domain. We can be interested in analysing the domain itself, or in using the domain to "fish" other remote homologous. In both cases, two steps must be carried out: 1) to construct a motif from the alignment and 2) search in sequence databases using this motif . Here is described the process
This process can be repeated recursively until no more sequences are found.

5. Sources of information

Several public domain sources of information have been used for the elaboration of this modes tutorial. They are excellent and updated on regular basis and contain also references. Here you can find the links to the last versions of the documents. These are: PROSITE documentation, and the acompanying document PROFILE.TXT. The document on methods for constructing profiles from the PROSITE database. The HMMER User's Guide. The Motif search help page at GenomeNet (Japan)

Paulino Gomez Puertas 3-03-01