Use of motifs in sequence analysis
4-2-2000
  1. Motivation
  2. Motifs: description and construction
    1. Regular expressions or patterns
    2. Profiles
    3. HMMs
  3. Databases & Tools
  4. Use of motifs
    1. To know the motifs that our problem sequence has: guessing about function
    2. To know what sequences in a database have a given motif: looking for remote homologous.
    3. Constructing a motif and learning about it.
  5. Sources of information


1 Motivation

There are two main motivations that have directed the development of motif analysis methods and the construction of databases of motifs. Historically, computational motif analysis was used for the detection of related sequences, whose similarity to the query sequences was too low as to be detected with the classical methods like BLAST or FASTA. Motif analysis relies on the identification of small conserved regions in the proteins that can be identified in remote homologous, despite the lack of overall similarity. At the same time, the motifs exist because of particular  requirements on  the structure  of specific  region(s) of a  protein which  may be important, for example, for their binding properties or for  their enzymatic  activity. These  requirements impose  very  tight constraints on  the evolution  of those  limited (in size) but  important  portion(s) of a protein sequence. Then, motif analysis can be used to get an idea on the function that a unknown protein is probably to have.


2. Motifs: description and construction

Motifs describe in different ways conserved regions in alignments. There are several ways of describing motifs but all of them are related with either regular expressions, profiles of HMMs.

2.1. Regular expressions

Let us  assume that  we have a partial alignment of three sequences around  an active  site residue  (in this  example an  histidine whose position is marked with an asterisk) as shown below:

                    *
             ALRDFATHDDF
             SMTAEATHDSI
             ECDQAATHEAS

Here we  would start scanning with a core pattern with the sequence A-T-H- [D or  E] (this pattern is small and would probably pick up too many false positive results).
Patterns are described using the following conventions:
 


   Examples:

   PA   [AC]-x-V-x(4)-{ED}.

   This pattern  is translated  as: [Ala or Cys]-any-Val-any-any-any-any-{any  but Glu or Asp}

   PA   <A-x-[ST](2)-x(0,1)-V.

   This pattern,  which must  be in  the N-terminal of the sequence (`<'), is translated as: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val
 

Construction.


The first  and  most  important criterion used in PROSITE  is  that a  good signature pattern must be as short as possible, should detect  all or most of the sequences it is designed to describe and should not  give too  many false  positive results. In other words it must exhibit both high sensitivity and high specificity.
A number of the patterns described in PROSITE have been published. PROSITE developpers have tested those  patterns on  the SWISS-PROT data bank to see if the signature pattern  was still  specific to  the group of family of proteins since the  paper was published. If this was the case PROSITE uses the published pattern as such, otherwise they updated the pattern using methods similar to those used  to develop  a new  pattern and  which  are  described  in  the following sub-section.

The creation of a new pattern generally start by studying review(s) on a group or family of proteins. An alignment  table of the proteins discussed in that review is built. If necessary  new published  sequences relevant to the subject under consideration are added to this table . Using such alignment tables, particular attention is payed to the residues and regions thought or proved to be important to the biological  function of  that group  of proteins.  These  biologically significant regions or residues are generally:

We then  try to  find a  short (not  more than four or five residues long) conserved sequence  which is  part of  a region  known to  be important or which include  biologically significant residue(s). We call the pattern(s) created at  this stage  the `core'  pattern(s). The most recent version of  the SWISS-PROT  Protein Sequence  databank is then scanned with these core  pattern(s).  If  a  core  pattern  will  detect  all  the  proteins  under consideration and none (or very few) of the other proteins, we can stop at this stage  and use  the core  pattern as  a bona  fide signature. In most cases we  are not  so lucky  and we pick up a lot of extra sequences which clearly do  not belong  to the  group of  proteins under  consideration. A further series  of scans,  involving a gradual increase in the size of the  pattern, is  then necessary.  In some cases we never manage to find a good  pattern and  we have to retry with a core pattern from a different part of  the sequence.  It must  also be noted that we take particular attention to try to  avoid `false' patterns. In the example in the beggining of the section you can find a `false' pattern:
According to the procedure outlined above, we would then have to  extend the core pattern. But in this case, any extension would be artificial and group together residues which have different properties and which are  represented only once in a given position of the alignment. For example, we  could scan  with the pattern [R, T or D]-[D, A or Q]-[F, E or A]-A-T-H-[D or  E]. This pattern would probably only pick up the sequences which are  in the  alignment, but  it would  be biologically  meaningless;  there is  no consensus in the first three positions of the pattern and the  pattern does  not  even  group  residues  with  identical  physicochemical  properties. Consequently, this pattern would probably fail to detect a new sequence containing the same active site but having a different N-terminal  sequence.
 
 

2.2. Profiles


A profile or weight matrix (the two terms are used synonymously here) is a  table of position-specific amino acid weights and gap costs. These numbers (also referred  to as scores) are used to calculate a similarity score for any alignment  between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a  given cut-off  value constitutes  a motif  occurrence. As  with  patterns, there may  be several  matches to  a profile in one sequence, but multiple occurrences in  the same  sequences  must  be  disjoint  (non-overlapping) according to a specific definition included in the profile.

The profile  structure used  in PROSITE  is similar  to but  slightly more general than  the one introduced by Gribskov and co-workers(3). Additional parameters allow  representation of other motif descriptors, including the currently popular  hidden Markov  models. A  technical description  of the profile structure and of the corresponding motif search method is given in the file PROFILE.TXT included in each PROSITE release.

Profiles can  be constructed  by a  large variety of different techniques. The classical  method developed  by Gribskov  and co-workers(4) requires a multiple sequence alignment as input and uses a symbol comparison table to convert  residue   frequency  distributions  into  weights.  The  profiles included in  the current  PROSITE release were generated by this procedure applying recent  modifications described  by Luethy  and co-workers(5). In the future,  we intend  to apply  additional  profile  construction  tools  including  structure-based   approaches  and   methods  involving  machine learning techniques.  We also  consider the  possibility  of  distributing  published profiles  developed by  others  in  PROSITE  format  along  with  locally produced documentation entries.

Unlike patterns,  profiles are  usually not confined to small regions with high sequence  similarity. Rather  they attempt  to characterize a protein family or  domain over  its entire  length.  This  can  lead  to  specific  problems not  arising with  PROSITE  patterns.  With  a  profile  covering  conserved as  well as  divergent sequence  regions, there  is a  chance to obtain a  significant similarity  score even  with a  partially  incorrect alignment.  This   possibility  is  taken  into  account  by  our  quality evaluation procedures.  In order to be acceptable, a profile must not only assign high  similarity scores to true motif occurrences and low scores to false matches.  In addition,  it should  correctly  align  those  residues having  analogous   functions  or   structural  properties   according  to experimental data.

Profiles are  supposed to  be more sensitive and more robust than patterns because they  provide discriminatory  weights not  only for  the  residues already found  at a  given position  of a motif but also for those not yet found. The  weights for  those not  yet found  are extrapolated  from  the observed amino  acid compositions using empiric knowledge about amino acid substitutability. The effect of such a procedure is exemplified below.

Shown are  a short  alignment without gaps and the corresponding weighting table derived with our standard method.

                  F   K   L   L   S   H   C   L   L   V
                  F   K   A   F   G   Q   T   M   F   Q
                  Y   P   I   V   G   Q   E   L   L   G
                  F   P   V   V   K   E   A   I   L   K
                  F   K   V   L   A   A   V   I   A   D
                  L   E   F   I   S   E   C   I   I   Q
                  F   K   L   L   G   N   V   L   V   C

          A     -18 -10  -1  -8   8  -3   3 -10  -2  -8
          C     -22 -33 -18 -18 -22 -26  22 -24 -19  -7
          D     -35   0 -32 -33  -7   6 -17 -34 -31   0
          E     -27  15 -25 -26  -9  23  -9 -24 -23  -1
          F      60 -30  12  14 -26 -29 -15   4  12 -29
          G     -30 -20 -28 -32  28 -14 -23 -33 -27  -5
          H     -13 -12 -25 -25 -16  14 -22 -22 -23 -10
          I       3 -27  21  25 -29 -23  -8  33  19 -23
          K     -26  25 -25 -27  -6   4 -15 -27 -26   0
          L      14 -28  19  27 -27 -20  -9  33  26 -21
          M       3 -15  10  14 -17 -10  -9  25  12 -11
          N     -22  -6 -24 -27   1   8 -15 -24 -24  -4
          P     -30  24 -26 -28 -14 -10 -22 -24 -26 -18
          Q     -32   5 -25 -26  -9  24 -16 -17 -23   7
          R     -18   9 -22 -22 -10   0 -18 -23 -22  -4
          S     -22  -8 -16 -21  11   2  -1 -24 -19  -4
          T     -10 -10  -6  -7  -5  -8   2 -10  -7 -11
          V       0 -25  22  25 -19 -26   6  19  16 -16
          W       9 -25 -18 -19 -25 -27 -34 -20 -17 -28
          Y      34 -18  -1   1 -23 -12 -19   0   0 -18

 Note that  at certain positions, a residue not occurring in  the alignment receives a  higher score  than one occurring in the alignment, as a result of other  residues at  that position. Thus A occurring in the third column  has  a   lower  score   (-1)  than   M  (+10)   not  occurring  there  but physicochemically similar  to L,  I, V,  F found  in the  other sequences. Similar extrapolation  procedures are  used  to  derive  position-specific insertion and deletion scores which further enhance the selectivity of the profile.
 

3  Gribskov M., McLachlan AD, Eisenberg D. Proc. Natl. Acad. Sci. U.S.A. 4:4355-4358(1987).
4  Gribskov M., Luethy R., Eisenberg D.  Meth. Enzymol. 183:146-159(1990).
5  Luethy R., Xenarios I., Bucher P. Protein Sci. 3:139-146(1994).
 

2.3. Profile HMMs

Profile hidden Markov models (profile HMMs) are statistical models of the primary structure consensus of a sequence family. Anders Krogh, David Haussler, and co-workers at UC Santa Cruz introduced profile HMMs [Krogh et al., 1994], adopting HMM techniques which have been used for years in speech recognition. HMMs had been used in biology before the Krogh/Haussler work, but the Krogh paper had a particularly dramatic impact, because HMM technology was so well-suited to the popular ``profile'' methods for searching databases using multiple sequence alignments instead of single query sequences. Since then, several computational biology groups have rapidly adopted HMMs as the underlying formalism for sequence profile analysis.

``Profiles'' were introduced by Gribskov and colleagues [Gribskov et al., 1987,Gribskov et al., 1990] at about the same time that other groups introduced similar approaches, such as ``flexible patterns'' [Barton, 1990], and ``templates''[Bashford et al., 1987,Taylor, 1986]. The term ``profile'' has stuck. All of these are more or less statistical descriptions of the consensus of a multiple sequence alignment. They use position-specific scores for amino acids (or nucleotides) and position specific scores for opening and extending an insertion or deletion. Traditional pairwise alignment (for example, BLAST [Altschul et al., 1990], FASTA [Pearson and Lipman, 1988], or the Smith/Waterman algorithm [Smith and Waterman, 1981]) uses position-independent scoring parameters. This property of profiles captures important information about the degree of conservation at various positions in the multiple alignment, and the varying degree to which gaps and insertions are permitted.

The advantage of using HMMs is that HMMs have a formal probabilistic basis. We can use Bayesian probability theory to guide how all the probability (scoring) parameters should be set. Though this might sound like a purely academic issue, this probabilistic basis lets us do things that the more heuristic methods cannot do easily. For example, an HMM can be trained from unaligned sequences, if a trusted alignment isn't yet known. Another consequence is that HMMs have a consistent theory behind gap and insertion scores. In most details, HMMs are a slight improvement over a carefully constructed profile - but far less skill and manual intervention is necessary to train a good HMM and use it. This allows us to make libraries of hundreds of profile HMMs and apply them on a very large scale to whole-genome or EST sequence analysis. One such database of protein domain models is Pfam [Sonnhammer et al., 1997]; the construction and use of Pfam is tightly tied to the HMMER software package.

HMMs do have important limitations. One is that HMMs do not capture any higher-order correlations. An HMM assumes that the identity of a particular position is independent of the identity of all other positions. HMMs make poor models of RNAs, for instance, because an HMM cannot describe base pairs. Also, compare protein ``threading'' methods, which include scoring terms for nearby amino acids in a three-dimensional protein structure.

A general definition of HMMs and an excellent tutorial introduction to their use has been written by Rabiner [Rabiner, 1989]. Throughout, I will often use ``HMM'' to refer to the specific case of profile HMMs as described by Krogh et al. [Krogh et al., 1994]. This shorthand usage is for convenience only. For a review of profile HMMs, see [Eddy, 1996], and for a complete book on the subject of probabilistic modeling in computational biology, see [Durbin et al., 1998]


3. Databases & tools

The observation and knowledge on a number of motifs characteristic of properties or functions of proteins lead to the development of motif databases. These databases generally consist of two components: a discriminator section that is used for a specifically designed program to detect if a sequence belongs to a family or contains a specific domain; and a textual section that provides a description of the motif.
The most important motif databases are:
PROSITE: Describes over 1000 protein families and domains. The motif is described as a regular expression. Nevertheless, regular expressions have limitations for detecting proteins belonging to families whose members are extremely divergent. Profile methods are much more powerful in detecting remote homologies and have been included in PROSITE for many protein families. PROSITE motifs, as previously commented, are based on SWISS-PROT. That is, does not include sequence information coming from different sources.
BLOCKS: Is a database composed of small stretches of multiple alignments corresponding to entries od PROSITE. BLOCKS is more a method of detection of motifs than a database itself.
PRINTS: Is a database composed of series of conserved motifs. In this case, the motif does not need to be a contiguous sequence in the alignment, but can be composed by a series of stretches dispersed along the genes. For example, in the version of March 1999, it contained 1100 entries, 2 of them being single-motif fingerprints and 1098 multiple-motif fingerprints (446 available in PROSITE). These were equivalent to 6510 single motifs. This methodology offers in some cases a higher accuracy for detecting remote homologous sequences.
Pfam: is a database of multiple alignment of protein domains and conserved protein regions. Each entry is obtained from an alignment and used to derive a HMM statistical descriptor that constitutes one of the most sensitive tools for the detection of remote homologues.
There are other databases, but mostly, the information that they contain is redundant to the offered by the already described ones.
Sumarising, there is a consensus on how sensitive are the different tools/databases on finding remote homologous sequences. Profiles are considered to be more sensitive than regular expressions. And, within profiles, HMM profiles are considered to be more sensitive than profiles generated by dinamic programming (ej. PROSITE profiles).
Also you can find servers that contain several databases and the corresponding tools to query them. They are very appropriate because they offer the best of all the databases and methodologies in one single tool. See, for example how the GeneFind Family Identification System proceeds. Another good example is the MOTIF service of the Tokio University. See the appendix for a description of databases and tools.


4. Use of motifs

As previously commented, you can use motifs both for trying to predict protein function or for looking for remote homologous. This implies to use different tools and databases depending on the particular problem. You can find both tools and databases in the appendix.
There exist servers that perform a exahustive analysis using most of the available databses and tools. Its use is preferible whan possible. In this way the results tend to be much more reliable and robust.

4.1. To know the motifs that our problem sequence has: guessing about function.

Search with a protein query sequence against a database of regular expressions (PROSITE) or profiles (PROSITE profiles or Pfam). The result provides a description of the motivs found in the sequence. In many cases the funtion of the protein can be inferred, or at least suspected, from them.

4.2. To know what sequences in a database have a given motif: looking for remote homologous.

In this case, one must search with a profile against protein sequence databases Given a profile, protein sequence databases on the service you are using are retrieved to find out the protein families that have the same motif. The profile, either in PROSITE or Pfam format, could be calculated in from the multiple sequence alignment (see next section) or retrieved from motif library such as PROSITE or Pfam. The Pfsearch program is used to retrieve with PROSITE format profile and Hmmsearch is used for Pfam format one.
Alternatively you can search a protein sequence pattern (regular expression) against sequence databases This allows you to search protein sequence libraries with your patterns. Sequence pattern must be specified in the PROSITE pattern format.

4.3. Constructing a motif and learning about it

We can have another problem. Let's imagine that after aligning a family of proteins, we discover a conserved region, and this region was not found when we searched in the motifs databases with the sequences. In other words: we are in front of a putative domain. We can be interested in analysing the domain itself, or in using the domain to "fish" other remote homologous. In both cases, two steps must be carried out: 1) to construct a motif from the alignment and 2) search in sequence databases using this motif . Here is described the process
This process can be repeated recursively until no more sequences are found.


5. Sources of information


Several public domain sources of information have been used for the elaboration of this modes tutorial. They are excellent and updated on regular basis and contain also references. Here you can find the links to the last versions of the documents. These are: PROSITE documentation, and the acompanying document PROFILE.TXT. The document on methods for constructing profiles from the PROSITE database. The HMMER User's Guide. The Motif search help page at GenomeNet (Japan) 





 

Joaquín Dopazo (CNIO).