*
ALRDFATHDDF
SMTAEATHDSI
ECDQAATHEAS
Here we would start scanning with a core pattern
with the sequence A-T-H- [D or E] (this pattern is small and would
probably pick up too many false positive results).
Patterns are described using the following conventions:
Examples:
PA [AC]-x-V-x(4)-{ED}.
This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
PA <A-x-[ST](2)-x(0,1)-V.
This pattern, which must be in
the N-terminal of the sequence (`<'), is translated as: Ala-any-[Ser
or Thr]-[Ser or Thr]-(any or none)-Val
The first and most important criterion
used in PROSITE is that a good signature pattern must
be as short as possible, should detect all or most of the sequences
it is designed to describe and should not give too many false
positive results. In other words it must exhibit both high sensitivity
and high specificity.
A number of the patterns described in PROSITE have been
published. PROSITE developpers have tested those patterns on
the SWISS-PROT data bank to see if the signature pattern was still
specific to the group of family of proteins since the paper
was published. If this was the case PROSITE uses the published pattern
as such, otherwise they updated the pattern using methods similar to those
used to develop a new pattern and which are
described in the following sub-section.
The creation of a new pattern generally start by studying review(s) on a group or family of proteins. An alignment table of the proteins discussed in that review is built. If necessary new published sequences relevant to the subject under consideration are added to this table . Using such alignment tables, particular attention is payed to the residues and regions thought or proved to be important to the biological function of that group of proteins. These biologically significant regions or residues are generally:
A profile or weight matrix (the two terms are used
synonymously here) is a table of position-specific amino acid weights
and gap costs. These numbers (also referred to as scores) are used
to calculate a similarity score for any alignment between a profile
and a sequence, or parts of a profile and a sequence. An alignment with
a similarity score higher than or equal to a given cut-off
value constitutes a motif occurrence. As with patterns,
there may be several matches to a profile in one sequence,
but multiple occurrences in the same sequences must
be disjoint (non-overlapping) according to a specific definition
included in the profile.
The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers(3). Additional parameters allow representation of other motif descriptors, including the currently popular hidden Markov models. A technical description of the profile structure and of the corresponding motif search method is given in the file PROFILE.TXT included in each PROSITE release.
Profiles can be constructed by a large variety of different techniques. The classical method developed by Gribskov and co-workers(4) requires a multiple sequence alignment as input and uses a symbol comparison table to convert residue frequency distributions into weights. The profiles included in the current PROSITE release were generated by this procedure applying recent modifications described by Luethy and co-workers(5). In the future, we intend to apply additional profile construction tools including structure-based approaches and methods involving machine learning techniques. We also consider the possibility of distributing published profiles developed by others in PROSITE format along with locally produced documentation entries.
Unlike patterns, profiles are usually not confined to small regions with high sequence similarity. Rather they attempt to characterize a protein family or domain over its entire length. This can lead to specific problems not arising with PROSITE patterns. With a profile covering conserved as well as divergent sequence regions, there is a chance to obtain a significant similarity score even with a partially incorrect alignment. This possibility is taken into account by our quality evaluation procedures. In order to be acceptable, a profile must not only assign high similarity scores to true motif occurrences and low scores to false matches. In addition, it should correctly align those residues having analogous functions or structural properties according to experimental data.
Profiles are supposed to be more sensitive and more robust than patterns because they provide discriminatory weights not only for the residues already found at a given position of a motif but also for those not yet found. The weights for those not yet found are extrapolated from the observed amino acid compositions using empiric knowledge about amino acid substitutability. The effect of such a procedure is exemplified below.
Shown are a short alignment without gaps and the corresponding weighting table derived with our standard method.
F K L L S H
C L L V
F K A F G Q
T M F Q
Y P I V G Q
E L L G
F P V V K E
A I L K
F K V L A A
V I A D
L E F I S E
C I I Q
F K L L G N
V L V C
A -18 -10 -1 -8 8
-3 3 -10 -2 -8
C -22 -33 -18 -18 -22 -26 22 -24 -19
-7
D -35 0 -32 -33 -7
6 -17 -34 -31 0
E -27 15 -25 -26 -9 23
-9 -24 -23 -1
F 60 -30 12 14 -26 -29 -15
4 12 -29
G -30 -20 -28 -32 28 -14 -23 -33 -27
-5
H -13 -12 -25 -25 -16 14 -22 -22 -23 -10
I 3 -27 21 25 -29 -23
-8 33 19 -23
K -26 25 -25 -27 -6 4 -15
-27 -26 0
L 14 -28 19 27 -27 -20
-9 33 26 -21
M 3 -15 10 14 -17 -10
-9 25 12 -11
N -22 -6 -24 -27 1
8 -15 -24 -24 -4
P -30 24 -26 -28 -14 -10 -22 -24 -26 -18
Q -32 5 -25 -26 -9 24 -16
-17 -23 7
R -18 9 -22 -22 -10 0 -18
-23 -22 -4
S -22 -8 -16 -21 11 2
-1 -24 -19 -4
T -10 -10 -6 -7 -5 -8
2 -10 -7 -11
V 0 -25 22 25 -19 -26
6 19 16 -16
W 9 -25 -18 -19 -25 -27 -34 -20 -17
-28
Y 34 -18 -1 1 -23 -12 -19
0 0 -18
Note that at certain positions, a residue
not occurring in the alignment receives a higher score
than one occurring in the alignment, as a result of other residues
at that position. Thus A occurring in the third column has
a lower score (-1) than
M (+10) not occurring there but physicochemically
similar to L, I, V, F found in the other
sequences. Similar extrapolation procedures are used
to derive position-specific insertion and deletion scores which
further enhance the selectivity of the profile.
3 Gribskov M., McLachlan AD, Eisenberg D. Proc.
Natl. Acad. Sci. U.S.A. 4:4355-4358(1987).
4 Gribskov M., Luethy R., Eisenberg D. Meth.
Enzymol. 183:146-159(1990).
5 Luethy R., Xenarios I., Bucher P. Protein
Sci. 3:139-146(1994).
``Profiles'' were introduced by Gribskov and colleagues [Gribskov et al., 1987,Gribskov et al., 1990] at about the same time that other groups introduced similar approaches, such as ``flexible patterns'' [Barton, 1990], and ``templates''[Bashford et al., 1987,Taylor, 1986]. The term ``profile'' has stuck. All of these are more or less statistical descriptions of the consensus of a multiple sequence alignment. They use position-specific scores for amino acids (or nucleotides) and position specific scores for opening and extending an insertion or deletion. Traditional pairwise alignment (for example, BLAST [Altschul et al., 1990], FASTA [Pearson and Lipman, 1988], or the Smith/Waterman algorithm [Smith and Waterman, 1981]) uses position-independent scoring parameters. This property of profiles captures important information about the degree of conservation at various positions in the multiple alignment, and the varying degree to which gaps and insertions are permitted.
The advantage of using HMMs is that HMMs have a formal probabilistic basis. We can use Bayesian probability theory to guide how all the probability (scoring) parameters should be set. Though this might sound like a purely academic issue, this probabilistic basis lets us do things that the more heuristic methods cannot do easily. For example, an HMM can be trained from unaligned sequences, if a trusted alignment isn't yet known. Another consequence is that HMMs have a consistent theory behind gap and insertion scores. In most details, HMMs are a slight improvement over a carefully constructed profile - but far less skill and manual intervention is necessary to train a good HMM and use it. This allows us to make libraries of hundreds of profile HMMs and apply them on a very large scale to whole-genome or EST sequence analysis. One such database of protein domain models is Pfam [Sonnhammer et al., 1997]; the construction and use of Pfam is tightly tied to the HMMER software package.
HMMs do have important limitations. One is that HMMs do not capture any higher-order correlations. An HMM assumes that the identity of a particular position is independent of the identity of all other positions. HMMs make poor models of RNAs, for instance, because an HMM cannot describe base pairs. Also, compare protein ``threading'' methods, which include scoring terms for nearby amino acids in a three-dimensional protein structure.
A general definition of HMMs and an excellent tutorial introduction to their use has been written by Rabiner [Rabiner, 1989]. Throughout, I will often use ``HMM'' to refer to the specific case of profile HMMs as described by Krogh et al. [Krogh et al., 1994]. This shorthand usage is for convenience only. For a review of profile HMMs, see [Eddy, 1996], and for a complete book on the subject of probabilistic modeling in computational biology, see [Durbin et al., 1998]
Several public domain sources of information have
been used for the elaboration of this modes tutorial. They are excellent
and updated on regular basis and contain also references. Here you can
find the links to the last versions of the documents. These are: PROSITE
documentation, and the acompanying document PROFILE.TXT.
The document on methods
for constructing profiles from the PROSITE database. The HMMER
User's Guide. The Motif
search help page at GenomeNet (Japan)
Paulino Gomez Puertas 3-03-01