ScanProsite tool manual

1. Pattern syntax rules

1.1. Pattern syntax used in the PROSITE database:

The standard IUPAC one-letter codes for the amino acids are used.
The symbol `x' is used for a position where any amino acid is accepted.
Ambiguities are indicated by listing the acceptable amino acids for a given position, between square brackets `[ ]'. For example: [ALT] stands for Ala or Leu or Thr.
Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met.
Each element in a pattern is separated from its neighbor by a `-'.
Repetition of an element of the pattern can be indicated by following that element with a numerical value or, if it is a gap ('x'), by a numerical range between parentheses.

Examples:
x(3) corresponds to x-x-x
x(2,4) corresponds to x-x or x-x-x or x-x-x-x
A(3) corresponds to A-A-A
Note: You can only use a range with 'x', i.e. A(2,4) is not a valid pattern element.
When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a `<' symbol or respectively ends with a `>' symbol. In some rare cases (e.g. PS00267 or PS00539), '>' can also occur inside square brackets for the C-terminal element. 'F-[GSTV]-P-R-L-[G>]' means that either 'F-[GSTV]-P-R-L-G' or 'F-[GSTV]-P-R-L>' are considered.

1.2. Extended syntax allowed in the ScanProsite tool:

If your pattern consists of one-letter amino acid codes only, without any ambiguous residues, you need not specify the '-', i.e. you can directly copy/paste peptide sequences into the text field.
Example: M-A-S-K-E can be written as MASKE.
To search all sequences which do not contain a certain amino acid, e.g Cys, you can use <{C}*>.

1.3. Examples

[AC]-x-V-x(4)-{ED}
This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}

< A-x-[ST](2)-x(0,1)-V
This pattern, which must be in the N-terminal of the sequence (`<'), is translated as: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val

<{C}*>
This pattern describes all sequences which do not contain any Cysteines.

IIRIFHLRNI
This pattern describes all sequences which contain the subsequence 'IIRIFHLRNI'.

2. Pattern matching mode

Three parameters allow to finely tune the behaviour of the pattern-matching engine. These are :

greed: extend at most variable-length pattern elements
overlap: allow partially overlapping matches
include: allow matches included within one another (implies overlap)

The default behavior is greedy, allows overlaps but not included matches. This means that two overlapping matches are rejected if one is entirely contained within the other.

For example, consider the sequence ``ABACADAEAFA'' and the simple pattern ``A-x(1,3)-A''. The six possible combinations of the switches produce the following results:

greed=1, overlap=1, include=0 (default) : 4 matches

  ABACADAEAFA
  ooooo......
  ..ooooo....
  ....ooooo..
  ......ooooo

greed=1, overlap=1, include=1 : 5 matches

  ABACADAEAFA
  ooooo......
  ..ooooo....
  ....ooooo..
  ......ooooo
  ........ooo

greed=1, overlap=0 : 2 matches

  ABACADAEAFA
  ooooo......
  ......ooooo

greed=0, overlap=1, include=0 or 1 : 5 matches

  ABACADAEAFA
  ooo........
  ..ooo......
  ....ooo....
  ......ooo..
  ........ooo

greed=0, overlap=0 : 3 matches

  ABACADAEAFA
  ooo........
  ....ooo....
  ........ooo

3. Database randomization

It is often useful to be able to search a pattern against a random database in order to evaluate its specificity. It is desirable that the database be not completely random, but comparable to the databases which are to be scanned in terms of amino acid frequency and local compositional bias. ScanProsite can randomize the scanned databases on the fly, using one of two methods:

reverse sequences - randomize by taking the reverse sequence of each individual entry
shuffle - randomize by local shuffling of the residues in windows of 20 residues

The reverse sequences method is generally recommendable, but it is not adapted for patterns which are strongly enriched in one amino acid (e.g. C-C-C-[LIV]) or which are palindromic (e.g. M-L-L-M). Sample randomized databases and the scripts used to generate them are available at ftp://ftp.isrec.isb-sib.ch/pub/databases/shuffled/.

Last modified 12/Apr/2002 by AGA

ExPASy Home page

Site Map

Search ExPASy

Contact us

PROSITE

Mirror sites: