The following is an excerpt from the chapter
Protein Identification and Analysis Tools in the ExPASy Server by Wilkins
et al.
published in the book
2-D Proteome Analysis Protocols (1998). Editor
A.J. Link. Humana Press, New Jersey.
The TagIdent tool [reference] serves two main purposes. Firstly, it can create
lists of proteins from one or more organisms that are within a
user-specified pI or Mw range. This is useful to find proteins from the
database that may be in a region of interest on a 2-D gel. Secondly, the
program can identify proteins from 2-D gels by virtue of their estimated pI
and Mw, and a short protein "sequence tag" of up to 6 amino acids. The
sequence tag can be derived from protein N-termini, C-termini, or
internally, and generated by chemical or mass spectrometric sequencing
techniques. As sequence tags are highly specific (e.g. there are 160,000
different combinations of 4 amino acid sequence tags) they represent a form
of protein identification that is useful for organisms that are molecularly
well-defined and have a relatively small number of proteins (e.g. E. coli or
S. cerevisiae). Interestingly, we have shown that C-terminal sequence tags
are more specific than N-terminal tags (Wilkins et al. (4)),
however it remains technically more difficult to generate high quality
C-terminal protein sequence data.
Use of TagIdent to list proteins in a defined pI and / or Mw region
To use TagIdent to generate a list of proteins in a pI and Mw range of
interest, proceed as follows:
If desired, a name can be given to the query, which will appear
as the subject of the e-mail message. This is useful for archiving purposes
or if many different queries are to be submitted to the program at the same
time. You should then specify the pI and Mw regions within which you would
like to search (e.g. pI of 5.5 ± 0.5 units and Mw 20000 ± 10%).
If you would
like to search using only one of the pI or Mw parameters, you can specify an
unrestricted window to cover all possibilities for the other parameter. For
example, a search where pI is set to 7.0 ± 7 units but where a Mw window of
20000 ± 10% is used will return all proteins of sizes 18000 to 22000 Mw,
regardless of their pI.
Finally, you can specify one or more keywords
matching those in the Swiss-Prot OS (species) or OC (classification) lines
to limit the search to one organism, or a range of organisms. Thus if you
want to investigate proteins exclusively from S. cerevisiae, you can specify
"CEREVISIAE". This is better than specifying "YEAST", a word common to the
classification of many yeasts which includes not only proteins from S.
cerevisiae, but also those from Candida albicans and Schizosaccharomyces
pombe. The same applies for Homo sapiens, where "SAPIENS" will search only
for human proteins whilst "HUMAN" will include proteins from human viruses.
If you would like to investigate proteins from a broader range of species,
it is possible, for example, to specify a classification like "MAMMALIA"
which will return all mammalian proteins within the specified pI and Mw
region. Use of the word "ALL" will search all species in the database.
If desired, searches can also be restricted through use of a Swiss-Prot
keyword, such as 'plasmid' or 'AIDS'. A document containing a full list
of all
Swiss-Prot keywords is available.
Finally, select the "Start TagIdent" button to submit the request to ExPASy.
Results will be displayed immediately, or, if you specified your e-mail address (for
longer jobs) sent in a few minutes to your e-mail address. A typical
output is shown in figure 4A.
Use of TagIdent to identify proteins from a 2-D gel
TagIdent can identify proteins by matching sequence tags against proteins
in Swiss-Prot/TrEMBL from one or more species within a specified pI and Mw range.
To use TagIdent for identification purposes, first specify the pI and Mw of
the protein of interest as estimated from the 2-D gel. Then specify error
margins that reflect the known accuracy of these estimates. See Wilkins et
al. (1) for an example of how pI and Mw ranges can be defined. The species
in the database to match against should then be specified, the
"Tagging" option selected by clicking in the small box, and the sequence tag
entered in single amino acid code in the "Tag" text box. Note that the
sequence tag can contain one or more 'X' if any amino acid is unknown.
Finally, you should specify the source of your protein sequence (N-,
C-terminal or internal), such that the program can show the protein area of
interest in the output. Thus, for example, if you have generated an
N-terminal protein sequence tag by Edman degradation, you should request
the program to show predicted protein N-termini. Finally, submit the search
to the ExPASy server by selecting the "Start TagIdent" button.
Results will be displayed immediately, or, if you specified your e-mail address (for
longer jobs) sent in a few minutes to your e-mail address.
Interpretation of TagIdent results for protein identification
Accurate identification of proteins with sequence tags relies on all
proteins from an organism being in sequence databases. In this manner, if
only one protein within a given pI and Mw range is found to contain a
certain N-, C-terminal or internal sequence tag, one can be confident that
there is no other, as yet undescribed protein that could otherwise match the
tag. In fully sequenced organisms the procedure is thus self-checking.
Because of this, the TagIdent approach is very useful for organisms such as
Haemophilus influenzae, Mycoplasma genitalium, Methanococcus jannaschi,
Escherichia coli, and even the eukaryote Saccharomyces cerevisiae, whose
genomes are known. A TagIdent output for a protein from E. coli is shown in
figure 4B, and illustrates the specificity of the approach. Caution is
advised where using TagIdent for the identification of proteins from poorly
molecularly defined organisms, or organisms that contain large protein
numbers (e.g. human). A four amino acid sequence tag (of 160,000 different
combinations) can be unique in microorganisms that have a total protein
count of 500 to 6000, but less useful in, for example humans, that have
50,000 to 100,000 different proteins. If protein identification results with
TagIdent show more than one protein carrying the sequence tag in the
expected region, the same sequence tag, pI and Mw data can be used in
conjunction with protein AA composition for identification with the
AACompIdent tool.
Comments
- Protein pI and Mw in TagIdent are calculated as described for Compute pI / Mw.
- Care must be taken in the use of pI and Mw estimates from 2-D gels as
part of protein identification strategies. Windows around these estimates
that are too narrow can exclude the correct identification from the list of
candidate identifications. As a general rule, we use windows of pI ± 0.5
units for proteins from bacteria and yeast, and pI ± 1.0 units for mammalian
proteins. We generally use a Mw window of ± 20%, but for proteins larger
than 60000 Mw a window of ± 10% is sufficient. If proteins are thought to be
highly post-translationally modified, very large pI and/or Mw windows may be
needed.
- TagIdent is extremely useful for searching proteins in the database for
the presence of sequence tags, as it can search in a species-specific manner
and with pI and Mw parameters. It thus avoids the drawbacks of the widely
used FASTA (2) and BLAST (3) programs which are global searching tools that
either cannot search with small sequence tags or return lists containing
many irrelevant proteins.
- If you specify parameters that generate an extremely large TagIdent
output (>1 megabyte), only the first 1000 lines will be sent by e-mail.
This is to avoid problems that can arise when large messages arrive at some
e-mail sites.
References:
- Wilkins, M.R., Pasquali, C., Appel, R.D., Ou, K., Golaz, O., Sanchez,
J.-C., Yan, J.X., Gooley, A.A., Hughes, G., Humphery-Smith, I., Williams,
K.L. and Hochstrasser, D.F. (1996) From Proteins to Proteomes: Large scale
protein identification by two-dimensional electrophoresis and amino acid
analysis. Bio/Technology 14, 61-65.
- Pearson, W.R. (1991) Searching protein sequence libraries: comparison of
the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms.
Genomics 11, 635-650.
- Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J.
(1990) Basic local alignment search tool. J. Mol. Biol. 215, 403-410.
- Wilkins M.R., Gasteiger E., Appel R.D., Hochstrasser D.F. (1998)
Protein Identification with N and C-terminal Sequence Tags in Proteome Projects.
Journal of Molecular Biology 278(3), 599-608.
Last modified 27/Jan/2003 by CHH