Protein Identification and Analysis Tools in the ExPASy Server

ExPASy Home page

Site Map

Search ExPASy

Contact us

Proteomics tools

Swiss-Prot

Mirror sites:

TagIdent Tool

The following is an excerpt from the chapter
Protein Identification and Analysis Tools in the ExPASy Server by Wilkins et al.

published in the book
2-D Proteome Analysis Protocols (1998). Editor A.J. Link. Humana Press, New Jersey.

The TagIdent tool [reference] serves two main purposes. Firstly, it can create lists of proteins from one or more organisms that are within a user-specified pI or Mw range. This is useful to find proteins from the database that may be in a region of interest on a 2-D gel. Secondly, the program can identify proteins from 2-D gels by virtue of their estimated pI and Mw, and a short protein "sequence tag" of up to 6 amino acids. The sequence tag can be derived from protein N-termini, C-termini, or internally, and generated by chemical or mass spectrometric sequencing techniques. As sequence tags are highly specific (e.g. there are 160,000 different combinations of 4 amino acid sequence tags) they represent a form of protein identification that is useful for organisms that are molecularly well-defined and have a relatively small number of proteins (e.g. E. coli or S. cerevisiae). Interestingly, we have shown that C-terminal sequence tags are more specific than N-terminal tags (Wilkins et al. (4)), however it remains technically more difficult to generate high quality C-terminal protein sequence data.

Use of TagIdent to list proteins in a defined pI and / or Mw region

To use TagIdent to generate a list of proteins in a pI and Mw range of interest, proceed as follows: If desired, a name can be given to the query, which will appear as the subject of the e-mail message. This is useful for archiving purposes or if many different queries are to be submitted to the program at the same time. You should then specify the pI and Mw regions within which you would like to search (e.g. pI of 5.5 ± 0.5 units and Mw 20000 ± 10%).

If you would like to search using only one of the pI or Mw parameters, you can specify an unrestricted window to cover all possibilities for the other parameter. For example, a search where pI is set to 7.0 ± 7 units but where a Mw window of 20000 ± 10% is used will return all proteins of sizes 18000 to 22000 Mw, regardless of their pI.

Finally, you can specify one or more keywords matching those in the Swiss-Prot OS (species) or OC (classification) lines to limit the search to one organism, or a range of organisms. Thus if you want to investigate proteins exclusively from S. cerevisiae, you can specify "CEREVISIAE". This is better than specifying "YEAST", a word common to the classification of many yeasts which includes not only proteins from S. cerevisiae, but also those from Candida albicans and Schizosaccharomyces pombe. The same applies for Homo sapiens, where "SAPIENS" will search only for human proteins whilst "HUMAN" will include proteins from human viruses. If you would like to investigate proteins from a broader range of species, it is possible, for example, to specify a classification like "MAMMALIA" which will return all mammalian proteins within the specified pI and Mw region. Use of the word "ALL" will search all species in the database. If desired, searches can also be restricted through use of a Swiss-Prot keyword, such as 'plasmid' or 'AIDS'. A document containing a full list of all Swiss-Prot keywords is available. Finally, select the "Start TagIdent" button to submit the request to ExPASy. Results will be displayed immediately, or, if you specified your e-mail address (for longer jobs) sent in a few minutes to your e-mail address. A typical output is shown in figure 4A.

Use of TagIdent to identify proteins from a 2-D gel

TagIdent can identify proteins by matching sequence tags against proteins in Swiss-Prot/TrEMBL from one or more species within a specified pI and Mw range. To use TagIdent for identification purposes, first specify the pI and Mw of the protein of interest as estimated from the 2-D gel. Then specify error margins that reflect the known accuracy of these estimates. See Wilkins et al. (1) for an example of how pI and Mw ranges can be defined. The species in the database to match against should then be specified, the "Tagging" option selected by clicking in the small box, and the sequence tag entered in single amino acid code in the "Tag" text box. Note that the sequence tag can contain one or more 'X' if any amino acid is unknown. Finally, you should specify the source of your protein sequence (N-, C-terminal or internal), such that the program can show the protein area of interest in the output. Thus, for example, if you have generated an N-terminal protein sequence tag by Edman degradation, you should request the program to show predicted protein N-termini. Finally, submit the search to the ExPASy server by selecting the "Start TagIdent" button. Results will be displayed immediately, or, if you specified your e-mail address (for longer jobs) sent in a few minutes to your e-mail address.

Interpretation of TagIdent results for protein identification

Accurate identification of proteins with sequence tags relies on all proteins from an organism being in sequence databases. In this manner, if only one protein within a given pI and Mw range is found to contain a certain N-, C-terminal or internal sequence tag, one can be confident that there is no other, as yet undescribed protein that could otherwise match the tag. In fully sequenced organisms the procedure is thus self-checking. Because of this, the TagIdent approach is very useful for organisms such as Haemophilus influenzae, Mycoplasma genitalium, Methanococcus jannaschi, Escherichia coli, and even the eukaryote Saccharomyces cerevisiae, whose genomes are known. A TagIdent output for a protein from E. coli is shown in figure 4B, and illustrates the specificity of the approach. Caution is advised where using TagIdent for the identification of proteins from poorly molecularly defined organisms, or organisms that contain large protein numbers (e.g. human). A four amino acid sequence tag (of 160,000 different combinations) can be unique in microorganisms that have a total protein count of 500 to 6000, but less useful in, for example humans, that have 50,000 to 100,000 different proteins. If protein identification results with TagIdent show more than one protein carrying the sequence tag in the expected region, the same sequence tag, pI and Mw data can be used in conjunction with protein AA composition for identification with the AACompIdent tool.

Comments

Protein pI and Mw in TagIdent are calculated as described for Compute pI / Mw.
Care must be taken in the use of pI and Mw estimates from 2-D gels as part of protein identification strategies. Windows around these estimates that are too narrow can exclude the correct identification from the list of candidate identifications. As a general rule, we use windows of pI ± 0.5 units for proteins from bacteria and yeast, and pI ± 1.0 units for mammalian proteins. We generally use a Mw window of ± 20%, but for proteins larger than 60000 Mw a window of ± 10% is sufficient. If proteins are thought to be highly post-translationally modified, very large pI and/or Mw windows may be needed.
TagIdent is extremely useful for searching proteins in the database for the presence of sequence tags, as it can search in a species-specific manner and with pI and Mw parameters. It thus avoids the drawbacks of the widely used FASTA (2) and BLAST (3) programs which are global searching tools that either cannot search with small sequence tags or return lists containing many irrelevant proteins.
If you specify parameters that generate an extremely large TagIdent output (>1 megabyte), only the first 1000 lines will be sent by e-mail. This is to avoid problems that can arise when large messages arrive at some e-mail sites.

References:

Wilkins, M.R., Pasquali, C., Appel, R.D., Ou, K., Golaz, O., Sanchez, J.-C., Yan, J.X., Gooley, A.A., Hughes, G., Humphery-Smith, I., Williams, K.L. and Hochstrasser, D.F. (1996) From Proteins to Proteomes: Large scale protein identification by two-dimensional electrophoresis and amino acid analysis. Bio/Technology 14, 61-65.
Pearson, W.R. (1991) Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 11, 635-650.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403-410.
Wilkins M.R., Gasteiger E., Appel R.D., Hochstrasser D.F. (1998) Protein Identification with N and C-terminal Sequence Tags in Proteome Projects. Journal of Molecular Biology 278(3), 599-608.

Last modified 27/Jan/2003 by CHH