Swiss-Prot Release notes 41

ExPASy Home page

Site Map

Search ExPASy

Contact us

Swiss-Prot

Mirror sites:

Swiss-Prot Protein Knowledgebase
Release Notes

Release 41, February 2003

Table of contents

1   Introduction
2   Description of the changes made to Swiss-Prot since release 40
3   Forthcoming changes
4   Status of the documentation files
5   The ExPASy World-Wide Web server
6   TrEMBL - a supplement to Swiss-Prot
7   FTP access to Swiss-Prot and TrEMBL
8   ENZYME and PROSITE
9   We need your help!
      Appendix A
      Appendix B

1 Introduction

Release 41.0 of Swiss-Prot contains 122'564 sequence entries, comprising 44'986'459 amino acids abstracted from 103'486 references. This represents an increase of 20% over release 40.0. The growth of the database is summarized below.

Release	Date	Number of entries	Number of amino acids
2.0	09/86	3'939	900'163
3.0	11/86	4'160	969'641
4.0	04/87	4'387	1'036'010
5.0	09/87	5'205	1'327'683
6.0	01/88	6'102	1'653'982
7.0	04/88	6'821	1'885'771
8.0	08/88	7'724	2'224'465
9.0	11/88	8'702	2'498'140
10.0	03/89	10'008	2'952'613
11.0	07/89	10'856	3'265'966
12.0	10/89	12'305	3'797'482
13.0	01/90	13'837	4'347'336
14.0	04/90	15'409	4'914'264
15.0	08/90	16'941	5'486'399
16.0	11/90	18'364	5'986'949
17.0	02/91	20'024	6'524'504
18.0	05/91	20'772	6'792'034
19.0	08/91	21'795	7'173'785
20.0	11/91	22'654	7'500'130
21.0	03/92	23'742	7'866'596
22.0	05/92	25'044	8'375'696
23.0	08/92	26'706	9'011'391
24.0	12/92	28'154	9'545'427
25.0	04/93	29'955	10'214'020
26.0	07/93	31'808	10'875'091
27.0	10/93	33'329	11'484'420
28.0	02/94	36'000	12'496'420
29.0	06/94	38'303	13'464'008
30.0	10/94	40'292	14'147'368
31.0	02/95	43'470	15'335'248
32.0	11/95	49'340	17'385'503
33.0	02/96	52'205	18'531'384
34.0	10/96	59'021	21'210'389
35.0	11/97	69'113	25'083'768
36.0	07/98	74'019	26'840'295
37.0	12/98	77'977	28'268'293
38.0	07/99	80'000	29'085'965
39.0	05/00	86'593	31'411'114
40.0	10/01	101'602	37'315'215
41.0	02/03	122'564	44'986'459

2 Description of the changes made to Swiss-Prot since release 40

: 2.1 Sequences and annotations

21'133 sequences have been added since release 40, the sequence data of 3'251 existing entries has been updated and the annotations of 57'525 entries have been revised.

: 2.2 The HPI project

The Human Proteomics Initiative (HPI) puts a major effort on the annotation of all known human sequences according to the quality standards of Swiss-Prot. This means that, for each known protein, a wealth of information is provided, which includes the description of its function, its domain structure, subcellular location, post-translational modifications (PTMs), variants, similarities to other proteins, etc. This not only implies the annotation of newly detected proteins, but also the integration of new research data into the existing entries by specialized biologists, who are in close contact with experts all over the world.

There are currently 9'172 annotated human sequences in Swiss-Prot. Up-to-date detailed statistics concerning the HPI project are available at:

http://www.expasy.org/sprot/hpi/hpi_stat.html

Simultaneously, two further efforts were increased: the description of human diseases associated with deficiency(ies) in the protein, and mammalian orthologs of human proteins are annotated at a level equivalent to that of the cognate human sequences.

For all aspects of the HPI project, we would appreciate the help and collaboration of the scientific community. Information concerning the human proteome is highly critical to a large section of the life science community. We therefore appeal to the user community to fully participate in this initiative by providing all the necessary information to define and to speed up the comprehensive annotation of the human proteome.

For a detailed description of the HPI project please consult:

http://www.expasy.org/sprot/hpi/

: 2.3 The HAMAP project

The first complete microbial genome sequence was that of the bacterium Haemophilus influenzae, which became available in 1995. Since then, more than 100 bacterial and archaeal genomes have been sequenced and many more sequencing projects of pathogenic and nonpathogenic microbes are in progress. To date, the publicly available microbial genomes encode more than 230'000 different proteins.

In order to handle the large amount of "raw" data coming from microbial genome sequencing, the High quality Automated Microbial Annotation of Proteomes (HAMAP) project was initiated. The project aims to automatically annotate a significant percentage of protein sequences, which originate from microbial genome sequencing projects.

To maintain a high level quality of annotation, specific tools are developed to deal with two completely separate subsets of bacterial and archaeal proteins: proteins that have no recognizable similarity to any other microbial or non-microbial proteins ("ORFans") and proteins that are part of well-defined families or subfamilies. This is done by using a rule system that describes the level and extent of annotations that can be assigned by similarity with a prototype manually annotated entry. The result is a curated entry whose quality is identical to that produced manually by an expert annotator.

Programs under development are designed to recognize protein peculiarities, and only proteins which match the defined criteria are processed automatically. Protein sequences which fail to fit into the rule system are further analyzed by Swiss-Prot expert annotators.

For a detailed description of the HAMAP project and its current status please consult:

http://www.expasy.org/sprot/hamap/

and:

Gattiker A., Michoud K., Rivoire C., Auchincloss A.H., Coudert E., Lima T., Kersey P., Pagni M., Sigrist C.J.A., Lachaize C., Veuthey A.-L., Gasteiger E., Bairoch A.
Automatic annotation of microbial proteomes in Swiss-Prot.
Comput. Biol. Chem. 27:49-58(2003).

: 2.4 What's happening with the model organisms?

We have selected a number of organisms that are the target of genome sequencing and/or mapping projects and for which we intend to:

be as complete as possible. All sequences available at a given time should be immediately included in Swiss-Prot. This also includes sequence corrections and updates;
provide a higher level of annotation;
provide cross-references to specialized database(s) that contain, among other data, some information about the genes that code for these proteins;
provide specific indexes and documents.

From our efforts to annotate human sequence entries as completely as possible arose the HPI project (see 2.2), and the bacterial model organisms became the focus of the HAMAP project (see 2.3). Here is the current status of the model organisms which are not covered by these two projects:

Organism	Database cross-references	Index file	Number of sequences
A.thaliana	None yet	arath.txt	1'952
C.albicans	None yet	calbican.txt	264
C.elegans	Wormpep	celegans.txt	2'291
D.discoideum	DictyDB	dicty.txt	316
D.melanogaster	FlyBase	fly.txt	1'764
M.musculus	MGD	mgdtosp.txt	6'169
S.cerevisiae	SGD	yeast.txt	4'892
S.pombe	GeneDB_SPombe	pombe.txt	2'116

: 2.5 'Nucleomorph' added to the OrGanelle (OG) line

The OG (OrGanelle) line indicates from which genome a gene for a protein originates. Until now, defined terms in the OG line where 'Chloroplast', 'Cyanelle', 'Mitochondrion' and 'Plasmid'. The term 'Nucleomorph' has been added, which is the residual nucleus of an algal endosymbiont that resides inside its host cell.

: 2.6 Progress in the conversion of Swiss-Prot to mixed-case characters

We are gradually converting Swiss-Prot entries from all 'UPPER CASE' to 'MiXeD CaSe'. With this release the RC (Reference Comment) line topic STRAIN and the CC line topic 'CATALYTIC ACTIVITY' have been converted.

As described in section 3.2, the process of converting all of Swiss-Prot to mixed case continues.

: 2.7 Multiple RP lines

Starting with release 41, there can be more than one RP (Reference Position) line per reference in a Swiss-Prot entry. The RP line describes the extent of the work carried out by the authors of the reference, e.g. the type of molecule that has been sequenced, protein characterization, PTM characterization, protein structure analysis, variation detection, etc.

As the number of experimental results per publication has increased over the years, the limitation of using a single RP line per reference no longer allowed to add all the information while maintaining a consistent format. Therefore we decided to permit multiple RP lines.

Example:

RP   SEQUENCE FROM N.A., SEQUENCE OF 23-42 AND 351-365, AND
RP   CHARACTERIZATION.

: 2.8 Changes concerning cross-references (DR line)

: 2.8.1 Schizosaccharomyces pombe GeneDB database

We have added cross-references to the Schizosaccharomyces pombe GeneDB database (available at http://www.genedb.org/genedb/pombe/index.jsp), which contains all S. pombe known and predicted protein coding genes, pseudogenes and tRNAs. It is hosted by the Sanger Institute.

The identifiers of the appropriate DR line are:

Data bank identifier:	GeneDB_SPombe
Primary identifier:	GeneDB's unique identifier for a S. pombe gene.
Secondary identifier:	None; a dash '-' is stored in that field.
Example:	DR GeneDB_SPombe; SPAC9E9.12c; -.

: 2.8.2 Genew

We have added cross-references to the Human Gene Nomenclature Database Genew (available at http://www.gene.ucl.ac.uk/nomenclature/searchgenes.pl), which provides data for all human genes which have approved symbols. It is managed by the HUGO Gene Nomenclature Committee (HGNC).

The identifiers of the appropriate DR line are:

Data bank identifier:	Genew
Primary identifier:	HGNC's unique identifier for a human gene
Secondary identifier:	HGNC's approved gene symbol.
Example:	DR Genew; HGNC:5217; HSD3B1.

: 2.8.3 Gramene

We have added cross-references to the Gramene database, a comparative mapping resource for grains (available at http://www.gramene.org/). The format for the explicit links are:

Data bank identifier:	Gramene
Primary identifier:	Unique identifier for a protein, which is identical to the Swiss-Prot primary AC number of that protein.
Secondary identifier:	None; a dash '-' is stored in that field.
Example:	DR Gramene; Q06967; -.

: 2.8.4 HAMAP

We have added cross-references to the collection of orthologous microbial protein families, generated manually by expert curators of the HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes) project in the framework of the Swiss-Prot protein knowledgebase. The data is accessible at http://www.expasy.org/sprot/hamap/families.html.

The identifiers of the appropriate DR line are:

Data bank identifier:	HAMAP
Primary identifier:	HAMAP unique identifier for a microbial protein family
Secondary identifier:	The values are either '-', 'fused', 'atypical' or 'atypical/fused'. The value '-' is a placeholder for an empty field; the 'fused' value indicates that the family rule does not cover the entire protein; the value 'atypical' points out that the protein is divergent in sequence or has mutated functional sites, and should not be included in family datasets. The value 'atypical/fused' indicates both latter findings.
Tertiary identifier:	Number of domains found in the protein, generally '1', rarely '2' for the fusion of 2 identical domains.
Example:	DR HAMAP; MF_00012; -; 1.

: 2.8.5 Phosphorylation Site Database

We have added cross-references to the Phosphorylation Site Database, PhosSite (available at http://vigen.biochem.vt.edu/xpd/xpd.htm), which provides access to information from scientific literature concerning prokaryotic proteins that undergo covalent phosphorylation on the hydroxyl side chains of serine, threonine or tyrosine residues. The identifiers of the appropriate DR line are:

Data bank identifier:	PhosSite
Primary identifier:	Unique identifier for a phosphoprotein, which is identical to the Swiss-Prot primary AC number of that protein.
Secondary identifier:	None; a dash '-' is stored in that field.
Example:	DR PhosSite; P00955; -.

: 2.8.6 TIGRFAMs

We have added cross-references to TIGRFAMs, a protein family database available at http://www.tigr.org/TIGRFAMs/. The identifiers of the appropriate DR line are:

Data bank identifier:	TIGRFAMs
Primary identifier:	TIGRFAMs unique identifier for a protein family.
Secondary identifier:	TIGRFAMs entry name for a protein family.
Tertiary identifier:	Number of hits found in the sequence.
Example:	DR TIGRFAMs; TIGR00630; uvra; 1.

: 2.8.7 CarbBank

We have removed the Swiss-Prot cross-references to CarbBank.

: 2.8.8 GCRDb

We have removed the Swiss-Prot cross-references to GCRDb.

: 2.8.9 Mendel

We have removed the Swiss-Prot cross-references to Mendel.

: 2.8.10 YEPD

We have removed the Swiss-Prot cross-references to the yeast electrophoresis protein database (YEPD).

: 2.9 Explicit links to dbSNP in FT VARIANT lines of human sequence entries

In human protein sequence entries we have introduced explicit links to the Single Nucleotide Polymorphism database (dbSNP) from the feature description of FT VARIANT keys. The format of such links is:

FT   VARIANT    from     to	  description (IN dbSNP:accession_number).
FT                                /FTId=VAR_number.

Example:

FT   VARIANT      65     65       T -> I (IN dbSNP:1065419).
FT                                /FTId=VAR_012009.

: 2.10 Feature key 'SIMILAR' became obsolete

The feature key 'SIMILAR' was used to describe the extent of a similarity with another protein sequence. Nowadays, most domains with similarity to other proteins are known regions described in domain and family databases, which are annotated in Swiss-Prot with the feature key 'DOMAIN' or 'REPEAT' and the comment (CC) line topic 'SIMILARITY'; thus the feature key 'SIMILAR' became obsolete and will not be used again.

: 2.11 Version of SP in XML format

A distribution version of Swiss-Prot and TrEMBL in XML format is being developed. The first draft of the XML specification was released for public review on February 21, 2002.

For more information see http://www.ebi.ac.uk/swissprot/SP-ML/.

Please send comments and suggestions by electronic mail to sp-ml@ebi.ac.uk.

3 Forthcoming changes

Please note that these are the last release notes in this format. In future, forthcoming changes and recent modifications are announced to users also between major Swiss-Prot releases. The distinct sections of this document will move to the following sites:

2. Description of the changes made to Swiss-Prot since the last release: http://www.expasy.org/sprot/relnotes/sp_news.html. This new document contains all recent modifications in Swiss-Prot including minor changes with no impact on the work of software developpers. Thus this document contains more information than announced in the document 'sp_soon.html' (see below).
3. Forthcoming changes: http://www.expasy.org/sprot/relnotes/sp_soon.html. All modifications, which have an impact on the Swiss-Prot format are announced in this document.
4. Status of the documentation files: http://www.expasy.org/sprot/userman.html#documentation
5. The ExPASy World-Wide Web server:
- Explicit general and continuously updated documentation: http://www.expasy.org/doc/expasy.pdf
- History of changes, improvements and new features: http://www.expasy.org/history.html
- Swiss-Flash, a service that reports news of databases, software and service developments: http://www.expasy.org/swiss-flash/
6. TrEMBL - a supplement to Swiss-Prot: ftp://ftp.ebi.ac.uk/pub/databases/trembl/relnotes.txt
7. FTP access to Swiss-Prot and TrEMBL: http://www.expasy.org/sprot/userman.html#ftp and http://www.expasy.org/sprot/download.html
8. ENZYME and PROSITE: Enzyme release notes (not yet) and http://www.expasy.org/prosite/psrelnot.html

Appendix A (Release statistics): http://www.expasy.org/sprot/relnotes/relstat.html
Appendix B (Relationships between Swiss-Prot and some biomolecular databases): http://www.expasy.org/sprot/userman.html#relship

: 3.1 Extension of the entry name format

We endeavor to assign meaningful entry names that facilitate the identification of the proteins and the species of origin. Currently the entry name consists of up to ten uppercase alphanumeric characters. Swiss-Prot uses a general purpose naming convention that can be symbolized as X_Y, where X is a mnemonic code of at most 4 alphanumeric characters representing the protein name, the '_' sign serves as a separator, and the Y is a mnemonic species identification code of at most 5 alphanumeric characters representing the biological source of the protein.

We are planning to elongate the mnemonic code for the protein name from up to 4 characters to up to 5 characters. E.g. the mnemonic code for the meiotic recombination protein rec10 is currently 'RE10'. After the introduction of extended entry names it could be modified to the 5-letter code 'REC10'.

: 3.2 Continuation of the conversion of Swiss-Prot to mixed-case characters

We will continue to convert Swiss-Prot entries from all 'UPPER CASE' to 'MiXeD CaSe'. We are proceeding in the conversion of CC (Comment) lines, we will start to convert the GN (Gene Name) lines to mixed case, but also any other line type might be effected.

: 3.3 Reference Comment (RC) line topics may span lines

The RC (Reference Comment) line store comments relevant to the reference cited, in currently 5 distinct topics: PLASMID, SPECIES, STRAIN, TISSUE and TRANSPOSON. It is not always possible to list all information within one line. Therefore we will allow multiple RC lines, in which one topic might span over a line. Example:

RC   STRAIN=Various strains;

could become

RC   STRAIN=AZ.026, DC.005, GA.039, GA2181, IL.014, IN.018, KY.172, KY2.37,
RC   LA.013, MN.001, MNb027, MS.040, NY.016, OH.036, TN.173, TN2.38,
RC   UT.002, AL.012, AZ.180, MI.035, VA.015, and IL2.17;

: 3.4 New format of comment line (CC) topics

We are continuing a major overhaul of various comment line topics. We would like the majority of the information stored to be usable by computer programs (while remaining human-readable). We are therefore standardizing the format of the topics.

: 3.4.1 ALTERNATIVE PRODUCTS

We are gradually restructuring the CC (comment) line topic ALTERNATIVE PRODUCTS and introducing unique identifiers for each described isoform. Qualifiers, which will be introduced are described in the table below:

Topic	Description
Event	Biological process that results in the production of the alternative forms (Alternative promoter, Alternative splicing, Alternative initiation). Format: Event=controlled vocabulary; Example: Event=Alternative splicing;
Named isoforms	Number of isoforms listed in the topics 'Name' below the topic 'Event=Alternative splicing'. Format: Named isoforms=number; Example: Named isoforms=6;
Comment	Any comments concerning one or more isoforms; optional; may be longer than 1 line. Format: Comment=free text; Example: Comment=Experimental confirmation may be lacking for some isoforms;
Name	A common name for an isoform used in the literature or assigned by Swiss-Prot (currenty only available for spliced isoforms). Format: Name=common name; Example: Name=Alpha;
Synonyms	Synonyms for an isoform as used in the literature; optional. Format: Synonyms=synonym_1[, synonym_n]; Example: Synonyms=B, KL5;
IsoId	Unique identifier for an isoform, consisting of the Swiss-Prot accession number, followed by a dash and an identifier for this isoform. Format: IsoId=acc#-isoform_number[, acc#-isoform_number]; Example: IsoId=P05067-1;
Sequence	Lists all FT VARSPLIC identifiers (VSP_#), which are needed to build the sequence for a specific isoform. If the accession number of the IsoId does not correspond to the accession number of the current entry, this topic contains the term 'External'. Format: Sequence=VSP_#[, VSP_#]\|Displayed\|External\|Not described; Example: Sequence=Displayed; Example: Sequence=VSP_000013, VSP_000014;
Note	Notes concerning current isoform; optional; Format: Note=free text; Example: Note=Predicted;

In the case of 'Alternative initiation' the topic 'Event' can be followed by a 'Comment' of free text. Format:

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative initiation;
CC         Comment=Optional free text with information on alternative 
CC         initiation or the products retrieved from this event. In the 
CC         case of alternative initiation there will be no other topics;

In the case of 'Alternative splicing' the topic 'Event' can be followed by a 'Comment' of free text and a listing of all described isoforms. Format:

CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing;
CC         Comment=Optional free text with information on alternative 
CC         splicing or the products retrieved from this event;
CC       Name=isoform_1; Synonyms=synonym_1[, synonym_n];
CC         IsoId=isoform_identifier_1[, isoform_identifer_n]; 
CC         Sequence=VSP_identifier_1 [, VSP_identifier_n];
CC         Note=Optional note concerning isoform_1;
CC       Name=isoform_n; Synonyms=synonym_1[, synonym_n];
CC         IsoId=isoform_identifier_1[, isoform_identifer_n]; 
CC         Sequence=VSP_identifier_1 [, VSP_identifier_n];
CC         Note=Optional note concerning isoform_n;

Example for new format of the CC lines and the corresponding FT lines for an entry with alternative splicing:

...
CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing; Named isoforms=9;
CC         Comment=Additional isoforms seem to exist. APP695, APP751 and
CC         APP770 are the major isoforms. The L-isoforms are referred to as
CC         appicans. Experimental confirmation may be lacking for some
CC         isoforms;
CC       Name=APP770; Synonyms=Prea4 770;
CC         IsoId=P05067-1; Sequence=Displayed;
CC       Name=APP305;
CC         IsoId=P05067-2; Sequence=VSP_000005, VSP_000006;
CC       Name=L-APP677;
CC         IsoId=P05067-3; Sequence=VSP_000002, VSP_000004, VSP_000009;
CC       Name=APP695; Synonyms=Prea4 695;
CC         IsoId=P05067-4; Sequence=VSP_000002, VSP_000004;
CC       Name=L-APP696;
CC         IsoId=P05067-5; Sequence=VSP_000002, VSP_000003, VSP_000009;
CC       Name=APP714;
CC         IsoId=P05067-6; Sequence=VSP_000002, VSP_000003;
CC       Name=L-APP733;
CC         IsoId=P05067-7; Sequence=VSP_000007, VSP_000008, VSP_000009;
CC       Name=APP751; Synonyms=Prea4 751;
CC         IsoId=P05067-8; Sequence=VSP_000007, VSP_000008;
CC       Name=L-APP752;
CC         IsoId=P05067-9; Sequence=VSP_000009;
...
FT   VARSPLIC    289    289       E -> V (in isoform APP695, isoform
FT                                L-APP696, isoform L-APP677 and isoform
FT                                APP714).
FT                                /FTId=VSP_000002.
FT   VARSPLIC    290    345       Missing (in isoform L-APP696 and isoform
FT                                APP714).
FT                                /FTId=VSP_000003.
FT   VARSPLIC    290    364       Missing (in isoform APP695 and isoform
FT                                L-APP677).
FT                                /FTId=VSP_000004.
FT   VARSPLIC    290    305       VCSEQAETGPCRAMIS -> KWYKEVHSGQARWLML (in
FT                                isoform APP305).
FT                                /FTId=VSP_000005.
FT   VARSPLIC    306    770       Missing (in isoform APP305).
FT                                /FTId=VSP_000006.
FT   VARSPLIC    345    345       M -> I (in isoform L-APP733 and isoform
FT                                APP751).
FT                                /FTId=VSP_000007.
FT   VARSPLIC    346    364       Missing (in isoform L-APP733 and isoform
FT                                APP751).
FT                                /FTId=VSP_000008.
FT   VARSPLIC    637    654       Missing (in isoform L-APP677, isoform
FT                                L-APP696, isoform L-APP733 and isoform
FT                                L-APP752).
FT                                /FTId=VSP_000009.
...

: 3.4.2 PATHWAY

We are gradually structuring the comment line topic PATHWAY. To describe the biochemical pathway in which the protein is involved, we use the following format:

CC   -!- PATHWAY: biochemical pathway; n^th step.[ Comment.]

Example:

CC   -!- PATHWAY: Coenzyme A (CoA) biosynthesis; first step.

: 3.4.3 COFACTOR

The comment line topic COFACTOR is gradually being modified to the following format:

CC   -!- COFACTOR: cofactor1[, cofactor2 and cofactor3].[ Comment.]

Examples:

CC   -!- COFACTOR: Magnesium.
CC   -!- COFACTOR: Copper, Manganese and Nickel.

: 3.5 Changes concerning cross-references (DR line)

We will add cross-references to the Gene Ontology (GO) database (available at http://www.geneontology.org/), which provides controlled vocabularies for the description of the molecular function, biological process and cellular component of gene products.

The identifiers of the appropriate DR line are:

Data bank identifier:	GO
Primary identifier:	GO's unique identifier for a GO term.
Secondary identifier:	A 1-letter abbreviation for one of the 3 ontology aspects, separated from the GO term by a column. If the term is longer than 45 characters, the first 43 characters are indicated followed by 3 dots ('...'). The abbreviations for the 3 distinct aspects of the ontology are P (biological Process), F (molecular Function) and C (cellular Component).
Tertiary identifier:	3-character GO evidence code.
Example:	DR GO; GO:0003677; F:DNA binding; TAS.

: 3.6 Modifications concerning the feature table (FT line)

We are investigating a major effort in the annotation of posttranslational modifications, which has an effect on various feature keys and feature descriptions. Major format changes are described below.

: 3.6.1 New feature key 'CROSSLNK'

The feature key 'CROSSLNK' will be introduced to describe bonds between amino acids, which are formed posttranslationally within a peptide or between peptides, such as isopeptidic bonds, carbon-carbon linkages, carbon-nitrogen linkages and backbone condensations. It will also include the description of tioether bonds and thiolester bonds and thus the feature keys 'THIOETH' and 'THIOLEST' will be removed.

Note: Disulfide bonds occur so often in proteins, that we will keep the special feature key 'DISULFID' to describe this kind of linkage.

Format:

FT   CROSSLNK    from     to      Description.

: 3.6.2 Removal of the feature key 'THIOETH'

See section 3.6.1.

: 3.6.3 Removal of the feature key 'THIOLEST'

See section 3.6.1.

4 Status of the documentation files

Swiss-Prot is distributed with a large number of documentation files. Some of these files have been available for a long time (the user manual, release notes, the various indexes for authors, citations, keywords, etc.), but many have been created recently and we are continuously adding new files, and updating and modifying existing files. Please note that the header in many documentation files has changed. The following table lists all the documents that are currently available.

See also section 7.3 for information on how to access updated versions of all documents between major releases.

userman.txt	User manual
relnotes.txt	Release notes for the current release (41)
shortdes.txt	Short description of entries in Swiss-Prot

jourlist.txt	List of cited journals
keywlist.txt	List of keywords
plasmid.txt	List of plasmids
speclist.txt	List of organism (species) identification codes
tisslist.txt	List of tissues
experts.txt	List of on-line experts for PROSITE and Swiss-Prot
dbxref.txt	List of databases cross-referenced in Swiss-Prot
submit.txt	Submission of sequence data to Swiss-Prot

acindex.txt	Accession number index
autindex.txt	Author index
citindex.txt	Citation index
keyindex.txt	Keyword index
speindex.txt	Species index
deleteac.txt	Deleted accession number index

7tmrlist.txt	List of 7-transmembrane G-linked receptor entries
aatrnasy.txt	List of aminoacyl-tRNA synthetases
allergen.txt	Nomenclature and index of allergen sequences
annbioch.txt	Swiss-Prot annotation: how is biochemical information assigned to sequence entries
arath.txt	Index of Arabidopsis thaliana entries and their corresponding gene designations [see 2]
bacsu.txt	Index of Bacillus subtilis strain 168 chromosomal entries and their corresponding SubtiList cross-references [see 1]
bloodgrp.txt	Blood group antigen proteins
bucai.txt	Index of Buchnera aphidicola (subsp. Acyrthosiphon pisum) entries [see 2]
bucap.txt	Index of Buchnera aphidicola (subsp. Schizaphis graminum) entries[see 2]
calbican.txt	Index of Candida albicans entries and their corresponding gene designations
cdlist.txt	CD nomenclature for surface proteins of human leucocytes
celegans.txt	Index of Caenorhabditis elegans entries and their corresponding gene designations and WormPep cross-references
dicty.txt	Index of Dictyostelium discoideum entries and their corresponding gene designations and DictyDB cross-references
ec2dtosp.txt	Index of Escherichia coli Gene-protein database (ECO2DBASE) entries referenced in Swiss-Prot
ecoli.txt	Index of Escherichia coli strain K12 chromosomal entries and their corresponding EcoGene cross-references
embltosp.txt	Index of EMBL Nucleotide Sequence Database entries referenced in Swiss-Prot
extradom.txt	Nomenclature of extracellular domains
fly.txt	Index of Drosophila entries and their corresponding FlyBase cross-references
glycosid.txt	Classification of glycosyl hydrolase families and index of glycosyl hydrolase entries in Swiss-Prot
haein.txt	Index of Haemophilus influenzae strain Rd chromosomal entries [see 1]
helpy.txt	Index of Helicobacter pylori strain 26695 chromosomal entries [see 1]
hoxlist.txt	Vertebrate homeotic Hox proteins: nomenclature and index
humchr01.txt	Index of proteins encoded on human chromosome 1
humchr02.txt	Index of proteins encoded on human chromosome 2
humchr03.txt	Index of proteins encoded on human chromosome 3
humchr04.txt	Index of proteins encoded on human chromosome 4
humchr05.txt	Index of proteins encoded on human chromosome 5
humchr06.txt	Index of proteins encoded on human chromosome 6
humchr07.txt	Index of proteins encoded on human chromosome 7
humchr08.txt	Index of proteins encoded on human chromosome 8
humchr09.txt	Index of proteins encoded on human chromosome 9
humchr10.txt	Index of proteins encoded on human chromosome 10
humchr11.txt	Index of proteins encoded on human chromosome 11
humchr12.txt	Index of proteins encoded on human chromosome 12
humchr13.txt	Index of proteins encoded on human chromosome 13
humchr14.txt	Index of proteins encoded on human chromosome 14
humchr15.txt	Index of proteins encoded on human chromosome 15
humchr16.txt	Index of proteins encoded on human chromosome 16
humchr17.txt	Index of proteins encoded on human chromosome 17
humchr18.txt	Index of proteins encoded on human chromosome 18
humchr19.txt	Index of proteins encoded on human chromosome 19
humchr20.txt	Index of proteins encoded on human chromosome 20
humchr21.txt	Index of proteins encoded on human chromosome 21
humchr22.txt	Index of proteins encoded on human chromosome 22
humchrx.txt	Index of proteins encoded on human chromosome X
humchry.txt	Index of proteins encoded on human chromosome Y
humpvar.txt	Index of human proteins with sequence variants
initfact.txt	List and index of translation initiation factors
intein.txt	Index of intein-containing entries referenced in Swiss-Prot
metallo.txt	Classification of metallothioneins and index of the entries in Swiss-Prot
metja.txt	Index of Methanococcus jannaschii entries [see 1]
mgdtosp.txt	Index of MGD entries referenced in Swiss-Prot
mimtosp.txt	Index of MIM entries referenced in Swiss-Prot
mycge.txt	Index of Mycoplasma genitalium strain G-37 chromosomal entries [see 1]
mycpn.txt	Index of Mycoplasma pneumoniae strain M129 chromosomal entries [see 2]
ngr234.txt	Table of predicted proteins in Rhizobium plasmid pNGR234a
nomlist.txt	List of nomenclature related references for proteins
pdbtosp.txt	Index of Protein Data Bank (PDB) entries referenced in Swiss-Prot
peptidas.txt	Classification of peptidase families and index of peptidase entries in Swiss-Prot
plastid.txt	List of chloroplast and cyanelle encoded proteins
pombe.txt	Index of Schizosaccharomyces pombe entries and their corresponding gene designations
restric.txt	List of restriction enzyme and methylase entries
ribosomp.txt	Index of ribosomal proteins classified by families on the basis of sequence similarities
ricpr.txt	Index of Rickettsia prowazekii strain Madrid E entries [see 1]
salty.txt	Index of Salmonella typhimurium strain LT2 chromosomal entries and their corresponding StyGene cross-references
syny3.txt	Index of Synechocystis sp. strain PCC 6803 entries [see 1]
upflist.txt	List of UPF (Uncharacterized Protein Families) and index of members
yeast.txt	Index of Saccharomyces cerevisiae entries in Swiss-Prot and their corresponding gene designations
yeast1.txt	Yeast chromosome I entries
yeast2.txt	Yeast chromosome II entries
yeast3.txt	Yeast chromosome III entries
yeast5.txt	Yeast chromosome V entries
yeast6.txt	Yeast chromosome VI entries
yeast7.txt	Yeast chromosome VII entries
yeast8.txt	Yeast chromosome VIII entries
yeast9.txt	Yeast chromosome IX entries
yeast10.txt	Yeast chromosome X entries
yeast11.txt	Yeast chromosome XI entries
yeast13.txt	Yeast chromosome XIII entries
yeast14.txt	Yeast chromosome XIV entries

Notes:

The filenames for indexes of microbe-specific entries have been renamed; the filename is now composed of the 5-letter code used for the species in the Swiss-Prot entry name and the extension 'txt'. This modification concerns the following files:

'bacsu.txt' (formerly: 'subtilis.txt'), 'haein.txt' (formerly: 'haeinflu.txt'), 'helpy.txt' (formerly: 'hpylori.txt'), 'metja.txt' (formerly: 'mjannasc.txt'), 'mycge.txt' (formerly: 'mgenital.txt'), 'ricpr.txt' (formerly: 'rprowaze.txt'), 'syny3.txt' (formerly: 'pcc6803.txt').

The files 'arath.txt', 'bucai.txt', 'bucap.txt' and 'mycpn.txt' are new documents introduced since release 40.

We have continued to include in some Swiss-Prot documentation files the references to Web sites relevant to the subject under consideration. There are now 89 documents that include such links.

5 New features of the ExPASy World-Wide Web server related to Swiss-Prot

Explicit general and continuously updated documentation about the ExPASy server is available at http://www.expasy.org/doc/expasy.pdf.

ExPASy is constantly modified and improved. If you wish to be informed on the changes made to the server you can either:

Read the document 'History of changes, improvements and new features' which is available at the address: http://www.expasy.org/history.html
Subscribe to Swiss-Flash, a service that reports news of databases, software and service developments. By subscribing to this service, you will automatically get Swiss-Flash bulletins by electronic mail. To subscribe, use the address: http://www.expasy.org/swiss-flash/.

Among all the improvements and the new features introduced since the last Swiss-Prot release, here are those that we believe are specifically useful to Swiss-Prot users:

The NiceProt view of Swiss-Prot has been further improved: access to documentation has been facilitated by adding "mouse-over" hypertext links from various sections in NiceProt to the corresponding information in the user manual. Those hypertext links, which give access to documentation rather than the data related to the protein entry, are visually different from the ordinary hyperlinks. While they are not immediately recognizable as such, the user can see that they are clickable by moving the mouse pointer over the section headings such as "References" or "Keywords". A short description of the linked information appears at the bottom of the web browser, and when clicked, a small additional window is opened with related information extracted from the user manual.
Similarly, in the "Cross-references" section, the names of the databases to which an entry is cross-referenced are linked to the corresponding sections in the document dbxref.txt (List of databases cross-referenced in Swiss-Prot).
Implicit links have been added to the resources AraC-XylS, Ensembl and ModBase. We have removed the implicit links to DOMO, which is no longer maintained.
For more details on Swiss-Prot cross-references, implicit and explicit links, you can read:
Gasteiger E., Jung E., Bairoch A.
Swiss-Prot: connecting biological knowledge via a protein database.
Curr. Issues Mol. Biol. 3:47-55(2001)
A few improvements have been applied to the pages describing the Human Proteomics Initiative (HPI). For each human chromosome a link is provided to the corresponding index of Swiss-Prot entries, to relevant information in the EBI Proteome database, in Ensembl, in the Human Genome Resources at NCBI and in euGenes at Indiana University.

The HPI status report has been modified to include, for each of the counted items (e.g. splice variants, variants, references) not only the absolute number, but also the maximal and average number of occurrences per entry, and the number of entries concerned by the counted item.

6 TrEMBL - a supplement to Swiss-Prot

The ongoing genome sequencing and mapping projects have dramatically increased the number of protein sequences to be incorporated into Swiss-Prot. Since we do not want to dilute the quality standards of Swiss-Prot by incorporating sequences into the database without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. But as we also want to make the sequences available as quickly as possible, we introduced in 1995 a computer annotated supplement to Swiss-Prot. This supplement consists of entries in Swiss-Prot-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except those already included in Swiss-Prot.

This supplement is named TrEMBL (Translation from EMBL). It can be considered as a preliminary section of Swiss-Prot. This Swiss-Prot release is supplemented by TrEMBL release 21.

TrEMBL is available by FTP from the EBI and ExPASy servers in the directory '/databases/trembl'. It can be queried on WWW by the EBI and ExPASy SRS servers. It is distributed with its own set of release notes.

7 FTP access to Swiss-Prot and TrEMBL

: 7.1 Generalities

Swiss-Prot is available for download on the following anonymous FTP servers:

Organization	Swiss Institute of Bioinformatics (SIB)
Address	ftp.expasy.org, au.expasy.org, bo.expasy.org, ca.expasy.org, cn.expasy.org, kr.expasy.org, tw.expasy.org, us.expasy.org
Directory	/databases/swiss-prot/

Organization	European Bioinformatics Institute (EBI)
Address	ftp.ebi.ac.uk
Directory	/pub/databases/swissprot/

: 7.2 Non-redundant database

On the ExPASy and EBI FTP servers we distribute files that make up a non-redundant and complete protein sequence database consisting of three components:

1) Swiss-Prot
2) TrEMBL
3) New entries to be integrated later into TrEMBL (hereafter known as TrEMBL_New)

Every week three files are completely rebuilt. These files are named: sprot.dat.gz, trembl.dat.gz and trembl_new.dat.gz. As indicated by their '. gz' extension, these are gzip-compressed files which, when decompressed, produce ASCII files in Swiss-Prot format.

Three other files are also available (sprot.fas.gz, trembl.fas.gz and trembl_new.fas.gz) which are compressed 'fasta' format sequence files useful for building the databases used by FASTA, BLAST and other sequence similarity search programs. Please do not use these files for any other purpose, as you will lose all annotations by using this stripped-down format.

The files for the non-redundant database are stored in the directory '/databases/sp_tr_nrdb' on the ExPASy FTP server (ftp.expasy.org) and in the directory '/pub/databases/sp_tr_nrdb' on the EBI FTP server (ftp.ebi.ac.uk).

Additional notes:

The Swiss-Prot file continuously grows as new annotated sequences are added.

The TrEMBL file decreases in size as sequences are moved out of that section after being annotated and moved into Swiss-Prot. Four times a year a new release of TrEMBL is built at EBI, at this point the TrEMBL file increases in size as it then includes all of the new data (see next section) that has accumulated since the last release.

The TrEMBL_New file starts as a very small file and grows in size until a new release of TrEMBL is available.

Swiss-Prot and TrEMBL share the same system of accession numbers. Therefore you will not find any primary accession number duplicated between the two sections. A TrEMBL entry (and its associated accession number(s)) can either move to Swiss-Prot as a new entry or be merged with an existing Swiss-Prot entry. In the latter case, the accession number(s) of that TrEMBL entry are added to that of the Swiss-Prot entry.

TrEMBL_New does not have real accession numbers. However it was necessary to have an 'AC' line so as to be able to use it with different software products. This AC line contains a temporary identifier which consists of the protein_ID (protein sequence identifier) of the coding sequence in the parent nucleotide sequence.

TrEMBL_New is quite messy! You will of course find new sequence entries but you will also encounter sequences that are going to be used to update existing TrEMBL or Swiss-Prot entries. None of the "cleaning" steps that are applied to produce a TrEMBL release are run on TrEMBL_New nor are any of the computer-annotation software tools that are used to enhance the information content of TrEMBL. TrEMBL_New is provided only so that users can be sure not to miss any important new sequences when they run similarity searches.

While these three files allow you to build what we call a 'non-redundant' database, it must be noted that this is not completely a true statement. Without going into a long explanation we can say that this is currently the best attempt in providing a complete selection of protein sequence entries while trying to eliminate redundancies. While Swiss-Prot is completely (well 99.994% !) non-redundant, TrEMBL is far from being non-redundant and the addition of Swiss-Prot + TrEMBL is even less so.

To describe to your users the version of the non-redundant database that you are providing them with, you should use a statement of the form:

: 7.3 Weekly updates of Swiss-Prot documents

Whilst the ExPASy FTP server so far only allowed FTP access to the Swiss-Prot documents and indexes in their versions at the time of the last full release, all documents are now updated with every weekly release of Swiss-Prot. They are available for FTP download from the directory /databases/swiss-prot/updated_doc/.

: 7.4 Weekly updates of Swiss-Prot

Weekly updates of Swiss-Prot are available by anonymous FTP. Three files are generated at each update:

new_seq.dat	Contains all the new entries since the last full release;
upd_seq.dat	Contains the entries for which the sequence data has been updated since the last release;
upd_ann.dat	Contains the entries for which one or more annotation fields have been updated since the last release.

Important notes

Although we try to follow a regular schedule, we do not promise to update these files every week. In most cases two weeks may elapse between two updates.
Instead of using the above files, you can, every week, download an updated copy of the Swiss-Prot database. This file is available in the directory containing the non-redundant database (see section 7.2).

8 ENZYME and PROSITE

: 8.1 The ENZYME nomenclature database

Release 30.0 of the ENZYME nomenclature database is distributed with release 41 of Swiss-Prot. ENZYME release 30.0 contains information relative to 4'136 enzymes. In this release, we have added a significant number of new entries and we also updated many entries.

: 8.2 The PROSITE database

PROSITE now comes with its own release notes.

9 We need your help!

We welcome feedback from our users. We would especially appreciate your notifying us if you find that sequences belonging to your field of expertise are missing from the database. We also would like to be notified about annotations to be updated, if, for example, the function of a protein has been clarified or if new information about post-translational modifications has become available. To facilitate this feedback we offer, on the ExPASy WWW server, a form that allows the submission of updates and/or corrections to Swiss-Prot:

http://www.expasy.org/sprot/update.html

It is also possible, from any entry in Swiss-Prot displayed by the ExPASy server, to submit updates and/or corrections for that particular entry. Finally, you can also send your comments by electronic mail to the address:

swiss-prot@expasy.org

Note that all update requests are assigned a unique identifier of the form UR-Xnnnn (example: UR-A0123). This identifier is used internally by the Swiss-Prot staff at SIB and EBI to track requests and is also used in e-mail exchanges with the persons who have submitted a request.

APPENDIX A: Some statistics

: A.1 Amino acid composition

: A.1.1 Composition in percent for the complete database

   Ala (A) 7.72   Gln (Q) 3.92   Leu (L) 9.56   Ser (S) 6.98
   Arg (R) 5.24   Glu (E) 6.54   Lys (K) 5.96   Thr (T) 5.51
   Asn (N) 4.28   Gly (G) 6.90   Met (M) 2.36   Trp (W) 1.18
   Asp (D) 5.27   His (H) 2.26   Phe (F) 4.06   Tyr (Y) 3.13
   Cys (C) 1.60   Ile (I) 5.88   Pro (P) 4.88   Val (V) 6.66

   Asx (B) 0.000  Glx (Z) 0.000  Xaa (X) 0.01

: A.1.2 Classification of the amino acids by their frequency

   Leu, Ala, Ser, Gly, Val, Glu, Lys, Ile, Thr, Asp, Arg, Pro, Asn, Phe,
   Gln, Tyr, Met, His, Cys, Trp

: A.2 Taxonomic origin

Total number of species represented in this release of Swiss-Prot: 7'778
The first twenty species represent 51'656 sequences: 42.1% of the total number of entries.

: A.2.1 Table of the frequency of occurrence of species

        Species represented 1x: 3679
                            2x: 1206
                            3x:  619
                            4x:  403
                            5x:  273
                            6x:  251
                            7x:  192
                            8x:  146
                            9x:  120
                           10x:   66
                       11- 20x:  331
                       21- 50x:  250
                       51-100x:   84
                         >100x:  158

: A.2.2 Table of the most represented species

  ------  ---------  --------------------------------------------
  Number  Frequency  Species
  ------  ---------  --------------------------------------------
       1       9172  Homo sapiens (Human)
       2       6169  Mus musculus (Mouse)
       3       4892  Saccharomyces cerevisiae (Baker's yeast)
       4       4832  Escherichia coli
       5       3442  Rattus norvegicus (Rat)
       6       2402  Bacillus subtilis
       7       2291  Caenorhabditis elegans
       8       2116  Schizosaccharomyces pombe (Fission yeast)
       9       1952  Arabidopsis thaliana (Mouse-ear cress)
      10       1773  Haemophilus influenzae
      11       1764  Drosophila melanogaster (Fruit fly)
      12       1529  Methanococcus jannaschii
      13       1485  Escherichia coli O157:H7
      14       1389  Bos taurus (Bovine)
      15       1371  Mycobacterium tuberculosis
      16       1240  Salmonella typhimurium
      17       1062  Gallus gallus (Chicken)
      18        942  Shigella flexneri
      19        919  Synechocystis sp. (strain PCC 6803)
      20        914  Escherichia coli O6
      21        876  Archaeoglobus fulgidus
      22        839  Pseudomonas aeruginosa
      23        838  Xenopus laevis (African clawed frog)
      24        822  Sus scrofa (Pig)
      25        771  Salmonella typhi
      26        716  Aquifex aeolicus
      27        704  Oryctolagus cuniculus (Rabbit)
      28        687  Mycoplasma pneumoniae
      29        670  Rhizobium meliloti (Sinorhizobium meliloti)
      30        609  Vibrio cholerae
      31        599  Treponema pallidum
      32        581  Mycobacterium leprae
      33        572  Buchnera aphidicola (subsp. Acyrthosiphon pisum) 
      34        560  Buchnera aphidicola (subsp. Schizaphis graminum)
      35        536  Helicobacter pylori (Campylobacter pylori)
      36        535  Rickettsia prowazekii
      37        524  Yersinia pestis
      38        519  Helicobacter pylori J99 (Campylobacter pylori J99)
      39        519  Streptomyces coelicolor
      40        494  Bacillus halodurans
      41        491  Zea mays (Maize)
      42        491  Methanobacterium thermoautotrophicum
      43        486  Mycoplasma genitalium
      44        480  Pasteurella multocida
      45        454  Anabaena sp. (strain PCC 7120)
      46        432  Lactococcus lactis (subsp. lactis) (Streptococcus lactis)
      47        419  Thermotoga maritima
      48        416  Oryza sativa (Rice)
      49        405  Borrelia burgdorferi (Lyme disease spirochete)
      50        404  Chlamydia trachomatis
      51        403  Rhizobium sp. (strain NGR234)
      52        393  Canis familiaris (Dog)
      53        391  Chlamydia pneumoniae (Chlamydophila pneumoniae)
      54        390  Neisseria meningitidis (serogroup B)
      55        386  Neisseria meningitidis (serogroup A)
      56        381  Chlamydia muridarum
      57        366  Caulobacter crescentus
      58        365  Pyrococcus horikoshii
      59        359  Listeria monocytogenes
      60        359  Clostridium acetobutylicum
      61        357  Pyrococcus abyssi
      62        354  Ralstonia solanacearum (Pseudomonas solanacearum)
      63        352  Listeria innocua
      64        352  Rhizobium loti (Mesorhizobium loti)
      65        350  Streptococcus pneumoniae
      66        346  Agrobacterium tumefaciens (strain C58 / ATCC 33970)
      67        341  Nicotiana tabacum (Common tobacco)
      68        337  Xylella fastidiosa
      69        335  Deinococcus radiodurans
      70        332  Ovis aries (Sheep)
      71        326  Xanthomonas campestris (pv. campestris)
      72        325  Halobacterium sp. (strain NRC-1)
      73        320  Staphylococcus aureus (strain N315)
      74        320  Campylobacter jejuni
      75        317  Staphylococcus aureus (strain Mu50 / ATCC 700699)
      76        316  Dictyostelium discoideum (Slime mold)
      77        311  Clostridium perfringens
      78        299  Sulfolobus solfataricus
      79        297  Staphylococcus aureus (strain MW2)
      80        290  Corynebacterium glutamicum (Brevibacterium flavum)
      81        288  Pisum sativum (Garden pea)
      82        287  Xanthomonas axonopodis (pv. citri)
      83        285  Streptococcus pyogenes
      84        283  Aeropyrum pernix
      85        278  Pyrococcus furiosus
      86        278  Staphylococcus aureus
      87        269  Brucella melitensis
      88        268  Bacteriophage T4
      89        266  Neurospora crassa
      90        265  Triticum aestivum (Wheat)
      91        264  Candida albicans (Yeast)
      92        261  Rickettsia conorii
      93        258  Hordeum vulgare (Barley)
      94        254  Vaccinia virus (strain Copenhagen)
      95        251  Glycine max (Soybean)
      96        250  Lycopersicon esculentum (Tomato)
      97        248  Rhodobacter capsulatus (Rhodopseudomonas capsulata)
      98        247  Thermoanaerobacter tengcongensis
      99        246  Solanum tuberosum (Potato)
     100        244  Pseudomonas putida

: A.2.3 Taxonomic distribution of the sequences

   Kingdom       Sequences (% of the database)
    Archaea            7119 (  6%)
    Bacteria          46344 ( 38%)
    Eukaryota         60623 ( 49%)
    Viruses            8478 (  7%)

   Within Eukaryota:

    Category            sequences (% of Eukaryota) (% of the complete database)
     Human                   9172 ( 15%)           (  7%)
     Other Mammalia         16041 ( 26%)           ( 13%)
     Other Vertebrata        5806 ( 10%)           (  5%)
     Viridiplantae           9581 ( 16%)           (  8%)
     Fungi                   9337 ( 15%)           (  8%)
     Insecta                 3352 (  6%)           (  3%)
     Nematoda                2504 (  4%)           (  2%)
     Other                   4830 (  8%)           (  4%)

: A.3 Sequence size

: A.3.1 Repartition of the sequences by size (excluding fragments)

               From   To  Number             From   To   Number
                  1-  50    2283             1001-1100     1127
                 51- 100    8420             1101-1200      796
                101- 150   12542             1201-1300      550
                151- 200   11267             1301-1400      379
                201- 250   11387             1401-1500      305
                251- 300   10019             1501-1600      213
                301- 350   10039             1601-1700      166
                351- 400    9804             1701-1800      118
                401- 450    7435             1801-1900      128
                451- 500    6547             1901-2000      106
                501- 550    5067             2001-2100       59
                551- 600    3400             2101-2200       96
                601- 650    2753             2201-2300       99
                651- 700    2015             2301-2400       57
                701- 750    1766             2401-2500       56
                751- 800    1474             >2500          326
                801- 850    1101
                851- 900    1142
                901- 950     817
                951-1000     704

: A.3.2 Longest and shortest sequences

   The shortest sequence is  GRWM_HUMAN (P24272) :     3 amino acids.
   The longest sequence is   NEBU_HUMAN (P20929) :  6669 amino acids.

: A.4 Journal citations

Note: the following citation statistics reflect the number of distinct journal citations.

Total number of journals cited in this release of Swiss-Prot: 1'316

: A.4.1 Table of the frequency of journal citations

        Journals cited 1x:  496
                       2x:  167
                       3x:   84
                       4x:   61
                       5x:   46
                       6x:   47
                       7x:   26
                       8x:   25
                       9x:   22
                      10x:   11
                  11- 20x:   98
                  21- 50x:   98
                  51-100x:   39
                    >100x:   96

: A.4.2 List of the most cited journals in Swiss-Prot

   Nb    Citations   Journal name
   --    ---------   -------------------------------------------------------------
    1         9138   Journal of Biological Chemistry
    2         5013   Proceedings of the National Academy of Sciences of the U.S.A.
    3         3631   Nucleic Acids Research
    4         3612   Journal of Bacteriology
    5         3381   Gene
    6         2663   FEBS Letters
    7         2598   Biochemical and Biophysical Research Communications
    8         2429   European Journal of Biochemistry
    9         2383   Biochemistry
   10         2171   The EMBO Journal
   11         2045   Nature
   12         2024   Biochimica et Biophysica Acta
   13         1821   Journal of Molecular Biology
   14         1752   Genomics
   15         1579   Cell
   16         1542   Molecular and Cellular Biology
   17         1243   Biochemical Journal
   18         1146   Science
   19         1123   Plant Molecular Biology
   20         1117   Molecular and General Genetics
   21         1068   Molecular Microbiology
   22          855   Journal of Biochemistry
   23          830   Virology
   24          748   Human Molecular Genetics
   25          693   Journal of Cell Biology
   26          645   Nature Genetics
   27          597   Journal of Virology
   28          588   Plant Physiology
   29          582   Human Mutation
   30          579   Genes and Development
   31          550   Oncogene
   32          538   The American Journal of Human Genetics
   33          530   Infection and Immunity
   34          529   Yeast
   35          516   Journal of Immunology
   36          494   Journal of General Virology
   37          469   Archives of Biochemistry and Biophysics
   38          454   Structure
   39          446   FEMS Microbiology Letters
   40          433   Microbiology
   41          394   Development
   42          379   Human Genetics
   43          376   Current Genetics
   44          376   Nature Structural Biology
   45          347   Genetics
   46          343   Molecular and Biochemical Parasitology
   47          335   Blood
   48          317   Applied and Environmental Microbiology
   49          313   Journal of Clinical Investigation
   50          299   Molecular Endocrinology
   51          283   DNA and Cell Biology
   52          282   Protein Science
   53          281   Journal of Molecular Evolution
   54          276   Developmental Biology
   55          276   Mammalian Genome
   56          271   Biological Chemistry Hoppe-Seyler
   57          251   Cancer Research
   58          248   Journal of Experimental Medicine
   59          246   Neuron
   60          241   Immunogenetics
   61          240   Mechanisms of Development
   62          229   Journal of General Microbiology
   63          228   Endocrinology
   64          221   DNA Sequence
   65          217   Acta Crystallographica, Section D
   66          213   Hoppe-Seyler's Zeitschrift fur Physiologische Chemie
   67          209   Molecular Biology of the Cell
   68          207   The Plant Cell
   69          203   Journal of Cell Science
   70          191   Molecular Biology and Evolution
   71          190   Brain Research. Molecular Brain Research
   72          187   The Plant Journal
   73          183   Journal of Neurochemistry
   74          180   Journal of Neuroscience
   75          160   Comparative Biochemistry and Physiology
   76          158   Cytogenetics and Cell Genetics
   77          156   DNA
   78          154   Bioscience, Biotechnology, and Biochemistry
   79          152   The Journal of Clinical Endocrinology and Metabolism
   80          145   Toxicon
   81          144   Molecular Pharmacology
   82          143   Antimicrobial Agents and Chemotherapy
   83          140   American Journal of Physiology
   84          131   Biochimie
   85          127   Bioorganicheskaia Khimiia
   86          125   Virus Research
   87          125   Proteins
   88          122   DNA Research
   89          121   Molecular Plant-Microbe Interactions
   90          119   Hemoglobin
   91          116   Peptides
   92          114   Agricultural and Biological Chemistry
   93          112   Current Biology
   94          111   Journal of Investigative Dermatology
   95          110   Molecular and Cellular Endocrinology
   96          106   Genome Research

: A.5 Statistics for some line types

The following table summarizes the total number of some Swiss-Prot lines, as well as the number of entries with at least one such line, and the frequency of the lines.

                                   Total    Number of  Average
Line type / subtype                number   entries    per entry
---------------------------------  -------- ---------  ---------

References (RL)                     232571              1.90
   Journal                          195556    111991    1.60
   Submitted to EMBL/GenBank/DDBJ    34500     27873    0.28
   Unpublished observations            536       532   <0.01
   Submitted to Swiss-Prot             464       462   <0.01
   Plant Gene Register                 463       453   <0.01
   Book citation                       460       450   <0.01
   Thesis                              190       188   <0.01
   Submitted to other databases        190       189   <0.01
   Unpublished results                 123       121   <0.01
   Patent                               87        86   <0.01
   Worm Breeder's Gazette                2         2   <0.01

Comments (CC)                       405433              3.31
   SIMILARITY                       117866    103489    0.96
   FUNCTION                          77092     75796    0.63
   SUBCELLULAR LOCATION              55038     55038    0.45
   CATALYTIC ACTIVITY                39528     37138    0.32
   SUBUNIT                           33846     33846    0.28
   PATHWAY                           17449     16966    0.14
   TISSUE SPECIFICITY                13626     13626    0.11
   COFACTOR                          12141     12141    0.10
   MISCELLANEOUS                      7816      7190    0.06
   PTM                                7140      6571    0.06
   ALTERNATIVE PRODUCTS               3946      3946    0.03
   INDUCTION                          3558      3558    0.03
   DOMAIN                             3535      3241    0.03
   DEVELOPMENTAL STAGE                3362      3362    0.03
   CAUTION                            3342      3172    0.03
   DISEASE                            2244      1868    0.02
   ENZYME REGULATION                  1753      1753    0.01
   MASS SPECTROMETRY                   893       810    0.01
   DATABASE                            818       751    0.01
   POLYMORPHISM                        343       334   <0.01
   BIOTECHNOLOGY                        50        50   <0.01
   PHARMACEUTICAL                       47        47   <0.01

Features (FT)                       655938              5.35
   DOMAIN                            95401     28727    0.78
   TRANSMEM                          77067     16988    0.63
   CONFLICT                          47337     16661    0.39
   CARBOHYD                          45507     11138    0.37
   DISULFID                          41846     10872    0.34
   TURN                              39177      2956    0.32
   METAL                             36827     10004    0.30
   STRAND                            36304      2644    0.30
   HELIX                             27742      2845    0.23
   ACT_SITE                          24322     15216    0.20
   CHAIN                             23456     19176    0.19
   VARIANT                           23307      4423    0.19
   REPEAT                            22336      3704    0.18
   NP_BIND                           15500     10893    0.13
   SIGNAL                            14828     14826    0.12
   MOD_RES                           13336      7528    0.11
   NON_TER                           10321      7875    0.08
   BINDING                            8145      6285    0.07
   ZN_FING                            7821      2770    0.06
   VARSPLIC                           6951      3249    0.06
   SITE                               6265      4319    0.05
   INIT_MET                           5574      5545    0.05
   PROPEP                             4686      4026    0.04
   MUTAGEN                            4273      1337    0.03
   DNA_BIND                           4193      3949    0.03
   CA_BIND                            4049      1149    0.03
   LIPID                              2946      2395    0.02
   TRANSIT                            2582      2562    0.02
   PEPTIDE                            2517      1001    0.02
   NON_CONS                            804       411    0.01
   UNSURE                              290       123   <0.01
   SE_CYS                              111        73   <0.01
   THIOETH                              94        32   <0.01
   THIOLEST                             23        23   <0.01

Cross-references (DR)               999237              8.15
   EMBL                             230657    116257    1.88
   InterPro                         195677    104236    1.60
   Pfam                             133012     99557    1.09
   PROSITE                          105218     66696    0.86
   PIR                               47040     35736    0.38
   PRINTS                            39413     34822    0.32
   SMART                             38729     29473    0.32
   HSSP                              38069     38069    0.31
   TIGRFAMs                          31394     29063    0.26
   ProDom                            30120     28820    0.25
   HAMAP                             23868     23778    0.19
   PDB                               11737      3547    0.10
   TIGR                              11065     11020    0.09
   MIM                                8171      7086    0.07
   Genew                              7836      7788    0.06
   MGD                                5820      5805    0.05
   SGD                                4936      4882    0.04
   EcoGene                            4228      4226    0.03
   MEROPS                             3316      3222    0.03
   TRANSFAC                           2464      2214    0.02
   WormPep                            2413      2239    0.02
   SubtiList                          2362      2361    0.02
   FlyBase                            2236      2173    0.02
   GeneDB_SPombe                      2131      2101    0.02
   TubercuList                        1400      1363    0.01
   StyGene                            1196      1193    0.01
   SWISS-2DPAGE                        810       809    0.01
   ListiList                           712       658    0.01
   Leproma                             585       581   <0.01
   Gramene                             411       411   <0.01
   MaizeDB                             405       401   <0.01
   HIV                                 370       354   <0.01
   REBASE                              358       353   <0.01
   ECO2DBASE                           351       299   <0.01
   DictyDb                             319       316   <0.01
   GlycoSuiteDB                        259       259   <0.01
   ZFIN                                225       225   <0.01
   PHCI-2DPAGE                         211       211   <0.01
   MypuList                            131       131   <0.01
   Aarhus/Ghent-2DPAGE                 128        98   <0.01
   Siena-2DPAGE                        104       104   <0.01
   HSC-2DPAGE                           85        85   <0.01
   PhosSite                             53        53   <0.01
   COMPLUYEAST-2DPAGE                   50        50   <0.01
   PMMA-2DPAGE                          47        47   <0.01
   Maize-2DPAGE                         39        39   <0.01
   SagaList                             25        25   <0.01
   ANU-2DPAGE                           15        15   <0.01

: A.6 Miscellaneous statistics

Total number of distinct authors cited in Swiss-Prot: 164'410

Total number of chloroplast-encoded sequences: 3'131
Total number of mitochondrial-encoded sequences: 2'385
Total number of cyanelle-encoded sequences: 145
Total number of plasmid-encoded sequences: 2'624

Number of additional sequences encoded in splice variants : 5'661

APPENDIX B: Relationships between Swiss-Prot and some biomolecular databases

The current status of the relationships (cross-references) between Swiss-Prot and some biomolecular databases is shown in the following schema: