VARSPLIC utility and splicing isoforms files -------------------------------------------- NOTE This program has been substantially revised 25th March 2003. All users are strongly advised to read this documentation, especially the section 'CHANGES FOR VERSION 2' and subsequent sections. INTRODUCTION Many proteins exist in more than one isoform, one cause of which is alternative (differential) splicing. Splice isoforms may differ considerably from one another, with potentially less than 50% sequence similarity between isoforms. In the Swiss-Prot database, one sequence (usually that of the longest isoform) is displayed for each protein. Known variations of this sequence are recorded in the feature table (using the VARSPLIC key), together with the name(s) of the isoform(s) in which each variant occurs. The results of database-wide sequence comparisons (such as FASTA or BLAST) may, in some cases, be dependent on which splice isoform's sequence is used for each protein in the database. A more informative set of results might be obtained if such comparisons were run against a database containing all known isoforms of proteins originating from genes undergoing alternative splicing. The program varsplic.pl has been written to generate additional records from Swiss-Prot and TrEMBL, one for each splice isoform of each protein. The default output is in FASTA format. There is also an option to generate statistics on the number of new records generated, and the % sequence change of each new record from its 'parent' sequence. Additionally, the user can choose to print new records only for isoforms whose difference from their 'parent' exceeds a specified threshold. varsplic.pl is written in PERL. It requires the Swissknife package. Both are available from ftp://ftp.ebi.ac.uk/pub/software/swissprot/ The output files obtained from running the program (specifying the 'fasta' option, with no threshold set) on Swiss-Prot and TrEMBL are available. The output files are named as follows: varsplic_sprot.fas.Z varsplic_trembl.fas.Z As indicated by their ".Z" extension these are Unix "compress" format files which when decompressed will produce ASCII files in FASTA format. New versions of these output files are rebuilt at each weekly release and are included with the other Swiss-Prot and TrEMBL non-redundant database files. There are available at: ftp://ftp.expasy.ch/databases/sp_tr_nrdb/ ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/ IMPORTANT WARNING It is intended that the original Swiss-Prot and TrEMBL entries should remain the primary source of information about all splice isoforms of each protein, and that the additional records generated by this program will be used only in sequence similarity comparisons. For this reason, only output in FASTA format is being distributed with this release. Please refer to the original Swiss-Prot records for further details concerning each splice isoform. USAGE For usage of varsplic.pl see the program documentation (in the program, or type: perldoc varsplic.pl) NOTES The 'which' option allows the user to specify one of three alternative output options: 1) 'full', i.e., a new record is generated for every existing sequence in the database, plus one new record for each alternative isoform. 2) 'allforms' i.e. a new record is generated for every existing sequence for which alternatively spliced isoforms exist, plus one new record for each alternative. 3) default (no option specified): new records are NOT generated for isoforms whose sequence is displayed in an existing record. New records are only generated for alternative isoforms. For new records in FASTA format, new IDs have been derived as follows: 1) for original records without alternatively spliced isoforms, no changes have been made 2) for original records with alternatively spliced isoforms, the original ID has been replaced by the primary accession number, followed by a hyphen and the number '00' 3) for new variants, the accession number of the original record, followed by a hyphen, followed by a number has been used as an ID. A different number has been used for each isoform. For example, record P05067 has three isoforms. If the user specifies 'full' or 'allforms', three FASTA records will be generated from this record, with IDs P05067-00 (the FASTA record of the original sequence), P05067-01 and P05067-02. Currently, the largest number of isoforms described for any protein in Swiss-Prot or TrEMBL is 13. For new records in FASTA format, the AC of the 'parent' record has been retained as the AC number. The program can generate statistics, counting the number of variants whose sequence has been changed from the parental record within a certain range. The 'statsfile' option allows the user to specify these ranges in a file: otherwise default values are used. The percentage change of each newly generated isoform from its parent isoform is calculated as follows: 1 - (length of sequence common to both isoforms / Length of their gapped alignment) The alignment length is determined as follows. A gap is introduced into the alignment at each position where variation is recorded between isoforms, i.e. mismatches are not allowed at any position, and no attempt is made to align divergent sequences with one another. See diagram below. Key: + Shared Sequence * Sequence unique to isoform 1 ^ Sequence unique to isoform 2 Isoform 1: +++++******++++++*****+++ Isoform 2: +++++^^^^^^^^^^^^+++++++++^^^^ These two isoforms would be aligned as below: Isoform 1: +++++****** ++++++*****+++ Isoform 2: +++++ ^^^^^^^^^^^^++++++ +++^^^^ and a % change would be recorded of 1 - (14/41) = 0.66 for isoform 2, with respect to isoform 1 CHANGES FOR VERSION 2 In Swiss-Prot release 41.01 (and in the accompanying TrEMBL release), a new format was introduced for "CC ALTERNATIVE PRODUCTS" lines. The new format is more structured than the previous format. Associated with these changes are the introduction of stable identifiers for each named splice isoform in all entries that describe more than splice isoform; the the extension of feature identifiers, previously only used for HUMAN VARIANT features, to VARSPLIC features in entries from all species. The effects of these changes on varsplic.pl are as follows: (i) the program has been substantially rewritten (ii) the new program is now reliant on Swissknife v1.3 or later, which has also been rewritten to deal with the new format. This now available from the EBI's ftp site at ftp://ftp.ebi.ac.uk/pub/software/swissprot/ (iii) there are some slight changes in the output format of files produced by varsplic.pl. Particularly, if only splice variants are expanded, the program now displays the (stable) isoform identifier associated with each isoform. (iv) there has been some cleaning of options and available output formats Full details of how to run the revised program, and the output it produces, are available as perldoc. For information, the following text on the new style "CC ALTERNATIVE PRODUCTS" lines has been reproduced from the release notes associated with Swiss-Prot release 41. ALTERNATIVE PRODUCTS We are gradually restructuring the CC (comment) line topic ALTERNATIVE PRODUCTS and introducing unique identifiers for each described isoform. Qualifiers, which will be introduced are described in the table below: Topic Description Event Biological process that results in the production of the alternative forms (Alternative promoter, Alternative splicing, Alternative initiation). Format: Event=controlled vocabulary; Example: Event=Alternative splicing; Named Number of isoforms listed in the topics 'Name' isoforms below the topic 'Event=Alternative splicing'. Format: Named isoforms=number; Example: Named isoforms=6; Comment Any comments concerning one or more isoforms; optional; may be longer than 1 line. Format: Comment=free text; Example: Comment=Experimental confirmation may be lacking for some isoforms; Name A common name for an isoform used in the literature or assigned by Swiss-Prot (currently only available for spliced isoforms). Format: Name=common name; Example: Name=Alpha; Synonyms Synonyms for an isoform as used in the literature; optional. Format: Synonyms=synonym_1[, synonym_n]; Example: Synonyms=B, KL5; IsoId Unique identifier for an isoform, consisting of the Swiss-Prot accession number, followed by a dash and an identifier for this isoform. Format: IsoId=acc#-isoform_number[,acc#-isoform_number]; Example: IsoId=P05067-1; Sequence Lists all FT VARSPLIC identifiers (VSP_#), which are needed to build the sequence for a specific isoform. If the accession number of the IsoId does not correspond to the accession number of the current entry, this topic contains the term 'External'. Format: Sequence=VSP_#[,VSP_#]|Displayed|External|Not described; Example: Sequence=Displayed; Example: Sequence=VSP_000013, VSP_000014; Note Notes concerning current isoform; optional; Format: Note=free text; Example: Note=Predicted; In the case of 'Alternative initiation' the topic 'Event' can be followed by a 'Comment' of free text. Format: CC -!- ALTERNATIVE PRODUCTS: CC Event=Alternative initiation; CC Comment=Optional free text with information on alternative CC initiation or the products retrieved from this event. In the CC case of alternative initiation there will be no other topics; In the case of 'Alternative splicing' the topic 'Event' can be followed by a 'Comment' of free text and a listing of all described isoforms. Format: CC -!- ALTERNATIVE PRODUCTS: CC Event=Alternative splicing; CC Comment=Optional free text with information on alternative CC splicing or the products retrieved from this event; CC Name=isoform_1; Synonyms=synonym_1[, synonym_n]; CC IsoId=isoform_identifier_1[, isoform_identifier_n]; CC Sequence=VSP_identifier_1 [, VSP_identifier_n]; CC Note=Optional note concerning isoform_1; CC Name=isoform_n; Synonyms=synonym_1[, synonym_n]; CC IsoId=isoform_identifier_1[, isoform_identifier_n]; CC Sequence=VSP_identifier_1 [, VSP_identifier_n]; CC Note=Optional note concerning isoform_n; Example for new format of the CC lines and the corresponding FT lines for an entry with alternative splicing: ... CC -!- ALTERNATIVE PRODUCTS: CC Event=Alternative splicing; Named isoforms=9; CC Comment=Additional isoforms seem to exist. APP695, APP751 and CC APP770 are the major isoforms. The L-isoforms are referred to as CC appicans. Experimental confirmation may be lacking for some CC isoforms; CC Name=APP770; Synonyms=Prea4 770; CC IsoId=P05067-1; Sequence=Displayed; CC Name=APP305; CC IsoId=P05067-2; Sequence=VSP_000005, VSP_000006; CC Name=L-APP677; CC IsoId=P05067-3; Sequence=VSP_000002, VSP_000004, VSP_000009; CC Name=APP695; Synonyms=Prea4 695; CC IsoId=P05067-4; Sequence=VSP_000002, VSP_000004; CC Name=L-APP696; CC IsoId=P05067-5; Sequence=VSP_000002, VSP_000003, VSP_000009; CC Name=APP714; CC IsoId=P05067-6; Sequence=VSP_000002, VSP_000003; CC Name=L-APP733; CC IsoId=P05067-7; Sequence=VSP_000007, VSP_000008, VSP_000009; CC Name=APP751; Synonyms=Prea4 751; CC IsoId=P05067-8; Sequence=VSP_000007, VSP_000008; CC Name=L-APP752; CC IsoId=P05067-9; Sequence=VSP_000009; ... FT VARSPLIC 289 289 E -> V (in isoform APP695, isoform FT L-APP696, isoform L-APP677 and isoform FT APP714). FT /FTId=VSP_000002. FT VARSPLIC 290 345 Missing (in isoform L-APP696 and isoform FT APP714). FT /FTId=VSP_000003. FT VARSPLIC 290 364 Missing (in isoform APP695 and isoform FT L-APP677). FT /FTId=VSP_000004. FT VARSPLIC 290 305 VCSEQAETGPCRAMIS -> KWYKEVHSGQARWLML (in FT isoform APP305). FT /FTId=VSP_000005. FT VARSPLIC 306 770 Missing (in isoform APP305). FT /FTId=VSP_000006. FT VARSPLIC 345 345 M -> I (in isoform L-APP733 and isoform FT APP751). FT /FTId=VSP_000007. FT VARSPLIC 346 364 Missing (in isoform L-APP733 and isoform FT APP751). FT /FTId=VSP_000008. FT VARSPLIC 637 654 Missing (in isoform L-APP677, isoform FT L-APP696, isoform L-APP733 and isoform FT L-APP752). FT /FTId=VSP_000009. ... BUG REPORTS Please report any bugs to pkersey@ebi.ac.uk REFERENCE Kersey P., Hermjakob H., Apweiler R. VARSPLIC: alternatively-spliced protein sequences derived from Swiss-Prot and TrEMBL. Bioinformatics 16:1048-1049(2000). Documentation last updated 25th March 2003