ENCODE Isoforms   PDG   CNB   CSIC

PFAM Domains and Sequence Data
Data for PFAM Domain - Isoform Relationship.
- Domains Isoforms Data

Alternative Splicing and PFAM Domains

There are 994 sequences for which at least one PFAM (or PFAM B) domain can be found through association by BLAST. These 994 sequences have a total of 3110 domains according to Pfam, just over three domains per protein.

How you calculate the percentage of broken domains depends on your point of view. Here we have provided as many calculations as possible.

The 3110 domains are split by trimming the N-terminal, trimming the C-terminal, deletion or insertions on 485 occasions. That computes to 16.1% of the domains. Of the 994 sequences with Pfam domains, 42.5% (423 sequences) have at least one broken domain. 53 sequences have two broken domains and in three sequences three of the domains have been split.

However, many of the sequences are fragments of other, larger sequences. There is weak evidence for their extension. In the absence of further evidence, these sequences should be treated not as separate, whole sequences, but as if they are identical to the larger sequences. When these sequence fragments are removed only 304 sequences contain split domains. This is is 39.5% of the sequences with Pfam domains because there are 772 different sequences wth Pfam domains (not 994) in this case.

When only those isoforms that vary in sequence are considered, (identical sequences and loci with just a single representative are ignored) this percentage rises to 46.5% (271 sequences with broken domains out of 584 non-identical sequence variants).

Another way to calculate the percentage of sequences with broken domains would be to look at just those sequences that are splice variants of the primary sequence. There are 210 loci with at least one splice variant that can be considered as having a distinct sequence from the primary sequence. Discounting the primary sequence in each locus, there are 261 sequences with broken domains (in 10 of the loci with sequence different splice variants all the sequences - incuding the primary sequences - split a domain) and a total of 373 non-identical splice variants. So 70% of the sequence different splice variants recorded in the 1% of the human genome we have looked at split at least one Pfam domain.

Of course, there are caveats. In 30 of the 304 cases of domain splitting the insertion or deletion is 4 residues or less. Some of the domains are Pfam B domains, so less defined and many of these split domains do not have equivalent PDB structures so it is impossible to know whether it would make sense to split the PFAM-defined domain in these case.

However, there are examples of domains being split where we do have structural information.

Separate work using comparisons with the nearest PDB structure suggests that at least 50% of these splits will not allow the domain to refold as in the PDB structural template.

GO Functional Terms and Alternative Splicing

Based on the results from GO it is possible to look for interesting cases of alternative splicing. Initial results suggest that alternative splicing may lead to domain recombinations on at least three separate occasions:

AC012630.1-008 - the sequence is identical to alternative splice variant AC012630.1-001 (RHG26_HUMAN), but has a 56 residue chunk missing. The missing chunk (699-754) is an interdomain segment according to the PFAM domain classification. AC012630.1-008 is recognised as an alternative splice variant of RHG26_HUMAN in Uniprot (Q9UNA1-2). This pair of sequences are one of the three cases where separate PDB structures exist for two different parts of the sequence - the second (C-terminal end). Interestingly BLAST version 2.9 could not find the 699 residue N-terminal fragment of RHG26_HUMAN, even though it is identical over all 699 residues. This is a bug in the older (well, last year actually) versions of BLAST.

AC015691.11-001 - this sequence seems to be a composite of two homologous proteins. It is composed of the first 4 domains of TRIM6_HUMAN (N-terminal end), plus the whole of TRI34_HUMAN. Both sequences align at 100%. The domain of TRIM6_HUMAN that is left out is the SPRY domain (residues 352-487 according to pfam). The sequence is identical to TRIM6_HUMAN for its first 300 residues (TRIM6_HUMAN domain 4 stops at 299) and there is a 26 residue linker between TRIM6_HUMAN and TRI34_HUMAN. While there are no variants for this sequence, it is related to sequences AC015691.10-001 (TRI34_HUMAN itself), AC015691.12-001 (TRIM5_HUMAN, a homologue of TRIM6_HUMAN) and AC015691.9-001 (TRIM6_HUMAN itself) and its variants.

(Note that this splice form has already been documented in a paper by Orimo et al. (Genomics, 2000)).

AP000295.6-004 and AP000295.6-001 - these two sequences are identical, they have the same first two domains as alternative splice variant AP000295.6-002 (INAR2_HUMAN), and have an extra 51 C-terminal residues. The N-terminal 280 residues are identical to AP000295.6-002, the sequence of AP000295.6-002 is cut perfectly between the 2nd and 3rd domains at residue 280. The C-terminal mini-domain is similar to two small pfamB ALU1_HUMAN domains. Note that this isoform is already recorded in Uniprot.

 

Michael Tress
mtress @cnb.uam.es
Protein Design Group
Centro Nacional de Biotecnología (CNB-CSIC)
Calle Darwin, Campus de la Universidad Autónoma de Madrid, Cantoblanco.
28049 MADRID.
Tel: (+34) 91 585 4676   Fax: (+34) 91 585 4506

Apologies to Osvaldo Graña from whom I totally stole the page design.

Thanks to Petr Sobola for the skater(s).