|
|
| Isoforms, Loci and Clones |
|
|
There are 1097 sequences from 434 separate loci in the GENCODE set that was frozen in October 2005. The 434 loci come from 183 separate clones.
In the set the number of the locus is between the full stop and the dash, while the number of the isoform is shown after the dash.
663 of the sequences are alternative splice variants, that is they come from the same locus as one of the 434 primary sequences another sequence.
The number of isoforms per locus varies considerably. There are 182 loci that have only one isoform, while there is one
locus that has 17 different splice variants (RP1-309K20.2). See the diagram below:

Thirty-three clones have more than one locus, with the most populous clone being AC104389 with 8 separate loci.
From this we have also calculated the frequency of isoforms per clone. The clone with the highest number of isoforms is AC012314 with 39,
while AP001462 has 32 and RP1-309K20 has 27. At the other end there are thirty clones that only have one isoform.
Many of the variants are partial fragments; 345 isoforms have N- or C-terminals that cannot be determined
with certainty. If these partial variants are a fragment of a larger variant
from the same locus they have been treated as if they were sequence-identical to the larger variant.
When the partial fragments are treated in this way, the number of sequences in the 434 loci drops to just 860. 216 of the loci have more than one variant (so 218 effectively have just one CDS). The highest number of isoforms per locus is now 9 (AC003987.1)
|
|
111 of the 1097 sequences are identical in both amino acid sequence and length of sequence. Virtually all of these identical sequences are alternative splice variants in the same locus, though there are some exceptions.
There are 5 sets of four identical sequences, 23 sets of three identical proteins and 50 sets of two identical sequences.
Four of the five isoforms from the locus XX-FW83563B9.2 are identical, while the locus AP006216.8 has four isoforms and all four are identical sequences.
This is not an isolated case, in fact in exactly one in 18 of all loci with multiple isoforms all the isoforms are identical sequences.
Besides AP006216.8, thirteen other loci (RP3-477O4.6, AP000302.59, AP000287.4, AC132217.5, AC015691.10, AC012314.5, AC012314.4, AC012314.14, AC009955.4,
AC008937.1, AC006037.1, AC004500.2, and AC004079.3) have two or three isoforms and in each case the isoforms are identical.
The protein that codes for TEX28_HUMAN turns up four times, twice in the same locus (AC092402.6), once in a locus from
the same clone (AC092402.4) and a fourth time in a completely different clone (Z68193.2). There are five other cases where
the identical protein turns up on a different locus, though in each of these five cases (Z84721, AC011330, AF277315 twice and AC092402 again)
both sequences come from the same clone.
|
|
Of the 1097 CDS sequences, 661 were from regions selected manually and 436 were from regions selected at random.
BLAST finds human homologues for all but 22 of the CDS sequences.
There are 1003 sequences that have GO terms easily associated to them (through BLAST).
13 sequences have their whole sequence covered by a PDB structure, though 7 of them are identical sequences. One locus
has isoforms covered by two different PDB structures.
587 sequences find at least one template structure with BLAST.
|
|
|
|
|