ENCODE GO TERMS   PDG   CNB   CSIC

GO Data
Data for All Sequences
- Go Terms
- Go Term Frequency

GO Functional Terms

There are 1003 sequences that have GO terms easily associated to them (by BLAST). In 71 cases no sequence found by BLAST has a GO annotation. The GO term does not neccessarily belong to the sequence with the best BLAST score - there are several cases where the best protein found by BLAST (often with 100% identity) is a hypothetical protein that does not have a GO term and several cases where the best protein found by BLAST simply is not annotated in GO.

GO terms were totalled for the 1003 sequences that contained at least one term. One GO ID (GO:0005515, Protein Binding) turned up 203 times, while GO:0016021 (membrane associated) is present in 193 sequences, and GO:0004872 (receptor activity) in 117 cases.

Nucleus (128) and nucleotide binding proteins (79) are also well represented and another membrane classsification (GO:0016020) is represented 80 times, just less than the GO terms GO:0006355 (regulation of transcription, DNA-dependent, 82 times), GO:46872 (metal ion binding, 82 times) and GO:0007165 (signal transduction, 81 times).

GO Functional Terms in Manual and Randomly Selected Regions

Of the 1097 CDS sequences, 661 were from regions selected manually and 436 from "randomly" selected regions. 30 of the 44 regions chosen to be included in the ENCODE experiment were selected randomly from groups pre-classified by non-exonic conservation and gene density. "Randomly" chosen regions contained less CDS than those chosen for their biological interest, though no inference can be made from this comparison, due to the nature of the "random" selection process.

Regions ENr112 (chromosome 2), ENr311 (chromosome 14), ENr313 (chromosome 16) do not have any CDS sequences, while regions ENr113, ENr114, ENr211, ENr213, and ENr312 have just one sequence plus alternative splice varaints.

In contrast none of the manually chosen regions has less than 2 sequences plus splice isoforms. Region ENm012 has the fewest (just 4 sequences), while ENm006 has 118 sequences.

Protein Binding is also the most frequent term in the randomly chosen sequences, it turns up 82 times. Second is "nucleus" which turns up 71 times. Membrane-associated is the third most frequent term (59), "regulation of transcription, DNA-dependent" is fourth with 44 mentions and "membrane" fifth with 39. "Nucleotide-binding" appears 32 times and three terms each have 25 mentions, "ATP-binding", "Ca-binding" and "transporter activity".

In comparison with the manually chosen sequences there are proportionally more "nucleus" sequences and less membrane related sequences. The terms "signal transduction" (14 mentions), "receptor activity" (9 mentions) and "metal ion binding" (19 mentions) are much less frequent in the random sample.

One other thing that is noticeable is that the sequences from the manually selected regions have considerably more GO terms per sequence. Mean GO terms per sequence for the manually chosen regions is 5.51, while it is just 4.63 for the sequences from the randomly chosen regions, and 11.7% of the randomly chosen sequences did not have GO terms at all, compared to just 6.5% of the sequences from the manually chosen regions.

GO Functional Terms and Alternative Splicing

Based on the results from GO it is possible to look for interesting cases of alternative splicing. Initial results suggest that alternative splicing may lead to domain recombinations on at least three separate occasions:

AC012630.1-008 - the sequence is identical to alternative splice variant AC012630.1-001 (RHG26_HUMAN), but has a 56 residue chunk missing. The missing chunk (699-754) is an interdomain segment according to the PFAM domain classification. AC012630.1-008 is recognised as an alternative splice variant of RHG26_HUMAN in Uniprot (Q9UNA1-2). This pair of sequences are one of the three cases where separate PDB structures exist for two different parts of the sequence - the second (C-terminal end). Interestingly BLAST could not find the 699 residue N-terminal fragment of RHG26_HUMAN, even though it is identical over all 699 residues.

AC015691.11-001 - this sequence seems to be a composite of two homologous proteins. It is composed of the first 4 domains of TRIM6_HUMAN (N-terminal end), plus the whole of TRI34_HUMAN. Both sequences align at 100%. The domain of TRIM6_HUMAN that is left out is the SPRY domain (residues 352-487 according to pfam). The sequence is identical to TRIM6_HUMAN for its first 300 residues (TRIM6_HUMAN domain 4 stops at 299) and there is a 26 residue linker between TRIM6_HUMAN and TRI34_HUMAN. While there are no variants for this sequence, it is related to sequences AC015691.10-001 (TRI34_HUMAN itself), AC015691.12-001 (TRIM5_HUMAN, a homologue of TRIM6_HUMAN) and AC015691.9-001 (TRIM6_HUMAN itself) and its variants.

(Note that this splice form has already been documented in a paper by Orimo et al. (Genomics, 2000)).

AP000295.6-004 and AP000295.6-001 - these two sequences are identical, they have the same first two domains as alternative splice variant AP000295.6-002 (INAR2_HUMAN), and have an extra 51 C-terminal residues. The N-terminal 280 residues are identical to AP000295.6-002, the sequence of AP000295.6-002 is cut perfectly between the 2nd and 3rd domains at residue 280. The C-terminal mini-domain is similar to two small pfamB ALU1_HUMAN domains. Note that this isoform is already recorded in Uniprot.

 

Michael Tress
mtress @cnb.uam.es
Protein Design Group
Centro Nacional de Biotecnología (CNB-CSIC)
Calle Darwin, Campus de la Universidad Autónoma de Madrid, Cantoblanco.
28049 MADRID.
Tel: (+34) 91 585 4676   Fax: (+34) 91 585 4506

Apologies to Osvaldo Graña from whom I totally stole the page design.

Thanks to Petr Sobola for the skater(s).