ENCODE Data   ENCODE   CNB   Biosapiens

Manual vs. Random Selection

30 of the 44 regions chosen to be included in the ENCODE experiment were selected randomly from groups pre-classified by non-exonic conservation and gene density. Although there were more randomly selected regions than manually selected regions, the two sets of regions were of approximately the same size. All the randomly selected regions were 500KB, while the manually selected regions were of different sizes and some were up to 2MB in size.

Of the 1097 CDS sequences, 661 were from regions selected manually and 436 from "randomly" selected regions. So while it is true that "randomly" chosen regions contained less CDS than those chosen for their biological interest, no extrapolations can be made from this comparison, due to the nature of the "random" selection process.

Regions ENr112 (chromosome 2), ENr311 (chromosome 14), ENr313 (chromosome 16) do not have any CDS sequences, while regions ENr113, ENr114, ENr211, ENr213, and ENr312 have just one sequence plus alternative splicing varaints.

In contrast none of the manually chosen regions has less than 2 sequences plus splice isoforms. Region ENm012 has the fewest (just 4 sequences), while ENm006 has 118 sequences.

GO Functional Terms

Protein Binding is the most frequent term in the randomly chosen sequences, it turns up 82 times. Second is "nucleus" which turns up 71 times. Membrane-associated is the third most frequent term (59), "regulation of transcription, DNA-dependent" is fourth with 44 mentions and "membrane" fifth with 39. "Nucleotide-binding" appears 32 times and three terms each have 25 mentions, "ATP-binding", "Ca-binding" and "transporter activity".

In comparison with the manually chosen sequences there are proportionally more "nucleus" sequences and less membrane related sequences. The terms "signal transduction" (14 mentions), "receptor activity" (9 mentions) and "metal ion binding" (19 mentions) are also much less frequent in the random sample.

One other thing that is noticeable is that the sequences from the manually selected regions have considerably more GO terms per sequence. Mean GO terms per sequence for the manually chosen regions is 5.51, while it is just 4.63 for the sequences from the randomly chosen regions, and 11.7% of the sequences from the randomly chosen regions did not have GO terms at all, compared to just 6.5% of the manually chosen sequences.

 

Michael Tress
Protein Design Group, Centro Nacional de Biotecnología (CNB-CSIC)
Calle Darwin, Cantoblanco, 28049 MADRID.
Tel: (+34) 91 585 4676   Fax: (+34) 91 585 4506

Thanks to Petr Sobola for the skater.

PDG
      CNB
      CSIC