ENCODE Homologues   PDG   CNB   CSIC

CDS Sequence Data
Data for All Sequences
- Human, Mouse, Zebra Fish, Go Terms and PDB Structures

The Data

There are 1097 sequences in the file that contains the CDS amino acid sequences (CODING_gencode_annotations_44regions_hg17_CHR_coord_oct2005.gtf_CDS.pep). Of these sequences 663 seem to be alternative splice variants, that is they are almost repeats of the same sequence. Of these alternative splice variants, 111 are same length identical sequences.

BLAST finds human homologues for all but 22 of the CDS sequences.

Of the 1097 CDS sequences, 661 were selected manually and 436 at random.

There are 1003 sequences that have GO terms easily associated to them (through BLAST).

13 sequences have their whole sequence covered by six different PDB structures.

587 sequences find at least one template structure with BLAST.

Searching Against Human Sequence Databases
- Human Homologues

BLAST finds human homologues for all but 22 of the CDS sequences. Eight of these 22 sequences have less than 50 residues, and several are identical to human proteins, but just too short to be recognised by BLAST. Most of these are partial variants.

606 of the remaining 1075 CDS sequences match 100% with a human sequence over their entire length. A further 229 sequences match 100% with a human sequence over part of their length. 173 more sequences have at least 97% identity to human sequences, giving 1008 of 1075 that match at least at 97%.

Mouse/Zebra Fish Homologues

BLAST was not able to find a mouse or zebra fish homologue for 16 of the 563 sequences that were covered entirely by a human protein. Five of the 191 sequences that match 100% with a human sequence over part of their length do not have a mouse or zebra fish BLAST homologue either.

Those with 100% ID to human sequence, but without mouse/zebra fish homologue or GO term are:

AC012314.1-003, AC021607.2-001, AC021607.2-002, AC129929.2-001, AC116366.4-002,
AC129929.3-001, AF277315.18-001, RP1-127L4.6-001, AF121781.16-017 (31 residues),
RP11-398K22.8-001, RP1-149A16.4-001, RP1-149A16.4-002, RP11-90M5.1-001, 
RP3-477O4.9-001, RP3-477O4.9-002, RP3-477O4.9-003, AP003774.4-001,
Z84723.1-001

These 5 have no mouse/zebra fish homologue but do have GO terms:

AC000123.1-007, LL22NC03-44A4.2-002, RP11-517O1.1-007 (40 residues), U52112.11-001, U52112.11-002 (33 residues)

26 mouse homologues are 100 identical to the human sequences, though they are not always of the same length. Here is a list of the 26 with their GO defined fiunction in brackets:

AC000357.2-001 *** (ADP-ribosylation factor 5)
AC003987.1-009 AC003987.1-016 *** (FAM4A1 splice variant a)
AC004080.4-005 *** (HOXA-9B)
AC008440.7-001 *** (Voltage-dependent calcium channel gamma-7 subunit)
AC012314.14-001 AC012314.14-003 AC012314.14-010 *** (40S ribosomal protein S9)
AC018512.5-001 AC018512.5-002 AC018512.5-005 *** (Splice Isoform Short of Small EDRK-rich factor 1a)
AP001462.1-005 AP001462.1-006 *** (Splice Isoform 3 of Menin)
LA16c-OS12.1-001 LA16c-OS12.1-002 *** (Splice Isoform 2 of Putative RNA-binding protein Luc7-like 1)
LL22NC03-113A11.1-012 *** (Splice Isoform 5 of DEP domain containing protein 5)
RP11-353C18.54-001 RP11-353C18.54-002 RP11-353C18.54-003 RP11-353C18.54-004 *** (Protein C20orf52)
RP11-505P4.2-015 *** (Eukaryotic translation elongation factor 1 alpha-like 3)
RP11-517O1.1-011 RP11-517O1.1-013 RP11-517O1.1-014 *** (Cohesin subunit SA-2)
RP3-429G5.4-001 *** (Sorting nexin 3)
RP4-696P19.3-011 *** (Splice Isoform 2 of Transcription factor EB)

The mean % identity between human and the nearest mouse homologues for the 563 complete sequences is 82%, human sequences are 18% difference from those of mice. Mouse homologues cover an average of 95% of the human sequences.

Just as in mouse, the sequence AC012314.14-001 and its variants are very conserved in zebra fish. BLAST find homologues of 95% sequence identity to the human sequence (40S ribosomal protein S9). The mean % identity between human and the nearest zebra fish homologues for the 563 complete sequences is 53%. Zebra fish homologues cover an average of 82% of the human sequences.

 

Michael Tress
mtress @cnb.uam.es
Protein Design Group
Centro Nacional de Biotecnología (CNB-CSIC)
Calle Darwin, Campus de la Universidad Autónoma de Madrid, Cantoblanco.
28049 MADRID.
Tel: (+34) 91 585 4676   Fax: (+34) 91 585 4506

Apologies to Osvaldo Graña from whom I totally stole the page design.

Thanks to Petr Sobola for the skater(s).