2099 of the 8488 fragments in the six sets found at least one homologue with BLAST. Some of the sequences found by BLAST searches were undoubtedly false positives caused by low complex regions of sequence, but most of the finds were sequences that had 100% identity. Note that fragments of less than 25 residues were ignored for the sake of speed and because BLAST is unlikely to give a good score to fragments that small.
There were fragments with homologues in all six sets, but practically all the finds were in the "gencode exons". There were only 54 in total from the other 5 regions.
When all six reading frames were taken into account, 2306 of the 50948 translations had homologues. There were 167 fragments that had
two or more translations with human homologues. The record was fragment chr21_33843808_33849255 where all six of the frames were over 1650 residues long and which managed to find homologues for all six reads. However, only one (the 1815 residues that had 99% identity to SON_HUMAN over 1782 residues) is likely to code a protein. The other five reads all have much lower percentage identities and are aligned in low complexity regions. The three upstream fragments and nine downstream fragments also have approximately 100% identity to SON_HUMAN.
However, there are also cases where more than one of the six frames has a human homologue and the precentage identities to the proteins are high. chr22_31579718_31581440 for example finds four human homologues, three of which are approximately 100% identity. Frames 1 and 3 find homologues over the same 55 residues N-terminal stretch and frames -1 and -3 find homologues for overlapping segments between residue 130 and residue 280. Frame -1 of the fragment finds homology with protein k222 (a 138 residue protein) between residues 149 and 229 and frame -3 between residues 229 and 284. So the homologous regions for frames -1 and -3 do not overlap and do not overlap those between frame 1 and 3. However, frame 1 does overlap with frame 3. Frame 1 fits with the upstream regions, the homologous protein is TIMP3_HUMAN, and there are 4 more TIMP3_HUMAN sections upstream from this one. However frame 3 is homologous to Q5THV5, a protein that is identical over the 147 N-terminal residues to TIMP3_HUMAN, but that
has different over the C-terminal 55 or so residues. If this is true and this codes for a protein, it would mean that the C-terminal end of the
protein is determined by the which frame is chosen for translation. In this case the C-terminal 60+ amino acids of the two proteins would be encoded by the same 180+ nucleotides and the choice of C-terminal determined by the frame shift.
(According to the Human Chromosome 22 article in Nature (Dunham et al, 1999): " The 61-kb TIMP3 gene, which is involved in Sorsby fundus macular degeneration, lies within a 268-kb intron of the large SYN3 gene". But there is no mention of Q5THV5 in this paper, Q5THV5 was deposited in May 2005 by the Havana
group)
Of the 2306 peptide fragments that find homologues 782 of them found a homologue that covered 100% of the sequence and had 100% identity. A further 1011 fragments had 100% identity to a human sequence over at least part of the fragment. In total 2000 of these 2036 homologues had at least 90% identity.
Of the five non-exonic sections, the "other EST fragments" find 14 fragments, all but one with high % identity. There are three segments that match GDIA_HUMAN on adjacent fragments of chromosome X, but they make up only a third of the entire sequence of GDIA_HUMAN. There are two fragments that match Q96N32 on the same chromosome, but again they only make up 50% of the protein. chr19_59470577_59470727_4 matches LIRB4_HUMAN at 100%
as do many of the exons on the same region of the chromosome. In this case it may be part of an alternative splice variant because one of the
exons also covers a very similar region of the sequence.
The intronic proximal fragments find homologues in 22 cases, most of them are likely to be false positives caused by low complexity. There are
5 fragments with high % identity, but only three of them might be protein coding - chr11_1866951_1867136_4 and a nearby fragment seem to be
100% identity to the ENSEMBL protein ENSP00000328502 and cover most of the sequence of the protein. chr19_59464094_59464166_1 has a 100%
identical 25 residue chunk of YMEL1_HUMAN as a homologue. The upstream and downstream exons also have YMEL1_HUMAN as a homologue at 100%, although the whole sequence is not covered.
The intronic distal fragments only find three fragments, all likely false positives.
The intergenic distal regions find 9 homologues, all from chromosome 11 and very close to each other. Most of the homologues have high % identity. Seven of the homologues are found from just two fragments, one of the fragments has four possible translation,s the other three. Of the nine frames with homologues, four are homologous to Q6ZTP7 and three to Q6ZQS6. All have very high percentage identity. The fragmments do not cover the entire sequence, but it may be that the missingg framments were less than 25 residues in size. The intergenic proximal fragments find just 6 homologues, three with high % identity, though all the fragments are very small.
This work was carried out by Michael Tress at the PDG and Julien Lagarde at IMIM.