ENCODE Transfrag Sequences   PDG   CNB   CSIC

Transfrag Homologues

Besides the CDS sequences thare are other regions of potential interest. Here we investigate the possibility that the transfrag regions have homology to known proteins. The transfrags have been split into six sections and many of the peptides coded for by the exonic transfrags are indeed identical to known human proteins.

Transfrags Sequences

- 6-frame Translations of Transfrags

Transfrags have been partitioned into exonic, intronic distal/proximal, intergenic distal/proximal.

- '*.fa' files are nucleotide sequences in fasta format, named after their 'chrom', 'start' and 'stop'. - '*.pep' files are 6-frame translations of each sequence entry present in the corresponding '*.fa' file, also in fasta format. The names of the resulting protein sequences are formed from the name of the input nucleic acid sequence with '_' and the translation frame appended to it. Thus a nucleic acid sequence with the name 'XYZ' franslated in all 6 frame would produce protein sequences with the names: 'XYZ_1' (frame +1), 'XYZ_2' (frame +2), 'XYZ_3' (frame +3), 'XYZ_4' (frame -1), 'XYZ_5' (frame -2), 'XYZ_6' (frame -3).

Transfrags Data

- Transfrags with Human Homologues

This data is arranged in 13 columns. The first column is the transfrag identifier. The last number refers to the frame above. The second column is the length of the fragment, the third the nearest human homologue and the fourth the sequence length of the homologue. Subsequent colmns are the e-value between query and homologue, the percentage identity between the two, the coverage of the query fragment by the homologue, the start and end residues of the query in the alignment, the start and end points of the homologue in the alignment. The last two columns are the source of the fragment (exonic, intronic etc) and whether the fragment contains a stop codon.

Although many fragments include a stop codon they do not always affect the alignment between fragment and homologue, because they are often upstream or downstream of the alignment. However, the stop codons in the more distant homolgues often do interpolate in the alignment.

Sequences with stop codons are underlined, sequences with low % identity are in italics. Transfrags that find more than one homologue are in colour. Transfrags where a homologue can be found with two different frames are in red, transfrags where a homologue can be found for three different frames are in orange and transfrags for which a homologue can be found for four different frames are in blue. There is one case where all six frames have homologues, but that is due to low complexity and I forget what colour that is in.

BLAST Homologues Found

2099 of the 8488 fragments in the six sets found at least one homologue with BLAST. Some of the sequences found by BLAST searches were undoubtedly false positives caused by low complex regions of sequence, but most of the finds were sequences that had 100% identity. Note that fragments of less than 25 residues were ignored for the sake of speed and because BLAST is unlikely to give a good score to fragments that small.

There were fragments with homologues in all six sets, but practically all the finds were in the "gencode exons". There were only 54 in total from the other 5 regions.

When all six reading frames were taken into account, 2306 of the 50948 translations had homologues. There were 167 fragments that had two or more translations with human homologues. The record was fragment chr21_33843808_33849255 where all six of the frames were over 1650 residues long and which managed to find homologues for all six reads. However, only one (the 1815 residues that had 99% identity to SON_HUMAN over 1782 residues) is likely to code a protein. The other five reads all have much lower percentage identities and are aligned in low complexity regions. The three upstream fragments and nine downstream fragments also have approximately 100% identity to SON_HUMAN.

However, there are also cases where more than one of the six frames has a human homologue and the precentage identities to the proteins are high. chr22_31579718_31581440 for example finds four human homologues, three of which are approximately 100% identity. Frames 1 and 3 find homologues over the same 55 residues N-terminal stretch and frames -1 and -3 find homologues for overlapping segments between residue 130 and residue 280. Frame -1 of the fragment finds homology with protein k222 (a 138 residue protein) between residues 149 and 229 and frame -3 between residues 229 and 284. So the homologous regions for frames -1 and -3 do not overlap and do not overlap those between frame 1 and 3. However, frame 1 does overlap with frame 3. Frame 1 fits with the upstream regions, the homologous protein is TIMP3_HUMAN, and there are 4 more TIMP3_HUMAN sections upstream from this one. However frame 3 is homologous to Q5THV5, a protein that is identical over the 147 N-terminal residues to TIMP3_HUMAN, but that has different over the C-terminal 55 or so residues. If this is true and this codes for a protein, it would mean that the C-terminal end of the protein is determined by the which frame is chosen for translation. In this case the C-terminal 60+ amino acids of the two proteins would be encoded by the same 180+ nucleotides and the choice of C-terminal determined by the frame shift.

(According to the Human Chromosome 22 article in Nature (Dunham et al, 1999): " The 61-kb TIMP3 gene, which is involved in Sorsby fundus macular degeneration, lies within a 268-kb intron of the large SYN3 gene". But there is no mention of Q5THV5 in this paper, Q5THV5 was deposited in May 2005 by the Havana group)

Of the 2306 peptide fragments that find homologues 782 of them found a homologue that covered 100% of the sequence and had 100% identity. A further 1011 fragments had 100% identity to a human sequence over at least part of the fragment. In total 2000 of these 2036 homologues had at least 90% identity.

Of the five non-exonic sections, the "other EST fragments" find 14 fragments, all but one with high % identity. There are three segments that match GDIA_HUMAN on adjacent fragments of chromosome X, but they make up only a third of the entire sequence of GDIA_HUMAN. There are two fragments that match Q96N32 on the same chromosome, but again they only make up 50% of the protein. chr19_59470577_59470727_4 matches LIRB4_HUMAN at 100% as do many of the exons on the same region of the chromosome. In this case it may be part of an alternative splice variant because one of the exons also covers a very similar region of the sequence.

The intronic proximal fragments find homologues in 22 cases, most of them are likely to be false positives caused by low complexity. There are 5 fragments with high % identity, but only three of them might be protein coding - chr11_1866951_1867136_4 and a nearby fragment seem to be 100% identity to the ENSEMBL protein ENSP00000328502 and cover most of the sequence of the protein. chr19_59464094_59464166_1 has a 100% identical 25 residue chunk of YMEL1_HUMAN as a homologue. The upstream and downstream exons also have YMEL1_HUMAN as a homologue at 100%, although the whole sequence is not covered.

The intronic distal fragments only find three fragments, all likely false positives.

The intergenic distal regions find 9 homologues, all from chromosome 11 and very close to each other. Most of the homologues have high % identity. Seven of the homologues are found from just two fragments, one of the fragments has four possible translation,s the other three. Of the nine frames with homologues, four are homologous to Q6ZTP7 and three to Q6ZQS6. All have very high percentage identity. The fragmments do not cover the entire sequence, but it may be that the missingg framments were less than 25 residues in size. The intergenic proximal fragments find just 6 homologues, three with high % identity, though all the fragments are very small.

This work was carried out by Michael Tress at the PDG and Julien Lagarde at IMIM.

 

These pages hosted at the Protein Design Group, CNB-CSIC.