The complexity of eukaryotic gene structure

ABSTRACT

Current gene prediction programs, among other bioinformatics tools, sistematically ignore the complexity of eukaryotic gene structure. This diversity comes from alternatively spliced gene structures, non-canonical signals (that affect either splicing or translation) and from regions that control gene transcription, promoters, which are not yet well understood. Here, we give and overview to see some of the limitations and future directions in the gene prediction field.

Eukaryotic gene structure

In short, what follows is an ideal model of a two exons gene. Make sure all the elements pictured are clear to you. Note signals that define coding exons.

Key Points
Signals Exons Lengths
Start (ATG) Single Exons
STOPs (TGA,TAA,TAG) First Introns
Donor (GT) Internal Intergenic
Acceptor (AG) Terminal UTRs

Promoter region

Promoter regions, upstream of genes, are made of short sequence units (termed promoter elements) which are binded by transcription factors (proteins) that modulate the transcription step. Their prediction, if accurate, would aid the prediction of genes. However, current methods are not reliable for highthroughput annotation of promoter regions.

Alternative splicing

Recent data indicates that 40-50% of human genes produce alternative mRNA forms through the splicing process. How do you think current gene prediction results are effected by this fact? do you think predicted gene models can have exons which are never found together in the mRNA molecule?

Non-canonical splice sites

The canonical splice sites are the donor GT dinucleotide and the accepto AG. However, other non-canonical splice sites exist. Being the more relevant the 0.69% GC-AG and 0.05% AT-AC (donor-acceptor). Be aware that standard gene prediction programs usually only take into account standard splice site signals.

Alternative translation

First of all, let's introduce the translational step, in which RNA is translated into protein by a triplet (codon) - amino acid code.

Did you know that the same mRNA can have two or more translation initiation sites (ATG)? or even different polyadenilation sites? or that codon meaning can be recoded? or that frameshifting exist? However, in the light of current data, these events seem to be rare (but bear in mind that alternative splicing was considered as rare few years ago). As an example, we will study further a recoding process that happens in a peculiar family of proteins that incorporate selenium: selenoproteins.