Gene annotation of the human UROD gene

Practical exercise

Enrique Blanco - eblanco@imim.es
Roderic Guigo - rguigo@imim.es


Abstract: In this exercise, the human uroporphyrinogen decarboxylase gene (URO-D) will be used to measure the accuracy of different gene finding approaches. GENSCAN, geneid, FGENESH and GENEWISE among others will be used to annotate the sequence. Both search by signal, content and homology (protein and cDNA sequences) methods will be employed in order to improve the ab initio results. Weak conservation of Start codons will lead to wrong prediction of initial exons in most cases.

Colour legend:
  • Genomic element
  • Operations or links

  • A. Gene annotation

    Step 1. Using a genome annotation browser
    • Open the UCSC Genome Browser

    • Select the Genome Browser link to locate the UROD gene

    • Select human genome

    • Type UROD in the position area

    • Submit the file

    • Click over the Known Gene link

    • Identify the known gene in the picture:
      • how many exons does the UROD gene have annotated in human?
      • how can you extract the genomic sequence of the gene?
      • what genes are annotated in the surrounding area?
      • are there any homolog annotated in other species?




    Step 2. Accesing NCBI database to retrieve the gene sequence
    • Go to NCBI database

    • Select Nucleotide searches

    • Search for the entry U30787

    • Press Go button

    • Click on the query result

    • Have a look at the different entry fields: exons, UTR, promoter, ...


    • Select the FASTA format on the Display text box

    • Select this page to text on the Send text box

    • Press Send button to see the sequence on the screen in plain text

    • This is the sequence in FASTA format

    B. Exploring ab initio gene prediction

    Step 2. Running geneid
    • Connect to the geneid server

    • Paste the FASTA sequence

    • Choose geneid output format

    • Run geneid with different parameters:

      1. Searching signals: Select acceptors, donors, start and stop codons. Look for them in the real annotation of the sequence

      2. Searching exons: Select All exons and try to find the real ones

      3. Finding genes: You do not need to select any option (default behaviour). Compare the predicted gene with the real gene

      Figure 1. Signal, exons and genes predicted by geneid in the sequence U30787


    Step 3. Running other genefinders

    Provided that there are several alternative programs to analyze a DNA sequence, we can run every application and observe the common parts of the predictions.

    1. GENSCAN:
      • Connect to the GENSCAN server

      • Paste DNA sequence

      • Press Run Genscan button

      • Compare annotations and predictions

    2. FGENESH:
      • Connect to Softberry homepage

      • On the left frame, select GENE FINDING in Eukaryota

      • Select the program FGENESH

      • Paste DNA sequence

      • Press Search button

      • Compare annotations and predictions

    NOTE: First exon is always missed in the predictions and there are some problems to detect the donor site from exon 5. Detection of Start codons is a serious drawback in current gene finding programs (see Figure 2). However, this problem can be overcome by using homology information to complete the gene prediction.

    Figure 2. EMBL annotation and genes predicted by Grail, GENSCAN, geneid and FGENESH in the sequence U30787


    C. Using EST/cDNA homology information

    Step 4. Using FGENESH-C + alignment of transcripts

    Using blastn, we can search the database est_human for ESTs supporting future predictions. Filter this output in order to select those non-overlapping ESTs that could form a complete cDNA sequence (see Figure 4). Moreover, ESTs not divided into two or more pieces in the genomic sequence (containing a couple of splice sites) should be rejected.
    • Connect to the FGENESH-C server (on Gene finding with similarity menu)

    • Paste the sequence U30787

    • Paste the cDNA sequence or EST you have selected

    • Press the search button

    • Notice that predicted gene will necessarily supported by homology information, so it will likely mapped only in the genomic region overlapping your EST query.

    Figure 4. Best human ESTs in the alignment mapped on the genomic sequence U30787

    D. Using protein homology information

    Step 5. Spliced alignment

    Spliced alignment is very useful when we have additional information (a putative homologous protein sequence) about the content of the sequence. Thus, gene prediction is guided by fitting the protein sequence into the best splice sites predicted in the genomic sequence.
    • Open the NCBI blast server

    • Choose blastx program (genomic query versus protein database)

    • Paste the genomic sequence and press the Blast! and Format!

    • Select the first protein. Display the FASTA sequence or click here. Obviously, it is the real protein annotated in the genomic sequence.

    • Open genewise web server to use this protein to predict the best gene structure

    • Paste both protein and genomic sequences and run the program

    • Compare predicted gene (end of the file) and annotations: look for splice sites within introns to check exon boundaries are correct

      Figure 5. Best HSPs representing proteins homologues similar to the genomic sequence U30787 obtained using blastx



    E. Results

    Here you can find the solutions to every exercise:

    UCSC graphical annotation
    NCBI annotation
    NCBI annotation (plain text)
    FASTA sequence
    geneid results: signals
    geneid results: exons
    geneid results: genes
    GENSCAN results
    FGENESH results
    Blastn + human ESTs results
    Blastx + protein results
    Genewise (human protein)


    F. Bibliography
    1. J.F. Abril and R. Guigó. gff2ps: visualizing genomic annotations. Bioinformatics 16:743-744 (2000).

    2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215:403-410 (1990).

    3. Burge, C. and Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94 (1997).

    4. E. Blanco, G. Parra and R. Guigó. Using geneid to Identify Genes. In A. D. Baxevanis and D. B. Davison, chief editors: Current Protocols in Bioinformatics. Volume 1, Unit 4.3. John Wiley & Sons Inc., New York. ISBN: 0-471-25093-7 (2002).

    5. Asaf A. Salamov and Victor V. Solovyev. Ab initio Gene Finding in Drosophila Genomic DNA Genome Res. 10: 516-522 (2000).

    6. D. Hyatt, J. Snoddy, D. Schmoyer, G. Chen, K. Fischer, M. Parang, I. Vokler, S. Petrov, P. Locascio, V. Olman, Miriam Land, M. Shah, and E. Uberbacher. Improved Analysis and Annotation Tools for Whole-Genome Computational Annotation and Analysis: GRAIL-EXP Genome Analysis Toolkit and Related Analysis Tools. Genome Sequencing & Biology Meeting (2000).

    7. Ewan Birney and Richard Durbin. Using GeneWise in the Drosophila Annotation Experiment. Genome Res. 10: 547-548 (2000).