Course Protein Function Prediction
Practical session/Exercises
Practical lesson 1
Gene context, gene fusions and phylogenetic profiles.

By F. Abascal,  PDG, CNB, CSIC.
  • Given this protein (yebC),
    >YEBC_ECOLI|P24237|Protein yebC.
    MAGHSKWANT RHRKAAQDAK RGKIFTKIIR ELVTAAKLGG GDPDANPRLR AAVDKALSNN
    MTRDTLNRAI ARGVGGDDDA NMETIIYEGY GPGGTAIMIE CLSDNRNRTV AEVRHAFSKC
    GGNLGTDGSV AYLFSKKGVI SFEKGDEDTI MEAALEAGAE DVVTYDDGAI DVYTAWEEMG
    KVRDALEAAG LKADSAEVSM IPSTKADMDA ETAPKLMRLI DMLEDCDDVQ EVYHNGEISD
    EVAATL
     
  • Let's try to obtain information about its function using BLAST, Pfam or other tools to find homology relationships.

  • Can you find some clues about its function?
    Now, let's try to use  comparative genomics information to find some clues.
     
  • Look at the species distribution in Pfam. Is this protein present in a wide or a narrow phylogenetic range?
    •  
  • The COG database provides information about the gene neighborhood of the genes belonging to each COG. So, access the  COGs - the inital version database and find the corresponding COG of yebC.
    • -Search by gene name ("yebC").
      -Once in the yebC's COG, look at the species distribution: is yebC always present in the bacterial genomes? What does that indicate?
      -Check the "genome context" of the yebC gene. Is there some gene that appears frequently in yebC's neighborhood? What do you think it means? Does it gives us some clue about yebC's function?
  • We can see that knowledge about protein families, even if they are composed of proteins of unknown function, is essential for comparative genomics. Many databases are devoted to comparative genomics. Two of them that are very often used, and that may provide information that is similar to that contained in COGs are:

  •  
    • MGDB, Microbial genome database

    •  
    • KEGG.  Kyoto encyclopedia of Genes and Genomes.

    •  
  • Use now the STRING system (Search Tool for the Retrieval of Interacting Genes/Proteins): it looks for conservation of gene order, common phylogenetic patterns and gene fusions.
      -Find the corresponding STRING entry (by text/by sequence).
      -What associations are predicted by gene neighborhood?
      -What associations are predicted by gene fussion?
      -What associations are predicted based on common phylogenetic patterns (Phylogeny)?
      -Take a look to the "Summary Network". It is possible then to identify which COGs are functionally related with those represented in the original graph, by changing the DEPTH of the network. By doing this we are in fact travelling along a giant network of functional interactions.  Some regions in this giant network may may correspond to specific regulatory circuits or metabolic pathways.
The main goal of the practical session is to make course participants familiar with existing text mining and IR tools for biology literature.

Practical 2: Gene Ontology, describing and annotating gene products.

Gene Ontology (GO) aims to provide standardized concepts or terms to describe relevant biological aspects. Try to use GO retrieve for a set of terms: apoptosis, caspase, glycogenin, transcription factor (or in case you are interested in some particular function/process/compartment use your own query instead). What did you retrieve? Browse through the results and visualize the corresponding ontology graphs. What kind of relationships between terms did you find? What are the advantages of using this method?

Try to explore annotation for a set of proteins, namely:

1) CASP9_HUMAN (P55211) (formerly known as ICE9_HUMAN),
2) Y1333_MYCTU (P64811) formerly known as YD33_MYCTU
3) RPE_YEAST (P46969)

by Searching the Gene Ontology Annotation database GOA. Those proteins were used in the earlier practical. What are one of the weak points when using GO annotations for bioinformatics annotations? (Hint: think about domains).

iHOP. This tool was developed at out group (PDG) at the CNB. Create a gene model for your query gene, check the results carefully, and surf through the virtual gene network of iHOP. What kind of results are obtained by iHOP? What are the advantages/disadvatages when using iHOP instead of the PubMed retireval search?


Practical 3: Deriving protein interactions through literature mining
Much of the function of many proteins comes interactions with other bio-molecules. Use different text mining tools which try to extract protein interactions for a given query protein/s (caspase, glycogenin, p53 etc...) from texts: iHOP, Chilibot
Compare your results with entries in interaction databases: BIND, DIP , GRID , HPRD, IntAct, MINT and STRING.
What kind of output is produced by each tool? Which differences do you encounter? What are the difficulties encountered by those tools?


Practical 4: Deriving functional information from structure
Since structural genomics programs are generating an ever-increasing number of hypothetical structures, the result is that more and more servers that predict function for structures with unknown function being developed.

Function here isnt just limited to a protein's broad functional category, but can include cellular location, prediction of binding sites and prediction of interacting surface patches.

Compare the predictions from the following servers for known structure 1t70. Do the the predictions help you to guess the function?
  • ProtFun - queries other prediction servers to obtain the cellular role, enzyme class, etc
  • ProFunc - uses both sequence- and structure-based methods to deduce function.
  • PPI-PRED - predicts residues involved in interactions between proteins.