(updated 23th July, 2003)

Task 2: Functional Curation

The second task will address the automatic assignment of GO annotations (http://www.geneontology.org/) to human proteins. This is currently done by curators at Swiss-Prot for the human genome, who have agreed to make these annotations available for use as training and test sets. This task requires that participants build an automated system that identifies the relevant information for each protein and produces the required GO annotations. These can be validated against the GO annotations in Swiss-Prot. As part of the task, the systems will be asked to provide a pointer into the text of the article that provides supporting evidence for the annotation. The Swiss-Prot curators will inspect these evidence links to determine their utility.

The data for the GO annotation for the human proteome are taken from the GOA project at the EBI (this information is freely available, for more details see http://www.ebi.ac.uk/GOA/release.html).

As the training data we will use a sub-set of these annotations that are related with experimental evidence. These are IPI, IGI, ISS, NAS, TAS, IDA, IMP and IEP. Not included in this list are IC (that is based on curator judgment) and ND (which means "no data") because they will be very difficult to trace back in a specific publication. Currently there are almost 8000 human proteins annotated in this way and they are cited in about 9500 publications. The total number of publications provided in the training set will be smaller (most likely several hundred) because full-text is only available for a limited number of journals.

The sub-parts of task 2 will be:

  1. 'Recover' text that proofs the GO annotation:
    We will provide the protein, its GO annotation and the associated publication and the participants will have to provide a part of the document that would (to a human expert) prove the original annotation.

  2. Provide GO annotation for human proteins:
    We will provide the protein and the associated publication and the participants will have to 'annotate' automatically the protein according to the information in this paper and provide a part of the document to prove the annotation.

  3. Selection of relevant papers:
    We will provide a protein and a (probably high) number of papers of which most are irrelevant for the protein. The participants will have to detect which papers are relevant for a protein in the sense that they contain information that would be suitable to derive a GO annotation and also provide these parts of the papers as evidence text that would be useful for annotation.

  4. Selection of relevant papers from the BMC journals:
    Similar to the previous task. The participants will have to detect which papers are relevant for a protein searching the content of the BMC journals (currently about 2550 publications).
    The content of the BMC journals is freely available on the BMC web-site in XML format [see http://www.biomedcentral.com/info/about/datamining/ for downloading instructions]. We believe that text mining will be one of the driving forces to change publishing policies and allow free access to scientific material and on the other hand text mining will greatly benefit from a general free access to the full content of scientific publications. Therefore we included this exploratory task to assess how much valuable information (only according to our narrow definition in BioCreative) from this already completely freely available content.

The evaluation of the first and second sub-parts are rather clear, participants have to provide supporting text from the documents and in the second sub-part, predict correctly the GO annotations for the proteins. The third and fourth sub-parts will be evaluated in a more shallow way by the annotators based on the 'usefulness' of the information that is extracted.

We designed this part of the competition in that way to:

Many of the issues of "noisy training data" will also apply to the Swiss-Prot task. For example, the Swiss-Prot group annotates "by protein" rather than by paper. Therefore a paper may or may not be fully annotated for all proteins occurring in the paper. This means that a system might generate annotations for a protein that had not been curated for that paper. This would look like a false positive, since there will be no GO annotations for that protein in the "answer key". We will take this into account in the evaluation of the results to not penalize these false 'false positives'.

Evaluation set-up

For the Functional annotation task, the participants will be provided with a set of full text articles from a restricted number of journals that were used by the Swiss-Prot team to annotate the functions of human proteins; we will also provide a sufficiently large set of human proteins with their corresponding GO codes (and their associations to the articles). The test data will consist of blind (not yet released) annotations and the corresponding set of articles. The participants will build systems capable of identifying the relevant information for each protein in order to produce the required GO annotations and select the evidence text. The annotations will be validated by comparing to the GO annotations provided by the Swiss-Prot team.

Data

EBI GO annotations are freely available but one has to sign an agreement to obtain the full-text documents.

After signing the data access agreement the participants will be provided with the full-text articles most likely in SGML format that have been used by Swiss-Prot curators to annotate human proteins according to GO. The second part of the data consists of the actual annotations for the proteins that the participants download from the GOA project at the EBI in the following format (for further details see http://www.ebi.ac.uk/GOA/release.html):

P41220 RGS2_HUMAN GO:0005096 PMID:10747990 TAS F Regulator of G-protein signaling 2 IPI00013177

The relevant parts are the Swiss-Prot accession number and ID (first two columns), the GO annotation code and the PubMed id of the corresponding publication. This file has to be filtered by the evidence codes that are used in the evaluation (mentioned before) and for lines that contain a reference to a PubMed ID.

Furthermore we provide the links between the PMIDs and the file names of the articles.

The participants should keep in mind the following details:

Test data and Evaluation

As described in the introduction the goals of the evaluation are to:

For the first part we will evaluate how close the prediction is in the GO hierarchy to the correct annotation. This will be measured in semantic distance similar to Phillip Lord et al. (see http://www.russet.org.uk/), or, in other words, the closer the prediction is to the correct annotation, the better. We will use an "intelligent" measure to express this distance.

For the second to fourth parts, the curators from Swiss-Prot will evaluate how "useful" the extracted text is for deriving the correct annotation. This measure is somewhat subjective, but we think that this is the only realistic way of evaluating this part, because we want to know how useful a system is under these settings and not how good it is in reproducing a certain way of annotating the training data.

We are aware that the task defined here may be difficult or even impossible in some cases. In the following we give some examples of what the results can look like:

1. Clear cases:

2. Indirect cases:

One would have to establish first a the relation between the individual proteins and the fact that they are all RGS proteins and then interpret from the second sentence later in the text that these proteins are related to G protein signaling.

The first sentence establishes that CCR1 is related to a G-protein coupled receptor pathway and the second sentence states that MIP-1alpha binds to this receptor and it can be deduced that it is therefore also related to this process.

In this case one would have to know (from an external source) that CCR1 belongs to these classes of receptors to deduce the relation to immune response (that is questionable from this sentence anyway).

These examples show that it is feasible to retrieve GO annotations from the publications. There are many indirect ways to express the facts that are easy for a human expert to understand but difficult for an automatic system. Depending on the final results, we may take partial results into account if they are judged to be useful for a further annotation process.

The evaluation will be based on both parts (prediction of the GO annotation AND the text to support the prediction), and the participants will have to do well in both of them. Providing only the prediction for GO (step 1) is not valid, as would be only providing text but no (or completely incorrect) annotation in GO.

Summary: the training data

GOA data

Download all files in the directory ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/. These give you the associations between Swiss-Prot Ids, the GO annotations and the PubMed identifiers (PMID) on the one hand and the mapping of Swiss-Prot Ids to other databases like, for example, HUGO.

There are in total 10 experimental evidence codes in GO: IPI, IGI, ISS, NAS, TAS, IC, ND, IDA, IMP and IEP (indicated in the file gene_association.goa_human.gz). We will use all evidence codes related to experimental evidence except IC (based on curator judgment) and ND (no data). Other evidence codes refering to non-experimental methods like computer based predictions (e.g. IEA - inferred from electronic annotation) will not be used. For a more detailed explanation of the GO evidence codes refer to http://www.geneontology.org/doc/GO.evidence.html.

Take the lines from gene_association.goa_human.gz with the tags mentioned in the previous paragraph and those PMIDs are in our list (see below).

GO data

Go to http://www.geneontology.org/doc/index.shtml and download the three GO ontologies (Molecular Function, Biological Process and Cellular Component) to map the GO codes from the GOA file (gene_association.goa_human.gz) to the GO terms.

Gene/protein names

We do not provide a fixed name dictionary for the contest. The participants should collect their own lists of gene and protein names and synonyms from the publicly available sources like Swiss-Prot, HUGO or LocusLink.

List of available documents

Only part of the documents that we received are relevant for the evaluation task (the documents that were used by the GO annotators at Swiss-Prot for their annotations). We selected these documents and provide them to the participants.

Get the full text of the documents provided by the Journal of Biological Chemistry

Send an e-mail to Christian Blaschke with a copy of the non-distribution agreement of the data [link to MITRE], stating that you accept the conditions, and fill in the form. We will then answer this message with a user name and password for downloading the data.

The format of the files is SGML (Standard General Markup Language) which is similar to XML (but more general). You can find the HighWire DTD descriptor file to interpret the document structure of the files at the HighWire web-site.

Get the full text of the documents from the BMC journals

The BMC content is freely available. Please check http://www.biomedcentral.com/info/about/datamining/ for downloading instructions to obtain the content of the BMC journals in XML format.

Contact

Please contact blaschke@cnb.uam.es for specific inquiries regarding task 2, BioCreative@gredos.cnb.uam.es for general questions about the evaluation procedure or post a message in the discussion list [to be announced] if the issue is of general interest and you would like to all the participants or start a discussion.

Summary: test data and results

The test data will be available somewhere in November (the exact time will, amongst other factors, influenced by how the evaluation is going and the completeness of the test data).

The exact format of the results will be defined later on.