The second task will address the automatic assignment of GO annotations (http://www.geneontology.org/) to human proteins. This is currently done by curators at Swiss-Prot for the human genome, who have agreed to make these annotations available for use as training and test sets. This task requires that participants build an automated system that identifies the relevant information for each protein and produces the required GO annotations. These can be validated against the GO annotations in Swiss-Prot. As part of the task, the systems will be asked to provide a pointer into the text of the article that provides supporting evidence for the annotation. The Swiss-Prot curators will inspect these evidence links to determine their utility.
The data for the GO annotation for the human proteome are taken from the GOA project at the EBI (this information is freely available, for more details see http://www.ebi.ac.uk/GOA/release.html).
As the training data we will use a sub-set of these annotations that are related with experimental evidence. These are IPI, IGI, ISS, NAS, TAS, IDA, IMP and IEP. Not included in this list are IC (that is based on curator judgment) and ND (which means "no data") because they will be very difficult to trace back in a specific publication. Currently there are almost 8000 human proteins annotated in this way and they are cited in about 9500 publications. The total number of publications provided in the training set will be smaller (most likely several hundred) because full-text is only available for a limited number of journals.
The sub-parts of task 2 will be:
'Recover' text that proofs the GO
annotation:
We will provide the protein, its GO annotation and the associated publication
and the participants will have to provide a part of the document that would
(to a human expert) prove the original annotation.
Provide GO annotation for human proteins:
We will provide the protein and the associated publication and the
participants will have to 'annotate' automatically the protein according
to the information in this paper and provide a part of the document to
prove the annotation.
Selection of relevant papers:
We will provide a protein and a (probably high) number of papers of
which most are irrelevant for the protein. The participants will have to
detect which papers are relevant for a protein in the sense that they contain
information that would be suitable to derive a GO annotation and also provide
these parts of the papers as evidence text that would be useful for annotation.
Selection of relevant papers from the BMC journals:
Similar to the previous task. The participants will have to detect
which papers are relevant for a protein searching the content of the BMC
journals (currently about 2550 publications).
The content of the BMC journals is freely available on the BMC
web-site in XML format [see http://www.biomedcentral.com/info/about/datamining/
for downloading instructions]. We believe that text mining will be one
of the driving forces to change publishing policies and allow free access
to scientific material and on the other hand text mining will greatly benefit
from a general free access to the full content of scientific publications.
Therefore we included this exploratory task to assess how much valuable
information (only according to our narrow definition in BioCreative) from
this already completely freely available content.
The evaluation of the first and second sub-parts are rather clear, participants have to provide supporting text from the documents and in the second sub-part, predict correctly the GO annotations for the proteins. The third and fourth sub-parts will be evaluated in a more shallow way by the annotators based on the 'usefulness' of the information that is extracted.
We designed this part of the competition in that way to:
allow the application of a wide range of text mining and information extraction techniques (from purely statistical to sophisticated NLP methods) and make the task open to a large number of groups working in this field.
evaluate the results according to an established standard (GO).
address a real problem in the field of biological database curation.
Many of the issues of "noisy training data" will also apply to the Swiss-Prot task. For example, the Swiss-Prot group annotates "by protein" rather than by paper. Therefore a paper may or may not be fully annotated for all proteins occurring in the paper. This means that a system might generate annotations for a protein that had not been curated for that paper. This would look like a false positive, since there will be no GO annotations for that protein in the "answer key". We will take this into account in the evaluation of the results to not penalize these false 'false positives'.
For the Functional annotation task, the participants will be provided with a set of full text articles from a restricted number of journals that were used by the Swiss-Prot team to annotate the functions of human proteins; we will also provide a sufficiently large set of human proteins with their corresponding GO codes (and their associations to the articles). The test data will consist of blind (not yet released) annotations and the corresponding set of articles. The participants will build systems capable of identifying the relevant information for each protein in order to produce the required GO annotations and select the evidence text. The annotations will be validated by comparing to the GO annotations provided by the Swiss-Prot team.
EBI GO annotations are freely available but one has to sign an agreement to obtain the full-text documents.
After signing the data access agreement the participants will be provided with the full-text articles most likely in SGML format that have been used by Swiss-Prot curators to annotate human proteins according to GO. The second part of the data consists of the actual annotations for the proteins that the participants download from the GOA project at the EBI in the following format (for further details see http://www.ebi.ac.uk/GOA/release.html):
P41220 RGS2_HUMAN GO:0005096 PMID:10747990 TAS F Regulator of G-protein signaling 2 IPI00013177
The relevant parts are the Swiss-Prot accession number and ID (first two columns), the GO annotation code and the PubMed id of the corresponding publication. This file has to be filtered by the evidence codes that are used in the evaluation (mentioned before) and for lines that contain a reference to a PubMed ID.
Furthermore we provide the links between the PMIDs and the file names of the articles.
The participants should keep in mind the following details:
we do not provide protein name dictionaries, i.e. the name of a protein in the GOA file may not be used in the associated documents but rather a synonym that may be found in Swiss-Prot or in other databases. It is the responsibility of the participants to collect synonyms lists to detect the protein names correctly in the documents;
GO consists of three (non overlapping) parts (molecular function, biological process, cellular component) that are treated separately;
one protein can have many functions (be part of many processes, be localized in different places in the cell) and can therefore appear many times in the corresponding parts of GO;
the function of a protein (its molecular processes, cellular components) can be described in many different articles;
one article can describe different functions (processes, components) of the same protein AND/OR mention a number of proteins of which all or just a subset are relevant in our evaluation task;
full-text articles are long and in general only a (small) section of the whole paper is relevant for classification of a certain protein (maybe a paragraph or two).)
As described in the introduction the goals of the evaluation are to:
provide text from documents that support GO annotations (applies to parts 1-4);
predict GO annotations of proteins for all the papers that are provided with that proteins (these can be different annotations for a protein in each of the papers; only part 2);
select papers that contain information relevant for GO annotation (applies to parts 3 and 4);
For the first part we will evaluate how close the prediction is in the GO hierarchy to the correct annotation. This will be measured in semantic distance similar to Phillip Lord et al. (see http://www.russet.org.uk/), or, in other words, the closer the prediction is to the correct annotation, the better. We will use an "intelligent" measure to express this distance.
For the second to fourth parts, the curators from Swiss-Prot will evaluate how "useful" the extracted text is for deriving the correct annotation. This measure is somewhat subjective, but we think that this is the only realistic way of evaluating this part, because we want to know how useful a system is under these settings and not how good it is in reproducing a certain way of annotating the training data.
We are aware that the task defined here may be difficult or even impossible in some cases. In the following we give some examples of what the results can look like:
1. Clear cases:
Protein: RGS4
GO:0005516 – calmodulin binding activity
PMID: 10747990
Text: 'Indeed, Ca2+/calmodulin binds a complex of RGS4 and a transition state analog of Galpha i1-GDP-AlF4-.'
Protein: p21waf/cip1
GO: 0008285 – negative regulation of cell proliferation
PMID: 10692450
Text: 'The p21waf/cip1 protein is a universal inhibitor of cyclin kinases and plays an important role in inhibiting cell proliferation.'
Protein: Thrombin
GO:0006915 – apoptosis
PMID: 10692450
Text: 'Induction of Apoptosis by Thrombin'
2. Indirect cases:
Proteins: RGS1,RGS2,RGS4,RGS16
GO: 0008277 – regulation of G-protein coupled receptor protein signaling pathway
PMID: 10747990
Text: 'We report that calmodulin binds in a Ca2+-dependent manner to all RGS proteins we tested, including RGS1, RGS2, RGS4, RGS10, RGS16, and GAIP' and later in the text 'To investigate the role of Ca2+ in feedback regulation of G protein signaling by RGS proteins, we characterized ...'.
One would have to establish first a the relation between the individual proteins and the fact that they are all RGS proteins and then interpret from the second sentence later in the text that these proteins are related to G protein signaling.
Protein: MIP-1alpha
GO:0007186 – G-protein coupled receptor protein signaling pathway
PMID: 10734056
Text: 'Taken together, these results indicate that CCR1-mediated responses are regulated at several steps in the signaling pathway, by receptor phosphorylation at the level of receptor/G protein coupling and by an unknown mechanism at the level of phospholipase C activation' and later 'In this study, the CCR1 receptor, which binds RANTES, MIP-1alpha , MCP-2, and MCP-3 with high affinity'.
The first sentence establishes that CCR1 is related to a G-protein coupled receptor pathway and the second sentence states that MIP-1alpha binds to this receptor and it can be deduced that it is therefore also related to this process.
Protein: CCR1
GO:0006955 – immune response
PMID: 10734056
Text: 'Thus, the ability of such classes of CC and CXC chemokine receptors to selectively cross-regulate each other at multiple levels may be physiologically relevant in controlling immune response'
In this case one would have to know (from an external source) that CCR1 belongs to these classes of receptors to deduce the relation to immune response (that is questionable from this sentence anyway).
These examples show that it is feasible to retrieve GO annotations from the publications. There are many indirect ways to express the facts that are easy for a human expert to understand but difficult for an automatic system. Depending on the final results, we may take partial results into account if they are judged to be useful for a further annotation process.
The evaluation will be based on both parts (prediction of the GO annotation AND the text to support the prediction), and the participants will have to do well in both of them. Providing only the prediction for GO (step 1) is not valid, as would be only providing text but no (or completely incorrect) annotation in GO.
Download all files in the directory ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/. These give you the associations between Swiss-Prot Ids, the GO annotations and the PubMed identifiers (PMID) on the one hand and the mapping of Swiss-Prot Ids to other databases like, for example, HUGO.
There are in total 10 experimental evidence codes in GO: IPI, IGI, ISS, NAS, TAS, IC, ND, IDA, IMP and IEP (indicated in the file gene_association.goa_human.gz). We will use all evidence codes related to experimental evidence except IC (based on curator judgment) and ND (no data). Other evidence codes refering to non-experimental methods like computer based predictions (e.g. IEA - inferred from electronic annotation) will not be used. For a more detailed explanation of the GO evidence codes refer to http://www.geneontology.org/doc/GO.evidence.html.
Take the lines from gene_association.goa_human.gz with the tags mentioned in the previous paragraph and those PMIDs are in our list (see below).
Go to http://www.geneontology.org/doc/index.shtml and download the three GO ontologies (Molecular Function, Biological Process and Cellular Component) to map the GO codes from the GOA file (gene_association.goa_human.gz) to the GO terms.
We do not provide a fixed name dictionary for the contest. The participants should collect their own lists of gene and protein names and synonyms from the publicly available sources like Swiss-Prot, HUGO or LocusLink.
Only part of the documents that we received are relevant for the evaluation task (the documents that were used by the GO annotators at Swiss-Prot for their annotations). We selected these documents and provide them to the participants.
Send an e-mail to Christian Blaschke with a copy of the non-distribution agreement of the data [link to MITRE], stating that you accept the conditions, and fill in the form. We will then answer this message with a user name and password for downloading the data.
The format of the files is SGML (Standard General Markup Language) which is similar to XML (but more general). You can find the HighWire DTD descriptor file to interpret the document structure of the files at the HighWire web-site.
The BMC content is freely available. Please check http://www.biomedcentral.com/info/about/datamining/ for downloading instructions to obtain the content of the BMC journals in XML format.
Please contact blaschke@cnb.uam.es for specific inquiries regarding task 2, BioCreative@gredos.cnb.uam.es for general questions about the evaluation procedure or post a message in the discussion list [to be announced] if the issue is of general interest and you would like to all the participants or start a discussion.
The test data will be available somewhere in November (the exact time will, amongst other factors, influenced by how the evaluation is going and the completeness of the test data).
The exact format of the results will be defined later on.