| CNB - Protein Design Group · The MITRE Corporation · |
|
|
| · CONTACT · |
Please refer to the new BioCreAtIvE hompage for information on this text mining challenge! This is the original homepage of the first BioCreAtIvE challenge.
Many groups are now working in the area of text mining. However, despite increased activity in this area, there are no common standards or shared evaluation criteria to enable comparison among the different approaches. The various groups are addressing different problems, often using private data sets, and as a result, it is impossible to determine how good the existing systems are, whether they will scale to real applications, and what performance can be expected.
This is similar to the situation in text processing in the late eighties, prior to the introduction of the Message Understanding Conferences (MUC). Researchers in natural language processing (NLP) and information extraction (IE) have, for many years now, used common evaluations to accelerate their research progress, e.g., via the above-mentioned Message Understanding Conferences (MUCs) and the Text Retrieval Conferences (TREC). This not only resulted in the formulation of common goals but also made it possible to compare different systems and gave a certain transparency to the field. With the introduction of a common evaluation and standardized evaluation metrics, it became possible to compare approaches, to assess what techniques did and did not work, and to make progress. This progress resulted in the creation of standard tools available to the general research community.
The field of bioinformatics also has a tradition of competitions, for example, in protein structure prediction (CASP [CASP]) or gene predictions in entire genomes (at the "Genome Based Gene Structure Determination" symposium held on the Wellcome Trust Genome Campus).
As mentioned above, there has been a lot of activity in the field of text mining in biology (sessions at the Pacific Symposium of Biocomputing, as well workshops [1,2] and sessions on language and biology in computational linguistics [3,4]) but thorough evaluation of text mining systems in biology has only just begun (see the recent evaluation for the KDD cup [KDD] and the genomics track this year's TREC conference [TREC]). We have therefore decided to set up an assessment of text mining systems in biology, by defining a common task, common data sets and a clearly defined evaluation.
We believe that the field is ripe for this experiment. Therefore the BioLINK group (Biological Literature, Information and Knowledge, [BioLINK]) is organizing a CASP-like evaluation for the text data mining community applied to biology: BioCreAtIvE - Critical Assessment of Information Extraction systems in Biology. Following the experience of CASP, the emphasis will be more on the comparison of methods and the community assessment of scientific progress, rather than on the purely competitive aspects.
Our aim is to define "biologically meaningful" tasks - tasks that would be recognized by biologists as a contribution to their work and that constitute a meaningful challenge for current text mining systems. We have selected tasks where "gold standard" data for training and test can be made available in sufficiently large quantities with modest investment. This has led us to focus on the use of existing expert-curated data from existing biological databases as a source for gold-standard data.
Through discussion at various meetings over the past several years, we have identified two classes of tasks of interest to both researchers and practitioners. The first of these is "entity identification" in text -- that is, the ability to find mentions of relevant biological entities (genes, proteins, small molecules, chemicals, tissues, etc.) in running text. This task enables accurate indexing of entities within articles; it also takes the first step towards the more ambitious task of relation extraction and, eventually, pathway discovery from the literature. The second task is more ambitious and focuses on the automatic functional annotation of proteins using the Gene Ontology (GO) classes [GO].
We do (in principle) not limit the data sources that are "allowed" by the systems, because we want this to be a test of a real world situation and do not want to exclude any methods ( text based or not) or any data source. However, since this is an evaluation of automated processing, manually generated information that is specifically about any of the test set of texts is not allowed. Note that this includes MEDLINE/PubMed information such as MeSH heading and enzyme numbers.
To ensure that the results come from an automated system (as opposed to smart biologists doing the annotations by hand), we have designed the following procedure:
Participants will have about 3-4 months to build a system and the final blind test data will be released in the fall (November 2003). The groups will return their results a few days later. The results will be automatically evaluated and the evidence will be checked by Swiss-Prot annotators. After the evaluation, the results will be discussed in a workshop (spring 2004 in Granada, Spain). The focus of the meeting will be the discussion of the different methods and the results obtained in the experience.
For any further questions regarding the competition please contact one of the organisers: Christian Blaschke, Lynette Hirschman or Alfonso Valencia.
Conference e-mail: BioCreative@gredos.cnb.uam.es
Mailing list for discussion of the competition: to be announced in the BioLink mailing list, contact BioLink-owner@listas.cnb.uam.es to be included in the list.
The goal in defining this task was to provide a way of assessing the ability of an automated system to identify the genes (or proteins, where there is ambiguity) mentioned in text. The "natural language processing" or MUC version of this task has required that a system identify each mention of a gene-or-protein in the text. We take here a more biologically-motivated approach that measures the ability of a system to identify the list of genes mentioned in text, where each gene mention is mapped to its unique name or symbol. This task has the advantage that it is performed by expert human curators in many databases. This means that it is easy to obtain "ground truth" data for training, by downloading sets of curated articles and the corresponding gene list for each article. This task assumes that we will make available a (reasonably) comprehensive lexicon of standard gene names and their synonyms. There will be data and resources for identifying gene lists in text from three model organisms: Fly, Yeast and Mouse.
For more details and how to obtain the training data go to http://www.mitre.org/public/biocreative/
The second task will address the assignment of GO annotations to human proteins [GOA]. This is currently done by curators at Swiss-Prot [SWP] for the human genome, who have agreed to make these annotations available for use as training and test sets.
For this task the full text of the journal articles will be used because most of the information is contained in the paper body and not int the abstract alone. The number of publications provided in the training set may be small (most likely a few hundred) because full-text will only be provided by a limited number of journals.
The sub-parts of task 2 will be:
For the second sub-part we will evaluate how close the prediction is in the GO hierarchy to the correct annotation. For all parts the curators from Swiss-Prot will evaluate how "useful" the extracted text is for deriving the correct annotation. This usefulness measure is somewhat subjective but we think that this is a realistic because we want to know how useful a system is under these settings and not how good it is in reproducing a certain way of annotating the training data.
The evaluation will be based on evidence text that is provided and in sub-part 2 also on the prediction of the GO code (in this part participants will have to do well in both of them; providing only the GO prediction is not valid, as would be only providing text but no (or completely incorrect) prediction of the GO code).
For more details and how to obtain the training data go to http://www.pdg.cnb.uam.es/BioLink/BioCreative_task2.html
The evaluation period for all tasks has been set for Dec 1 - Dec 5.
Participants wishing to obtain the evaluation data should send an email message to the administrators if they have not already done so and explain which part or parts they would like to participate in (Task 1A, Task 1B, and Task 2). Participants will be given access to the data on Monday, December 1, via an email message advising them how to obtain the data and how to submit results.* No human should examine the evaluation test set or the submitted results until after the evaluation deadline. The final deadline for submission will be Friday, 5 December, 23:59 GMT.
Details about how to submit results will be included with links to the evaluation data. We will most likely provide a unique user name and password to each participant to a server where participants will be able to place their submissions.
Results on system performance will be returned to participants in January. All participants will be invited to attend the evaluation workshop to be held at the end of March in Granada, Spain and will be given approximately a month to draft an abstract for the workshop.
We hope that this will provide enough time for everyone to participate.
*Special note/reminder, the documents which will be part of Task 2.3 & Task 2.4 are already available for download. The list of proteins will be emailed on November 13. The texts for task 2.3 (Journal of Biological Chemistry, ~30.000 documents) are here: http://www.pdg.cnb.uam.es/BioLINK/data/JournalEditions/JBC/ and the texts for task 2.4 (BioMedCentral, ~3.500 documents) are here: http://www.pdg.cnb.uam.es/BioLINK/data/JournalEditions/BioMedCentral/
The choice of full paper vs. abstract affects cost, data quantity, and data quality. Abstracts are readily available (in large quantities) via PubMed, in a standard ASCII format. On the other hand, most (though not all) curation is done on full text, not just the abstract. However, full text is more difficult to obtain (because of the absence of a central repository like PubMed and of copyright issues), there are many problems with differences in typography, as well as differing document structures and formats (PDF, HTML, XML, ...). Thus abstracts are much easier to work with, but full text is probably more realistic.
Each database has its specific focus, e.g., the Mouse database [MUS] integrates biological information about mouse genes and mouse mutant allele phenotypes. Basic information about orthologous genes in other mammalian species is included, although not extensively curated.In FlyBase [DROS], gene expression information (information about a gene's associated transcripts and proteins), is curated for a gene's wild-type alleles, but not for the gene's mutant alleles. In general, genes from other organisms or genes mentioned as background do not appear in the gene lists and are not curated.
This raises the possibility that an automated system might correctly detect a gene mentioned in an abstract, but this gene might not appear in the gene list generated by the curators. The test sets will be hand-annotated to avoid these problems; however, the training data will be provided with "noisy" annotation -- annotations derived from the curated databases through application of some simple filtering rules. We may be able to "clean" these annotations further as the evaluation progresses.
Genes are constantly being discovered and named. To assist developers, we have taken a snapshot of the nomenclature resources of the three databases that we are using and used this to assemble a lexicon: a comprehensive list of gene names and synonyms for each organism. This is intended to make things easier for developers. However, because it is only a snapshot, it is almost immediately out of date. This means that there may be novel names that are not in the lexicon provided, and therefore have no canonical form. This is an artifact of lexicon generation. The model organism databases address this problem in varying ways. MGI, for example, works with the International Committee on Standardized Nomenclature in Mouse to meet and develop nomenclature standards (in conjunction, also, with the HUGO nomenclature committee for human genes). MGI also updates the files on their ftp server nightly, to provide AccID, symbol, name, synonyms, sequence IDs, ortholog IDs and other information.
BioCreative is, as we think, the first very biologically motivated evaluation of text mining systems. But we are aware that only history can judge over the impact of our efforts; it may be the first but it will certainly not be the last initiative of this type.
Some relevant problems that we do not address are: detection of other entities like chemical substances or tissue and cell types, detection of entity relations like protein-protein, gene-disease or disease-symptom and tasks going beyond fact extraction like classification and summarization of information for a given protein.
Until very recently text mining in biology has been performed solely on abstracts. We hope to demonstrate the a lot of critical information is contained in the body of the publications and that the access to full text will be crucial for further development of text mining systems.
[1] Language Modeling of Biological Data, University of Pennsylvania, February 2001; http://www.ircs.upenn.edu/modeling2001/.
[2] Workshop on Natural language Processing and Ontology Building, University of Tokyo, February 2002: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/WS.html
[3] Special Session on Language Processing and Biological Data, Human Language Technology Workshop, March 2002.
[4] Workshops on Natural Language Processing in the Biomedical Domain, Association of Computational Linguistics, July 2002: http://www.ccs.neu.edu/home/futrelle/bionlp/acl02/BIO/ and July 2003: http://www-tsujii.is.s.u-tokyo.ac.jp/ACL03/bionlp.htm
[CASP] Critical Assessment of techniques for protein Structure Prediction , http://predictioncenter.llnl.gov/
[KDD] http://www.biostat.wisc.edu/~craven/kddcup/
[TREC] http://trec.nist.gov/
[BLNK] Biological Literature, Information and Knowledge, http://www.pdg.cnb.uam.es/BioLINK/
[GO] http://www.geneontology.org/
[GOA] http://www.ebi.ac.uk/GOA/release.html
[SWP] http://www.ebi.ac.uk/swissprot/
[MUS] http://www.informatics.jax.org
MGI curates expression data, and also provides detailed reports for over 10,000
mutant alleles for mouse genes and detailed annotations for all genes related
to map location, molecular attributes, sequence representations, and phenotypes.provides
a gene index with official nomenclature for ALL mouse genes. It currently
represents over 32,000 genes (out of a total of over 53,000 genetic markers).
MGI works collaboratively with genome centers to integrate the mouse biology
data with the genome browsers (such as the UCSC browser), and MGI also
works in collaboration with the human and rat gene nomenclature committees
to provide standard nomenclature for orthologs. These official nomenclatures
are incorporated into NCBI, SWISS-PROT and other resources and are
updated daily. The confusion in nomenclature arises from the fact
that journals, with some exceptions, do not require official nomenclature
or a representative accession ID for the genes described in the literature.
[DROS] http://flybase.bio.indiana.edu/