A critical assessment of text mining methods in molecular biology 

Granada, Spain, March 28 -March 31, 2004

Organizers
BioCreative Organizers
 

 
 
Alfonso Valencia, CNB/CSIC, Universidad Autónoma de Madrid

 
 
Christian Blaschke, CNB/CSIC, Universidad Autónoma de Madrid

 
 
Lynette Hirschman, The MITRE Corporation

 
 
Alex Yeh, The MITRE Corporation

 
 
Alex Morgan, The MITRE Corporation

 
 
Marc Colosimo, The MITRE Corporation

 
 
Jeff Colombe, The MITRE Corporation

Our collaborators

We are very grateful to the people that supported us in preparing the training data and in the evaluation of the results. Here we want to mention especially Lorrie Tanabe and John Wilbur from NCBI for providing data for the tasks 1A and Evelyn Camon and her team at the EBI for providing us with data for tasks 2 and examining the results.
Overview
The progress in the automation of experimental approaches in the last years such as yeast-two-hybrid based methods, DNA expression arrays and mass spectrometry determination of components of macro-molecular complexes has created new opportunities and challenges for biologists in the generation and analysis of massive amounts of data. On the other hand the available literature is also growing constantly (MEDLINE alone, the archive of abstracts of biomedical publications, already contains more than 11 million entries) and it has become clear that there is an urgent need not only for methods for the massive analysis of experimental data but that the already existing knowledge locked in the scientific literature has to be extracted and collected for the understanding of this new kind of data.
Due to the impressive amount of available literature, this is a time-consuming task with many uncertainties and drawbacks, e.g. finding the right articles, finding the right conclusions, finding the way to disseminate the knowledge with co-workers, etc. Therefore, the development of systems for the automatic processing and analysis of scientific literature is absolutely needed, saving valuable time and resources and providing a way for identifying important data. Since the 90's this has been a growing field with a considerable numbers of publications, associations of active researches in the field (e.g the BioLINK group with a special workshop dedicated to information extraction and text mining at ISMB and workshops at the Association of Computational Linguistics) and specially dedicated web resources (BioNLP.org).
But the bad news is that the field has stayed very academic and the systems have been developed for different purposes, applied to different data sets and evaluated according to different measures, which makes them fundamentally incomparable. We believe that the technology is progressing, but until these methods are applied to real biological data sets and evaluated on biologically relevant tasks, it will be difficult to convince biologists that the technology can assist them in accessing the vast biological literature. Therefore there is an urgent need to introduce standards in the field to compare different methods and systems and to make them more useful and applicable for biologists. Furthermore training opportunities have to be created to introduce people to existing systems and a strong and active community has to be built to guide the development and application of this technology.
We therefore have decided to set up an assessment of text mining systems in biology, by defining a common task, common data sets and a clearly defined evaluation to be able to give clear answers about the performance of these techniques in a real world scenario relevant for the biological community.
We will build on the experience in bioinformatics where a number of competitions revolutionized entire fields by the agreement on common targets and evaluation rules. This happened for example for protein structure predictions (CASP: Critical Assessment of techniques for protein Structure Prediction) or gene predictions in entire genomes (at the "Genome Based Gene Structure Determination" symposium held on the Wellcome Trust Genome Campus). Researchers in natural language processing (NLP) and information extraction (IE) have, for many years now, used common evaluations to accelerate their research progress as well (e.g., via the Message Understanding Conferences - MUCs, and the Text Retrieval Conferences - TREC).
Text mining and information extraction systems are only starting to be evaluated in a serious manner on biologically relevant problems. The recent KDD Cup 2002 was a data mining competition held in conjunction with the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. The competition included two tasks that involved data mining in molecular biology domains. The first task focused on constructing models that can assist genome annotators by automatically extracting information from scientific articles. The second task focused on learning models that characterize the behavior of individual genes in a hidden experimental setting. The first meeting of the TREC (Text REtrieval Conference) Genomics Track this year featured two tasks: ad hoc retrieval and information extraction. Both tasks centered around the Gene Reference into Function (GeneRIF) resource of the National Library of Medicine, which was used as both pseudorelevance judgments for ad hoc document retrieval as well as target text forinformation extraction.
Goals
We are organizing this workshop in conjunction with the ongoing BioCreative evaluation of texting mining systems. The workshop will allow participants in BioCreative to present their results and discuss their approaches. There will also be a considerable number of invited talks from respected experts in the field and time for open discussions about what has been learnt and where we might go in the future.
Handout
 
Download handout here

Programme

In general terms the program will consist of the following parts:
  • training part: Talks from the invited experts about the state of the art of Text Mining in general and in its application to molecular biology
  • outreach to define what is useful for biologists: evaluation of the performance of the current methods, based on results of the competition
    • talks by the organizers summarizing the results
    • talks by developers of selected systems
    • posters from all the participants to present their methods and results
  • community building: open discussion on various aspects of the competition (if the goals were biologically relevant and feasible, what the results mean for the community and for biology in general, ...) and conclusions for the future (how and if this project should be continued, challenges for the next years, ...).
The sessions will be structured in the following way:
  • presentation of the results by the organizers
  • talks by the invited speakers about various aspects of text mining methods
  • talks by a number of participants in the evaluation that will be selected by the organizers because of performance, biological relevance and originality of their methods.
  • discussion rounds with invited speakers and selected participants in the evaluation
  • poster sessions where all the participants of the evaluation can present their methods
Download detailed programme here.
Invited Speakers
  • Alfonso Valencia (organizer), CNB/CSIC, Universidad Autónoma de Madrid
  • Lynette Hirschman (organizer), The MITRE Corporation
  • Alex Yeh (organizer), The MITRE Corporation
  • Rolf Apweiler, European Bioinformatics Institute
  • Les Grivell, EMBO
  • Sophia Ananiadou, University of Salford
  • Andrey Rzhetsky, Columbia Genome Center and Department of Medical Informatics
  • Mark Craven, University of Wisconsin
  • See-Kiong Ng, Laboratories for Information Technology, Singapore
  • Éric Gaussier, Xerox Research Centre Europe
  • William Hersh, Oregon Health & Science University
Location
Granada lies in the heart of the south of Spain and is one of the most recognized Spanish university cities. Because of its beauty and architectural splendor the entire city of Granada has been declared a national monument. The Alhambra near the city center is the most extensive and best preserved medieval Islamic palace in existence and was declared a UNESCO world heritage site in 1984.
The meeting will be hosted at the Hotel Alixares near the center of the old town.


Travel Details

How to get to Granada from Granada airport or Malaga airport. Click Here
Application Details
Number of Participants
The number of participants will be limited to about 50-60(including speakers) to allow extensive discussions and scientific exchange. We will reserve places for one person from each group that participated in the BioCreative evaluation, as well as additional places for young scientists with experience in the field (students & postdoctoral). The remaining places are open to an interested audience that was not active in the evaluation and will be chosen from applications.
Contact
Applicants should send their application by Email to Christian Blaschke entitled "EMBO-BioCreative Workshop-[first name last name of applicant]" not later than JANUARY 16, 2004. Successful applicants will be notified by January 30, 2004. Please include the following information:
  • 1) First Name, Last Name, Full mailing adress, Tel and Fax N°, E-mail address.
  • 2) If you participated in BioCreative.
  • 3) Your status (student, postdoctoral, young scientist, ..)
  • 4) A brief curriculum vitae.
  • 5) Summary of research interests.
  • 6) If you come from academia or a commercial organization.
Selection of Participants
The following criteria will be applied to select participants after application:
  • Participation in BioCreative
  • Previous activity and research in the field
  • Status (Some preference will be given to students and post-docs)
  • Number of participants from the respective research group attending
After the application deadline the organizers will discuss each application based on the above criteria. A number of participants will be chosen to give short oral presentations, the rest will present their work in a poster.
Costs
  • BioCreative Participants: 250 Euros
  • Non-Participants: 450 Euros
The costs will cover hotel, breakfasts, coffee breaks, lunches and dinners for the workshop from Sunday dinner, March 28 through Wednesday afternoon, March 31. Printed material (handouts etc.) is also included. Payment will be requested once applicants have been accepted for the workshop.
Funding
This workshop was funded by the European Commission as part of the E-BioSci and ORIEL projects, contract nos. QLRI-CT-2001-30266 and IST-2001-32688, under the RTD Programmes "Quality of Life and Management of Living Resources" and "Multimedia Content and Tools (KA3)". The MITRE contributions to BioCreAtIvE have been funded in part through a grant from the US National Science Foundation.