| · Introduction · | · Program · | · Organization · | · General Info. · | · Schedule · | · Eval. Philo. · | · Task 1 · | · Task 2 · | · Eval. Setup · | · Restrictions · | · Talks · |

Organizers:
Christian Blaschke, CNB/CSIC, Madrid, Spain.
Ian Donaldson, Samuel Lunenfeld Research Institute, Canada
Carol Friedman, Columbia University
Rob Gaizauskas, University of Sheffield
Lynette Hirschman, The MITRE Corporation
Larry Hunter, University of Colorado
Marc Light, University of Iowa
Barend Mons, Erasmus Medical Center
See-Kiong Ng, Institute for Infocomm Research, Singapore
Andrey Rzhetsky, Columbia Genome Center
Alfonso Valencia, CNB/CSIC Madrid
For any further questions contact us at Lynette Hirschman (lynette@mitre.org) or Christian Blaschke (blaschke@cnb.uam.es) see http://www.pdg.cnb.uam.es/BioLink/

With the increasing availability of textual information related to biology, including MedLine abstracts and full-text journal articles, research on information extraction is rapidly becoming an essential component of various bioinformatics applications. It is expected that text mining in general, and information extraction in particular, will provide tools to facilitate the annotation of vast amounts of molecular information, including gene sequences, transcription profiles and biological pathways.
The Special Interest Group on Text Mining (or BioLINK) was created to address the need of communication and interchange of ideas in the field of text mining and information extraction applied to biology and biomedicine. Information extraction (IE) is an outgrowth of work in automated natural language processing, which began in the 1950s with work on transformational grammar by Zellig Harris and later Noam Chomsky. Information extraction technology made rapid progress starting in the late 1980s, thanks to a series of conferences focused on evaluation of IE: the Message Understanding Conferences (MUCs). There is also a long history of research on applications in medicine. Applications to the medical field focus on two distinct sub-problems: improved access to the medical literature and extraction of information from patient records.
Despite these successes in other fields Natural Language Processing (NLP) techniques were not introduced in biology until the late 90's (first publication around 1997 at ISMB97). The field is dominated by two, not necessarily convergent, views.
application oriented, where simple methods are used (maybe too simple) to address "real" biological problems,
tool oriented, where complex, state of the art NLP methods are used to address problems that are not always relevant to biologists.
Even if this view is exaggerated there exists a problem of misunderstanding and misinterpretation in this field that hinders the development toward powerful text mining systems accepted by biologists (the actual users).
To improve this situation we hold regular open meetings to bring together researchers from the field to interchange ideas and share them with a wider community interested in the latest developments (see events). To get a step further and formulate common goals, standard datasets and uniform evaluation criteria we plan to organize a critical assessment of text mining methods later this year (see news) inspired by the CASP evaluations.

|
bbb |
End |
Speakers/Authors |
Affiliation |
TITLE |
|
09:00 |
09:15 |
Christian Blaschke |
CNB, Spain |
Welcome |
|
09:15 |
09:45 |
Robert Stevens |
U. Manchester, UK |
Report from BioOntologies SIG |
|
09:45 |
10:15 |
Ian Donaldson, Joel Martin, Berry de Bruijn, Christopher W.V. Hogue |
U. Toronto, CA |
PreBIND and Textomy - Mining the Biomedical Literature for Protein-Protein Interactions Using a Support Vector Machine |
|
10:15 |
10:45 |
Andy Fulmer, Jun Xu, Steven Zhao |
Procter & Gamble, US |
An Overview of Text Mining in the Biology Domain at P&G |
|
10:45 |
11:00 |
BREAK |
|
|
|
11:00 |
11:15 |
George Demetriou, Robert Gaizauskas |
U. Sheffield, UK |
Corpus Resources for Development and Evaluation of a Biological Text Mining System |
|
11:15 |
11:30 |
Yuka Tateisi, Tomoko Ohta, Jin-dong Kim, Huaquing Hong, Su Jian, Jun-ichi Tsujii |
CREST, Japan Science and Technology Corporation |
The GENIA Corpus: MEDLINE Abstracts Annotated with Linguistic Information |
|
11:30 |
11:45 |
Seth Kulick, Mark Liberman, Andrew Schein |
U. Pennsylvania, US |
Shallow Semantic Annotation of Biomedical Corpora for Information Extraction |
|
11:45 |
12:00 |
|
|
Discussion about the creation and use of text corpora |
|
12:00 |
13:00 |
LUNCH |
|
|
|
13:00 |
13:30 |
Maria Samsonova |
St.Petersburg State Polytechnical University |
Processing of the Natural Language Queries to a Relational Database |
|
13:30 |
13:45 |
Tony C. Smith, John G. Cleary |
U. of Waikato, NZ |
Automatically Linking MEDLINE Abstracts to the Gene Ontology |
|
13:45 |
14:00 |
Yoshimasa Tsuruoka, Teruyoshi Hishiki, Osamu Ogasawara, Kousaku Okubo |
CREST, Japan Science and Technology Corporation |
Integration of Diverse Knowledge and Data into Biomedical Knowledge Matrices |
|
14:00 |
14:15 |
Eunji Yi, Gary G. Lee, Soo-Jun Park |
Pohang U. of Science and Technology, Korea |
HMM-based Protein Name Recognition with Edit Distance Using Automatically Annotated Corpus |
|
14:15 |
14:30 |
|
|
Discussion |
|
14:30 |
15:00 |
BREAK |
|
|
|
15:00 |
15:15 |
Francisco M. Couto, Mario J. Silva, Pedro Coutinho |
U of Lisbon, Portugal |
Curating Extracted Information Through the Correlation between Structure and Function |
|
15:15 |
15:30 |
Rune Linding, Peter O'Hanlon, Ulrich Reincke, Toby Gibson, |
EMBL |
Profiling and Classification of Scientific Documents with SAS Text Miner |
|
15:30 |
15:45 |
|
|
Discussion |
|
15:45 |
16:15 |
Alex Yeh, Lynette Hirschman, Alex Morgan |
MITRE, US |
BioCreAtIvE: Entity Extraction |
|
16:15 |
16:45 |
Christian Blaschke, Alfonso Valencia |
CNB, Spain |
BioCreAtIvE: Functional Extraction |
|
16:45 |
17:30 |
|
|
Summary and discussion |
BioCreAtIvE - Critical Assessment of Information Extraction systems in Biology
Many groups are now working in the area of text mining; there have been sessions devoted to this topic at the Pacific Symposium of Biocomputing for the past four years, as well workshops1 and sessions on language and biology in computational linguistics.2 However, despite the increased activity in this area, it is not possible to point to any successfully deployed text mining tools. And it is still not possible to compare the different approaches, because the various groups involved are addressing different problems, often using private data sets.
This is similar to the situation in text processing in the early nineties, prior to the introduction of the Message Understanding Conferences. With the introduction of a common evaluation and standardized evaluation metrics, it became possible to compare approaches, to assess what techniques did and didn't work, and to make progress. There are now commercial tools that can be used to extract names from different kinds of text. The field of biology is ripe for a similar experiment.
BioCreAtIvE is being organized by the BioLINK group (Biological Literature, Information and Knowledge, http://www.pdg.cnb.uam.es/BioLINK/). Our purpose is to hold a CASP-like evaluation (CASP: Critical Assessment of techniques for protein Structure Prediction - http://predictioncenter.llnl.gov/) for the text data mining community.

Database curators:
R. Apweiler: SWISS-PROT
C. Wu: PIR
J. Blake: MGI
I. Donaldson: BIND
Text mining researchers:
A. Valencia and C. Blaschke: CNB
L. Hirschman and A. Yeh: MITRE
L. Hunter, U. of Colorado
S-K Ng, Kent Ridge Digital Labs, Singapore
C. Friedman, Columbia
For any further questions regarding the competition please contact one of the organisers: Christian Blaschke (blaschke@cnb.uam.es), Lynette Hirschman (lynette@mitre.org) or Alfonso Valencia (valencia@cnb.uam.es).
Conference e-mail: BioCreative@gredos.cnb.uam.es
Mailing list for discussion of the competition: to be announced in the BioLink mailing list, contact BioLink-owner@listas.cnb.uam.es to be included in the list.

Researchers in natural language processing (NLP) and information extraction (IE) have, for many years now, used common evaluations to accelerate their research progress, e.g., via the Message Understanding Conferences (MUCs) and the Text Retrieval Conferences (TREC). This not only results in the formulation of common goals but also makes it possible to compare different systems and gives a certain transparency to the field. The field of bioinformatics also has a tradition of competitions, for example, in protein structure prediction (CASP) or gene predictions in entire genomes (at the "Genome Based Gene Structure Determination" symposium held on the Wellcome Trust Genome Campus).
Current systems are developed for different purposes, applied to different data sets and evaluated according to different measures, which makes them fundamentally incomparable; thorough evaluation of text mining systems in biology has only just begun (see the recent evaluation for the KDD cup - http://www.biostat.wisc.edu/~craven/kddcup/). We have therefore decided to set up a competition of text mining systems in biology, by defining a common task, common data sets and a clearly defined evaluation.
Following the experience of CASP, the emphasis will be more on the comparison of methods and the community assessment of scientific progress, than on the purely competitive aspects.

July 2003: Release initial training data set and initial task guidelines, in conjunction with ISMB
September 2003: Release full training set and revised task guidelines
November 2003: Release test data and receive results
December 2003: Tabulate results
April 2004: Convene final workshop

Our aim is to define "biologically meaningful" tasks - tasks that would be recognized by biologists as a contribution to their work and that constitute a meaningful challenge for current text mining systems. We have selected tasks where "gold standard" data for training and test can be made available in sufficiently large quantities with modest investment. This has led us to focus on the use of existing expert-curated data from existing biological databases as a source for gold-standard data.
Through discussion at various meetings over the past several years, we have identified two classes of task of interest to both researchers and practitioners. The first of these is "entity identification" in text - the ability to find mentions of relevant biological entities (genes, proteins, small molecules, chemicals, tissues, etc.) in running text, such as abstracts, full text articles and comment fields in curated databases. This task enables accurate indexing of entities within articles; it also takes the first step towards the more ambitious task of relation extraction and, eventually, pathway discovery from the literature. The second task is more ambitious and focuses on the automatic functional annotation of proteins using the Gene Ontology (GO) classes.

Our goal in defining this task was to provide a way of assessing the ability of an automated system to identify the genes (or proteins, where there is ambiguity) mentioned in text. The "natural language processing" or MUC version of this task has required that a system identify each mention of a gene-or-protein in the text.3 We take here a more biologically-motivated approach that measures the ability of a system to identify the list of genes mentioned in text, where each gene mention is mapped to its unique name or symbol.4 This task has the advantage that it is task performed by expert human curators in many databases. This means that it is easy to obtain "ground truth" data for training, by downloading sets of curated articles and the corresponding gene list for each article. This task assumes that we will make available a (reasonably) comprehensive lexicon of standard gene names and their synonyms. Our plan is to provide data and resources for identifying gene lists in text from three model organisms: Fly, Yeast and Mouse. The test data will be drawn from these same organisms (and each text will be marked for the organism it discusses).
In addition to the primary gene list task, we plan to define a second task, with a much smaller data set. This task will require that, given a text and a standard gene name, a system identify all references to that particular gene. This task focuses on one gene (or one entity) at a time; but it requires that both explicit mentions (names) and indirect mentions, via pronouns ("it") or classifiers ("the protein") be included.5 There is no "naturally occurring" gold standard for this task; therefore producing the gold standard requires detailed expert annotation. However, we believe that it is worth assessing system performance on this task, because it is a precursor to the identification of relations among entities in text, where a gene mention may occur in many forms, such as:
Tyrosine phosphorylation of VE-cadherin was maximal at 30 minutes after the addition of the growth factor. At this time point, the protein could be coimmunoprecipitated with the adaptor protein Shc.
Here "the protein" in the second sentence counts as a mention of "VE-cadherin" and participates with Shc in a relation of "coimmunoprecipitation".

The second task will address the automatic assignment of GO annotations (http://www.geneontology.org/) to human proteins. This is currently done by curators at SWISS PROT for the human genome, who have agreed to make these annotations available for use as training and test sets. This task requires that participants build an automated system that identifies the relevant information for each protein and produces the required GO annotations. These can be validated against the GO annotations in SWISS PROT. As part of the task, the systems will be asked to provide a pointer into the text of the article that provides supporting evidence for the annotation. The SWISS PROT curators will inspect these evidence links to determine their utility. SWISS PROT will make available 9000 human proteins annotated with 17,000 GO terms associated with 15,000 papers for the training corpus. The test data will consist of blind (not yet released) annotations on a new set of articles.
This task has several advantages:
The application of a wide range of text mining and information extraction techniques (from purely statistical to sophisticated NLP methods) makes the task open to a large number of groups working in this field;
Evaluation of the results according to an established standard (GO)
Application of the systems to a real problem in the field of biological database curation.
Many of the issues of "noisy training data" will also apply to the SWISS PROT task. For example, the SWISS PROT group annotates "by protein" rather than by paper. Therefore a paper may or may not be fully annotated for all proteins occurring in the paper. This means that a system might generate annotations for a protein that had not been curated for that paper. This would look like a false positive, since there be no GO annotations for that protein in the "answer key".

For the Entity annotation task, participants will be provided with thousands of abstracts from several domains (Fly, Mouse, Yeast), along with lexical resources (normalized gene names and synonym lists) and the corresponding lists of genes identified by curators from the associated articles.6 The test data will consist of abstracts from the same three domains. The participants will build systems that generate, for each abstract, the appropriate list of curatable genes, along with annotations of the evidence for gene mentions from the abstract. The list and the annotation of evidence will be validated by comparison to the lists of genes produced by the curators of the respective data bases.
For the Functional annotation task, the participants will be provided with a set of full text articles from a restricted number of journals that were used by the Swiss-Prot team to annotate the functions of human proteins; we will also provide a sufficiently large set of human proteins with their corresponding GO codes (and their associations to the articles). The test data will consist of blind (not yet released) annotations on a new set of articles. The participants will build systems capable of identifying the relevant information for each protein in order to produce the required GO annotations. The annotations will be validated by comparing to the GO annotations provided by the Swiss-Prot team.
For both evaluations, to ensure that the results come from an automated system (as opposed to smart biologists doing the annotations by hand), we have designed the following procedure:
TIME: The time for producing the automatically annotated data should be short (on the order of days).
VOLUME: There should be at least a few hundred papers or thousands of abstracts to annotate;
EVIDENCE: The system must supply the phrase (or short extracts) from the text that justifies the selection of the annotation. These will be manually checked by the appropriate curators for correctness.
Participants will have about four months to build a system and the final blind test data would be released in the fall (November 2003). The groups would return their results a few days later. The results would be automatically evaluated and the evidence would be checked by the appropriate curators. After the evaluation, the results will be discussed in a workshop (spring 2004 in Granada, Spain). The focus of the meeting will be the discussion of the different methods and the results obtained in the experience.

Because of these restrictions we do (in principle) not limit the data sources that are 'allowed' by the systems because we want this to be a test of a real world situation and do not want to exclude any methods (be it text based or not) or any data source). However, since this is an evaluation of automated processing, manually generated information that is specifically about any of the test set texts is not allowed. Note that this includes MEDLINE/PubMed information such as MeSH heading and enzyme numbers.

Robert Stevens
BioHealth Informatics Group
Department
of Computer Science
University of Manchester
Oxford
Road
Manchester, UK
Contact: robert.stevens@cs.man.ac.uk
Abstract: NA
Presentation: Ontologies in Bioinformatics.
Ian Donaldson* (1) , Joel Martin (2), Berry de Bruijn (2), Christopher W.V. Hogue (1,3)
1: Samuel Lunenfeld Research Institute, Toronto, M5G 1X5, Canada
2: Institute for Information Technology, National Research Council of Canada, Ottawa, K1A 0R6, Canada
3: Dept. of Biochemistry, University of Toronto, Canada
Contact*: ian.donaldson@utoronto.ca
Abstract
The majority of experimentally verified molecular interaction and biological pathway data are present in the unstructured text of biomedical journal articles where they are inaccessible to computational methods. The Biomolecular Interaction Network Database (BIND) seeks to capture these data in a machine-readable format. PreBIND and Textomy are two components of a literature-mining system designed to find protein-protein interaction information and present this to curators or public users for review and submission to the BIND database. This system couples a co-occurrence network of protein names with Support Vector Machine (SVM) technology that identifies abstracts describing biomolecular interactions.
Performance analyses estimated that the SVM F-measure was 92% and that the system would be able to recall up to 60% of all non-high throughput interactions present in the MIPS yeast-protein interaction database. Finally, this system was applied to a real-world curation problem and its use was found to reduce the task duration by 70%.
Machine learning methods are useful as tools to direct interaction and pathway database back-filling; however, this potential can only be realized if these techniques are coupled with human review and entry into a factual database such as BIND. The PreBIND system described here is available to the public at http://bind.ca. Current capabilities allow searching for human, mouse and yeast protein-interaction information. Performance of the system and on-going work to improve it will be discussed.
Presentation:
PreBIND and Textomy - mining the biomedical
literature for protein-protein interactions using a Support Vector
Machine.
Andy Fulmer, Jun Xu, Steven Zhao
Miami Valley Labs
The Procter &
Gamble Company
Cincinnati, Ohio USA
Contact: fulmer.aw@pg.com
Abstract
P&G conducts research in biology to support its businesses in pharmaceutical drugs, personal health care, pet care, and consumer products. We have established the capabilities to conduct high quality, high throughput Affymetrix gene chip expression studies on our biological models, where a single experiment generates data on ~10,000 different genes. The demand to interpret these datasets in a timely manner has motivated our entry into the text mining field.
We began with a "keep it simple" application oriented approach in 1999, when we launched the development of a tool called PathBinder with a group of computer scientists at Iowa State University. PathBinder has evolved into a robust Medline miner used to harvest signal transduction / gene regulatory pathway interactions. These interactions are curated by the project biologists into a Pathways Knowledge Base, built on a simple logical interaction model, with a separate suite of tools to build and analyze pathways.
Although PathBinder maintains >80% precision in identifying bona fide pathway interactions at the sentence level processing unit, we estimate that recall falls in the range of only ~30-50%. To increase recall, we have launched a collaboration with a group of linguists at Los Alamos National Labs (LANL) to develop better NLP filters to post-process the PathBinder sentences.
In 2001, with the initial pathway text mining initiatives underway, we began turning our attention to mining "functional context" to characterize members of a statistically manicured list of genes from a gene chip study. One approach is to develop tools to annotate the genes to GO terms using information mined from Medline, then "cluster" the gene list in ontology space. An ontology clustering tool is under development with LANL, while the GOMedlineMiner is still in the intellectual incubator.
Another way to cluster a gene list in functional space is based on term extraction, directly from Medline. We have recently launched a pilot project with ALMA Bioinformatica to evaluate various approaches to clustering the TextMiner bags of words associated with members of the gene list.
George Demetriou, Robert Gaizauskas
U. Sheffield, UK
Contact: robertg@dcs.shef.ac.uk
Abstract:
It is now widely accepted that the development and evaluation of robust language processing systems is heavily dependent on the availability of suitable annotated corpus resources. For the creation of text mining systems, these resources serve the purposes of:
Task specification and refinement
System development
System evaluation
In this paper we describe the acquisition and annotation of texts and adaptation of a scoring procedure as part of the process of development and evaluation of the PASTA system. PASTA – the Protein Active Site Template Acquisition system – is a text mining system for the automatic extraction of information relating to protein structures from the biological literature. The aim of this description is a) to inform others of these resources, should they wish to use them or build on them; and b) to highlight issues encountered and lessons learned in the process which may benefit others setting out to build similar resources. A key constraint on this effort was the limited manpower available to the PASTA project. The resources that were created during the process of the PASTA project and which are described in this paper are freely available at http://nlp.shef.ac.uk/pasta.
Yuka Tateisi, Tomoko Ohta, Jin-dong Kim (1)
Huaquing Hong, Su Jian (2)
Jun-ichi Tsujii (1,3)
1: CREST, Japan Science and Technology Corporation
2: Institute for Infocomm Research, Singapore
3: Department of Computer Science, Faculty of Information Science and Technology, University of Tokyo
Contact: tsujii@is.s.u-tokyo.ac.jp
Abstract
We describe the GENIA corpus, in which a set of MEDLINE abstracts is annotated with various information. A 2000-abstract corpus annotated with biologically meaningful terms and their semantic classes, and part-of-speech are made available. The annotation of partial parse tree structure and coreference is under way.
Seth Kulick (1), Mark Liberman (1), Andrew Schein (2)
1: Institute for Research in Cognitive Science, 3401 Walnut Street, Suite 400A, Philadelphia, PA 19104
2: Dept. of Computer and Information Science, Levine Hall, 3330 Walnut Street, Philadelphia, PA 19104
Contact: skulick@linc.cis.upenn.edu
Abstract
Work over the last few years in literature data mining for biology has progressed from linguistically unsophisticated models to the adaptation of Natural Language Processing (NLP) techniques that use full parsers and coreference to extract relations that span multiple sentences. However, there has been a lack of annotated corpora that can fuel further work in this direction in the same way that the development of syntactically annotated corpora such as the Penn Treebank led to the development of statistical language parsers.
To address this situation, we are developing new linguistic resources in three categories: a large corpus of biomedical text annotated with syntactic structure (Treebank) and predicate-argument structure ("proposition bank" or Propbank); a large set of biomedical abstracts and full-text articles annotated with entities and relations of interest to researchers, such as enzyme inhibition, or mutation/cancer connections (Factbanks); and broad-coverage lexicons and tools for the analysis of biomedical texts. We are also developing and adapting software tools that allow human experts to annotate biomedical texts for entity tagging, as well as for treebanking and propbanking. We are focusing initially on two applications: drug development, in collaboration with researchers in the Knowledge Integration and Discovery Systems group at GlaxoSmithKline, and pediatric oncology, in collaboration with researchers in the eGenome group at Children's Hospital of Philadelphia. These applications, worthwhile in their own right, provide excellent test beds for broader research efforts in natural language processing and data integration.
M. Samsonova, A. Pisarev and M. Blagov
Department of Computational
Biology
St.Petersburg State Polytechnical University
office
210, 21, Tikhoretsky Ave.
St.Petersburg 194064, Russia
Contact: email samson@spbcas.ru
Abstract
Motivation. New method is developed to query a relational database in natural language.
Results. The method based on semantic approach interprets grammatical and lexical units of a natural language into concepts of subject domain, which are given in conceptual scheme. The conceptual scheme is mapped formally onto the logical scheme. We applied the method to query FlyEx database in natural language. FlyEx contains the information on expression of segmentation genes in Drosophila. The method allows to formulate queries in various natural languages simultaneously, and is adaptive to changes in the knowledge domain and user's views. It provides optimal transformation of queries from natural language to SQL, as well as visualization of information as a hyperscheme. The method does not require to specify all possible language constructions as well as a standard grammar accuracy in formulation of NL queries.
Availability: http://urchin.spbcas.ru/NLP/NLP.html
Presentation:
Processing of the natural language queries to a relational database
Tony C. Smith (1), John G. Cleary (2)
1: Department of Computer Science, University of Waikato, New Zealand, tcs@cs.waikato.ac.nz
2: Reel Two Ltd, New Zealand, jcleary@reeltwo.com
Abstract
This paper introduces the ``Gene Ontology Knowledge Discovery System'' (GO-KDS) which automatically connects MEDLINE abstracts to relevant Gene Ontology terms. GO-KDS uses a machine learning algorithm to discover models for 3,700 of the categories in GO. These models are then used to select relevant documents from all 6 million entries in the MEDLINE corpus of abstracts. The underlying learning algorithm is described together with some performance figures obtained using leave-one-out evaluation. The major bottleneck in the extension of the system is obtaining curated documents to use as training data. We speculate that this problem can be solved by using GO-KDS to boostrap off initial (small) training sets.
Presentation:
Automatically Linking MEDLINE Abstracts to the Gene Ontology.
Yoshimasa Tsuruoka (1), Teruyoshi Hishiki (2), Osamu Ogasawara (3), Kousaku Okubo (4)
1: CREST, Japan Science and Technology Corporation
2: Biological Information Research Center, National Institute of Advanced Industrial Science and Technology (AIST)
3: Information and Mathematical Science Laboratory
4: Institute for Biomedical Regulation, Kyushu University
Contact: tsujii@is.s.u-tokyo.ac.jp
Abstract
After the accomplishment of human draft sequence, more and more efforts are being made in the mapping of the data-driven patterns to background knowledge, hoping to efficiently produce hypotheses out of the flood of data. Here we propose a framework of biomedical data and knowledge that has a high adaptability to the automated data interpretation. Then, we show that biomedical databases with heterogeneous scopes and structures can be converted to the format, and possible roles of ontology of biomedical objects combined with natural language processing techniques. Lastly, we present applications of formatted biomedical knowledge to scientific discovery.
Presentation:
Integration of diverse knowledge and data into biomedical knowledge matrices.
Eunji Yi*, Gary Geunbae Lee, Soo-Jun Park
Natural Language Processing
Laboratory
Department of Computer Science and Engineering
Pohang
University of Science and Technology (POSTECH)
San 31, Hyoja-Dong,
Pohang, 790-784, Korea (South)
Contact*: imew@nlp.postech.ac.kr
Abstract:
In the biomedical domain, various machine learning approaches have been applied to the named entity recognition task, and there are two major unsolved problems: lack of annotated training corpora and extensive spelling variation in names. The machine learning based approaches have shown relatively low accuracy compared to the hand-built rule-based ones, mainly because of the lack of named entity annotated corpora for training in the biomedical domain. Annotated corpora are essential to achieve sufficiently good results for named entity recognition, especially for the machine learning based approaches. But it is very difficult and time consuming to build such corpora by human annotation. To overcome this problem in this paper, we explore the use of an automatically annotated corpus.
Moreover, one of the main difficulties in named entity recognition is that there are many variant forms for each named entity in biomedical articles. So it is difficult to recognize them even if we meet a named entity already defined in the named entity dictionary. Edit distance, a useful metric to measure the similarity between two strings, will be applied to help with those kinds of problems in this paper. Altogether, we propose a modified HM-based named entity recognition method, which combines the HMM with an edit-distance metric and which uses an automatically annotated corpus.
Francisco M. Couto, Mario J. Silva, Pedro Coutinho
Francisco José Moreira
Couto
Informatics Department
Faculty of Sciences
University
of Lisbon
Bloco C5 - Piso 1 - Sala 5.1.17 - Campo Grande
1749-016
Lisboa, Portugal
Contact: fjmc@di.fc.ul.pt
Abstract:
We propose to apply the correlation between structure and function of gene products to curate information automatically extracted from biological literature. This can be achieved by automatically validating extracted information that satisfies the correlation, since it has strong evidence of being correct.
We applied a semantic similarity measure (SSM) to
identify a correlation between the modular structures of glycoside
hydrolases (GHs) and functional terms extracted from associated
literature. The source of GHs was CAZy, a database of
carbohydrate-active enzymes classified in various families by their
modular structure. We retrieved literature associated with each GH.
From this literature, we extracted Gene Ontology (GO) functional
terms. We implemented a SSM on GO to measure the relatedness between
the GO terms extracted. Finally, we identified the correlation by
comparing the probability of extracting similar terms inside with
outside a family.
Download
paper as .pdf
Rune Linding, EMBL, linding@embl.de
Peter O'Hanlon, SAS Australia, peter.ohanlon@oz.sas.com
Ulrich Reincke, SAS Germany, ulrich.reincke@ger.sas.com
Toby Gibson, EMBL, gibson@embl.de
Abstract
The automatic classification of documents into categories is an increasingly important task. As in life sciences scientific document collections continue to grow at exponential growth rates, the task of retrieving and classifying the appropriate documents by hand can become unmanageable. In fact it is impossible to follow a scientific field by manual methods.
SAS Institute (www.sas.com) and the European Molecular Biology Laboratory (EMBL)/ the ELM Consortium (http://elm.eu.org) are cooperating on the development of a text mining-application for the automated identification and ranking of scientific articles. The so-called topic scoring engine is based on the SAS Text MinerTM. The topic scoring engine identifies documents with similar contents and creates search-profiles which will comply with the congruencies of the documents.
The topic scoring engine replaces keyword querying of bibliographic databases such as Pubmed with a structured automated process by means of a "document based retrieval". This will reduce research time while improving the quality of the results. The outstanding feature of the topic scoring engine is that it does not look for pre-defined vocabulary like a search engine. Instead the tool tests with different types of singular value decompositions all possible information resolutions of the concepts underlying the text. Through a complex iteration of correlation analysis, variable clustering and selection an optimum set concepts is generated which enters different types of predictive models to train a search profile for each topic. These profiles are subsequently applied as filters to new publications. This allows the user to seek publications matching these profiles without having to submit complex queries. Furthermore, users can receive weekly or even daily updates about the relevant new publications and research topics. Thus scientific literature research will be rendered much more convenient. Finally the topic scoring engine helps to overcome the barrier of false of mismatching keywords. SAS and EMBL plan to provide this as a public service to the scientific community after a trial period.
Christian Blaschke (1), Alexander Yeh (2)
1: Protein Design Group, CNB/CSIC Universidad Autonoma Cantoblanco, 28049 Madrid, Spain, blaschke@cnb.uam.es
2: The MITRE Corporation, Information Technology Center, US, asy@mitre.org
Abstract
Many groups are now working in the area of text mining; there have been sessions devoted to this topic at the Pacific Symposium of Biocomputing for the past four years, as well workshops and sessions on language and biology in computational linguistics. However, despite the increased activity in this area, it is not possible to point to any successfully deployed text mining tools. And it is still not possible to compare the different approaches, because the various groups involved are addressing different problems, often using private data sets.
This is similar to the situation in text processing in the early nineties, prior to the introduction of the Message Understanding Conferences. With the introduction of a common evaluation and standardized evaluation metrics, it became possible to compare approaches, to assess what techniques did and didn't work, and to make progress. There are now commercial tools that can be used to extract names from different kinds of text. The field of biology is ripe for a similar experiment.
BioCreAtIvE is being organized by the BioLINK group (Biological Literature, Information and Knowledge, http://www.pdg.cnb.uam.es/BioLINK/). Our purpose is to hold a CASP-like evaluation (CASP: Critical Assessment of techniques for protein Structure Prediction - http://predictioncenter.llnl.gov/) for the text data mining community.
1 Language Modeling of Biological Data, University of Pennsylvania, February 2001; http://www.ircs.upenn.edu/modeling2001/. Also Workshop on Natural language Processing and Ontology Building, University of Tokyo, February 2002: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/WS.html
2 Special Session on Language Processing and Biological Data, Human Language Technology Workshop, March 2002. Also the Workshops on Natural Language Processing in the Biomedical Domain, Association of Computational Linguistics July 2002: http://www.ccs.neu.edu/home/futrelle/bionlp/acl02/BIO/ and July 2003: http://www-tsujii.is.s.u-tokyo.ac.jp/ACL03/bionlp.htm.
33The MUC "named entity" task requires that the system identify all mentions of genes (or proteins or…) in a text; this is generally done as in-line mark-up of the occurrences of these names, as in the Genia annotated corpus (http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/; also see data from Tanabe and Wilbur on gene-or-protein names in selected sentences. It is labor-intensive to provide consistent annotation for this task, because of questions about how much of the name to include ("feline homologue of CD2"), what to do with abbreviations embedded in a compound name ("VE growth factor (VEGF) receptor-2 (VEGFR-2)"), and what to do with names that may be compound names or conjoined names (MEK-1/-2). Interannotator agreement for this task varies somewhat, but for biology texts, it seems to be around 85%-90%.
4 This requires that the system produce a list of unique genes in the text. This corresponds roughly to the "template element" task evaluated in MUC. This requires the existence of a set of canonical names and a lexicon/synonym list to facilitate the mapping. This is a "naturally occurring" task done by database curators.
5 This resembles the ACE (Automated Content Extraction) Entity Detection Task (EDT) (http://www.itl.nist.gov/iaui/894.01/tests/ace/phase1/doc/).
6 We will attempt to "clean" this list to correspond to genes mentioned in the abstract, to maximize the utility of these lists for automated training. However, we plan to do the cleaning of the training data automatically, so there may still be some residual "noise" in the data: genes mentioned in the abstract but not curated; and genes listed as curated that may be mentioned only in the full paper.
7 Abstracts of all talks will be made available on the BioLINK web site (http://www.pdg.cnb.uam.es/BioLINK/) following the workshop. At this time, we were only able to include abstracts submitted in ASCII or word formats.