Protein Descriptions in Sentences Corpus
PRODISEN
Home - CNIO - ICS - CNB - News - Members - Contact

































PRODISEN corpus

In the bioinformatics, and especially in the text and literature mining world, corpora which are freely available for test purposes, are scarce. Although the number of corpora for Named Entity Recognition (NER) is increasing over the last years and a number of interesting and useful corpora have been published (GENIA, YAPEX, BioCreative, TREC), the number of corpora for other fields like Information Extraction is still very low. For instance, for the extraction of Protein-Protein Interactions, one of the newest tasks in Text Mining, almost no good corpora have been developed so far. Similar things are true for other areas in the field. One type of information which has been quite overlooked so far by the Literature Mining community is the construction of a corpus for the functional description of proteins and genes from literature. The Protein description in sentences (Prodisen) corpus is a useful resource for the automatic identification and construction of text-based protein description records using information extraction and text classification techniques. It was constructed using basic guidelines and criteria relevant for the construction of a text corpus of functional descriptions of genes and proteins.



© by Martin Krallinger 2006