BioCreative data resources


    During the BioCreative assessment a lot of data and results have been produced both by the organizers (mainly in task I) and the participants themselves (mainly in task II).Part of these data are valuable resources by themselves independently of BioCreative. Therefor we put together these data and distribute them as resources that as we hope will be useful for researchers in the field for developing and testing their text mining and information extraction systems.

    The goal of Task I was to provide a way of assessing the ability of an automated system to identify genes and related entities mentioned in text. The training and test sets contain annotated text taken from Medline abstracts with the genes, etc. identified. This task comes in two parts. Task IA is concerned with finding the mentions of genes and related entities in the sentences. Task IB is concerned with producing a list of the genes mentioned in an abstract. The list is to contain the normalized names for the genes mentioned. For more information about this task check: http://www.mitre.org/public/biocreative/

The goal of Task II was to provide "functional annotations" for proteins. Systems had to identify the proteins in the text, check if any functional information was present and return the GO code representing this information and the text that corroborates this annotation. The resources provided here contain the triplets of protein - GO code - evidence text (mostly short pieces of text of a few sentences) + the evaluation of the curators that checked the results. This means that there are positive and negative examples which would be especially useful to train machine learning methods. For more information about this task check: http://www.pdg.cnb.uam.es/BioLINK/BioCreative_task2.html
 

Download the BioCreative corpus and Biocreative tasks resources .