Biomedical Text corpora and related data collection resources.

  • TREC Genomics Track: A set of data collecions provided by TREC Genomics Track useful for development and evaluation of retrieval and text categorization strategies in the biomedical domain.
  • BioCreative corpus: Dataset produced by the BioCreative assessment, text passages relevant for GO annotations of human proteins.
  • GENIA corpus: Annotated corpus of literature related to the MeSH terms: Human, Blood Cells, and Transcription Factors.
  • Yapex corpus: Training and test data for the protein tagger (NER) YAPEX.
  • PASBio: Predicate-argument structures of biomedical literature.
  • LLL05 dataset: Genic Interaction Extraction Challenge: protein/gene interactions IE data set
  • IEPA corpus: The Interaction Extraction Performance Assessment corpus
  • BioText Data: Dataset for extraction of disease/treatment entities relations
  • BioText NC Semantics Dataset: Dataset of Noun Compound Semantics used in experiments described in articles
  • PennBioIE: UPenn Biomedical Information Extraction datasets of annotated PubMed abstracts: CYP450 domain and oncology domain
  • Medstract corpus: Biomedical annotation corpus useful for acronym definition and coreference resolution
  • Medstract corpus: Biomedical annotation corpus useful for acronym definition and coreference resolution
  • OHSUMED text collection: Document collection used for the TREC-9 contest.
  • BMC corpus: Open access corpus of full text articles provided by BioMed Central.
  • FetchProt corpus: Full text journal articles from the biological domain analyzed for experiments on proteins.
  • PDG Bio-sentence splitter corpus: Small collection of text data sets derived from PubMed abstracts to develop and assess sentence splitting tools.
  • Bio1 corpus: annotated corpus, same field as GENIA, but annotated to small top-level ontology.




If you know of additional data collections or corpora of interest for biomedical text mining, please be so kind and contact me: martink@cnb.uam.es. This way you help to improve completeness of this list of text mining resources for biology and biomedicine. The number of data collections to train and test biomedical text mining and Bio-NLP tools is still very small, and thus any additional data resource is especially important to improve research in this field.
Refer to my links section for other useful resources for Bio-NLP.







HOME