Text mining lecture -

Lecture	TEXT MINING
Course	Bioinformatics and computational biology (Bioinformatica y biologia computacional)
Course type	Summer school (curso de verano)
Institution	University Complutense of Madrid (UCM)
Date	15th of July 2005
Lecture Description	The basic aim of this course is to make participants familiar with some of the basic aspects related to Natural Language Processing (NLP) and text mining applied in the biomedical and molecular biology domain. Some of the most relevant topics in NLP, including information extraction (IE) and information retrieval (IR) will be discussed. The most important data resource and biomedical article collection, namely Medline (PubMed) as well as dictionaries and ontologies of functional terms and concepts used to describe gene products will be introduced. The basic NLP application types developed for the biology domain as well as the historical development of this field will be presented. In the practical exercises some of the existing systems will be tested and their use in assisting efficient knowledge access discussed.
Presentation	Lecture presentation formats: html, pdf, OpenOffice, PowerPoint
Practical session/Exercises	The main goal of the practical session is to make course participants familiar with existing text mining and IR tools for biology literature. EXERCISE 1: PubMed, the main biomedical literature database . The efficient use of the Pubmed databse is crucial to be able to retrieve existing information in the literature. 1.1. Carry out a PubMed search for 'HIV' using the 'Limits' option. How many articles did you retrieve? Now try to follow the research interest in HIV over time through the associated publications deposited in PubMed by constructing a 'Publication time period' vs 'number of retrieved publications' table. Start from 1980 and use time intervals of 5 years (e.g. 1980-1984, 1985-1990,...). Describe your results. Comment: The aim of this search exercise is to explore an easy way to monitor research interests related to a certain topic of research. For instance pharmaceutical companies are often interested in monitoring research interests of other companies to obtain competitive intelligence. 1.2. Retrieve articles from PubMed for the Escherichia coli gene (previously use din other sessions of this summer school) TRME_ECOLI. How many articles did your retrieve? Which problems did you encounter? Describe the obtained results. 1.3. (OPTIONAL) Perform the same search for the Escherichia coli gene MRAZ_ECOLI. Notice that you worked with that protein before, in the " Redes de Interaccion de Proteinas" session; and for the yeast gene RPE_YEAST (used in the Analisis de Secuencias session). What are the difficulties your encountered? How many documents did you retrieve? (Comment: it is really worth to take some time to follow the tutorial provided by the PubMed database). EXERCISE 2: Google Scholar, alternative IR and document indexing and ranking tools. Google developed Google Scholar, in order to provide a search engine specifically for academic and research users. Try out the search queries proposed in exercises 1.3 and 1.4. using the advanced Scholar Search. Compare the results with the results of PubMed. What are the advantages and disadvantages when using Google Scholar? Exercise 3: SDI, perfom periodic IR searches* Set up your own selective dissemination of information service (SDI) query using the My NCBI Cubby service. EXERCISE 4: eTBlast, get similar documents. While writing a scientific article, report or a grant application, people often want to retrieve a set of documents which are related/relevant to this given work. What could/should you do in such situations? A PubMed search using alternative Boolean queries? Typically people use Boolean queries against PubMed to obtain their set of references. You can use eTBlast instead and upload or past your free text to obtain similar articles. You can even iterate the search by selecting a subset of relevant documents retrieved in the first eTBlast round. In case you have your own input document or are interested in certain PubMed article you can use it as your query text (or else try some of the following files: etblast_sample1.txt, etblast_sample1_trmE.txt). Notice that eTBlast is relatively slow. Use the advance search mode, you can try out different metrics for calculating the document similarity. You can try out uploading your own stop word file: stop_word_list.txt to filter those for when calculating the document similarity. Explain the output (ranked list). Compare the list of similar documents for a given abstract in PubMed (related article search) with the results of eTBlast. What are the advantages of using eTBlast and what are the disadvantages. Are the highlighted word (with high weight) according to your opinion relevant and discriminative? EXERCISE 5: Gene Ontology, describing and annotating gene products. Gene Ontology (GO) aims to provide standardized concepts or terms to describe relevant biological aspects. Try to use GO retrieve the ontology sub-structure for a set of terms: apoptosis, caspase, glycogenin, transcription factor (or in case you are interested in some particular function/process/compartment use your own query instead). What did you retrieve. Browse through the results and visualize the corresponding ontology graphs. What kind of relationships between terms did you find? What are the advantages of using such an ontology? Try to explore annotation for a set of proteins, namely: 1) CASP9_HUMAN (P55211) (formerly known as ICE9_HUMAN), 2) Y1333_MYCTU (P64811) formerly known as YD33_MYCTU 3) RPE_YEAST (P46969) by Searching the Gene Ontology Annotation database GOA. Those proteins have been used in the practical part of the Patrones, perfiles y dominios session. What are one of the weak points when using GO annotations for bioinformatics annotations? (Hint: think about domains). EXERCISE 6: Bio-NER, identification of Genes and protein in text. Retrieve a given abstract from PubMed searching for genes of your own research interest or alternatively for some of the following genes gene names: Caspase-9 (CASP-9, APAF-3), RPE1 (EPI1, POS18), Orc-1, Bcl-2, glycogenin, p53. Then try to tag gene and protein names from some those abstracts using different gene/protein NER tools and compare the results. If your need GenBank ids (e.g. gi:20986531) or SwissProt accession numbers (Q07817 / BCLX_HUMAN ) use: NCBI or UniProt Use some of the online applications NLProt, GAPSCORE, Yapex or BioNE recognizer (you can also download ABNER). How do they perform? What are the common error? Which differences do you encounter? What are the main difficulties ? Which taggers do you think are useful in practice? Explore for some of the gene symbols previously used (e.g. CASP-9, RPE1, mraZ, RPE1, Orc-1, glycogenin, p53 ) or for genes of your own research interest iHOP. This tool was developed at out group (PDG) at the CNB. Create a gene model for your query gene, check the results carefully, and surf through the virtual gene network of iHOP. What kind of results are obtained by iHOP? What are the advantages/disadvatages when using iHOP instead of other bio-NER tools or the PubMed retireval search? EXERCISE 7: From sequence to abstracts You have been using protein sequences for a range of analysis purposes in previous lectures of this course. Traditionally in case you want to obtain information related to a query sequence , after doing a sequence search (e.g. Blast against NCBI), retrieving the query genes, extracting their gene names or symbols and searching with those names PubMed you obtained the associated literature. This is a lot of work, with a lot of corresponding working steps. Those steps are integrated in the MedBlast tool. Lets try to obtain the corresponding literature for some of the protein sequences used in other lectures (or your own query sequence of interest) for this exercise: >TRME_ECOLI MSDNDTIVAQATPPGRGGVGILRISGFKAREVAETVLGKLPKPRYADYLPFKDADGSVLD QGIALWFPGPNSFTGEDVLELQGHGGPVILDLLLKRILTIPGLRIARPGEFSERAFLNDK LDLAQAEAIADLIDASSEQAARSALNSLQGAFSARVNHLVEALTHLRIYVEAAIDFPDEE IDFLSDGKIEAQLNDVIADLDAVRAEARQGSLLREGMKVVIAGRPNAGKSSLLNALAGRE AAIVTDIAGTTRDVLREHIHIDGMPLHIIDTAGLREASDEVERIGIERAWQEIEQADRVL FMVDGTTTDAVDPAEIWPEFIARLPAKLPITVVRNKADITGETLGMSEVNGHALIRLSAR TGEGVDVLRNHLKQSMGFDTNMEGGFLARRRHLQALEQAAEHLQQGKAQLLGAWAGELLA EELRLAQQNLSEITGEFTSDDLLGRIFSSFCIGK >RPE_YEAST MVKPIIAPSILASDFANLGCECHKVINAGADWLHIDVMDGHFVPNITLGQPIVTSLRRSV PRPGDASNTEKKPTAFFDCHMMVENPEKWVDDFAKCGADQFTFHYEATQDPLHLVKLIKS KGIKAACAIKPGTSVDVLFELAPHLDMALVMTVEPGFGGQKFMEDMMPKVETLRAKFPHL NIQVDGGLGKETIPKAAKAGANVIVAGTSVFTAADPHDVISFMKEEVSKELRSRDLLD >Y1333_MYCTU MNSITDVGGIRVGHYQRLDPDASLGAGWACGVTVVLPPPGTVGAVDCRGGAPGTRETDLL DPANSVRFVDALLLAGGSAYGLAAADGVMRWLEEHRRGVAMDSGVVPIVPGAVIFDLPVG GWNCRPTADFGYSACAAAGVDVAVGTVGVGVGARAGALKGGVGTASATLQSGVTVGVLAV VNAAGNVVDPATGLPWMADLVGEFALRAPPAEQIAALAQLSSPLGAFNTPFNTTIGVIAC DAALSPAACRRIAIAAHDGLARTIRPAHTPLDGDTVFALATGAVAVPPEAGVPAALSPET QLVTAVGAAAADCLARAVLAGVLNAQPVAGIPTYRDMFPGAFGS >MRAZ_ECOLI MFRGATLVNLDSKGRLSVPTRYREQLLENAAGQMVCTIDIYHPCLLLYPLPEWEIIEQKL SRLSSMNPVERRVQRLLLGHASECQMDGAGRLLIAPVLRQHAGLTKEVMLVGQFNKFELW DETTWHQQVKEDIDAEQLATGDLSERLQDLSL Use MedBlast, a NLP based retrieval system to return relevant articles for your sequence. Notice that this system is low and sensitive to server overload! Describe the obtained results. What are the main difficulties when linking a query sequence to scientific articles? Supplementary exercises EXERCISE 8: Extracting protein function annotation from text The functional annotations contained in databases such as Gene Ontology annotation (GOA) was directly or indirectly extracted from the literature. Several applications have been developed to associate proteins with functional terms. Try to use text mining applications and GOA annotations to find functional information for your query proteins: GOAnnotator, iHOP, GOA EXERCISE 9: Deriving protein interactions through literature mining Proteins instantiate their function through interactions with other bio-molecules. Use different text mining tools which try to extract protein interactions for a given query protein/s (caspase, glycogenin, p53 etc...) from texts: iHOP, PreBIND, Chilibot Compare your results with entries in interaction databases: BIND, DIP , GRID , HPID, HPRD, IntAct, MINT and STRING. What kind of output is produced by each tool? Which differences do you encounter? What are the difficulties encountered by those tools? EXERCISE 10: IR performance. You are looking for documents related to the E. coli mraZ gene. Carry out a PubMed search and a Google scholar search and calculate the precision and recall for the first 20 retrieved documents. Evaluating the whether the retrieved documents are relevant to the search query construct a precision-recall curve.
Links	Collection of useful links (resources, tools, glossary...)
References	General statistical NLP Book: Chris Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: May 1999. Computational linguistics book: Roland Hausser, Foundations of Computational Linguistics : Human-Computer Communication in Natural Language, Springer. Bio-NLP book chapter: M. Krallinger and A. Valencia. Applications of Text Mining in Molecular Biology, from name recognition to Protein interaction maps. In Data Analysis and Visualization in Genomics and Proteomics, chapter 4, Wiley. Information retireval and indexing: Witten, I. H., Moffat, A. and Bell, T. C. (1994) Managing Gigabytes, Compressing and Indexing Documents and Images. International Thomson Publishing, London. Bio-NLP review articles: (1) R. Hoffmann, M. Krallinger, E. Andres, J. Tamames, C. Blaschke and A. Valencia. Text Mining for Metabolic Pathways, Signaling Cascades, and Protein Networks. Science STKE 283, pe21 (2005) [PDF]. (2) M. Krallinger, R. Alonso-Allende Erhadt and A. Valencia. Text-mining approaches in molecular biology and biomedicine. Drug Discovery Today 10, 439-445 (2005) [PDF]. (3) M. Krallinger and A. Valencia. Text mining and information retrieval services for Molecular Biology. Genome Biology, 6 (7), 224 (2005). [PDF].
Contact person	Martin Krallinger: [e-mail],[URL]