| |
| Lecture |
TEXT MINING
|
|
Course
|
Bioinformatics and computational biology (Bioinformatica y biologia computacional)
|
|
Course type
|
Summer school (curso de verano)
|
|
Institution
|
University Complutense of Madrid (UCM)
|
|
Date
|
15th of July 2005
|
|
Lecture Description
|
The basic aim of this course is to make participants familiar with some of the basic aspects related to Natural Language Processing
(NLP) and text mining applied in the biomedical and molecular biology domain. Some of the most relevant topics in NLP, including
information extraction (IE) and information retrieval (IR) will be discussed. The most important data resource and biomedical article
collection, namely Medline (PubMed) as well as dictionaries and ontologies of functional terms and concepts used to describe gene
products will be introduced. The basic NLP application types developed for the biology domain as well as the historical development
of this field will be presented.
In the practical exercises some of the existing systems will be tested and their use in assisting efficient knowledge access discussed.
|
|
Presentation
|
Lecture presentation formats:
html,
pdf,
OpenOffice,
PowerPoint
|
|
Practical session/Exercises
|
The main goal of the practical session is to make course participants familiar with existing text mining
and IR tools for biology literature.
EXERCISE 1: PubMed, the main biomedical literature database .
The efficient use of the Pubmed databse is crucial to be able to retrieve existing information in the literature.
1.1. Carry out a PubMed search
for 'HIV' using the 'Limits' option.
How many articles did you retrieve?
Now try to follow the research interest in HIV over time
through the associated publications deposited in PubMed by constructing a 'Publication time period' vs 'number of retrieved publications'
table. Start from 1980 and use time intervals of 5 years (e.g. 1980-1984, 1985-1990,...).
Describe your results.
Comment: The aim of this search exercise is to explore an easy way to monitor research interests related to a certain topic of research.
For instance pharmaceutical companies are often interested in monitoring research interests of other companies to obtain competitive
intelligence.
1.2. Retrieve articles from PubMed for the Escherichia coli gene (previously use din other sessions of this summer school)
TRME_ECOLI.
How many articles did your retrieve?
Which problems did you encounter?
Describe the obtained results.
1.3. (OPTIONAL*) Perform the same search for the Escherichia coli gene MRAZ_ECOLI.
Notice that you worked with that protein before, in the "
Redes de Interaccion de Proteinas" session; and for the yeast gene RPE_YEAST
(used in the Analisis de Secuencias session).
What are the difficulties your encountered? How many documents did you retrieve?
(Comment: it is really worth to take some time to follow the tutorial
provided by the PubMed database).
EXERCISE 2: Google Scholar, alternative IR and document indexing and ranking tools.
Google developed Google Scholar, in order to provide a search engine
specifically for academic and research users. Try out the search queries proposed in exercises 1.3 and 1.4. using the advanced Scholar Search.
Compare the results with the results of PubMed. What are the advantages and disadvantages when using Google Scholar?
Exercise 3: SDI, perfom periodic IR searches
Set up your own selective dissemination of information service (SDI) query using the
My NCBI Cubby service.
EXERCISE 4: eTBlast, get similar documents.
While writing a scientific article, report or a grant application, people often want to retrieve a set of documents which are related/relevant
to this given work. What could/should you do in such situations? A PubMed search using alternative Boolean queries?
Typically people use Boolean queries against PubMed to obtain their set of references.
You can use eTBlast instead and upload or past your free text to obtain similar articles.
You can even iterate the search by selecting a subset of relevant documents retrieved in the first eTBlast round.
In case you have your own input document or are interested in certain PubMed article you can use it as your query text
(or else try some of the following files: etblast_sample1.txt,
etblast_sample1_trmE.txt).
Notice that eTBlast is relatively slow. Use the advance search mode, you can try out different metrics for calculating the document similarity.
You can try out uploading your own stop word file: stop_word_list.txt to filter those for when
calculating the document similarity.
Explain the output (ranked list). Compare the list of similar documents for a given abstract in PubMed (related article search) with the results of
eTBlast. What are the advantages of using eTBlast and what are the disadvantages. Are the highlighted word (with high weight) according
to your opinion relevant and discriminative?
EXERCISE 5: Gene Ontology, describing and annotating gene products.
Gene Ontology (GO) aims to provide standardized concepts or terms to describe relevant biological aspects.
Try to use GO retrieve the ontology sub-structure for a set of terms: apoptosis, caspase, glycogenin,
transcription factor (or in case you are interested in some particular function/process/compartment use your own query instead).
What did you retrieve. Browse through the results and visualize the corresponding ontology graphs. What kind of relationships
between terms did you find? What are the advantages of using such an ontology?
Try to explore annotation for a set of proteins, namely:
1) CASP9_HUMAN (P55211)
(formerly known as ICE9_HUMAN),
2) Y1333_MYCTU (P64811) formerly known as YD33_MYCTU
3) RPE_YEAST (P46969)
by Searching the Gene Ontology Annotation database GOA.
Those proteins have been used in the practical part of the
Patrones, perfiles y dominios session. What are one of the weak points when using GO annotations for bioinformatics annotations? (Hint: think about domains).
EXERCISE 6: Bio-NER, identification of Genes and protein in text.
Retrieve a given abstract from PubMed
searching for genes of your own research interest or alternatively for some of the following genes gene names:
Caspase-9 (CASP-9, APAF-3), RPE1 (EPI1, POS18), Orc-1, Bcl-2, glycogenin, p53.
Then try to tag gene and protein names from some those abstracts using different gene/protein NER tools and compare
the results. If your need GenBank ids (e.g. gi:20986531) or SwissProt accession numbers (Q07817 / BCLX_HUMAN )
use: NCBI
or UniProt
Use some of the online applications NLProt,
GAPSCORE,
Yapex or
BioNE recognizer
(you can also download ABNER).
How do they perform? What are the common error? Which differences do you encounter? What are the main difficulties ? Which taggers do you think are useful
in practice?
Explore for some of the gene symbols previously used (e.g. CASP-9, RPE1, mraZ, RPE1, Orc-1, glycogenin, p53 ) or for genes of your own research interest
iHOP. This tool was developed at out group (PDG) at the CNB. Create a gene model for your query
gene, check the results carefully, and surf through the virtual gene network of iHOP.
What kind of results are obtained by iHOP?
What are the advantages/disadvatages when using iHOP instead of other bio-NER tools or the PubMed retireval search?
EXERCISE 7: From sequence to abstracts
You have been using protein sequences for a range of analysis purposes in previous lectures of this course.
Traditionally in case you want to obtain information related to a query sequence , after doing a sequence search
(e.g. Blast against NCBI), retrieving the query genes, extracting their gene names or symbols and searching with those names
PubMed you obtained the associated literature. This is a lot of work, with a lot of corresponding working steps.
Those steps are integrated in the MedBlast tool.
Lets try to obtain the corresponding literature for some of the protein sequences used in other
lectures (or your own query sequence of interest) for this exercise:
>TRME_ECOLI
MSDNDTIVAQATPPGRGGVGILRISGFKAREVAETVLGKLPKPRYADYLPFKDADGSVLD
QGIALWFPGPNSFTGEDVLELQGHGGPVILDLLLKRILTIPGLRIARPGEFSERAFLNDK
LDLAQAEAIADLIDASSEQAARSALNSLQGAFSARVNHLVEALTHLRIYVEAAIDFPDEE
IDFLSDGKIEAQLNDVIADLDAVRAEARQGSLLREGMKVVIAGRPNAGKSSLLNALAGRE
AAIVTDIAGTTRDVLREHIHIDGMPLHIIDTAGLREASDEVERIGIERAWQEIEQADRVL
FMVDGTTTDAVDPAEIWPEFIARLPAKLPITVVRNKADITGETLGMSEVNGHALIRLSAR
TGEGVDVLRNHLKQSMGFDTNMEGGFLARRRHLQALEQAAEHLQQGKAQLLGAWAGELLA
EELRLAQQNLSEITGEFTSDDLLGRIFSSFCIGK
>RPE_YEAST
MVKPIIAPSILASDFANLGCECHKVINAGADWLHIDVMDGHFVPNITLGQPIVTSLRRSV
PRPGDASNTEKKPTAFFDCHMMVENPEKWVDDFAKCGADQFTFHYEATQDPLHLVKLIKS
KGIKAACAIKPGTSVDVLFELAPHLDMALVMTVEPGFGGQKFMEDMMPKVETLRAKFPHL
NIQVDGGLGKETIPKAAKAGANVIVAGTSVFTAADPHDVISFMKEEVSKELRSRDLLD
>Y1333_MYCTU
MNSITDVGGIRVGHYQRLDPDASLGAGWACGVTVVLPPPGTVGAVDCRGGAPGTRETDLL
DPANSVRFVDALLLAGGSAYGLAAADGVMRWLEEHRRGVAMDSGVVPIVPGAVIFDLPVG
GWNCRPTADFGYSACAAAGVDVAVGTVGVGVGARAGALKGGVGTASATLQSGVTVGVLAV
VNAAGNVVDPATGLPWMADLVGEFALRAPPAEQIAALAQLSSPLGAFNTPFNTTIGVIAC
DAALSPAACRRIAIAAHDGLARTIRPAHTPLDGDTVFALATGAVAVPPEAGVPAALSPET
QLVTAVGAAAADCLARAVLAGVLNAQPVAGIPTYRDMFPGAFGS
>MRAZ_ECOLI
MFRGATLVNLDSKGRLSVPTRYREQLLENAAGQMVCTIDIYHPCLLLYPLPEWEIIEQKL
SRLSSMNPVERRVQRLLLGHASECQMDGAGRLLIAPVLRQHAGLTKEVMLVGQFNKFELW
DETTWHQQVKEDIDAEQLATGDLSERLQDLSL
Use MedBlast, a
NLP based retrieval system to return relevant articles for your sequence. Notice that this
system is low and sensitive to server overload! Describe the
obtained results. What are the main difficulties when linking a query sequence to
scientific articles?
Supplementary exercises
EXERCISE 8: Extracting protein function annotation from text
The functional annotations contained in databases such as Gene Ontology annotation (GOA)
was directly or indirectly extracted from the literature. Several applications have been developed to associate proteins with
functional terms. Try to use text mining applications and GOA annotations to find functional information for your query proteins:
GOAnnotator,
iHOP,
GOA
EXERCISE 9: Deriving protein interactions through literature mining
Proteins instantiate their function through interactions with other bio-molecules. Use different text
mining tools which try to extract protein interactions for a given query protein/s (caspase, glycogenin, p53 etc...)
from texts: iHOP,
PreBIND,
Chilibot
Compare your results with entries in interaction databases: BIND,
DIP ,
GRID ,
HPID,
HPRD,
IntAct,
MINT and
STRING.
What kind of output is produced by each tool? Which differences do you encounter?
What are the difficulties encountered by those tools?
EXERCISE 10: IR performance.
You are looking for documents related to the E. coli mraZ gene.
Carry out a PubMed search and a Google scholar search and calculate the precision and recall for the first 20 retrieved documents.
Evaluating the whether the retrieved documents are relevant to the search query construct a precision-recall curve.
|
|
Links
|
Collection of useful links (resources, tools, glossary...)
|
|
References
|
General statistical NLP Book:
Chris Manning and Hinrich Schütze, Foundations of Statistical Natural
Language Processing, MIT Press. Cambridge, MA: May 1999.
Computational linguistics book: Roland Hausser, Foundations of
Computational Linguistics : Human-Computer Communication in Natural Language,
Springer.
Bio-NLP book chapter: M. Krallinger and A. Valencia. Applications of Text Mining in Molecular Biology, from name recognition to Protein interaction maps.
In Data Analysis and Visualization in Genomics and
Proteomics, chapter 4, Wiley.
Information retireval and indexing: Witten, I. H., Moffat, A. and Bell, T. C. (1994)
Managing Gigabytes, Compressing and Indexing Documents and
Images.
International Thomson Publishing, London.
Bio-NLP review articles:
(1) R. Hoffmann, M. Krallinger, E. Andres, J. Tamames, C. Blaschke and A. Valencia.
Text Mining for Metabolic Pathways, Signaling Cascades, and Protein Networks. Science STKE 283, pe21 (2005)
[PDF].
(2) M. Krallinger, R. Alonso-Allende Erhadt and A. Valencia. Text-mining approaches in molecular biology and biomedicine.
Drug Discovery Today 10, 439-445 (2005)
[PDF].
(3) M. Krallinger and A. Valencia. Text mining and information retrieval services for Molecular Biology.
Genome Biology, 6 (7), 224 (2005).
[PDF].
|
|
Contact person
|
Martin Krallinger: [e-mail],[URL]
|
|
|