Evaluating gene/protein name tagging and mapping for article retrieval

Chong Min Lee, Manabu Torii, Jinesh Shah, Yi Ting Tsai, Zhang Zhi Hu, Hongfang Liu

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations


Background: Tagging gene/protein names in text and mapping them to database entries are critical tasks in biological literature mining. Most of the existing tagging and normalization approaches, however, have not been evaluated for practical use in article retrieval towards efficient biocuration. Results: By utilizing literature cross-reference information provided by NCBI Entrez Gene database, we found that the coverage of gene/protein databases with respect to gene/protein names found in text is around 94%. The upper bound of the recall in retrieving MEDLINE citations by gene/protein names is around 70-80% when citations cross-referred by many genes are overlooked and flexible matching of names are used. Of genes/proteins failed to be retrieved by names, over 30% are caused by citations not discussing cross-referred genes/proteins in the abstracts and around 60% are caused by the gene/protein name tagging system trained on the BioCreAtIvE II gene mention corpus. Conclusions: The study demonstrates that existing gene/protein databases have a decent coverage of gene/protein names used in MEDLINE abstracts. Approaches and data resources for gene/protein tagging and mapping need to be selected appropriately for individual practical tasks.

Original languageEnglish (US)
Pages (from-to)104-109
Number of pages6
JournalCEUR Workshop Proceedings
StatePublished - 2010
Event4th International Symposium on Semantic Mining in Biomedicine, SMBM 2010 - Cambridge, United Kingdom
Duration: Oct 25 2010Oct 26 2010

ASJC Scopus subject areas

  • General Computer Science


Dive into the research topics of 'Evaluating gene/protein name tagging and mapping for article retrieval'. Together they form a unique fingerprint.

Cite this