TY - JOUR
T1 - Detecting concept mentions in biomedical text using hidden Markov model
T2 - Multiple concept types at once or one at a time?
AU - Torii, Manabu
AU - Wagholikar, Kavishwar
AU - Liu, Hongfang
N1 - Funding Information:
This article has been published as part of thematic series “Semantic Mining of Languages in Biology and Medicine” of Journal of Biomedical Semantics. An early version of this paper was presented at the Fourth International Symposium on Languages in Biology and Medicine (LBM 2011), held in Singapore in 2011. De-identified clinical records used in this research were provided by the i2b2 National Center for Biomedical Computing funded by U54LM008748 and were originally prepared for the Shared Tasks for Challenges in NLP for Clinical Data organized by Dr. Ozlem Uzuner, i2b2 and SUNY. The authors also thank all the other researchers and developers who made their software resources and annotated corpora available to the research community. The authors acknowledge the funding from National Science Foundation (ABI: 0845523) and the National Institute of Health (R01LM009959A1).
Publisher Copyright:
© 2014 Torii et al.; licensee BioMed Central Ltd.
PY - 2014/1/17
Y1 - 2014/1/17
N2 - Background: Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance. Results: Hidden Markov models using the different strategies were evaluated on a clinical corpus annotated with three concept types (i2b2/VA corpus) and a biology literature corpus annotated with five concept types (JNLPBA corpus). Ten-fold cross-validation tests were conducted and the experimental results showed that models trained for multiple concept types consistently yielded better performance than those trained for a single concept type. F-scores observed for the former strategies were higher than those observed for the latter by 0.9 to 2.6% on the i2b2/VA corpus and 1.4 to 10.1% on the JNLPBA corpus, depending on the target concept types. Improved boundary detection and reduced type confusion were observed for the all-types-at-once strategy. Conclusions: The current results suggest that detection of concept phrases could be improved by simultaneously tackling multiple concept types. This also suggests that we should annotate multiple concept types in developing a new corpus for machine learning models. Further investigation is expected to gain insights in the underlying mechanism to achieve good performance when multiple concept types are considered.
AB - Background: Identifying phrases that refer to particular concept types is a critical step in extracting information from documents. Provided with annotated documents as training data, supervised machine learning can automate this process. When building a machine learning model for this task, the model may be built to detect all types simultaneously (all-types-at-once) or it may be built for one or a few selected types at a time (one-type- or a-few-types-at-a-time). It is of interest to investigate which strategy yields better detection performance. Results: Hidden Markov models using the different strategies were evaluated on a clinical corpus annotated with three concept types (i2b2/VA corpus) and a biology literature corpus annotated with five concept types (JNLPBA corpus). Ten-fold cross-validation tests were conducted and the experimental results showed that models trained for multiple concept types consistently yielded better performance than those trained for a single concept type. F-scores observed for the former strategies were higher than those observed for the latter by 0.9 to 2.6% on the i2b2/VA corpus and 1.4 to 10.1% on the JNLPBA corpus, depending on the target concept types. Improved boundary detection and reduced type confusion were observed for the all-types-at-once strategy. Conclusions: The current results suggest that detection of concept phrases could be improved by simultaneously tackling multiple concept types. This also suggests that we should annotate multiple concept types in developing a new corpus for machine learning models. Further investigation is expected to gain insights in the underlying mechanism to achieve good performance when multiple concept types are considered.
KW - Data mining
KW - Electronic health records
KW - Information storage and retrieval
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=84920719766&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84920719766&partnerID=8YFLogxK
U2 - 10.1186/2041-1480-5-3
DO - 10.1186/2041-1480-5-3
M3 - Article
AN - SCOPUS:84920719766
SN - 2041-1480
VL - 5
JO - Journal of Biomedical Semantics
JF - Journal of Biomedical Semantics
IS - 1
M1 - 3
ER -