An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics

Manabu Torii; Lanlan Yin; Thang Nguyen; Chand T. Mazumdar; Hongfang Liu; David M. Hartley; Noele P. Nelson

doi:10.1016/j.ijmedinf.2010.10.015

An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics

Manabu Torii, Lanlan Yin, Thang Nguyen, Chand T. Mazumdar, Hongfang Liu, David M. Hartley, Noele P. Nelson

Digital Health Sciences

Research output: Contribution to journal › Article › peer-review

35 Scopus citations

Abstract

Purpose: Early detection of infectious disease outbreaks is crucial to protecting the public health of a society. Online news articles provide timely information on disease outbreaks worldwide. In this study, we investigated automated detection of articles relevant to disease outbreaks using machine learning classifiers. In a real-life setting, it is expensive to prepare a training data set for classifiers, which usually consists of manually labeled relevant and irrelevant articles. To mitigate this challenge, we examined the use of randomly sampled unlabeled articles as well as labeled relevant articles. Methods: Naïve Bayes and Support Vector Machine (SVM) classifiers were trained on 149 relevant and 149 or more randomly sampled unlabeled articles. Diverse classifiers were trained by varying the number of sampled unlabeled articles and also the number of word features. The trained classifiers were applied to 15 thousand articles published over 15 days. Top-ranked articles from each classifier were pooled and the resulting set of 1337 articles was reviewed by an expert analyst to evaluate the classifiers. Results: Daily averages of areas under ROC curves (AUCs) over the 15-day evaluation period were 0.841 and 0.836, respectively, for the naïve Bayes and SVM classifier. We referenced a database of disease outbreak reports to confirm that this evaluation data set resulted from the pooling method indeed covered incidents recorded in the database during the evaluation period. Conclusions: The proposed text classification framework utilizing randomly sampled unlabeled articles can facilitate a cost-effective approach to training machine learning classifiers in a real-life Internet-based biosurveillance project. We plan to examine this framework further using larger data sets and using articles in non-English languages.

Original language	English (US)
Pages (from-to)	56-66
Number of pages	11
Journal	International Journal of Medical Informatics
Volume	80
Issue number	1
DOIs	https://doi.org/10.1016/j.ijmedinf.2010.10.015
State	Published - Jan 2011

Keywords

Biosurveillance
Disease notification
Disease outbreaks
Information storage and retrieval
Internet
Medical informatics applications
Natural language processing

ASJC Scopus subject areas

Health Informatics

Access to Document

10.1016/j.ijmedinf.2010.10.015

Cite this

@article{fb871605b4dc4780923b364a96c615d8,

title = "An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics",

abstract = "Purpose: Early detection of infectious disease outbreaks is crucial to protecting the public health of a society. Online news articles provide timely information on disease outbreaks worldwide. In this study, we investigated automated detection of articles relevant to disease outbreaks using machine learning classifiers. In a real-life setting, it is expensive to prepare a training data set for classifiers, which usually consists of manually labeled relevant and irrelevant articles. To mitigate this challenge, we examined the use of randomly sampled unlabeled articles as well as labeled relevant articles. Methods: Na{\"i}ve Bayes and Support Vector Machine (SVM) classifiers were trained on 149 relevant and 149 or more randomly sampled unlabeled articles. Diverse classifiers were trained by varying the number of sampled unlabeled articles and also the number of word features. The trained classifiers were applied to 15 thousand articles published over 15 days. Top-ranked articles from each classifier were pooled and the resulting set of 1337 articles was reviewed by an expert analyst to evaluate the classifiers. Results: Daily averages of areas under ROC curves (AUCs) over the 15-day evaluation period were 0.841 and 0.836, respectively, for the na{\"i}ve Bayes and SVM classifier. We referenced a database of disease outbreak reports to confirm that this evaluation data set resulted from the pooling method indeed covered incidents recorded in the database during the evaluation period. Conclusions: The proposed text classification framework utilizing randomly sampled unlabeled articles can facilitate a cost-effective approach to training machine learning classifiers in a real-life Internet-based biosurveillance project. We plan to examine this framework further using larger data sets and using articles in non-English languages.",

keywords = "Biosurveillance, Disease notification, Disease outbreaks, Information storage and retrieval, Internet, Medical informatics applications, Natural language processing",

author = "Manabu Torii and Lanlan Yin and Thang Nguyen and Mazumdar, {Chand T.} and Hongfang Liu and Hartley, {David M.} and Nelson, {Noele P.}",

note = "Funding Information: This research and development project was conducted by Georgetown University and is made possible by a contract awarded and administered by the U.S. Army Medical Research and Materiel Command (USAMRMC) and the Telemedicine and Advanced Technology Research Center (TATRC), Fort Detrick, Maryland 21702, under contract number W81XWH-04-1-0857. The views, opinions and findings contained in this research are those of the authors and do not necessarily reflect the views of the Department of Defense and should not be construed as an official DoD/Army policy unless so designated by other documentation. No official endorsement should be made. We thank our sponsors and also the members of Project Argus, especially Mr. Dan Ji and Mr. Peter Li for their technical support and Dr. Kevin Jones for stimulating discussion. ",

year = "2011",

month = jan,

doi = "10.1016/j.ijmedinf.2010.10.015",

language = "English (US)",

volume = "80",

pages = "56--66",

journal = "International Journal of Medical Informatics",

issn = "1386-5056",

publisher = "Elsevier Ireland Ltd",

number = "1",

}

TY - JOUR

T1 - An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics

AU - Torii, Manabu

AU - Yin, Lanlan

AU - Nguyen, Thang

AU - Mazumdar, Chand T.

AU - Liu, Hongfang

AU - Hartley, David M.

AU - Nelson, Noele P.

N1 - Funding Information: This research and development project was conducted by Georgetown University and is made possible by a contract awarded and administered by the U.S. Army Medical Research and Materiel Command (USAMRMC) and the Telemedicine and Advanced Technology Research Center (TATRC), Fort Detrick, Maryland 21702, under contract number W81XWH-04-1-0857. The views, opinions and findings contained in this research are those of the authors and do not necessarily reflect the views of the Department of Defense and should not be construed as an official DoD/Army policy unless so designated by other documentation. No official endorsement should be made. We thank our sponsors and also the members of Project Argus, especially Mr. Dan Ji and Mr. Peter Li for their technical support and Dr. Kevin Jones for stimulating discussion.

PY - 2011/1

Y1 - 2011/1

N2 - Purpose: Early detection of infectious disease outbreaks is crucial to protecting the public health of a society. Online news articles provide timely information on disease outbreaks worldwide. In this study, we investigated automated detection of articles relevant to disease outbreaks using machine learning classifiers. In a real-life setting, it is expensive to prepare a training data set for classifiers, which usually consists of manually labeled relevant and irrelevant articles. To mitigate this challenge, we examined the use of randomly sampled unlabeled articles as well as labeled relevant articles. Methods: Naïve Bayes and Support Vector Machine (SVM) classifiers were trained on 149 relevant and 149 or more randomly sampled unlabeled articles. Diverse classifiers were trained by varying the number of sampled unlabeled articles and also the number of word features. The trained classifiers were applied to 15 thousand articles published over 15 days. Top-ranked articles from each classifier were pooled and the resulting set of 1337 articles was reviewed by an expert analyst to evaluate the classifiers. Results: Daily averages of areas under ROC curves (AUCs) over the 15-day evaluation period were 0.841 and 0.836, respectively, for the naïve Bayes and SVM classifier. We referenced a database of disease outbreak reports to confirm that this evaluation data set resulted from the pooling method indeed covered incidents recorded in the database during the evaluation period. Conclusions: The proposed text classification framework utilizing randomly sampled unlabeled articles can facilitate a cost-effective approach to training machine learning classifiers in a real-life Internet-based biosurveillance project. We plan to examine this framework further using larger data sets and using articles in non-English languages.

AB - Purpose: Early detection of infectious disease outbreaks is crucial to protecting the public health of a society. Online news articles provide timely information on disease outbreaks worldwide. In this study, we investigated automated detection of articles relevant to disease outbreaks using machine learning classifiers. In a real-life setting, it is expensive to prepare a training data set for classifiers, which usually consists of manually labeled relevant and irrelevant articles. To mitigate this challenge, we examined the use of randomly sampled unlabeled articles as well as labeled relevant articles. Methods: Naïve Bayes and Support Vector Machine (SVM) classifiers were trained on 149 relevant and 149 or more randomly sampled unlabeled articles. Diverse classifiers were trained by varying the number of sampled unlabeled articles and also the number of word features. The trained classifiers were applied to 15 thousand articles published over 15 days. Top-ranked articles from each classifier were pooled and the resulting set of 1337 articles was reviewed by an expert analyst to evaluate the classifiers. Results: Daily averages of areas under ROC curves (AUCs) over the 15-day evaluation period were 0.841 and 0.836, respectively, for the naïve Bayes and SVM classifier. We referenced a database of disease outbreak reports to confirm that this evaluation data set resulted from the pooling method indeed covered incidents recorded in the database during the evaluation period. Conclusions: The proposed text classification framework utilizing randomly sampled unlabeled articles can facilitate a cost-effective approach to training machine learning classifiers in a real-life Internet-based biosurveillance project. We plan to examine this framework further using larger data sets and using articles in non-English languages.

KW - Biosurveillance

KW - Disease notification

KW - Disease outbreaks

KW - Information storage and retrieval

KW - Internet

KW - Medical informatics applications

KW - Natural language processing

UR - http://www.scopus.com/inward/record.url?scp=78650283817&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=78650283817&partnerID=8YFLogxK

U2 - 10.1016/j.ijmedinf.2010.10.015

DO - 10.1016/j.ijmedinf.2010.10.015

M3 - Article

C2 - 21134784

AN - SCOPUS:78650283817

SN - 1386-5056

VL - 80

SP - 56

EP - 66

JO - International Journal of Medical Informatics

JF - International Journal of Medical Informatics

IS - 1

ER -

An exploratory study of a text classification framework for Internet-based surveillance of emerging epidemics

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this