TY - GEN
T1 - Evaluating the impact of data representation on EHR-based analytic tasks
AU - Oh, Wonsuk
AU - Steinbach, Michael S.
AU - Regina Castro, M.
AU - Peterson, Kevin A.
AU - Kumar, Vipin
AU - Caraballo, Pedro J.
AU - Simona, Gyorgy J.
N1 - Funding Information:
This work was supported by NIH award LM011972, NSF awards IIS 1602394 and IIS 1602198. The views expressed in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.
Publisher Copyright:
© 2019 International Medical Informatics Association (IMIA) and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0).
PY - 2019/8/21
Y1 - 2019/8/21
N2 - Different analytic techniques operate optimally with different types of data. As the use of EHR-based analytics expands to newer tasks, data will have to be transformed into different representations, so the tasks can be optimally solved. We classified representations into broad categories based on their characteristics, and proposed a new knowledge-driven representation for clinical data mining as well as trajectory mining, called Severity Encoding Variables (SEVs). Additionally, we studied which characteristics make representations most suitable for particular clinical analytics tasks including trajectory mining. Our evaluation shows that, for regression, most data representations performed similarly, with SEV achieving a slight (albeit statistically significant) advantage. For patients at high risk of diabetes, it outperformed the competing representation by (relative) 20%. For association mining, SEV achieved the highest performance. Its ability to constrain the search space of patterns through clinical knowledge was key to its success.
AB - Different analytic techniques operate optimally with different types of data. As the use of EHR-based analytics expands to newer tasks, data will have to be transformed into different representations, so the tasks can be optimally solved. We classified representations into broad categories based on their characteristics, and proposed a new knowledge-driven representation for clinical data mining as well as trajectory mining, called Severity Encoding Variables (SEVs). Additionally, we studied which characteristics make representations most suitable for particular clinical analytics tasks including trajectory mining. Our evaluation shows that, for regression, most data representations performed similarly, with SEV achieving a slight (albeit statistically significant) advantage. For patients at high risk of diabetes, it outperformed the competing representation by (relative) 20%. For association mining, SEV achieved the highest performance. Its ability to constrain the search space of patterns through clinical knowledge was key to its success.
KW - Data Mining
KW - Data Science
KW - Electronic Health Records
UR - http://www.scopus.com/inward/record.url?scp=85071512591&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85071512591&partnerID=8YFLogxK
U2 - 10.3233/SHTI190229
DO - 10.3233/SHTI190229
M3 - Conference contribution
C2 - 31437931
AN - SCOPUS:85071512591
T3 - Studies in Health Technology and Informatics
SP - 288
EP - 292
BT - MEDINFO 2019
A2 - Seroussi, Brigitte
A2 - Ohno-Machado, Lucila
A2 - Ohno-Machado, Lucila
A2 - Seroussi, Brigitte
PB - IOS Press
T2 - 17th World Congress on Medical and Health Informatics, MEDINFO 2019
Y2 - 25 August 2019 through 30 August 2019
ER -