TY - JOUR
T1 - Privacy-preserving predictive modeling
T2 - Harmonization of contextual embeddings from different sources
AU - Huang, Yingxiang
AU - Lee, Junghye
AU - Wang, Shuang
AU - Sun, Jimeng
AU - Liu, Hongfang
AU - Jiang, Xiaoqian
N1 - Funding Information:
This research was supported in part by the National Institute of Health under award number R01GM114612, R01GM118574, R00HG008175, U01EB023685, and NLM T15 grant NIH/NLM T15LM011271 to the University of California, San Diego. Finally, it was also supported in part by grant U01TR002062 and OHNLP U01.
Publisher Copyright:
© 2018 JMIR Publications Inc.. All right reserved.
PY - 2018/4
Y1 - 2018/4
N2 - Background: Data sharing has been a big challenge in biomedical informatics because of privacy concerns. Contextual embedding models have demonstrated a very strong representative capability to describe medical concepts (and their context), and they have shown promise as an alternative way to support deep-learning applications without the need to disclose original data. However, contextual embedding models acquired from individual hospitals cannot be directly combined because their embedding spaces are different, and naive pooling renders combined embeddings useless. Objective: The aim of this study was to present a novel approach to address these issues and to promote sharing representation without sharing data. Without sacrificing privacy, we also aimed to build a global model from representations learned from local private data and synchronize information from multiple sources. Methods: We propose a methodology that harmonizes different local contextual embeddings into a global model. We used Word2Vec to generate contextual embeddings from each source and Procrustes to fuse different vector models into one common space by using a list of corresponding pairs as anchor points. We performed prediction analysis with harmonized embeddings. Results: We used sequential medical events extracted from the Medical Information Mart for Intensive Care III database to evaluate the proposed methodology in predicting the next likely diagnosis of a new patient using either structured data or unstructured data. Under different experimental scenarios, we confirmed that the global model built from harmonized local models achieves a more accurate prediction than local models and global models built from naive pooling. Conclusions: Such aggregation of local models using our unique harmonization can serve as the proxy for a global model, combining information from a wide range of institutions and information sources. It allows information unique to a certain hospital to become available to other sites, increasing the fluidity of information flow in health care.
AB - Background: Data sharing has been a big challenge in biomedical informatics because of privacy concerns. Contextual embedding models have demonstrated a very strong representative capability to describe medical concepts (and their context), and they have shown promise as an alternative way to support deep-learning applications without the need to disclose original data. However, contextual embedding models acquired from individual hospitals cannot be directly combined because their embedding spaces are different, and naive pooling renders combined embeddings useless. Objective: The aim of this study was to present a novel approach to address these issues and to promote sharing representation without sharing data. Without sacrificing privacy, we also aimed to build a global model from representations learned from local private data and synchronize information from multiple sources. Methods: We propose a methodology that harmonizes different local contextual embeddings into a global model. We used Word2Vec to generate contextual embeddings from each source and Procrustes to fuse different vector models into one common space by using a list of corresponding pairs as anchor points. We performed prediction analysis with harmonized embeddings. Results: We used sequential medical events extracted from the Medical Information Mart for Intensive Care III database to evaluate the proposed methodology in predicting the next likely diagnosis of a new patient using either structured data or unstructured data. Under different experimental scenarios, we confirmed that the global model built from harmonized local models achieves a more accurate prediction than local models and global models built from naive pooling. Conclusions: Such aggregation of local models using our unique harmonization can serve as the proxy for a global model, combining information from a wide range of institutions and information sources. It allows information unique to a certain hospital to become available to other sites, increasing the fluidity of information flow in health care.
KW - Contextual embedding
KW - Interoperability
KW - Patient data privacy
KW - Predictive models
UR - http://www.scopus.com/inward/record.url?scp=85056933801&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85056933801&partnerID=8YFLogxK
U2 - 10.2196/medinform.9455
DO - 10.2196/medinform.9455
M3 - Article
AN - SCOPUS:85056933801
SN - 2291-9694
VL - 6
JO - JMIR Medical Informatics
JF - JMIR Medical Informatics
IS - 2
M1 - e33
ER -