Leveraging genetic reports and electronic health records for the prediction of primary cancers: Algorithm development and validation study

Nansu Zong; Victoria Ngo; Daniel J. Stone; Andrew Wen; Yiqing Zhao; Yue Yu; Sijia Liu; Ming Huang; Chen Wang; Guoqian Jiang

doi:10.2196/23586

Leveraging genetic reports and electronic health records for the prediction of primary cancers: Algorithm development and validation study

Nansu Zong, Victoria Ngo, Daniel J. Stone, Andrew Wen, Yiqing Zhao, Yue Yu, Sijia Liu, Ming Huang, Chen Wang, Guoqian Jiang

Research output: Contribution to journal › Article › peer-review

1 Scopus citations

Abstract

Background: Precision oncology has the potential to leverage clinical and genomic data in advancing disease prevention, diagnosis, and treatment. A key research area focuses on the early detection of primary cancers and potential prediction of cancers of unknown primary in order to facilitate optimal treatment decisions. Objective: This study presents a methodology to harmonize phenotypic and genetic data features to classify primary cancer types and predict cancers of unknown primaries. Methods: We extracted genetic data elements from oncology genetic reports of 1011 patients with cancer and their corresponding phenotypical data from Mayo Clinic’s electronic health records. We modeled both genetic and electronic health record data with HL7 Fast Healthcare Interoperability Resources. The semantic web Resource Description Framework was employed to generate the network-based data representation (ie, patient-phenotypic-genetic network). Based on the Resource Description Framework data graph, Node2vec graph-embedding algorithm was applied to generate features. Multiple machine learning and deep learning backbone models were compared for cancer prediction performance. Results: With 6 machine learning tasks designed in the experiment, we demonstrated the proposed method achieved favorable results in classifying primary cancer types (area under the receiver operating characteristic curve [AUROC] 96.56% for all 9 cancer predictions on average based on the cross-validation) and predicting unknown primaries (AUROC 80.77% for all 8 cancer predictions on average for real-patient validation). To demonstrate the interpretability, 17 phenotypic and genetic features that contributed the most to the prediction of each cancer were identified and validated based on a literature review. Conclusions: Accurate prediction of cancer types can be achieved with existing electronic health record data with satisfactory precision. The integration of genetic reports improves prediction, illustrating the translational values of incorporating genetic tests early at the diagnosis stage for patients with cancer.

Original language	English (US)
Article number	e23586
Journal	JMIR Medical Informatics
Volume	9
Issue number	5
DOIs	https://doi.org/10.2196/23586
State	Published - May 2021

Keywords

Electronic health records
FHIR
Fast Healthcare Interoperability Resources
Genetic reports
Predicting primary cancers
RDF
Resource Description Framework

ASJC Scopus subject areas

Health Informatics
Health Information Management

Access to Document

10.2196/23586

Cite this

@article{5e57fa62018744c3bb1b8dc0fff7d0b5,

title = "Leveraging genetic reports and electronic health records for the prediction of primary cancers: Algorithm development and validation study",

abstract = "Background: Precision oncology has the potential to leverage clinical and genomic data in advancing disease prevention, diagnosis, and treatment. A key research area focuses on the early detection of primary cancers and potential prediction of cancers of unknown primary in order to facilitate optimal treatment decisions. Objective: This study presents a methodology to harmonize phenotypic and genetic data features to classify primary cancer types and predict cancers of unknown primaries. Methods: We extracted genetic data elements from oncology genetic reports of 1011 patients with cancer and their corresponding phenotypical data from Mayo Clinic{\textquoteright}s electronic health records. We modeled both genetic and electronic health record data with HL7 Fast Healthcare Interoperability Resources. The semantic web Resource Description Framework was employed to generate the network-based data representation (ie, patient-phenotypic-genetic network). Based on the Resource Description Framework data graph, Node2vec graph-embedding algorithm was applied to generate features. Multiple machine learning and deep learning backbone models were compared for cancer prediction performance. Results: With 6 machine learning tasks designed in the experiment, we demonstrated the proposed method achieved favorable results in classifying primary cancer types (area under the receiver operating characteristic curve [AUROC] 96.56% for all 9 cancer predictions on average based on the cross-validation) and predicting unknown primaries (AUROC 80.77% for all 8 cancer predictions on average for real-patient validation). To demonstrate the interpretability, 17 phenotypic and genetic features that contributed the most to the prediction of each cancer were identified and validated based on a literature review. Conclusions: Accurate prediction of cancer types can be achieved with existing electronic health record data with satisfactory precision. The integration of genetic reports improves prediction, illustrating the translational values of incorporating genetic tests early at the diagnosis stage for patients with cancer.",

keywords = "Electronic health records, FHIR, Fast Healthcare Interoperability Resources, Genetic reports, Predicting primary cancers, RDF, Resource Description Framework",

author = "Nansu Zong and Victoria Ngo and Stone, {Daniel J.} and Andrew Wen and Yiqing Zhao and Yue Yu and Sijia Liu and Ming Huang and Chen Wang and Guoqian Jiang",

note = "Publisher Copyright: {\textcopyright}Nansu Zong, Victoria Ngo, Daniel J Stone, Andrew Wen, Yiqing Zhao, Yue Yu, Sijia Liu, Ming Huang, Chen Wang, Guoqian Jiang.",

year = "2021",

month = may,

doi = "10.2196/23586",

language = "English (US)",

volume = "9",

journal = "JMIR Medical Informatics",

issn = "2291-9694",

publisher = "JMIR Publications Inc.",

number = "5",

}

TY - JOUR

T1 - Leveraging genetic reports and electronic health records for the prediction of primary cancers

T2 - Algorithm development and validation study

AU - Zong, Nansu

AU - Ngo, Victoria

AU - Stone, Daniel J.

AU - Wen, Andrew

AU - Zhao, Yiqing

AU - Yu, Yue

AU - Liu, Sijia

AU - Huang, Ming

AU - Wang, Chen

AU - Jiang, Guoqian

N1 - Publisher Copyright: ©Nansu Zong, Victoria Ngo, Daniel J Stone, Andrew Wen, Yiqing Zhao, Yue Yu, Sijia Liu, Ming Huang, Chen Wang, Guoqian Jiang.

PY - 2021/5

Y1 - 2021/5

N2 - Background: Precision oncology has the potential to leverage clinical and genomic data in advancing disease prevention, diagnosis, and treatment. A key research area focuses on the early detection of primary cancers and potential prediction of cancers of unknown primary in order to facilitate optimal treatment decisions. Objective: This study presents a methodology to harmonize phenotypic and genetic data features to classify primary cancer types and predict cancers of unknown primaries. Methods: We extracted genetic data elements from oncology genetic reports of 1011 patients with cancer and their corresponding phenotypical data from Mayo Clinic’s electronic health records. We modeled both genetic and electronic health record data with HL7 Fast Healthcare Interoperability Resources. The semantic web Resource Description Framework was employed to generate the network-based data representation (ie, patient-phenotypic-genetic network). Based on the Resource Description Framework data graph, Node2vec graph-embedding algorithm was applied to generate features. Multiple machine learning and deep learning backbone models were compared for cancer prediction performance. Results: With 6 machine learning tasks designed in the experiment, we demonstrated the proposed method achieved favorable results in classifying primary cancer types (area under the receiver operating characteristic curve [AUROC] 96.56% for all 9 cancer predictions on average based on the cross-validation) and predicting unknown primaries (AUROC 80.77% for all 8 cancer predictions on average for real-patient validation). To demonstrate the interpretability, 17 phenotypic and genetic features that contributed the most to the prediction of each cancer were identified and validated based on a literature review. Conclusions: Accurate prediction of cancer types can be achieved with existing electronic health record data with satisfactory precision. The integration of genetic reports improves prediction, illustrating the translational values of incorporating genetic tests early at the diagnosis stage for patients with cancer.

AB - Background: Precision oncology has the potential to leverage clinical and genomic data in advancing disease prevention, diagnosis, and treatment. A key research area focuses on the early detection of primary cancers and potential prediction of cancers of unknown primary in order to facilitate optimal treatment decisions. Objective: This study presents a methodology to harmonize phenotypic and genetic data features to classify primary cancer types and predict cancers of unknown primaries. Methods: We extracted genetic data elements from oncology genetic reports of 1011 patients with cancer and their corresponding phenotypical data from Mayo Clinic’s electronic health records. We modeled both genetic and electronic health record data with HL7 Fast Healthcare Interoperability Resources. The semantic web Resource Description Framework was employed to generate the network-based data representation (ie, patient-phenotypic-genetic network). Based on the Resource Description Framework data graph, Node2vec graph-embedding algorithm was applied to generate features. Multiple machine learning and deep learning backbone models were compared for cancer prediction performance. Results: With 6 machine learning tasks designed in the experiment, we demonstrated the proposed method achieved favorable results in classifying primary cancer types (area under the receiver operating characteristic curve [AUROC] 96.56% for all 9 cancer predictions on average based on the cross-validation) and predicting unknown primaries (AUROC 80.77% for all 8 cancer predictions on average for real-patient validation). To demonstrate the interpretability, 17 phenotypic and genetic features that contributed the most to the prediction of each cancer were identified and validated based on a literature review. Conclusions: Accurate prediction of cancer types can be achieved with existing electronic health record data with satisfactory precision. The integration of genetic reports improves prediction, illustrating the translational values of incorporating genetic tests early at the diagnosis stage for patients with cancer.

KW - Electronic health records

KW - FHIR

KW - Fast Healthcare Interoperability Resources

KW - Genetic reports

KW - Predicting primary cancers

KW - RDF

KW - Resource Description Framework

UR - http://www.scopus.com/inward/record.url?scp=85103687759&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85103687759&partnerID=8YFLogxK

U2 - 10.2196/23586

DO - 10.2196/23586

M3 - Article

AN - SCOPUS:85103687759

SN - 2291-9694

VL - 9

JO - JMIR Medical Informatics

JF - JMIR Medical Informatics

IS - 5

M1 - e23586

ER -

Leveraging genetic reports and electronic health records for the prediction of primary cancers: Algorithm development and validation study

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this