Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm

Yoonjung Yoonie Joo; Jennifer A. Pacheco; William K. Thompson; Laura J. Rasmussen-Torvik; Luke V. Rasmussen; Frederick T.J. Lin; Mariza de Andrade; Kenneth M. Borthwick; Erwin Bottinger; Andrew Cagan; David S. Carrell; Joshua C. Denny; Stephen B. Ellis; Omri Gottesman; James G. Linneman; Jyotishman Pathak; Peggy L. Peissig; Ning Shang; Gerard Tromp; Annapoorani Veerappan; Maureen E. Smith; Rex L. Chisholm; Andrew J. Gawron; M. Geoffrey Hayes; Abel N. Kho

doi:10.1371/journal.pone.0283553

Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm

Yoonjung Yoonie Joo, Jennifer A. Pacheco, William K. Thompson, Laura J. Rasmussen-Torvik, Luke V. Rasmussen, Frederick T.J. Lin, Mariza de Andrade, Kenneth M. Borthwick, Erwin Bottinger, Andrew Cagan, David S. Carrell, Joshua C. Denny, Stephen B. Ellis, Omri Gottesman, James G. Linneman, Jyotishman Pathak, Peggy L. Peissig, Ning Shang, Gerard Tromp, Annapoorani VeerappanMaureen E. Smith, Rex L. Chisholm, Andrew J. Gawron, M. Geoffrey Hayes, Abel N. Kho

Quantitative Health Sciences

Research output: Contribution to journal › Article › peer-review

Abstract

Objective Diverticular disease (DD) is one of the most prevalent conditions encountered by gastroenterologists, affecting ∼50% of Americans before the age of 60. Our aim was to identify genetic risk variants and clinical phenotypes associated with DD, leveraging multiple electronic health record (EHR) data sources of 91,166 multi-ancestry participants with a Natural Language Processing (NLP) technique. Materials and methods We developed a NLP-enriched phenotyping algorithm that incorporated colonoscopy or abdominal imaging reports to identify patients with diverticulosis and diverticulitis from multicenter EHRs. We performed genome-wide association studies (GWAS) of DD in European, African and multi-ancestry participants, followed by phenome-wide association studies (PheWAS) of the risk variants to identify their potential comorbid/pleiotropic effects in clinical phenotypes. Results Our developed algorithm showed a significant improvement in patient classification performance for DD analysis (algorithm PPVs ≥ 0.94), with up to a 3.5 fold increase in terms of the number of identified patients than the traditional method. Ancestry-stratified analyses of diverticulosis and diverticulitis of the identified subjects replicated the well-established associations between ARHGAP15 loci with DD, showing overall intensified GWAS signals in diverticulitis patients compared to diverticulosis patients. Our PheWAS analyses identified significant associations between the DD GWAS variants and circulatory system, genitourinary, and neoplastic EHR phenotypes. Discussion As the first multi-ancestry GWAS-PheWAS study, we showcased that heterogenous EHR data can be mapped through an integrative analytical pipeline and reveal significant genotype- phenotype associations with clinical interpretation. Conclusion A systematic framework to process unstructured EHR data with NLP could advance a deep and scalable phenotyping for better patient identification and facilitate etiological investigation of a disease with multilayered data.

Original language	English (US)
Article number	e0283553
Journal	PloS one
Volume	18
Issue number	5 MAY
DOIs	https://doi.org/10.1371/journal.pone.0283553
State	Published - May 2023

ASJC Scopus subject areas

General

Access to Document

10.1371/journal.pone.0283553

Cite this

Joo, Y. Y., Pacheco, J. A., Thompson, W. K., Rasmussen-Torvik, L. J., Rasmussen, L. V., Lin, F. T. J., de Andrade, M., Borthwick, K. M., Bottinger, E., Cagan, A., Carrell, D. S., Denny, J. C., Ellis, S. B., Gottesman, O., Linneman, J. G., Pathak, J., Peissig, P. L., Shang, N., Tromp, G., ... Kho, A. N. (2023). Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm. PloS one, 18(5 MAY), Article e0283553. https://doi.org/10.1371/journal.pone.0283553

Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm. / Joo, Yoonjung Yoonie; Pacheco, Jennifer A.; Thompson, William K. et al.
In: PloS one, Vol. 18, No. 5 MAY, e0283553, 05.2023.

Research output: Contribution to journal › Article › peer-review

Joo, YY, Pacheco, JA, Thompson, WK, Rasmussen-Torvik, LJ, Rasmussen, LV, Lin, FTJ, de Andrade, M, Borthwick, KM, Bottinger, E, Cagan, A, Carrell, DS, Denny, JC, Ellis, SB, Gottesman, O, Linneman, JG, Pathak, J, Peissig, PL, Shang, N, Tromp, G, Veerappan, A, Smith, ME, Chisholm, RL, Gawron, AJ, Hayes, MG & Kho, AN 2023, 'Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm', PloS one, vol. 18, no. 5 MAY, e0283553. https://doi.org/10.1371/journal.pone.0283553

@article{f2847b00d25a4afaa8fb159626678f30,

title = "Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm",

abstract = "Objective Diverticular disease (DD) is one of the most prevalent conditions encountered by gastroenterologists, affecting ∼50% of Americans before the age of 60. Our aim was to identify genetic risk variants and clinical phenotypes associated with DD, leveraging multiple electronic health record (EHR) data sources of 91,166 multi-ancestry participants with a Natural Language Processing (NLP) technique. Materials and methods We developed a NLP-enriched phenotyping algorithm that incorporated colonoscopy or abdominal imaging reports to identify patients with diverticulosis and diverticulitis from multicenter EHRs. We performed genome-wide association studies (GWAS) of DD in European, African and multi-ancestry participants, followed by phenome-wide association studies (PheWAS) of the risk variants to identify their potential comorbid/pleiotropic effects in clinical phenotypes. Results Our developed algorithm showed a significant improvement in patient classification performance for DD analysis (algorithm PPVs ≥ 0.94), with up to a 3.5 fold increase in terms of the number of identified patients than the traditional method. Ancestry-stratified analyses of diverticulosis and diverticulitis of the identified subjects replicated the well-established associations between ARHGAP15 loci with DD, showing overall intensified GWAS signals in diverticulitis patients compared to diverticulosis patients. Our PheWAS analyses identified significant associations between the DD GWAS variants and circulatory system, genitourinary, and neoplastic EHR phenotypes. Discussion As the first multi-ancestry GWAS-PheWAS study, we showcased that heterogenous EHR data can be mapped through an integrative analytical pipeline and reveal significant genotype- phenotype associations with clinical interpretation. Conclusion A systematic framework to process unstructured EHR data with NLP could advance a deep and scalable phenotyping for better patient identification and facilitate etiological investigation of a disease with multilayered data.",

author = "Joo, {Yoonjung Yoonie} and Pacheco, {Jennifer A.} and Thompson, {William K.} and Rasmussen-Torvik, {Laura J.} and Rasmussen, {Luke V.} and Lin, {Frederick T.J.} and {de Andrade}, Mariza and Borthwick, {Kenneth M.} and Erwin Bottinger and Andrew Cagan and Carrell, {David S.} and Denny, {Joshua C.} and Ellis, {Stephen B.} and Omri Gottesman and Linneman, {James G.} and Jyotishman Pathak and Peissig, {Peggy L.} and Ning Shang and Gerard Tromp and Annapoorani Veerappan and Smith, {Maureen E.} and Chisholm, {Rex L.} and Gawron, {Andrew J.} and Hayes, {M. Geoffrey} and Kho, {Abel N.}",

note = "Publisher Copyright: {\textcopyright} 2023 Joo et al.",

year = "2023",

month = may,

doi = "10.1371/journal.pone.0283553",

language = "English (US)",

volume = "18",

journal = "PloS one",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "5 MAY",

}

TY - JOUR

T1 - Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm

AU - Joo, Yoonjung Yoonie

AU - Pacheco, Jennifer A.

AU - Thompson, William K.

AU - Rasmussen-Torvik, Laura J.

AU - Rasmussen, Luke V.

AU - Lin, Frederick T.J.

AU - de Andrade, Mariza

AU - Borthwick, Kenneth M.

AU - Bottinger, Erwin

AU - Cagan, Andrew

AU - Carrell, David S.

AU - Denny, Joshua C.

AU - Ellis, Stephen B.

AU - Gottesman, Omri

AU - Linneman, James G.

AU - Pathak, Jyotishman

AU - Peissig, Peggy L.

AU - Shang, Ning

AU - Tromp, Gerard

AU - Veerappan, Annapoorani

AU - Smith, Maureen E.

AU - Chisholm, Rex L.

AU - Gawron, Andrew J.

AU - Hayes, M. Geoffrey

AU - Kho, Abel N.

PY - 2023/5

Y1 - 2023/5

N2 - Objective Diverticular disease (DD) is one of the most prevalent conditions encountered by gastroenterologists, affecting ∼50% of Americans before the age of 60. Our aim was to identify genetic risk variants and clinical phenotypes associated with DD, leveraging multiple electronic health record (EHR) data sources of 91,166 multi-ancestry participants with a Natural Language Processing (NLP) technique. Materials and methods We developed a NLP-enriched phenotyping algorithm that incorporated colonoscopy or abdominal imaging reports to identify patients with diverticulosis and diverticulitis from multicenter EHRs. We performed genome-wide association studies (GWAS) of DD in European, African and multi-ancestry participants, followed by phenome-wide association studies (PheWAS) of the risk variants to identify their potential comorbid/pleiotropic effects in clinical phenotypes. Results Our developed algorithm showed a significant improvement in patient classification performance for DD analysis (algorithm PPVs ≥ 0.94), with up to a 3.5 fold increase in terms of the number of identified patients than the traditional method. Ancestry-stratified analyses of diverticulosis and diverticulitis of the identified subjects replicated the well-established associations between ARHGAP15 loci with DD, showing overall intensified GWAS signals in diverticulitis patients compared to diverticulosis patients. Our PheWAS analyses identified significant associations between the DD GWAS variants and circulatory system, genitourinary, and neoplastic EHR phenotypes. Discussion As the first multi-ancestry GWAS-PheWAS study, we showcased that heterogenous EHR data can be mapped through an integrative analytical pipeline and reveal significant genotype- phenotype associations with clinical interpretation. Conclusion A systematic framework to process unstructured EHR data with NLP could advance a deep and scalable phenotyping for better patient identification and facilitate etiological investigation of a disease with multilayered data.

AB - Objective Diverticular disease (DD) is one of the most prevalent conditions encountered by gastroenterologists, affecting ∼50% of Americans before the age of 60. Our aim was to identify genetic risk variants and clinical phenotypes associated with DD, leveraging multiple electronic health record (EHR) data sources of 91,166 multi-ancestry participants with a Natural Language Processing (NLP) technique. Materials and methods We developed a NLP-enriched phenotyping algorithm that incorporated colonoscopy or abdominal imaging reports to identify patients with diverticulosis and diverticulitis from multicenter EHRs. We performed genome-wide association studies (GWAS) of DD in European, African and multi-ancestry participants, followed by phenome-wide association studies (PheWAS) of the risk variants to identify their potential comorbid/pleiotropic effects in clinical phenotypes. Results Our developed algorithm showed a significant improvement in patient classification performance for DD analysis (algorithm PPVs ≥ 0.94), with up to a 3.5 fold increase in terms of the number of identified patients than the traditional method. Ancestry-stratified analyses of diverticulosis and diverticulitis of the identified subjects replicated the well-established associations between ARHGAP15 loci with DD, showing overall intensified GWAS signals in diverticulitis patients compared to diverticulosis patients. Our PheWAS analyses identified significant associations between the DD GWAS variants and circulatory system, genitourinary, and neoplastic EHR phenotypes. Discussion As the first multi-ancestry GWAS-PheWAS study, we showcased that heterogenous EHR data can be mapped through an integrative analytical pipeline and reveal significant genotype- phenotype associations with clinical interpretation. Conclusion A systematic framework to process unstructured EHR data with NLP could advance a deep and scalable phenotyping for better patient identification and facilitate etiological investigation of a disease with multilayered data.

UR - http://www.scopus.com/inward/record.url?scp=85159769375&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85159769375&partnerID=8YFLogxK

U2 - 10.1371/journal.pone.0283553

DO - 10.1371/journal.pone.0283553

M3 - Article

C2 - 37196047

AN - SCOPUS:85159769375

SN - 1932-6203

VL - 18

JO - PloS one

JF - PloS one

IS - 5 MAY

M1 - e0283553

ER -

Multi-ancestry genome- and phenome-wide association studies of diverticular disease in electronic health records with natural language processing enriched phenotyping algorithm

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this