Natural language processing and machine learning for identifying incident stroke from electronic health records: Algorithm development and validation

Yiqing Zhao; Sunyang Fu; Suzette J. Bielinski; Paul A. Decker; Alanna M. Chamberlain; Veronique L. Roger; Hongfang Liu; Nicholas B. Larson

doi:10.2196/22951

Natural language processing and machine learning for identifying incident stroke from electronic health records: Algorithm development and validation

Yiqing Zhao, Sunyang Fu, Suzette J. Bielinski, Paul A. Decker, Alanna M. Chamberlain, Veronique L. Roger, Hongfang Liu, Nicholas B. Larson

Research output: Contribution to journal › Article › peer-review

Abstract

Background: Stroke is an important clinical outcome in cardiovascular research. However, the ascertainment of incident stroke is typically accomplished via time-consuming manual chart abstraction. Current phenotyping efforts using electronic health records for stroke focus on case ascertainment rather than incident disease, which requires knowledge of the temporal sequence of events. Objective: The aim of this study was to develop a machine learning-based phenotyping algorithm for incident stroke ascertainment based on diagnosis codes, procedure codes, and clinical concepts extracted from clinical notes using natural language processing. Methods: The algorithm was trained and validated using an existing epidemiology cohort consisting of 4914 patients with atrial fibrillation (AF) with manually curated incident stroke events. Various combinations of feature sets and machine learning classifiers were compared. Using a heuristic rule based on the composition of concepts and codes, we further detected the stroke subtype (ischemic stroke/transient ischemic attack or hemorrhagic stroke) of each identified stroke. The algorithm was further validated using a cohort (n=150) stratified sampled from a population in Olmsted County, Minnesota (N=74,314). Results: Among the 4914 patients with AF, 740 had validated incident stroke events. The best-performing stroke phenotyping algorithm used clinical concepts, diagnosis codes, and procedure codes as features in a random forest classifier. Among patients with stroke codes in the general population sample, the best-performing model achieved a positive predictive value of 86% (43/50; 95% CI 0.74-0.93) and a negative predictive value of 96% (96/100). For subtype identification, we achieved an accuracy of 83% in the AF cohort and 80% in the general population sample. Conclusions: We developed and validated a machine learning-based algorithm that performed well for identifying incident stroke and for determining type of stroke. The algorithm also performed well on a sample from a general population, further demonstrating its generalizability and potential for adoption by other institutions.

Original language	English (US)
Article number	e22951
Journal	Journal of medical Internet research
Volume	23
Issue number	3
DOIs	https://doi.org/10.2196/22951
State	Published - Mar 2021

Keywords

Electronic health records
Machine learning
Natural language processing
Stroke

ASJC Scopus subject areas

Health Informatics

Access to Document

10.2196/22951

Cite this

@article{1a4e399f3ffd47e78794b5e83edd546c,

title = "Natural language processing and machine learning for identifying incident stroke from electronic health records: Algorithm development and validation",

abstract = "Background: Stroke is an important clinical outcome in cardiovascular research. However, the ascertainment of incident stroke is typically accomplished via time-consuming manual chart abstraction. Current phenotyping efforts using electronic health records for stroke focus on case ascertainment rather than incident disease, which requires knowledge of the temporal sequence of events. Objective: The aim of this study was to develop a machine learning-based phenotyping algorithm for incident stroke ascertainment based on diagnosis codes, procedure codes, and clinical concepts extracted from clinical notes using natural language processing. Methods: The algorithm was trained and validated using an existing epidemiology cohort consisting of 4914 patients with atrial fibrillation (AF) with manually curated incident stroke events. Various combinations of feature sets and machine learning classifiers were compared. Using a heuristic rule based on the composition of concepts and codes, we further detected the stroke subtype (ischemic stroke/transient ischemic attack or hemorrhagic stroke) of each identified stroke. The algorithm was further validated using a cohort (n=150) stratified sampled from a population in Olmsted County, Minnesota (N=74,314). Results: Among the 4914 patients with AF, 740 had validated incident stroke events. The best-performing stroke phenotyping algorithm used clinical concepts, diagnosis codes, and procedure codes as features in a random forest classifier. Among patients with stroke codes in the general population sample, the best-performing model achieved a positive predictive value of 86% (43/50; 95% CI 0.74-0.93) and a negative predictive value of 96% (96/100). For subtype identification, we achieved an accuracy of 83% in the AF cohort and 80% in the general population sample. Conclusions: We developed and validated a machine learning-based algorithm that performed well for identifying incident stroke and for determining type of stroke. The algorithm also performed well on a sample from a general population, further demonstrating its generalizability and potential for adoption by other institutions.",

keywords = "Electronic health records, Machine learning, Natural language processing, Stroke",

author = "Yiqing Zhao and Sunyang Fu and Bielinski, {Suzette J.} and Decker, {Paul A.} and Chamberlain, {Alanna M.} and Roger, {Veronique L.} and Hongfang Liu and Larson, {Nicholas B.}",

year = "2021",

month = mar,

doi = "10.2196/22951",

language = "English (US)",

volume = "23",

journal = "Journal of medical Internet research",

issn = "1439-4456",

publisher = "Journal of medical Internet Research",

number = "3",

}

TY - JOUR

T1 - Natural language processing and machine learning for identifying incident stroke from electronic health records

T2 - Algorithm development and validation

AU - Zhao, Yiqing

AU - Fu, Sunyang

AU - Bielinski, Suzette J.

AU - Decker, Paul A.

AU - Chamberlain, Alanna M.

AU - Roger, Veronique L.

AU - Liu, Hongfang

AU - Larson, Nicholas B.

PY - 2021/3

Y1 - 2021/3

N2 - Background: Stroke is an important clinical outcome in cardiovascular research. However, the ascertainment of incident stroke is typically accomplished via time-consuming manual chart abstraction. Current phenotyping efforts using electronic health records for stroke focus on case ascertainment rather than incident disease, which requires knowledge of the temporal sequence of events. Objective: The aim of this study was to develop a machine learning-based phenotyping algorithm for incident stroke ascertainment based on diagnosis codes, procedure codes, and clinical concepts extracted from clinical notes using natural language processing. Methods: The algorithm was trained and validated using an existing epidemiology cohort consisting of 4914 patients with atrial fibrillation (AF) with manually curated incident stroke events. Various combinations of feature sets and machine learning classifiers were compared. Using a heuristic rule based on the composition of concepts and codes, we further detected the stroke subtype (ischemic stroke/transient ischemic attack or hemorrhagic stroke) of each identified stroke. The algorithm was further validated using a cohort (n=150) stratified sampled from a population in Olmsted County, Minnesota (N=74,314). Results: Among the 4914 patients with AF, 740 had validated incident stroke events. The best-performing stroke phenotyping algorithm used clinical concepts, diagnosis codes, and procedure codes as features in a random forest classifier. Among patients with stroke codes in the general population sample, the best-performing model achieved a positive predictive value of 86% (43/50; 95% CI 0.74-0.93) and a negative predictive value of 96% (96/100). For subtype identification, we achieved an accuracy of 83% in the AF cohort and 80% in the general population sample. Conclusions: We developed and validated a machine learning-based algorithm that performed well for identifying incident stroke and for determining type of stroke. The algorithm also performed well on a sample from a general population, further demonstrating its generalizability and potential for adoption by other institutions.

AB - Background: Stroke is an important clinical outcome in cardiovascular research. However, the ascertainment of incident stroke is typically accomplished via time-consuming manual chart abstraction. Current phenotyping efforts using electronic health records for stroke focus on case ascertainment rather than incident disease, which requires knowledge of the temporal sequence of events. Objective: The aim of this study was to develop a machine learning-based phenotyping algorithm for incident stroke ascertainment based on diagnosis codes, procedure codes, and clinical concepts extracted from clinical notes using natural language processing. Methods: The algorithm was trained and validated using an existing epidemiology cohort consisting of 4914 patients with atrial fibrillation (AF) with manually curated incident stroke events. Various combinations of feature sets and machine learning classifiers were compared. Using a heuristic rule based on the composition of concepts and codes, we further detected the stroke subtype (ischemic stroke/transient ischemic attack or hemorrhagic stroke) of each identified stroke. The algorithm was further validated using a cohort (n=150) stratified sampled from a population in Olmsted County, Minnesota (N=74,314). Results: Among the 4914 patients with AF, 740 had validated incident stroke events. The best-performing stroke phenotyping algorithm used clinical concepts, diagnosis codes, and procedure codes as features in a random forest classifier. Among patients with stroke codes in the general population sample, the best-performing model achieved a positive predictive value of 86% (43/50; 95% CI 0.74-0.93) and a negative predictive value of 96% (96/100). For subtype identification, we achieved an accuracy of 83% in the AF cohort and 80% in the general population sample. Conclusions: We developed and validated a machine learning-based algorithm that performed well for identifying incident stroke and for determining type of stroke. The algorithm also performed well on a sample from a general population, further demonstrating its generalizability and potential for adoption by other institutions.

KW - Electronic health records

KW - Machine learning

KW - Natural language processing

KW - Stroke

UR - http://www.scopus.com/inward/record.url?scp=85102688011&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85102688011&partnerID=8YFLogxK

U2 - 10.2196/22951

DO - 10.2196/22951

M3 - Article

C2 - 33683212

AN - SCOPUS:85102688011

SN - 1439-4456

VL - 23

JO - Journal of medical Internet research

JF - Journal of medical Internet research

IS - 3

M1 - e22951

ER -

Natural language processing and machine learning for identifying incident stroke from electronic health records: Algorithm development and validation

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this