A large language model–based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records

Chia Chun Chiang; Man Luo; Gina Dumkrieger; Shubham Trivedi; Yi Chieh Chen; Chieh Ju Chao; Todd J. Schwedt; Abeed Sarker; Imon Banerjee

doi:10.1111/head.14702

A large language model–based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records

Chia Chun Chiang, Man Luo, Gina Dumkrieger, Shubham Trivedi, Yi Chieh Chen, Chieh Ju Chao, Todd J. Schwedt, Abeed Sarker, Imon Banerjee

Research output: Contribution to journal › Article › peer-review

Abstract

Objective: To develop a natural language processing (NLP) algorithm that can accurately extract headache frequency from free-text clinical notes. Background: Headache frequency, defined as the number of days with any headache in a month (or 4 weeks), remains a key parameter in the evaluation of treatment response to migraine preventive medications. However, due to the variations and inconsistencies in documentation by clinicians, significant challenges exist to accurately extract headache frequency from the electronic health record (EHR) by traditional NLP algorithms. Methods: This was a retrospective cross-sectional study with patients identified from two tertiary headache referral centers, Mayo Clinic Arizona and Mayo Clinic Rochester. All neurology consultation notes written by 15 specialized clinicians (11 headache specialists and 4 nurse practitioners) between 2012 and 2022 were extracted and 1915 notes were used for model fine-tuning (90%) and testing (10%). We employed four different NLP frameworks: (1) ClinicalBERT (Bidirectional Encoder Representations from Transformers) regression model, (2) Generative Pre-Trained Transformer-2 (GPT-2) Question Answering (QA) model zero-shot, (3) GPT-2 QA model few-shot training fine-tuned on clinical notes, and (4) GPT-2 generative model few-shot training fine-tuned on clinical notes to generate the answer by considering the context of included text. Results: The mean (standard deviation) headache frequency of our training and testing datasets were 13.4 (10.9) and 14.4 (11.2), respectively. The GPT-2 generative model was the best-performing model with an accuracy of 0.92 (0.91, 0.93, 95% confidence interval [CI]) and R² score of 0.89 (0.87, 0.90, 95% CI), and all GPT-2–based models outperformed the ClinicalBERT model in terms of exact matching accuracy. Although the ClinicalBERT regression model had the lowest accuracy of 0.27 (0.26, 0.28), it demonstrated a high R² score of 0.88 (0.85, 0.89), suggesting the ClinicalBERT model can reasonably predict the headache frequency within a range of ≤ ± 3 days, and the R² score was higher than the GPT-2 QA zero-shot model or GPT-2 QA model few-shot training fine-tuned model. Conclusion: We developed a robust information extraction model based on a state-of-the-art large language model, a GPT-2 generative model that can extract headache frequency from EHR free-text clinical notes with high accuracy and R² score. It overcame several challenges related to different ways clinicians document headache frequency that were not easily achieved by traditional NLP models. We also showed that GPT-2–based frameworks outperformed ClinicalBERT in terms of accuracy in extracting headache frequency from clinical notes. To facilitate research in the field, we released the GPT-2 generative model and inference code with open-source license of community use in GitHub. Additional fine-tuning of the algorithm might be required when applied to different health-care systems for various clinical use cases.

Original language	English (US)
Pages (from-to)	400-409
Number of pages	10
Journal	Headache
Volume	64
Issue number	4
DOIs	https://doi.org/10.1111/head.14702
State	Published - Apr 2024

Keywords

artificial intelligence
headache frequency
large language model
migraine
natural language processing

ASJC Scopus subject areas

Neurology
Clinical Neurology

Access to Document

10.1111/head.14702

Cite this

Chiang, C. C., Luo, M., Dumkrieger, G., Trivedi, S., Chen, Y. C., Chao, C. J., Schwedt, T. J., Sarker, A., & Banerjee, I. (2024). A large language model–based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records. Headache, 64(4), 400-409. https://doi.org/10.1111/head.14702

@article{aec764f328aa4699a9d7e7a1579a5fa8,

title = "A large language model–based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records",

abstract = "Objective: To develop a natural language processing (NLP) algorithm that can accurately extract headache frequency from free-text clinical notes. Background: Headache frequency, defined as the number of days with any headache in a month (or 4 weeks), remains a key parameter in the evaluation of treatment response to migraine preventive medications. However, due to the variations and inconsistencies in documentation by clinicians, significant challenges exist to accurately extract headache frequency from the electronic health record (EHR) by traditional NLP algorithms. Methods: This was a retrospective cross-sectional study with patients identified from two tertiary headache referral centers, Mayo Clinic Arizona and Mayo Clinic Rochester. All neurology consultation notes written by 15 specialized clinicians (11 headache specialists and 4 nurse practitioners) between 2012 and 2022 were extracted and 1915 notes were used for model fine-tuning (90%) and testing (10%). We employed four different NLP frameworks: (1) ClinicalBERT (Bidirectional Encoder Representations from Transformers) regression model, (2) Generative Pre-Trained Transformer-2 (GPT-2) Question Answering (QA) model zero-shot, (3) GPT-2 QA model few-shot training fine-tuned on clinical notes, and (4) GPT-2 generative model few-shot training fine-tuned on clinical notes to generate the answer by considering the context of included text. Results: The mean (standard deviation) headache frequency of our training and testing datasets were 13.4 (10.9) and 14.4 (11.2), respectively. The GPT-2 generative model was the best-performing model with an accuracy of 0.92 (0.91, 0.93, 95% confidence interval [CI]) and R2 score of 0.89 (0.87, 0.90, 95% CI), and all GPT-2–based models outperformed the ClinicalBERT model in terms of exact matching accuracy. Although the ClinicalBERT regression model had the lowest accuracy of 0.27 (0.26, 0.28), it demonstrated a high R2 score of 0.88 (0.85, 0.89), suggesting the ClinicalBERT model can reasonably predict the headache frequency within a range of ≤ ± 3 days, and the R2 score was higher than the GPT-2 QA zero-shot model or GPT-2 QA model few-shot training fine-tuned model. Conclusion: We developed a robust information extraction model based on a state-of-the-art large language model, a GPT-2 generative model that can extract headache frequency from EHR free-text clinical notes with high accuracy and R2 score. It overcame several challenges related to different ways clinicians document headache frequency that were not easily achieved by traditional NLP models. We also showed that GPT-2–based frameworks outperformed ClinicalBERT in terms of accuracy in extracting headache frequency from clinical notes. To facilitate research in the field, we released the GPT-2 generative model and inference code with open-source license of community use in GitHub. Additional fine-tuning of the algorithm might be required when applied to different health-care systems for various clinical use cases.",

keywords = "artificial intelligence, headache frequency, large language model, migraine, natural language processing",

author = "Chiang, {Chia Chun} and Man Luo and Gina Dumkrieger and Shubham Trivedi and Chen, {Yi Chieh} and Chao, {Chieh Ju} and Schwedt, {Todd J.} and Abeed Sarker and Imon Banerjee",

note = "Publisher Copyright: {\textcopyright} 2024 American Headache Society.",

year = "2024",

month = apr,

doi = "10.1111/head.14702",

language = "English (US)",

volume = "64",

pages = "400--409",

journal = "Headache",

issn = "0017-8748",

publisher = "Wiley-Blackwell",

number = "4",

}

TY - JOUR

T1 - A large language model–based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records

AU - Chiang, Chia Chun

AU - Luo, Man

AU - Dumkrieger, Gina

AU - Trivedi, Shubham

AU - Chen, Yi Chieh

AU - Chao, Chieh Ju

AU - Schwedt, Todd J.

AU - Sarker, Abeed

AU - Banerjee, Imon

PY - 2024/4

Y1 - 2024/4

N2 - Objective: To develop a natural language processing (NLP) algorithm that can accurately extract headache frequency from free-text clinical notes. Background: Headache frequency, defined as the number of days with any headache in a month (or 4 weeks), remains a key parameter in the evaluation of treatment response to migraine preventive medications. However, due to the variations and inconsistencies in documentation by clinicians, significant challenges exist to accurately extract headache frequency from the electronic health record (EHR) by traditional NLP algorithms. Methods: This was a retrospective cross-sectional study with patients identified from two tertiary headache referral centers, Mayo Clinic Arizona and Mayo Clinic Rochester. All neurology consultation notes written by 15 specialized clinicians (11 headache specialists and 4 nurse practitioners) between 2012 and 2022 were extracted and 1915 notes were used for model fine-tuning (90%) and testing (10%). We employed four different NLP frameworks: (1) ClinicalBERT (Bidirectional Encoder Representations from Transformers) regression model, (2) Generative Pre-Trained Transformer-2 (GPT-2) Question Answering (QA) model zero-shot, (3) GPT-2 QA model few-shot training fine-tuned on clinical notes, and (4) GPT-2 generative model few-shot training fine-tuned on clinical notes to generate the answer by considering the context of included text. Results: The mean (standard deviation) headache frequency of our training and testing datasets were 13.4 (10.9) and 14.4 (11.2), respectively. The GPT-2 generative model was the best-performing model with an accuracy of 0.92 (0.91, 0.93, 95% confidence interval [CI]) and R2 score of 0.89 (0.87, 0.90, 95% CI), and all GPT-2–based models outperformed the ClinicalBERT model in terms of exact matching accuracy. Although the ClinicalBERT regression model had the lowest accuracy of 0.27 (0.26, 0.28), it demonstrated a high R2 score of 0.88 (0.85, 0.89), suggesting the ClinicalBERT model can reasonably predict the headache frequency within a range of ≤ ± 3 days, and the R2 score was higher than the GPT-2 QA zero-shot model or GPT-2 QA model few-shot training fine-tuned model. Conclusion: We developed a robust information extraction model based on a state-of-the-art large language model, a GPT-2 generative model that can extract headache frequency from EHR free-text clinical notes with high accuracy and R2 score. It overcame several challenges related to different ways clinicians document headache frequency that were not easily achieved by traditional NLP models. We also showed that GPT-2–based frameworks outperformed ClinicalBERT in terms of accuracy in extracting headache frequency from clinical notes. To facilitate research in the field, we released the GPT-2 generative model and inference code with open-source license of community use in GitHub. Additional fine-tuning of the algorithm might be required when applied to different health-care systems for various clinical use cases.

AB - Objective: To develop a natural language processing (NLP) algorithm that can accurately extract headache frequency from free-text clinical notes. Background: Headache frequency, defined as the number of days with any headache in a month (or 4 weeks), remains a key parameter in the evaluation of treatment response to migraine preventive medications. However, due to the variations and inconsistencies in documentation by clinicians, significant challenges exist to accurately extract headache frequency from the electronic health record (EHR) by traditional NLP algorithms. Methods: This was a retrospective cross-sectional study with patients identified from two tertiary headache referral centers, Mayo Clinic Arizona and Mayo Clinic Rochester. All neurology consultation notes written by 15 specialized clinicians (11 headache specialists and 4 nurse practitioners) between 2012 and 2022 were extracted and 1915 notes were used for model fine-tuning (90%) and testing (10%). We employed four different NLP frameworks: (1) ClinicalBERT (Bidirectional Encoder Representations from Transformers) regression model, (2) Generative Pre-Trained Transformer-2 (GPT-2) Question Answering (QA) model zero-shot, (3) GPT-2 QA model few-shot training fine-tuned on clinical notes, and (4) GPT-2 generative model few-shot training fine-tuned on clinical notes to generate the answer by considering the context of included text. Results: The mean (standard deviation) headache frequency of our training and testing datasets were 13.4 (10.9) and 14.4 (11.2), respectively. The GPT-2 generative model was the best-performing model with an accuracy of 0.92 (0.91, 0.93, 95% confidence interval [CI]) and R2 score of 0.89 (0.87, 0.90, 95% CI), and all GPT-2–based models outperformed the ClinicalBERT model in terms of exact matching accuracy. Although the ClinicalBERT regression model had the lowest accuracy of 0.27 (0.26, 0.28), it demonstrated a high R2 score of 0.88 (0.85, 0.89), suggesting the ClinicalBERT model can reasonably predict the headache frequency within a range of ≤ ± 3 days, and the R2 score was higher than the GPT-2 QA zero-shot model or GPT-2 QA model few-shot training fine-tuned model. Conclusion: We developed a robust information extraction model based on a state-of-the-art large language model, a GPT-2 generative model that can extract headache frequency from EHR free-text clinical notes with high accuracy and R2 score. It overcame several challenges related to different ways clinicians document headache frequency that were not easily achieved by traditional NLP models. We also showed that GPT-2–based frameworks outperformed ClinicalBERT in terms of accuracy in extracting headache frequency from clinical notes. To facilitate research in the field, we released the GPT-2 generative model and inference code with open-source license of community use in GitHub. Additional fine-tuning of the algorithm might be required when applied to different health-care systems for various clinical use cases.

KW - artificial intelligence

KW - headache frequency

KW - large language model

KW - migraine

KW - natural language processing

UR - http://www.scopus.com/inward/record.url?scp=85189497347&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85189497347&partnerID=8YFLogxK

U2 - 10.1111/head.14702

DO - 10.1111/head.14702

M3 - Article

C2 - 38525734

AN - SCOPUS:85189497347

SN - 0017-8748

VL - 64

SP - 400

EP - 409

JO - Headache

JF - Headache

IS - 4

ER -

A large language model–based generative natural language processing framework fine-tuned on clinical notes accurately extracts headache frequency from electronic health records

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this