Sublanguage Characteristics of Clinical Documents

Sungrim Moon; Huan He; Hongfang Liu

doi:10.1109/BIBM55620.2022.9995620

Sublanguage Characteristics of Clinical Documents

Sungrim Moon, Huan He, Hongfang Liu

Digital Health Sciences

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Understanding the common or different characteristics of sublanguages in clinical documents through corpus analysis is essential for downstream applications of clinical natural language processing (NLP). Here, we conducted a sublanguage analysis of a corpus consisting of 500,000 clinical documents concerning clinical sections. We analyzed sublanguage characteristics per practice setting or document type for the top ten most frequent clinical sections. The named entity (NE) for the problem, test, and treatment concepts was extracted using fine-tuned bio-clinical Bidirectional Encoder Representations from Transformers (BERT). Fast-clustering using sentence-BERT was applied, and clustering results, a case study of terms containing 'pain,' were visualized using SandDance. Our results confirmed that document types with a narrow scope (i.e., limited evaluation) presented high term frequencies in diverse disjoint clusters than document types with a broad scope (i.e., Discharge Summary). Family Medicine and Primary Care practice settings presented similar cluster distributions (i.e., the frequent use of similar co-occurring words with 'pain'), implying the similar sublanguage. In contrast, Emergency Medicine showed a distinct sublanguage with high term frequencies in disjoint clusters than other practices. Those findings suggest that analyzing term distribution with respect to different combinations of the section, practicing setting, and document type provide important information when developing or implementing NLP systems.

Original language	English (US)
Title of host publication	Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022
Editors	Donald Adjeroh, Qi Long, Xinghua Shi, Fei Guo, Xiaohua Hu, Srinivas Aluru, Giri Narasimhan, Jianxin Wang, Mingon Kang, Ananda M. Mondal, Jin Liu
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	3280-3286
Number of pages	7
ISBN (Electronic)	9781665468190
DOIs	https://doi.org/10.1109/BIBM55620.2022.9995620
State	Published - 2022
Event	2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022 - Las Vegas, United States Duration: Dec 6 2022 → Dec 8 2022

Publication series

Name	Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022

Conference

Conference	2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022
Country/Territory	United States
City	Las Vegas
Period	12/6/22 → 12/8/22

Keywords

clinical documents
clinical section
clustering
document type
named entity recognition
natural language processing
practice setting
sublanguage analysis

ASJC Scopus subject areas

Psychiatry and Mental health
Information Systems and Management
Biomedical Engineering
Medicine (miscellaneous)
Cardiology and Cardiovascular Medicine
Health Informatics

Access to Document

10.1109/BIBM55620.2022.9995620

Cite this

Moon, S., He, H., & Liu, H. (2022). Sublanguage Characteristics of Clinical Documents. In D. Adjeroh, Q. Long, X. Shi, F. Guo, X. Hu, S. Aluru, G. Narasimhan, J. Wang, M. Kang, A. M. Mondal, & J. Liu (Eds.), Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022 (pp. 3280-3286). (Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BIBM55620.2022.9995620

Sublanguage Characteristics of Clinical Documents. / Moon, Sungrim; He, Huan; Liu, Hongfang.
Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022. ed. / Donald Adjeroh; Qi Long; Xinghua Shi; Fei Guo; Xiaohua Hu; Srinivas Aluru; Giri Narasimhan; Jianxin Wang; Mingon Kang; Ananda M. Mondal; Jin Liu. Institute of Electrical and Electronics Engineers Inc., 2022. p. 3280-3286 (Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Moon, S, He, H & Liu, H 2022, Sublanguage Characteristics of Clinical Documents. in D Adjeroh, Q Long, X Shi, F Guo, X Hu, S Aluru, G Narasimhan, J Wang, M Kang, AM Mondal & J Liu (eds), Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022. Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022, Institute of Electrical and Electronics Engineers Inc., pp. 3280-3286, 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022, Las Vegas, United States, 12/6/22. https://doi.org/10.1109/BIBM55620.2022.9995620

Moon S, He H, Liu H. Sublanguage Characteristics of Clinical Documents. In Adjeroh D, Long Q, Shi X, Guo F, Hu X, Aluru S, Narasimhan G, Wang J, Kang M, Mondal AM, Liu J, editors, Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022. Institute of Electrical and Electronics Engineers Inc. 2022. p. 3280-3286. (Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022). doi: 10.1109/BIBM55620.2022.9995620

Moon, Sungrim ; He, Huan ; Liu, Hongfang. / Sublanguage Characteristics of Clinical Documents. Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022. editor / Donald Adjeroh ; Qi Long ; Xinghua Shi ; Fei Guo ; Xiaohua Hu ; Srinivas Aluru ; Giri Narasimhan ; Jianxin Wang ; Mingon Kang ; Ananda M. Mondal ; Jin Liu. Institute of Electrical and Electronics Engineers Inc., 2022. pp. 3280-3286 (Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022).

@inproceedings{a0c57ff633f046ac895eee84dce652f7,

title = "Sublanguage Characteristics of Clinical Documents",

abstract = "Understanding the common or different characteristics of sublanguages in clinical documents through corpus analysis is essential for downstream applications of clinical natural language processing (NLP). Here, we conducted a sublanguage analysis of a corpus consisting of 500,000 clinical documents concerning clinical sections. We analyzed sublanguage characteristics per practice setting or document type for the top ten most frequent clinical sections. The named entity (NE) for the problem, test, and treatment concepts was extracted using fine-tuned bio-clinical Bidirectional Encoder Representations from Transformers (BERT). Fast-clustering using sentence-BERT was applied, and clustering results, a case study of terms containing 'pain,' were visualized using SandDance. Our results confirmed that document types with a narrow scope (i.e., limited evaluation) presented high term frequencies in diverse disjoint clusters than document types with a broad scope (i.e., Discharge Summary). Family Medicine and Primary Care practice settings presented similar cluster distributions (i.e., the frequent use of similar co-occurring words with 'pain'), implying the similar sublanguage. In contrast, Emergency Medicine showed a distinct sublanguage with high term frequencies in disjoint clusters than other practices. Those findings suggest that analyzing term distribution with respect to different combinations of the section, practicing setting, and document type provide important information when developing or implementing NLP systems.",

keywords = "clinical documents, clinical section, clustering, document type, named entity recognition, natural language processing, practice setting, sublanguage analysis",

author = "Sungrim Moon and Huan He and Hongfang Liu",

note = "Publisher Copyright: {\textcopyright} 2022 IEEE.; 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022 ; Conference date: 06-12-2022 Through 08-12-2022",

year = "2022",

doi = "10.1109/BIBM55620.2022.9995620",

language = "English (US)",

series = "Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "3280--3286",

editor = "Donald Adjeroh and Qi Long and Xinghua Shi and Fei Guo and Xiaohua Hu and Srinivas Aluru and Giri Narasimhan and Jianxin Wang and Mingon Kang and Mondal, {Ananda M.} and Jin Liu",

booktitle = "Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022",

}

TY - GEN

T1 - Sublanguage Characteristics of Clinical Documents

AU - Moon, Sungrim

AU - He, Huan

AU - Liu, Hongfang

PY - 2022

Y1 - 2022

N2 - Understanding the common or different characteristics of sublanguages in clinical documents through corpus analysis is essential for downstream applications of clinical natural language processing (NLP). Here, we conducted a sublanguage analysis of a corpus consisting of 500,000 clinical documents concerning clinical sections. We analyzed sublanguage characteristics per practice setting or document type for the top ten most frequent clinical sections. The named entity (NE) for the problem, test, and treatment concepts was extracted using fine-tuned bio-clinical Bidirectional Encoder Representations from Transformers (BERT). Fast-clustering using sentence-BERT was applied, and clustering results, a case study of terms containing 'pain,' were visualized using SandDance. Our results confirmed that document types with a narrow scope (i.e., limited evaluation) presented high term frequencies in diverse disjoint clusters than document types with a broad scope (i.e., Discharge Summary). Family Medicine and Primary Care practice settings presented similar cluster distributions (i.e., the frequent use of similar co-occurring words with 'pain'), implying the similar sublanguage. In contrast, Emergency Medicine showed a distinct sublanguage with high term frequencies in disjoint clusters than other practices. Those findings suggest that analyzing term distribution with respect to different combinations of the section, practicing setting, and document type provide important information when developing or implementing NLP systems.

AB - Understanding the common or different characteristics of sublanguages in clinical documents through corpus analysis is essential for downstream applications of clinical natural language processing (NLP). Here, we conducted a sublanguage analysis of a corpus consisting of 500,000 clinical documents concerning clinical sections. We analyzed sublanguage characteristics per practice setting or document type for the top ten most frequent clinical sections. The named entity (NE) for the problem, test, and treatment concepts was extracted using fine-tuned bio-clinical Bidirectional Encoder Representations from Transformers (BERT). Fast-clustering using sentence-BERT was applied, and clustering results, a case study of terms containing 'pain,' were visualized using SandDance. Our results confirmed that document types with a narrow scope (i.e., limited evaluation) presented high term frequencies in diverse disjoint clusters than document types with a broad scope (i.e., Discharge Summary). Family Medicine and Primary Care practice settings presented similar cluster distributions (i.e., the frequent use of similar co-occurring words with 'pain'), implying the similar sublanguage. In contrast, Emergency Medicine showed a distinct sublanguage with high term frequencies in disjoint clusters than other practices. Those findings suggest that analyzing term distribution with respect to different combinations of the section, practicing setting, and document type provide important information when developing or implementing NLP systems.

KW - clinical documents

KW - clinical section

KW - clustering

KW - document type

KW - named entity recognition

KW - natural language processing

KW - practice setting

KW - sublanguage analysis

UR - http://www.scopus.com/inward/record.url?scp=85146711579&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85146711579&partnerID=8YFLogxK

U2 - 10.1109/BIBM55620.2022.9995620

DO - 10.1109/BIBM55620.2022.9995620

M3 - Conference contribution

AN - SCOPUS:85146711579

T3 - Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022

SP - 3280

EP - 3286

BT - Proceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022

A2 - Adjeroh, Donald

A2 - Long, Qi

A2 - Shi, Xinghua

A2 - Guo, Fei

A2 - Hu, Xiaohua

A2 - Aluru, Srinivas

A2 - Narasimhan, Giri

A2 - Wang, Jianxin

A2 - Kang, Mingon

A2 - Mondal, Ananda M.

A2 - Liu, Jin

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022

Y2 - 6 December 2022 through 8 December 2022

ER -

Sublanguage Characteristics of Clinical Documents

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this