Sublanguage Characteristics of Clinical Documents

Sungrim Moon, Huan He, Hongfang Liu

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Understanding the common or different characteristics of sublanguages in clinical documents through corpus analysis is essential for downstream applications of clinical natural language processing (NLP). Here, we conducted a sublanguage analysis of a corpus consisting of 500,000 clinical documents concerning clinical sections. We analyzed sublanguage characteristics per practice setting or document type for the top ten most frequent clinical sections. The named entity (NE) for the problem, test, and treatment concepts was extracted using fine-tuned bio-clinical Bidirectional Encoder Representations from Transformers (BERT). Fast-clustering using sentence-BERT was applied, and clustering results, a case study of terms containing 'pain,' were visualized using SandDance. Our results confirmed that document types with a narrow scope (i.e., limited evaluation) presented high term frequencies in diverse disjoint clusters than document types with a broad scope (i.e., Discharge Summary). Family Medicine and Primary Care practice settings presented similar cluster distributions (i.e., the frequent use of similar co-occurring words with 'pain'), implying the similar sublanguage. In contrast, Emergency Medicine showed a distinct sublanguage with high term frequencies in disjoint clusters than other practices. Those findings suggest that analyzing term distribution with respect to different combinations of the section, practicing setting, and document type provide important information when developing or implementing NLP systems.

Original languageEnglish (US)
Title of host publicationProceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022
EditorsDonald Adjeroh, Qi Long, Xinghua Shi, Fei Guo, Xiaohua Hu, Srinivas Aluru, Giri Narasimhan, Jianxin Wang, Mingon Kang, Ananda M. Mondal, Jin Liu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages7
ISBN (Electronic)9781665468190
StatePublished - 2022
Event2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022 - Las Vegas, United States
Duration: Dec 6 2022Dec 8 2022

Publication series

NameProceedings - 2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022


Conference2022 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2022
Country/TerritoryUnited States
CityLas Vegas


  • clinical documents
  • clinical section
  • clustering
  • document type
  • named entity recognition
  • natural language processing
  • practice setting
  • sublanguage analysis

ASJC Scopus subject areas

  • Psychiatry and Mental health
  • Information Systems and Management
  • Biomedical Engineering
  • Medicine (miscellaneous)
  • Cardiology and Cardiovascular Medicine
  • Health Informatics


Dive into the research topics of 'Sublanguage Characteristics of Clinical Documents'. Together they form a unique fingerprint.

Cite this