TY - JOUR
T1 - Evaluating Large Language Model (LLM) Performance on Established Breast Classification Systems
AU - Haider, Syed Ali
AU - Pressman, Sophia M.
AU - Borna, Sahar
AU - Gomez-Cabello, Cesar A.
AU - Sehgal, Ajai
AU - Leibovich, Bradley C.
AU - Forte, Antonio Jorge
N1 - Publisher Copyright:
© 2024 by the authors.
PY - 2024/7
Y1 - 2024/7
N2 - Medical researchers are increasingly utilizing advanced LLMs like ChatGPT-4 and Gemini to enhance diagnostic processes in the medical field. This research focuses on their ability to comprehend and apply complex medical classification systems for breast conditions, which can significantly aid plastic surgeons in making informed decisions for diagnosis and treatment, ultimately leading to improved patient outcomes. Fifty clinical scenarios were created to evaluate the classification accuracy of each LLM across five established breast-related classification systems. Scores from 0 to 2 were assigned to LLM responses to denote incorrect, partially correct, or completely correct classifications. Descriptive statistics were employed to compare the performances of ChatGPT-4 and Gemini. Gemini exhibited superior overall performance, achieving 98% accuracy compared to ChatGPT-4’s 71%. While both models performed well in the Baker classification for capsular contracture and UTSW classification for gynecomastia, Gemini consistently outperformed ChatGPT-4 in other systems, such as the Fischer Grade Classification for gender-affirming mastectomy, Kajava Classification for ectopic breast tissue, and Regnault Classification for breast ptosis. With further development, integrating LLMs into plastic surgery practice will likely enhance diagnostic support and decision making.
AB - Medical researchers are increasingly utilizing advanced LLMs like ChatGPT-4 and Gemini to enhance diagnostic processes in the medical field. This research focuses on their ability to comprehend and apply complex medical classification systems for breast conditions, which can significantly aid plastic surgeons in making informed decisions for diagnosis and treatment, ultimately leading to improved patient outcomes. Fifty clinical scenarios were created to evaluate the classification accuracy of each LLM across five established breast-related classification systems. Scores from 0 to 2 were assigned to LLM responses to denote incorrect, partially correct, or completely correct classifications. Descriptive statistics were employed to compare the performances of ChatGPT-4 and Gemini. Gemini exhibited superior overall performance, achieving 98% accuracy compared to ChatGPT-4’s 71%. While both models performed well in the Baker classification for capsular contracture and UTSW classification for gynecomastia, Gemini consistently outperformed ChatGPT-4 in other systems, such as the Fischer Grade Classification for gender-affirming mastectomy, Kajava Classification for ectopic breast tissue, and Regnault Classification for breast ptosis. With further development, integrating LLMs into plastic surgery practice will likely enhance diagnostic support and decision making.
KW - artificial intelligence
KW - breast
KW - breast ptosis
KW - capsular contracture
KW - ectopic breast tissue
KW - gender-affirming mastectomy
KW - gynecomastia
KW - large language models
KW - machine learning
KW - plastic surgery
UR - http://www.scopus.com/inward/record.url?scp=85199618550&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85199618550&partnerID=8YFLogxK
U2 - 10.3390/diagnostics14141491
DO - 10.3390/diagnostics14141491
M3 - Article
AN - SCOPUS:85199618550
SN - 2075-4418
VL - 14
JO - Diagnostics
JF - Diagnostics
IS - 14
M1 - 1491
ER -