TY - JOUR
T1 - Evaluating large language models on a highly-specialized topic, radiation oncology physics
AU - Holmes, Jason
AU - Liu, Zhengliang
AU - Zhang, Lian
AU - Ding, Yuzhen
AU - Sio, Terence T.
AU - McGee, Lisa A.
AU - Ashman, Jonathan B.
AU - Li, Xiang
AU - Liu, Tianming
AU - Shen, Jiajian
AU - Liu, Wei
N1 - Publisher Copyright:
Copyright © 2023 Holmes, Liu, Zhang, Ding, Sio, McGee, Ashman, Li, Liu, Shen and Liu.
PY - 2023
Y1 - 2023
N2 - Purpose: We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs. Methods: We developed an exam consisting of 100 radiation oncology physics questions based on our expertise. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. The performance of ChatGPT (GPT-4) was further explored by being asked to explain first, then answer. The deductive reasoning capability of ChatGPT (GPT-4) was evaluated using a novel approach (substituting the correct answer with “None of the above choices is the correct answer.”). A majority vote analysis was used to approximate how well each group could score when working together. Results: ChatGPT GPT-4 outperformed all other LLMs and medical physicists, on average, with improved accuracy when prompted to explain before answering. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups or Bard (LaMDA). In evaluating deductive reasoning ability, ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote. Conclusion: This study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants.
AB - Purpose: We present the first study to investigate Large Language Models (LLMs) in answering radiation oncology physics questions. Because popular exams like AP Physics, LSAT, and GRE have large test-taker populations and ample test preparation resources in circulation, they may not allow for accurately assessing the true potential of LLMs. This paper proposes evaluating LLMs on a highly-specialized topic, radiation oncology physics, which may be more pertinent to scientific and medical communities in addition to being a valuable benchmark of LLMs. Methods: We developed an exam consisting of 100 radiation oncology physics questions based on our expertise. Four LLMs, ChatGPT (GPT-3.5), ChatGPT (GPT-4), Bard (LaMDA), and BLOOMZ, were evaluated against medical physicists and non-experts. The performance of ChatGPT (GPT-4) was further explored by being asked to explain first, then answer. The deductive reasoning capability of ChatGPT (GPT-4) was evaluated using a novel approach (substituting the correct answer with “None of the above choices is the correct answer.”). A majority vote analysis was used to approximate how well each group could score when working together. Results: ChatGPT GPT-4 outperformed all other LLMs and medical physicists, on average, with improved accuracy when prompted to explain before answering. ChatGPT (GPT-3.5 and GPT-4) showed a high level of consistency in its answer choices across a number of trials, whether correct or incorrect, a characteristic that was not observed in the human test groups or Bard (LaMDA). In evaluating deductive reasoning ability, ChatGPT (GPT-4) demonstrated surprising accuracy, suggesting the potential presence of an emergent ability. Finally, although ChatGPT (GPT-4) performed well overall, its intrinsic properties did not allow for further improvement when scoring based on a majority vote across trials. In contrast, a team of medical physicists were able to greatly outperform ChatGPT (GPT-4) using a majority vote. Conclusion: This study suggests a great potential for LLMs to work alongside radiation oncology experts as highly knowledgeable assistants.
KW - ChatGPT
KW - artificial intelligence
KW - large language model
KW - medical physics
KW - natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85166245782&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85166245782&partnerID=8YFLogxK
U2 - 10.3389/fonc.2023.1219326
DO - 10.3389/fonc.2023.1219326
M3 - Article
AN - SCOPUS:85166245782
SN - 2234-943X
VL - 13
JO - Frontiers in Oncology
JF - Frontiers in Oncology
M1 - 1219326
ER -