Integrating large language models in systematic reviews: a framework and case study using ROBINS-I for risk of bias assessment

Bashar Hasan; Samer Saadi; Noora S. Rajjoub; Moustafa Hegazi; Mohammad Al-Kordi; Farah Fleti; Magdoleen Farah; Irbaz B. Riaz; Imon Banerjee; Zhen Wang; Mohammad H Murad

doi:10.1136/bmjebm-2023-112597

Integrating large language models in systematic reviews: a framework and case study using ROBINS-I for risk of bias assessment

Bashar Hasan, Samer Saadi, Noora S. Rajjoub, Moustafa Hegazi, Mohammad Al-Kordi, Farah Fleti, Magdoleen Farah, Irbaz B. Riaz, Imon Banerjee, Zhen Wang, Mohammad H Murad

Research output: Contribution to journal › Article › peer-review

Abstract

Large language models (LLMs) may facilitate and expedite systematic reviews, although the approach to integrate LLMs in the review process is unclear. This study evaluates GPT-4 agreement with human reviewers in assessing the risk of bias using the Risk Of Bias In Non-randomised Studies of Interventions (ROBINS-I) tool and proposes a framework for integrating LLMs into systematic reviews. The case study demonstrated that raw per cent agreement was the highest for the ROBINS-I domain of ' Classification of Intervention'. Kendall agreement coefficient was highest for the domains of ' Participant Selection', ' Missing Data' and ' Measurement of Outcomes', suggesting moderate agreement in these domains. Raw agreement about the overall risk of bias across domains was 61% (Kendall coefficient=0.35). The proposed framework for integrating LLMs into systematic reviews consists of four domains: rationale for LLM use, protocol (task definition, model selection, prompt engineering, data entry methods, human role and success metrics), execution (iterative revisions to the protocol) and reporting. We identify five basic task types relevant to systematic reviews: selection, extraction, judgement, analysis and narration. Considering the agreement level with a human reviewer in the case study, pairing artificial intelligence with an independent human reviewer remains required.

Original language	English (US)
Journal	BMJ evidence-based medicine
DOIs	https://doi.org/10.1136/bmjebm-2023-112597
State	Accepted/In press - 2024

Keywords

Evidence-Based Practice
Methods
Systematic Reviews as Topic

ASJC Scopus subject areas

General Medicine

Access to Document

10.1136/bmjebm-2023-112597

Cite this

@article{84798f7ad6904483817cb955bf783681,

title = "Integrating large language models in systematic reviews: a framework and case study using ROBINS-I for risk of bias assessment",

abstract = "Large language models (LLMs) may facilitate and expedite systematic reviews, although the approach to integrate LLMs in the review process is unclear. This study evaluates GPT-4 agreement with human reviewers in assessing the risk of bias using the Risk Of Bias In Non-randomised Studies of Interventions (ROBINS-I) tool and proposes a framework for integrating LLMs into systematic reviews. The case study demonstrated that raw per cent agreement was the highest for the ROBINS-I domain of ' Classification of Intervention'. Kendall agreement coefficient was highest for the domains of ' Participant Selection', ' Missing Data' and ' Measurement of Outcomes', suggesting moderate agreement in these domains. Raw agreement about the overall risk of bias across domains was 61% (Kendall coefficient=0.35). The proposed framework for integrating LLMs into systematic reviews consists of four domains: rationale for LLM use, protocol (task definition, model selection, prompt engineering, data entry methods, human role and success metrics), execution (iterative revisions to the protocol) and reporting. We identify five basic task types relevant to systematic reviews: selection, extraction, judgement, analysis and narration. Considering the agreement level with a human reviewer in the case study, pairing artificial intelligence with an independent human reviewer remains required.",

keywords = "Evidence-Based Practice, Methods, Systematic Reviews as Topic",

author = "Bashar Hasan and Samer Saadi and Rajjoub, {Noora S.} and Moustafa Hegazi and Mohammad Al-Kordi and Farah Fleti and Magdoleen Farah and Riaz, {Irbaz B.} and Imon Banerjee and Zhen Wang and Murad, {Mohammad H}",

note = "Publisher Copyright: {\textcopyright} Author(s) (or their employer(s)) 2024. No commercial re-use. See rights and permissions. Published by BMJ.",

year = "2024",

doi = "10.1136/bmjebm-2023-112597",

language = "English (US)",

journal = "BMJ evidence-based medicine",

issn = "2515-446X",

publisher = "BMJ Publishing Group",

}

TY - JOUR

T1 - Integrating large language models in systematic reviews

T2 - a framework and case study using ROBINS-I for risk of bias assessment

AU - Hasan, Bashar

AU - Saadi, Samer

AU - Rajjoub, Noora S.

AU - Hegazi, Moustafa

AU - Al-Kordi, Mohammad

AU - Fleti, Farah

AU - Farah, Magdoleen

AU - Riaz, Irbaz B.

AU - Banerjee, Imon

AU - Wang, Zhen

AU - Murad, Mohammad H

PY - 2024

Y1 - 2024

N2 - Large language models (LLMs) may facilitate and expedite systematic reviews, although the approach to integrate LLMs in the review process is unclear. This study evaluates GPT-4 agreement with human reviewers in assessing the risk of bias using the Risk Of Bias In Non-randomised Studies of Interventions (ROBINS-I) tool and proposes a framework for integrating LLMs into systematic reviews. The case study demonstrated that raw per cent agreement was the highest for the ROBINS-I domain of ' Classification of Intervention'. Kendall agreement coefficient was highest for the domains of ' Participant Selection', ' Missing Data' and ' Measurement of Outcomes', suggesting moderate agreement in these domains. Raw agreement about the overall risk of bias across domains was 61% (Kendall coefficient=0.35). The proposed framework for integrating LLMs into systematic reviews consists of four domains: rationale for LLM use, protocol (task definition, model selection, prompt engineering, data entry methods, human role and success metrics), execution (iterative revisions to the protocol) and reporting. We identify five basic task types relevant to systematic reviews: selection, extraction, judgement, analysis and narration. Considering the agreement level with a human reviewer in the case study, pairing artificial intelligence with an independent human reviewer remains required.

AB - Large language models (LLMs) may facilitate and expedite systematic reviews, although the approach to integrate LLMs in the review process is unclear. This study evaluates GPT-4 agreement with human reviewers in assessing the risk of bias using the Risk Of Bias In Non-randomised Studies of Interventions (ROBINS-I) tool and proposes a framework for integrating LLMs into systematic reviews. The case study demonstrated that raw per cent agreement was the highest for the ROBINS-I domain of ' Classification of Intervention'. Kendall agreement coefficient was highest for the domains of ' Participant Selection', ' Missing Data' and ' Measurement of Outcomes', suggesting moderate agreement in these domains. Raw agreement about the overall risk of bias across domains was 61% (Kendall coefficient=0.35). The proposed framework for integrating LLMs into systematic reviews consists of four domains: rationale for LLM use, protocol (task definition, model selection, prompt engineering, data entry methods, human role and success metrics), execution (iterative revisions to the protocol) and reporting. We identify five basic task types relevant to systematic reviews: selection, extraction, judgement, analysis and narration. Considering the agreement level with a human reviewer in the case study, pairing artificial intelligence with an independent human reviewer remains required.

KW - Evidence-Based Practice

KW - Methods

KW - Systematic Reviews as Topic

UR - http://www.scopus.com/inward/record.url?scp=85185924195&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85185924195&partnerID=8YFLogxK

U2 - 10.1136/bmjebm-2023-112597

DO - 10.1136/bmjebm-2023-112597

M3 - Article

C2 - 38383136

AN - SCOPUS:85185924195

SN - 2515-446X

JO - BMJ evidence-based medicine

JF - BMJ evidence-based medicine

ER -

Integrating large language models in systematic reviews: a framework and case study using ROBINS-I for risk of bias assessment

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this