TY - JOUR
T1 - Use of artificial intelligence to support the assessment of the methodological quality of systematic reviews
AU - Marques-Cruz, Manuel
AU - Pinto, Filipe
AU - Vieira, Rafael José
AU - Bognanni, Antonio
AU - Perestrelo, Paula
AU - Gil-Mata, Sara
AU - Duarte, Vítor Henrique
AU - Barbosa, José Pedro
AU - Cardoso-Fernandes, António
AU - Martinho-Dias, Daniel
AU - Franco-Pego, Francisco
AU - Germini, Federico
AU - Arienti, Chiara
AU - Chu, Alexandro W.L.
AU - Riera-Serra, Pau
AU - Jemioło, Paweł
AU - Rodrigues, Pedro Pereira
AU - Fonseca, João A.
AU - Azevedo, Luís Filipe
AU - Schünemann, Holger J.
AU - Cruz-Correia, Ricardo
AU - Jankin, Slava
AU - Sousa-Pinto, Bernardo
N1 - Publisher Copyright:
© 2025 The Author(s)
PY - 2025/11
Y1 - 2025/11
N2 - Objectives: Published systematic reviews display a heterogeneous methodological quality, which can impact decision-making. Large language models (LLMs) can support and make the assessment of the methodological quality of systematic reviews more efficient, aiding in the incorporation of their evidence in guideline recommendations. We aimed to develop an LLM-based tool for supporting the assessment of the methodological quality of systematic reviews. Methods: We assessed the performance of 8 LLMs in evaluating the methodological quality of systematic reviews. In particular, we provided 100 systematic reviews for eight LLMs (five base models and three fine-tuned models) to evaluate their methodological quality based on a 27-item validated tool (Reported Methodological Quality (ReMarQ)). The fine-tuned models had been trained with a different sample of 300 manually assessed systematic reviews. We compared the answers provided by LLMs with those independently provided by human reviewers, computing the accuracy, kappa coefficient and F1-score for this comparison. Results: The best performing LLM was a fine-tuned GPT-3.5 model (mean accuracy = 96.5% [95% CI = 89.9%–100%]; mean kappa coefficient = 0.90 [95% CI = 0.71–1.00]; mean F1-score = 0.91 [95% CI = 0.83–1.00]). This model displayed an accuracy >80% and a kappa coefficient >0.60 for all individual items. When we made this LLM assess 60 times the same set of systematic reviews, answers to 18 of 27 items were always consistent (ie, were always the same) and only 11% of assessed systematic reviews showed inconsistency. Conclusion: Overall, LLMs have the potential to accurately support the assessment of the methodological quality of systematic reviews based on a validated tool comprising dichotomous items.
AB - Objectives: Published systematic reviews display a heterogeneous methodological quality, which can impact decision-making. Large language models (LLMs) can support and make the assessment of the methodological quality of systematic reviews more efficient, aiding in the incorporation of their evidence in guideline recommendations. We aimed to develop an LLM-based tool for supporting the assessment of the methodological quality of systematic reviews. Methods: We assessed the performance of 8 LLMs in evaluating the methodological quality of systematic reviews. In particular, we provided 100 systematic reviews for eight LLMs (five base models and three fine-tuned models) to evaluate their methodological quality based on a 27-item validated tool (Reported Methodological Quality (ReMarQ)). The fine-tuned models had been trained with a different sample of 300 manually assessed systematic reviews. We compared the answers provided by LLMs with those independently provided by human reviewers, computing the accuracy, kappa coefficient and F1-score for this comparison. Results: The best performing LLM was a fine-tuned GPT-3.5 model (mean accuracy = 96.5% [95% CI = 89.9%–100%]; mean kappa coefficient = 0.90 [95% CI = 0.71–1.00]; mean F1-score = 0.91 [95% CI = 0.83–1.00]). This model displayed an accuracy >80% and a kappa coefficient >0.60 for all individual items. When we made this LLM assess 60 times the same set of systematic reviews, answers to 18 of 27 items were always consistent (ie, were always the same) and only 11% of assessed systematic reviews showed inconsistency. Conclusion: Overall, LLMs have the potential to accurately support the assessment of the methodological quality of systematic reviews based on a validated tool comprising dichotomous items.
KW - Artificial intelligence
KW - Automated evaluation
KW - Evidence appraisal
KW - Large language models
KW - Meta-research
KW - Methodological quality
KW - Systematic reviews
UR - https://www.scopus.com/pages/publications/105016834130
U2 - 10.1016/j.jclinepi.2025.111944
DO - 10.1016/j.jclinepi.2025.111944
M3 - Article
C2 - 40865587
AN - SCOPUS:105016834130
SN - 0895-4356
VL - 187
JO - Journal of Clinical Epidemiology
JF - Journal of Clinical Epidemiology
M1 - 111944
ER -