Skip to main navigation Skip to search Skip to main content

Use of artificial intelligence to support the assessment of the methodological quality of systematic reviews

  • Manuel Marques-Cruz
  • , Filipe Pinto
  • , Rafael José Vieira
  • , Antonio Bognanni
  • , Paula Perestrelo
  • , Sara Gil-Mata
  • , Vítor Henrique Duarte
  • , José Pedro Barbosa
  • , António Cardoso-Fernandes
  • , Daniel Martinho-Dias
  • , Francisco Franco-Pego
  • , Federico Germini
  • , Chiara Arienti
  • , Alexandro W.L. Chu
  • , Pau Riera-Serra
  • , Paweł Jemioło
  • , Pedro Pereira Rodrigues
  • , João A. Fonseca
  • , Luís Filipe Azevedo
  • , Holger J. Schünemann
  • Ricardo Cruz-Correia, Slava Jankin, Bernardo Sousa-Pinto
  • University of Porto
  • Local Health Unit Trás-os-Montes e Alto Douro
  • Knok Healthcare
  • McMaster University
  • IRCCS Istituto Clinico Humanitas - Rozzano (Milano)
  • Local Health Unit of Alto Minho
  • Almada-Seixal Local Health Unit
  • Hospital Universitario Son Espases
  • AGH University of Krakow
  • Jagiellonian University Medical College
  • University of Birmingham

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Objectives: Published systematic reviews display a heterogeneous methodological quality, which can impact decision-making. Large language models (LLMs) can support and make the assessment of the methodological quality of systematic reviews more efficient, aiding in the incorporation of their evidence in guideline recommendations. We aimed to develop an LLM-based tool for supporting the assessment of the methodological quality of systematic reviews. Methods: We assessed the performance of 8 LLMs in evaluating the methodological quality of systematic reviews. In particular, we provided 100 systematic reviews for eight LLMs (five base models and three fine-tuned models) to evaluate their methodological quality based on a 27-item validated tool (Reported Methodological Quality (ReMarQ)). The fine-tuned models had been trained with a different sample of 300 manually assessed systematic reviews. We compared the answers provided by LLMs with those independently provided by human reviewers, computing the accuracy, kappa coefficient and F1-score for this comparison. Results: The best performing LLM was a fine-tuned GPT-3.5 model (mean accuracy = 96.5% [95% CI = 89.9%–100%]; mean kappa coefficient = 0.90 [95% CI = 0.71–1.00]; mean F1-score = 0.91 [95% CI = 0.83–1.00]). This model displayed an accuracy >80% and a kappa coefficient >0.60 for all individual items. When we made this LLM assess 60 times the same set of systematic reviews, answers to 18 of 27 items were always consistent (ie, were always the same) and only 11% of assessed systematic reviews showed inconsistency. Conclusion: Overall, LLMs have the potential to accurately support the assessment of the methodological quality of systematic reviews based on a validated tool comprising dichotomous items.

Original languageEnglish
Article number111944
JournalJournal of Clinical Epidemiology
Volume187
DOIs
StatePublished - Nov 2025

Keywords

  • Artificial intelligence
  • Automated evaluation
  • Evidence appraisal
  • Large language models
  • Meta-research
  • Methodological quality
  • Systematic reviews

Fingerprint

Dive into the research topics of 'Use of artificial intelligence to support the assessment of the methodological quality of systematic reviews'. Together they form a unique fingerprint.

Cite this