Can We Employ LLM to Meta-Evaluate LLM-Based Evaluators? A Preliminary Study

  • Huilin Wang
  • , Lei Yu*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Large language models (LLMs) are frequently employed to evaluate the contents generated by LLMs. A number of recent works focus on the meta-evaluation of LLM-based evaluation, aiming to understand the efficacy of LLMs as evaluators. Conventional meta-evaluation techniques heavily depend on existing benchmarks, yet they overlook the explanations or analyses generated by evaluators. This shortcoming renders the meta-evaluation not only incomplete but also lacking in comprehensiveness. To incorporate these critical insights, human annotations are required, which is costly, inefficient, and lacks scalability. The aforementioned issues highlight the need for a complete, scalable, cost-effective meta-evaluation method. While LLMs have demonstrated exceptional capabilities across a variety of tasks, their potential for automating meta-evaluation are relatively underexplored. To fill this gap, we investigate the potential of LLMs to conduct meta-evaluation in this work. To this end, we propose several innovative meta-evaluation frameworks using LLMs within two distinct paradigms: pairwise comparison (JDEval and MDEval) and individual scoring (JDEval-i and BSMEval). Our experiments reveal that the pairwise comparison paradigm is better suited for meta-evaluation than the individual scoring approach. Both JDEval and MDEval exhibit strong performance in meta-evaluation tasks, showing a high level of agreement with human annotations. Specifically, MDEval achieves a consistency rate of 81.7% with manual annotations.

Original languageEnglish
Title of host publicationAdvanced Intelligent Computing Technology and Applications - 21st International Conference, ICIC 2025, Proceedings
EditorsDe-Shuang Huang, Chuanlei Zhang, Qinhu Zhang, Yijie Pan
PublisherSpringer Science and Business Media Deutschland GmbH
Pages161-172
Number of pages12
ISBN (Print)9789819500130
DOIs
StatePublished - 2025
Event21st International Conference on Intelligent Computing, ICIC 2025 - Ningbo, China
Duration: 26 Jul 202529 Jul 2025

Publication series

NameLecture Notes in Computer Science
Volume15864 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference21st International Conference on Intelligent Computing, ICIC 2025
Country/TerritoryChina
CityNingbo
Period26/07/2529/07/25

Keywords

  • Evaluators
  • LLMs
  • Meta-evaluate

Fingerprint

Dive into the research topics of 'Can We Employ LLM to Meta-Evaluate LLM-Based Evaluators? A Preliminary Study'. Together they form a unique fingerprint.

Cite this