TY - GEN
T1 - Can We Employ LLM to Meta-Evaluate LLM-Based Evaluators? A Preliminary Study
AU - Wang, Huilin
AU - Yu, Lei
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
PY - 2025
Y1 - 2025
N2 - Large language models (LLMs) are frequently employed to evaluate the contents generated by LLMs. A number of recent works focus on the meta-evaluation of LLM-based evaluation, aiming to understand the efficacy of LLMs as evaluators. Conventional meta-evaluation techniques heavily depend on existing benchmarks, yet they overlook the explanations or analyses generated by evaluators. This shortcoming renders the meta-evaluation not only incomplete but also lacking in comprehensiveness. To incorporate these critical insights, human annotations are required, which is costly, inefficient, and lacks scalability. The aforementioned issues highlight the need for a complete, scalable, cost-effective meta-evaluation method. While LLMs have demonstrated exceptional capabilities across a variety of tasks, their potential for automating meta-evaluation are relatively underexplored. To fill this gap, we investigate the potential of LLMs to conduct meta-evaluation in this work. To this end, we propose several innovative meta-evaluation frameworks using LLMs within two distinct paradigms: pairwise comparison (JDEval and MDEval) and individual scoring (JDEval-i and BSMEval). Our experiments reveal that the pairwise comparison paradigm is better suited for meta-evaluation than the individual scoring approach. Both JDEval and MDEval exhibit strong performance in meta-evaluation tasks, showing a high level of agreement with human annotations. Specifically, MDEval achieves a consistency rate of 81.7% with manual annotations.
AB - Large language models (LLMs) are frequently employed to evaluate the contents generated by LLMs. A number of recent works focus on the meta-evaluation of LLM-based evaluation, aiming to understand the efficacy of LLMs as evaluators. Conventional meta-evaluation techniques heavily depend on existing benchmarks, yet they overlook the explanations or analyses generated by evaluators. This shortcoming renders the meta-evaluation not only incomplete but also lacking in comprehensiveness. To incorporate these critical insights, human annotations are required, which is costly, inefficient, and lacks scalability. The aforementioned issues highlight the need for a complete, scalable, cost-effective meta-evaluation method. While LLMs have demonstrated exceptional capabilities across a variety of tasks, their potential for automating meta-evaluation are relatively underexplored. To fill this gap, we investigate the potential of LLMs to conduct meta-evaluation in this work. To this end, we propose several innovative meta-evaluation frameworks using LLMs within two distinct paradigms: pairwise comparison (JDEval and MDEval) and individual scoring (JDEval-i and BSMEval). Our experiments reveal that the pairwise comparison paradigm is better suited for meta-evaluation than the individual scoring approach. Both JDEval and MDEval exhibit strong performance in meta-evaluation tasks, showing a high level of agreement with human annotations. Specifically, MDEval achieves a consistency rate of 81.7% with manual annotations.
KW - Evaluators
KW - LLMs
KW - Meta-evaluate
UR - https://www.scopus.com/pages/publications/105027771880
U2 - 10.1007/978-981-95-0014-7_14
DO - 10.1007/978-981-95-0014-7_14
M3 - 会议稿件
AN - SCOPUS:105027771880
SN - 9789819500130
T3 - Lecture Notes in Computer Science
SP - 161
EP - 172
BT - Advanced Intelligent Computing Technology and Applications - 21st International Conference, ICIC 2025, Proceedings
A2 - Huang, De-Shuang
A2 - Zhang, Chuanlei
A2 - Zhang, Qinhu
A2 - Pan, Yijie
PB - Springer Science and Business Media Deutschland GmbH
T2 - 21st International Conference on Intelligent Computing, ICIC 2025
Y2 - 26 July 2025 through 29 July 2025
ER -