Can we trust the judges? Validation of factuality evaluation methods via answer perturbation

Gharsallah, Sarra. Robaldo, Adele. Tokareva, Mariia. Gatti Pinheiro, Giovanni. Guendouz, Ilyana; Troncy, Raphaël; Papotti, Paolo; Michiardi, Pietro

EvalLLM 2025, Workshop on Evaluation Generative Models and Challenges, colocated with TALN, 30 June 2025, Marseille, France

Evaluating the factual correctness of large language models (LLMs) is vital for many applications. But are our evaluation tools themselves trustworthy ? Despite the rise of factuality-based metrics, their sensitivity and reliability remain underexplored. This paper introduces a meta-evaluation framework that systematically tests these metrics using controlled corruptions of gold standard answers. Our method generates ranked outputs with known degrees of degradation to probe how metrics capture nuanced changes in truthfulness. Our experiments reveal that pipeline-based methods, such as the RAGAS’s factual correctness metric, better track degradation than LLM-as-judge approaches.We also propose a new variant of the factual correctness metric that provides a competitive and cost-efficient.

Detail

Document

BIBTEX

Type:

Poster / Demo

City:

Marseille

Date:

2025-06-30

Department:

Data Science

Eurecom Ref:

8291

© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in EvalLLM 2025, Workshop on Evaluation Generative Models and Challenges, colocated with TALN, 30 June 2025, Marseille, France and is available at :