EvalLLM 2025, Workshop on Evaluation Generative Models and Challenges, colocated with TALN, 30 June 2025, Marseille, France
Evaluating the factual correctness of large language models (LLMs) is vital for many applications. But are our evaluation tools themselves trustworthy ? Despite the rise of factuality-based metrics, their sensitivity and reliability remain underexplored. This paper introduces a meta-evaluation framework that systematically tests these metrics using controlled corruptions of gold standard answers. Our method generates ranked outputs with known degrees of degradation to probe how metrics capture nuanced changes in truthfulness. Our experiments reveal that pipeline-based methods, such as the RAGAS’s factual correctness metric, better track degradation than LLM-as-judge approaches.We also propose a new variant of the factual correctness metric that provides a competitive and cost-efficient.
Type:
Poster / Demo
City:
Marseille
Date:
2025-06-30
Department:
Data Science
Eurecom Ref:
8291
Copyright:
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in EvalLLM 2025, Workshop on Evaluation Generative Models and Challenges, colocated with TALN, 30 June 2025, Marseille, France and is available at :