AlignLLM: alignment-based evaluation using ensemble of LLMs-as-judges for Q &A.

Abeyratne, Ramitha; Wiratunga, Nirmalie; Martin, Kyle; Nkisi-orj, Ikechukwu; Jayawardena, Lasal

doi:10.1007/978-3-031-96559-3_2

AlignLLM: alignment-based evaluation using ensemble of LLMs-as-judges for Q &A.

Abeyratne, Ramitha; Wiratunga, Nirmalie; Martin, Kyle; Nkisi-orj, Ikechukwu; Jayawardena, Lasal

Authors

RAMITHA ABEYRATNE r.abeyratne@rgu.ac.uk
Research Student

Professor Nirmalie Wiratunga n.wiratunga@rgu.ac.uk
Associate Dean for Research

Dr Kyle Martin k.martin3@rgu.ac.uk
Senior Lecturer

Dr Ikechukwu Nkisi-Orji i.nkisi-orji@rgu.ac.uk
Chancellor's Fellow

Lasal Jayawardena

Abstract

Evaluating responses generated by large language models (LLMs) is challenging in the absence of ground-truth knowledge, particularly in specialised domains such as law. Increasingly, LLMs themselves are used to evaluate the responses they generate; however, this approach is prone to bias and inherent errors. To address these issues, we propose an unsupervised ensemble method that employs multiple general-purpose LLMs as a 'collective judge', rather than relying on a single model. Here we introduce a novel application of case alignment as an aggregation mechanism, achieving higher correlation with supervised metrics than unsupervised LLM-as-a-judge baselines. Specifically, we construct two spaces for the ensemble: one for reconstructed questions by the ensemble given the model's original responses ('problem-space'), and another for the set of answers generated in response to those reconstructed questions ('solution-space'). By applying similarity-based alignment metrics across these two spaces, we gauge how closely our ensemble-based evaluation metric correlates with accuracy-based metrics that rely on ground-truth data. Our results on two legal Q&A datasets show significant correlations using this alignment strategy, suggesting that it can effectively evaluate LLM-generated responses even when ground truth is unavailable.

Citation

ABEYRATNE, R., WIRATUNGA, N., MARTIN, K., NKISI-ORJ, I. and JAYAWARDENA, L. 2025. AlignLLM: alignment-based evaluation using ensemble of LLMs-as judges for Q&A. In Bichindaritz, I. and López, B. (eds.) Case-based reasoning research and development: proceedings of the 33rd International conference on case-based reasoning 2025 (ICCBR 2025), 30 June - 3 July 2025, Biarritz, France. Lecture notes in computer science (LNCS), 15662. Cham: Springer [online], pages 21-36. Available from: https://doi.org/10.1007/978-3-031-96559-3_2

Presentation Conference Type	Conference Paper (published)
Conference Name	33rd International conference on case-based reasoning 2025 (ICCBR 2025)
Start Date	Jun 30, 2025
End Date	Jul 3, 2025
Acceptance Date	Mar 14, 2025
Online Publication Date	Jun 23, 2025
Publication Date	Jun 23, 2025
Deposit Date	Mar 17, 2025
Publicly Available Date	Jun 24, 2026
Publisher	Springer
Peer Reviewed	Peer Reviewed
Volume	15662
Pages	21-36
Series Title	Lecture notes in computer science (LNCS)
Series ISSN	0302-9743; 1611-3349
Book Title	Case-based reasoning research and development
ISBN	9783031965586
DOI	https://doi.org/10.1007/978-3-031-96559-3_2
Keywords	LLMs-as-Judges; Case-alignment; Legal Q&A
Public URL	https://rgu-repository.worktribe.com/output/2754840
Related Public URLs	https://rgu-repository.worktribe.com/output/2754880 (Link to code and datasets associated with this output)