TY - GEN
T1 - Gaps or Hallucinations? Scrutinizing Machine-Generated Legal Analysis for Fine-grained Text Evaluations
AU - Hou, Abe Bohan
AU - Jurayj, William
AU - Holzenberger, Nils
AU - Blair-Stanek, Andrew
AU - Van Durme, Benjamin
N1 - Publisher Copyright:
©2024 Association for Computational Linguistics.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps - as opposed to hallucinations in a strict erroneous sense - to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.
AB - Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps - as opposed to hallucinations in a strict erroneous sense - to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.
UR - https://www.scopus.com/pages/publications/85213123067
M3 - Conference contribution
AN - SCOPUS:85213123067
T3 - NLLP 2024 - Natural Legal Language Processing Workshop 2024, Proceedings of the Workshop
SP - 280
EP - 302
BT - NLLP 2024 - Natural Legal Language Processing Workshop 2024, Proceedings of the Workshop
A2 - Aletras, Nikolaos
A2 - Chalkidis, Ilias
A2 - Barrett, Leslie
A2 - Goanta, Catalina
A2 - Preotiuc-Pietro, Daniel
A2 - Spanakis, Gerasimos
PB - Association for Computational Linguistics (ACL)
T2 - 6th Natural Legal Language Processing Workshop 2024, NLLP 2024, co-located with the 2024 Conference on Empirical Methods in Natural Language Processing
Y2 - 16 November 2024
ER -