Skip to main navigation Skip to search Skip to main content

CLERC: A Dataset for U. S. Legal Case Retrieval and Retrieval-Augmented Analysis Generation

  • Abe Bohan Hou
  • , Orion Weller
  • , Guanghui Qin
  • , Eugene Yang
  • , Dawn Lawrie
  • , Nils Holzenberger
  • , Andrew Blair-Stanek
  • , Benjamin Van Durme

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Legal professionals need to write analyses that rely on citations to relevant precedents, i.e., previous case decisions. Intelligence systems assisting legal professionals in writing such documents provide great benefits but are challenging to design. Such systems need to help locate, summarize, and reason over salient precedents in order to be useful. To enable systems for such tasks, we work with legal professionals to create a colossal dataset1 supporting two important backbone tasks: information retrieval (IR) and retrieval-augmented generation (RAG). This dataset CLERC (Case Law Evaluation and Retrieval Corpus), is constructed for training and evaluating models on their ability to (1) find corresponding citations for a given piece of legal analysis and to (2) compile the text of these citations (as well as previous context) into a cogent analysis that supports a reasoning goal. We benchmark state-of-the-art models on CLERC, showing that current approaches still struggle: GPT-4o generates analyses with the highest ROUGE F-scores but hallucinates the most, while zero-shot IR models only achieve 48.3% recall@1000.

Original languageEnglish
Title of host publication2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics
Subtitle of host publicationProceedings of the Conference Findings, NAACL 2025
EditorsLuis Chiruzzo, Alan Ritter, Lu Wang
PublisherAssociation for Computational Linguistics (ACL)
Pages7913-7928
Number of pages16
ISBN (Electronic)9798891761957
DOIs
Publication statusPublished - 1 Jan 2025
Event2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, NAACL 2025 - Albuquerque, United States
Duration: 29 Apr 20254 May 2025

Publication series

Name2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Proceedings of the Conference Findings, NAACL 2025

Conference

Conference2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics, NAACL 2025
Country/TerritoryUnited States
CityAlbuquerque
Period29/04/254/05/25

Fingerprint

Dive into the research topics of 'CLERC: A Dataset for U. S. Legal Case Retrieval and Retrieval-Augmented Analysis Generation'. Together they form a unique fingerprint.

Cite this