Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models

Long-context language models (LCLMs) have exhibited impressive capabilities in longcontext understanding tasks. Among these, long-context referencing—a crucial task that requires LCLMs to attribute items of interest to specific parts of long-context data—remains underexplored. To bridge this gap, this paper proposes Referencing Evaluation for Longcontext Language Models (Ref-Long), a novel benchmark designed to assess the long-context referencing capability of LCLMs. Specifically, Ref-Long requires LCLMs to identify the indexes of documents that reference a specific key, emphasizing contextual relationships between the key and the documents over simple retrieval. Based on the task design, we construct three subsets ranging from synthetic to realistic scenarios to form the Ref-Long benchmark. Experimental results of 13 LCLMs reveal significant shortcomings in long-context referencing, even among advanced models like GPT-4o. To further investigate these challenges, we conduct comprehensive analyses, including human evaluations, task format adjustments, fine-tuning experiments, and error analyses, leading to several key insights.

Many general long-context benchmarks are created by artificially inserting irrelevant texts into short-context NLP tasks. This often leads to unrealistic context distributions and introduces evaluation biases. Alternatively, building benchmarks from scratch with human annotation demands substantial resources and complex manual effort.
Retrieval-based benchmarks, such as Needle-in-a-Haystack, tend to overlook the nuanced relationships between retrieved passages and their surrounding context, making them overly simplistic and insufficient for comprehensive evaluation.

Ref-Long has several advantages:

Ref-Long tasks consider the relationship information between the specific key and its surrounding context, which forces LCLMs to genuinely understand long contexts instead of simply relying on shortcuts to retrieve the key.
Ref-Long tasks can be constructed cost-efficiently, as only the locations of specific keys are required.
Ref-Long tasks remain manageable for human annotators, allowing their difficulty level to be estimated.

Surprisingly, even the strongest LCLMs face significant challenges in the multi-hard setting of our first subset, where the context length is only 24K—well below their maximum capacity.

Yet Ref-Long tasks are relatively easy for humans.

We conducted several analyses and found the following:

Guiding GPT-4o with human strategies improves its performance, but the results are still far from satisfactory.
Modifying the format of the keys does enhance LCLMs’ performance, yet the improvements remain limited.
Fine-tuning also fails to offer a robust solution for addressing Ref-Long tasks.

These findings suggest that the challenges LCLMs face in Ref-Long tasks cannot be resolved merely by adjusting superficial factors during inference.

Summary of Our Research

Existing long-context evaluation benchmarks are costly/biased/not challenging.

To address the above issues, we propose Ref-Long, which is specifically designed to assess the long-context referencing capability of LCLMs.

Surprisingly, even the strongest LCLMs face significant challenges in the multi-hard setting of our first subset, where the context length is only 24K—well below their maximum capacity.

Yet Ref-Long tasks are relatively easy for humans.

We conducted several analyses and found the following: