CacheNotes: Task-aware key-value cache compression for reasoning-intensive knowledge tasks

Corallo, Giulio; Weller, Orion; Petroni, Fabio; Papotti, Paolo

EACL 2026, 19th Conference of the European Chapter of the Association for Computational Linguistics, 24-29 March 2026, Rabat, Morocco

Integrating external knowledge into Large Language Models (LLMs) iscrucial for many real-world applications, yet current methods like Retrieval-Augmented Generation (RAG) face limitations with broad, multi-source queries, while long-context models are computationally prohibitive.We introduce CacheNotes: Task-Aware Key-Value Cache Compression. Given a task description and a corpus, CacheNotes first generates a sequence of Compression-Planning-Tokens (CPTs), an offline task-focused distillation pass that identifies and organizes key information from the corpus. These CPTs are then used to guide a one-time compression of the corpus into a compact, reusable KV cache, which is then used alone at inference time to efficiently answer diverse, reasoning-intensive queries, eliminating repeated retrieval or context expansion.Experiments on LongBench show that, on Question-Answering tasks at a 20× compression, CacheNotes outperforms RAG by over 8 F1 points and reduces latency by over 4×. On RULER, it surpasses previous query-agnostic compression methods by 55 points, narrowing the gap to query-aware compression approaches. Additional results on real-world enterprise and synthetic datasets demonstrate its strong performance on multi-hop and broad-coverage queries.

Detail

Document

DOI

BIBTEX

Type:

Conference

City:

Rabat

Date:

2026-03-24

Department:

Data Science

Eurecom Ref:

8688

Copyright ACL. Personal use of this material is permitted. The definitive version of this paper was published in EACL 2026, 19th Conference of the European Chapter of the Association for Computational Linguistics, 24-29 March 2026, Rabat, Morocco and is available at : http://dx.doi.org/10.18653/v1/2026.eacl-long.309