agent-memory-bench: open-source benchmark to detect the real memory failures of AI agents

A developer releases an offline, dependency-free benchmark that measures the four critical failure modes of AI agent memory systems: retraction, collision, retrieval and conflict. Traditional retrieval metrics hide vast differences in real performance (from 23% to 92%).
By GitHub (Kausha3/agent-memory-bench) · June 27, 2026.
**The problem it aims to solve**
Almost all modern AI agents incorporate some "memory" module to retain information between turns or sessions. However, the usual way of evaluating that memory is superficial: did it retrieve the relevant chunk? According to the author of this repository, that criterion hides the failures that really degrade agents in production. An agent does not fail because it can't find a piece of data; it fails because it retrieves an outdated piece of data, confuses two similar entities, loses a fact buried under noise, or returns information that contradicts what it already believed to be true. Those four cases are what `agent-memory-bench` puts under the microscope.
**The four failure categories**
The benchmark precisely defines four failure modes:
- **Retraction**: a fact is updated; the system must return the new value and not leak the old one. - **Collision**: two similar entities exist; the system must answer about the one asked without confusing them. - **Recall**: a fact was stated early in the conversation, is needed late, and there is intervening noise. It includes a multi-hop reasoning boundary case that none of the three current baselines solve. - **Conflict**: a fact is explicitly contradicted within the text itself; the system must resolve which is the current value.
The full definitions, worked examples, and the rationale for why each mode is hard are documented in `TAXONOMY.md`.
**Technical architecture: offline, no API key, reproducible**
One of the project's explicit goals is frictionless reproducibility. The benchmark runs entirely offline, with no external dependencies and no need for an API key. Two commands suffice:
``` npm install npm run bench # prints the leaderboard npm test # adversarial suite over the scoring core and the baselines ```
It is written entirely in TypeScript (100% of the repository). The scenarios are ordered scripts of `remember` and `query` events. Each `query` declares the substring the correct answer must contain **and** the outdated substrings that must not appear, so returning a stale value counts as a failure, not a partial hit.
The harness (`src/harness.ts`) resets the system, replays the scenario, and judges each query in isolation. The scoring (`src/score.ts`, `src/report.ts`) aggregates per-category and global rates and renders the leaderboard.
**The current leaderboard (v0.1)**
The repository includes three reference baselines over 13 scenarios across 4 categories:
| System | Retraction | Collision | Recall | Conflict | Global | |---|---|---|---|---|---| | typed-constraint | 100% | 100% | 75% | 100% | **92%** | | keyword | 0% | 100% | 75% | 0% | **46%** | | recency | 100% | 0% | 0% | 0% | **23%** |
The author emphasizes that this should be read as **a map of where each strategy fails**, not as a product ranking:
- **keyword** (similarity-based retrieval, no time model) hits collision but fails completely on retraction and conflict: with no notion of time, it cheerfully returns the value the user has already changed. - **recency** (the most recent token wins) fixes retraction but collapses on collision and recall: it drifts toward the most recent lookalike, which is usually the wrong entity. - **typed-constraint** models time (facts are retracted) and identity (facts are linked to an entity), which lets it pass three categories. It fails on the single multi-hop recall scenario, a deliberate boundary item that no baseline solves, ensuring the benchmark is not saturated.
The key takeaway: conventional retrieval-quality metrics would score the three systems similarly, whereas their actual correctness ranges from 23% to 92%. That gap is precisely the point.
**How to add your own system**
The interface an external system must implement is minimal (`src/types.ts`):
```typescript interface MemorySystem { readonly name: string; reset(): void | Promise<void>; remember(text: string): void | Promise<void>; query(question: string): string | Promise<string>; } ```
The methods can be asynchronous, so an embedding store, a hosted memory product, or an LLM-backed extractor plugs in exactly like the pure-code baselines. You just add the class in `src/systems/` and include it in `src/run.ts`.
**Status and roadmap**
The current version is v0.1: 4 categories, 13 scenarios, 3 reference baselines. The roadmap plans to expand each category with more scenarios and harder distractors, add temporal-drift and preference categories, incorporate an LLM-judge mode for free-form answers, and publish a contribution guide so external memory systems can submit to the leaderboard. The author notes that the most valuable contributions are new adversarial scenarios that break the `typed-constraint` baseline.
**Implications for agentic AI**
This benchmark touches on a significant blind spot in the current ecosystem. In general, agent memory research and products (from classic RAG to systems like MemGPT, Zep, or OpenAI/Anthropic's own memories) are evaluated almost exclusively with retrieval metrics inherited from information search: precision, chunk-level recall, MRR, NDCG. These metrics do not capture whether the agent will end up confusing two users, returning an old phone number, or simultaneously asserting two contradictory facts.
The approach of `agent-memory-bench` is more aligned with how agents fail in real continuous-use scenarios: personal assistant tasks, conversational CRM, support agents, or any application where the state of the world changes over time and the agent must maintain coherence. Retraction and conflict are especially critical in domains where data changes frequently (prices, availability, user preferences, project status).
**Limitations of the project in its current state**
The repository currently has 0 stars, 0 forks, and 0 comments in the Hacker News discussion, indicating that it is a very recent, single-contributor project. The suite of 13 scenarios is a modest, though deliberate, starting point: the author is aware that the benchmark is not saturated precisely because it includes a boundary case that is unsolvable for the current baselines. The absence of scenarios with real language models (the LLM-judge mode is on the roadmap, not implemented) currently limits its direct applicability to production systems that use embeddings or LLMs for memory management.
**Sector context**
As sector context, the rigorous evaluation of agent memory systems is an emerging area but one drawing increasing attention. Projects like MemoryBench (different authors, different scope) or long-dialogue reasoning benchmarks (such as LongMemEval) address similar facets, albeit with different approaches. The `agent-memory-bench` proposal stands out for its emphasis on specific failure modes and its fully local, reproducible execution, which makes it easy to integrate into CI pipelines for projects building their own memory systems.
**Regulatory perspective**
From the standpoint of the EU AI Act, the memory failures this benchmark describes —especially retraction and conflict— have direct implications for high-risk systems where factual accuracy is critical. An agent that returns outdated medical, legal, or financial information because its memory does not handle retractions correctly could incur liabilities under the Regulation's accuracy and robustness requirements. Having benchmarks specific to these failure modes facilitates the technical documentation the Act requires for high-risk AI systems.