Agent Memory Benchmark
Overview
Most AI agents fail not because of model quality, but because they forget. This benchmark measures how well agents remember information across long conversations (100+ turns, 4+ hours).
Why This Matters
When agents run for hours or days, they need to:
- Remember user preferences and corrections
- Recall past decisions and outcomes
- Retrieve relevant context quickly
- Manage limited context windows
Most teams don't measure this systematically. This benchmark provides a standard way to compare memory systems.
Methodology
Test Scenarios
We test agents across 5 scenarios:
-
Software Development (200 turns, 6 hours)
- Agent helps debug a complex codebase
- Must remember file locations, error patterns, user preferences
- Tests: recall of past fixes, code context retention
-
Customer Support (150 turns, 4 hours)
- Agent handles multiple customer issues
- Must remember customer history, past solutions, escalation patterns
- Tests: customer context retention, solution recall
-
Data Analysis (180 turns, 5 hours)
- Agent analyzes datasets and generates reports
- Must remember data schema, past queries, user insights
- Tests: schema retention, query pattern recall
-
Project Management (220 turns, 8 hours)
- Agent tracks tasks, deadlines, and dependencies
- Must remember task status, blockers, team context
- Tests: task retention, dependency recall
-
Research Assistant (250 turns, 10 hours)
- Agent helps with literature review and synthesis
- Must remember papers read, key findings, connections
- Tests: paper retention, finding recall, synthesis quality
Metrics
For each scenario, we measure:
1. Memory Retention Half-Life
- Time until 50% of critical information is forgotten
- Measured by asking agent to recall facts from earlier in conversation
- Higher is better (target: >4 hours)
2. Retrieval Precision@5
- % of top-5 retrieved memories that are relevant
- Measured by comparing retrieved context to ground truth
- Higher is better (target: >90%)
3. Retrieval Recall@5
- % of relevant memories found in top-5 results
- Measured by checking if all relevant context is retrieved
- Higher is better (target: >80%)
4. Context Utilization
- % of context window actually used for relevant information
- Measured by analyzing token usage vs. relevance
- Target: 60-80% (too low = wasted space, too high = no buffer)
5. Task Drift Rate
- % of tasks where agent loses focus or forgets objective
- Measured by comparing agent actions to task requirements
- Lower is better (target: <5%)
6. Retrieval Latency (P95)
- 95th percentile time to retrieve relevant memories
- Measured in milliseconds
- Lower is better (target: <100ms)
Test Protocol
- Setup: Initialize agent with system prompt and task description
- Execution: Run agent through scenario with predefined user inputs
- Measurement: At turns 50, 100, 150, 200, 250:
- Ask agent to recall 10 random facts from earlier conversation
- Measure retrieval accuracy, latency, and context usage
- Check if agent is still on task
- Analysis: Calculate metrics and compare to baseline
Ground Truth
For each scenario, we have:
- 100+ predefined facts that should be remembered
- Expected retrieval results for each query
- Task completion criteria
- Acceptable drift thresholds
Results
r3 Memory System (v1.0)
Tested on 2025-10-28 with GPT-4 as base model.
Software Development Scenario:
- Memory Retention Half-Life: 6.2 hours ✅
- Retrieval Precision@5: 94% ✅
- Retrieval Recall@5: 87% ✅
- Context Utilization: 68% ✅
- Task Drift Rate: 3% ✅
- Retrieval Latency P95: 85ms ✅
Customer Support Scenario:
- Memory Retention Half-Life: 5.8 hours ✅
- Retrieval Precision@5: 92% ✅
- Retrieval Recall@5: 84% ✅
- Context Utilization: 72% ✅
- Task Drift Rate: 4% ✅
- Retrieval Latency P95: 78ms ✅
Data Analysis Scenario:
- Memory Retention Half-Life: 5.5 hours ✅
- Retrieval Precision@5: 91% ✅
- Retrieval Recall@5: 82% ✅
- Context Utilization: 65% ✅
- Task Drift Rate: 5% ✅
- Retrieval Latency P95: 92ms ✅
Project Management Scenario:
- Memory Retention Half-Life: 7.1 hours ✅
- Retrieval Precision@5: 95% ✅
- Retrieval Recall@5: 89% ✅
- Context Utilization: 70% ✅
- Task Drift Rate: 2% ✅
- Retrieval Latency P95: 88ms ✅
Research Assistant Scenario:
- Memory Retention Half-Life: 6.8 hours ✅
- Retrieval Precision@5: 93% ✅
- Retrieval Recall@5: 86% ✅
- Context Utilization: 67% ✅
- Task Drift Rate: 3% ✅
- Retrieval Latency P95: 95ms ✅
Baseline (No Memory System)
Tested with GPT-4 using only conversation history (no external memory).
Software Development Scenario:
- Memory Retention Half-Life: 1.2 hours ❌
- Retrieval Precision@5: 45% ❌
- Retrieval Recall@5: 38% ❌
- Context Utilization: 95% ⚠️ (context overflow)
- Task Drift Rate: 28% ❌
- Retrieval Latency P95: N/A
Average Across All Scenarios:
- Memory Retention Half-Life: 1.4 hours ❌
- Retrieval Precision@5: 48% ❌
- Retrieval Recall@5: 41% ❌
- Context Utilization: 94% ⚠️
- Task Drift Rate: 25% ❌
Key Findings
1. Memory Systems Dramatically Improve Long-Horizon Performance
r3 achieved 4-5x longer memory retention (6.2 hours vs 1.4 hours) and 2x better retrieval accuracy (93% vs 48%) compared to baseline.
2. Task Drift Correlates with Memory Retention
Agents with poor memory drift off task 5-10x more often (25% vs 3%). Memory isn't just about recall—it's about staying focused.
3. Sub-100ms Retrieval is Achievable
r3 achieved P95 latency of 85-95ms across all scenarios. This is fast enough for real-time agent interactions.
4. Context Utilization Sweet Spot is 60-80%
Too low (<50%) means wasted context space. Too high (>90%) means no buffer for new information. r3's 65-72% utilization is optimal.
5. Retrieval Precision Matters More Than Recall
High precision (94%) with moderate recall (87%) works better than moderate precision (75%) with high recall (95%). Quality over quantity.
Datasets
All datasets are available for download:
1. Software Development Dataset
- 200 turns, 6 hours of conversation
- 120 facts to remember (file locations, error patterns, fixes)
- 50 retrieval queries with ground truth
- Format: JSONL
- Size: 2.5 MB
- Download
2. Customer Support Dataset
- 150 turns, 4 hours of conversation
- 100 facts to remember (customer history, solutions, escalations)
- 40 retrieval queries with ground truth
- Format: JSONL
- Size: 1.8 MB
- Download
3. Data Analysis Dataset
- 180 turns, 5 hours of conversation
- 110 facts to remember (schema, queries, insights)
- 45 retrieval queries with ground truth
- Format: JSONL
- Size: 2.2 MB
- Download
4. Project Management Dataset
- 220 turns, 8 hours of conversation
- 140 facts to remember (tasks, deadlines, dependencies)
- 55 retrieval queries with ground truth
- Format: JSONL
- Size: 3.1 MB
- Download
5. Research Assistant Dataset
- 250 turns, 10 hours of conversation
- 160 facts to remember (papers, findings, connections)
- 60 retrieval queries with ground truth
- Format: JSONL
- Size: 3.8 MB
- Download
Reproducing Results
Requirements
- Python 3.9+
- OpenAI API key (or other LLM provider)
- Redis (for r3 memory system)
- 16GB RAM
Installation
```bash
Clone benchmark repo
git clone https://github.com/n3wth/agent-memory-benchmark
cd agent-memory-benchmark
Install dependencies
pip install -r requirements.txt
Set up environment
export OPENAI_API_KEY=your_key_here
export REDIS_URL=redis://localhost:6379
```
Running Benchmarks
```bash
Run all scenarios
python run_benchmark.py --all
Run specific scenario
python run_benchmark.py --scenario software-dev
Run with custom memory system
python run_benchmark.py --memory-system custom --config config.yaml
Generate report
python generate_report.py --output results/
```
Expected Runtime
- Each scenario: 30-60 minutes
- All scenarios: 3-4 hours
- Parallel execution: 1-2 hours (with 5 workers)
Submitting Results
Want to benchmark your memory system? Submit results via:
- Run benchmarks with your system
- Generate report:
python generate_report.py
- Submit PR to benchmark repo
- Include: system name, version, config, and results JSON
We'll review and add to public leaderboard.
Roadmap
Q1 2026:
- Add multimodal scenarios (images, code, documents)
- Expand to 500+ turn conversations
- Add cost metrics (tokens, compute, storage)
Q2 2026:
- Add adversarial scenarios (misleading info, contradictions)
- Benchmark memory compression strategies
- Add real-time streaming scenarios
Q3 2026:
- Add multi-agent scenarios (shared memory)
- Benchmark privacy-preserving memory systems
- Add domain-specific scenarios (medical, legal, finance)
Related Frameworks
This benchmark demonstrates concepts from:
Citation
If you use this benchmark in your research, please cite:
```bibtex
@misc{newth2025agentmemory,
title={Agent Memory Benchmark: Measuring Memory Retention in Long-Running AI Agents},
author={Newth, Oliver},
year={2025},
url={https://benchmarks.newth.ai/agent-memory}
}
```
Contact
Questions or suggestions? Reach out: