v1.0

memory

Agent Memory Benchmark

How well do AI agents remember across long conversations?

Published 10/28/2025

Updated 10/28/2025

5 datasets

Memory Retention Half-Life

6.3hours

↑ higher is better

Retrieval Precision@5

93%

↑ higher is better

Retrieval Recall@5

86%

↑ higher is better

Context Utilization

68%

↓ lower is better

Task Drift Rate

3.4%

↓ lower is better

Retrieval Latency P95

88ms

↓ lower is better

Agent Memory Benchmark

Overview

Most AI agents fail not because of model quality, but because they forget. This benchmark measures how well agents remember information across long conversations (100+ turns, 4+ hours).

Why This Matters

When agents run for hours or days, they need to:

Remember user preferences and corrections
Recall past decisions and outcomes
Retrieve relevant context quickly
Manage limited context windows

Most teams don't measure this systematically. This benchmark provides a standard way to compare memory systems.

Methodology

Test Scenarios

We test agents across 5 scenarios:

Software Development (200 turns, 6 hours)
- Agent helps debug a complex codebase
- Must remember file locations, error patterns, user preferences
- Tests: recall of past fixes, code context retention
Customer Support (150 turns, 4 hours)
- Agent handles multiple customer issues
- Must remember customer history, past solutions, escalation patterns
- Tests: customer context retention, solution recall
Data Analysis (180 turns, 5 hours)
- Agent analyzes datasets and generates reports
- Must remember data schema, past queries, user insights
- Tests: schema retention, query pattern recall
Project Management (220 turns, 8 hours)
- Agent tracks tasks, deadlines, and dependencies
- Must remember task status, blockers, team context
- Tests: task retention, dependency recall
Research Assistant (250 turns, 10 hours)
- Agent helps with literature review and synthesis
- Must remember papers read, key findings, connections
- Tests: paper retention, finding recall, synthesis quality

Metrics

For each scenario, we measure:

1. Memory Retention Half-Life

Time until 50% of critical information is forgotten
Measured by asking agent to recall facts from earlier in conversation
Higher is better (target: >4 hours)

2. Retrieval Precision@5

% of top-5 retrieved memories that are relevant
Measured by comparing retrieved context to ground truth
Higher is better (target: >90%)

3. Retrieval Recall@5

% of relevant memories found in top-5 results
Measured by checking if all relevant context is retrieved
Higher is better (target: >80%)

4. Context Utilization

% of context window actually used for relevant information
Measured by analyzing token usage vs. relevance
Target: 60-80% (too low = wasted space, too high = no buffer)

5. Task Drift Rate

% of tasks where agent loses focus or forgets objective
Measured by comparing agent actions to task requirements
Lower is better (target: <5%)

6. Retrieval Latency (P95)

95th percentile time to retrieve relevant memories
Measured in milliseconds
Lower is better (target: <100ms)

Test Protocol

Setup: Initialize agent with system prompt and task description
Execution: Run agent through scenario with predefined user inputs
Measurement: At turns 50, 100, 150, 200, 250:
- Ask agent to recall 10 random facts from earlier conversation
- Measure retrieval accuracy, latency, and context usage
- Check if agent is still on task
Analysis: Calculate metrics and compare to baseline

Ground Truth

For each scenario, we have:

100+ predefined facts that should be remembered
Expected retrieval results for each query
Task completion criteria
Acceptable drift thresholds

Results

r3 Memory System (v1.0)

Tested on 2025-10-28 with GPT-4 as base model.

Software Development Scenario:

Memory Retention Half-Life: 6.2 hours ✅
Retrieval Precision@5: 94% ✅
Retrieval Recall@5: 87% ✅
Context Utilization: 68% ✅
Task Drift Rate: 3% ✅
Retrieval Latency P95: 85ms ✅

Customer Support Scenario:

Memory Retention Half-Life: 5.8 hours ✅
Retrieval Precision@5: 92% ✅
Retrieval Recall@5: 84% ✅
Context Utilization: 72% ✅
Task Drift Rate: 4% ✅
Retrieval Latency P95: 78ms ✅

Data Analysis Scenario:

Memory Retention Half-Life: 5.5 hours ✅
Retrieval Precision@5: 91% ✅
Retrieval Recall@5: 82% ✅
Context Utilization: 65% ✅
Task Drift Rate: 5% ✅
Retrieval Latency P95: 92ms ✅

Project Management Scenario:

Memory Retention Half-Life: 7.1 hours ✅
Retrieval Precision@5: 95% ✅
Retrieval Recall@5: 89% ✅
Context Utilization: 70% ✅
Task Drift Rate: 2% ✅
Retrieval Latency P95: 88ms ✅

Research Assistant Scenario:

Memory Retention Half-Life: 6.8 hours ✅
Retrieval Precision@5: 93% ✅
Retrieval Recall@5: 86% ✅
Context Utilization: 67% ✅
Task Drift Rate: 3% ✅
Retrieval Latency P95: 95ms ✅

Baseline (No Memory System)

Tested with GPT-4 using only conversation history (no external memory).

Software Development Scenario:

Memory Retention Half-Life: 1.2 hours ❌
Retrieval Precision@5: 45% ❌
Retrieval Recall@5: 38% ❌
Context Utilization: 95% ⚠️ (context overflow)
Task Drift Rate: 28% ❌
Retrieval Latency P95: N/A

Average Across All Scenarios:

Memory Retention Half-Life: 1.4 hours ❌
Retrieval Precision@5: 48% ❌
Retrieval Recall@5: 41% ❌
Context Utilization: 94% ⚠️
Task Drift Rate: 25% ❌

Key Findings

1. Memory Systems Dramatically Improve Long-Horizon Performance

r3 achieved 4-5x longer memory retention (6.2 hours vs 1.4 hours) and 2x better retrieval accuracy (93% vs 48%) compared to baseline.

2. Task Drift Correlates with Memory Retention

Agents with poor memory drift off task 5-10x more often (25% vs 3%). Memory isn't just about recall—it's about staying focused.

3. Sub-100ms Retrieval is Achievable

r3 achieved P95 latency of 85-95ms across all scenarios. This is fast enough for real-time agent interactions.

4. Context Utilization Sweet Spot is 60-80%

Too low (<50%) means wasted context space. Too high (>90%) means no buffer for new information. r3's 65-72% utilization is optimal.

5. Retrieval Precision Matters More Than Recall

High precision (94%) with moderate recall (87%) works better than moderate precision (75%) with high recall (95%). Quality over quantity.

Datasets

All datasets are available for download:

1. Software Development Dataset

200 turns, 6 hours of conversation
120 facts to remember (file locations, error patterns, fixes)
50 retrieval queries with ground truth
Format: JSONL
Size: 2.5 MB
Download

2. Customer Support Dataset

150 turns, 4 hours of conversation
100 facts to remember (customer history, solutions, escalations)
40 retrieval queries with ground truth
Format: JSONL
Size: 1.8 MB
Download

3. Data Analysis Dataset

180 turns, 5 hours of conversation
110 facts to remember (schema, queries, insights)
45 retrieval queries with ground truth
Format: JSONL
Size: 2.2 MB
Download

4. Project Management Dataset

220 turns, 8 hours of conversation
140 facts to remember (tasks, deadlines, dependencies)
55 retrieval queries with ground truth
Format: JSONL
Size: 3.1 MB
Download

5. Research Assistant Dataset

250 turns, 10 hours of conversation
160 facts to remember (papers, findings, connections)
60 retrieval queries with ground truth
Format: JSONL
Size: 3.8 MB
Download

Reproducing Results

Requirements

Python 3.9+
OpenAI API key (or other LLM provider)
Redis (for r3 memory system)
16GB RAM

Installation

```bash

Clone benchmark repo

git clone https://github.com/n3wth/agent-memory-benchmark cd agent-memory-benchmark

Install dependencies

pip install -r requirements.txt

Set up environment

export OPENAI_API_KEY=your_key_here export REDIS_URL=redis://localhost:6379 ```

Running Benchmarks

```bash

Run all scenarios

python run_benchmark.py --all

Run specific scenario

python run_benchmark.py --scenario software-dev

Run with custom memory system

python run_benchmark.py --memory-system custom --config config.yaml

Generate report

python generate_report.py --output results/ ```

Expected Runtime

Each scenario: 30-60 minutes
All scenarios: 3-4 hours
Parallel execution: 1-2 hours (with 5 workers)

Submitting Results

Want to benchmark your memory system? Submit results via:

Run benchmarks with your system
Generate report: python generate_report.py
Submit PR to benchmark repo
Include: system name, version, config, and results JSON

We'll review and add to public leaderboard.

Roadmap

Q1 2026:

Add multimodal scenarios (images, code, documents)
Expand to 500+ turn conversations
Add cost metrics (tokens, compute, storage)

Q2 2026:

Add adversarial scenarios (misleading info, contradictions)
Benchmark memory compression strategies
Add real-time streaming scenarios

Q3 2026:

Add multi-agent scenarios (shared memory)
Benchmark privacy-preserving memory systems
Add domain-specific scenarios (medical, legal, finance)

Related Frameworks

This benchmark demonstrates concepts from:

Citation

If you use this benchmark in your research, please cite:

```bibtex @misc{newth2025agentmemory, title={Agent Memory Benchmark: Measuring Memory Retention in Long-Running AI Agents}, author={Newth, Oliver}, year={2025}, url={https://benchmarks.newth.ai/agent-memory} } ```

Contact

Questions or suggestions? Reach out:

Email: oliver@newth.ai
GitHub: @n3wth
Twitter: @n3wth

Available Datasets

Software Development Dataset

200 turns, 6 hours, 120 facts to remember

Size: 2.5 MB•Format: JSONL

Download

Customer Support Dataset

150 turns, 4 hours, 100 facts to remember

Size: 1.8 MB•Format: JSONL

Download

Data Analysis Dataset

180 turns, 5 hours, 110 facts to remember

Size: 2.2 MB•Format: JSONL

Download

Project Management Dataset

220 turns, 8 hours, 140 facts to remember

Size: 3.1 MB•Format: JSONL

Download

Research Assistant Dataset

250 turns, 10 hours, 160 facts to remember

Size: 3.8 MB•Format: JSONL

Download

Results Leaderboard

System	Version	Memory Retention Half-Life	Retrieval Precision@5	Retrieval Recall@5	Context Utilization	Task Drift Rate	Retrieval Latency P95
r3 Memory System	1.0	6.3hours	93%	86%	68%	3.4%	88ms
Baseline (No Memory)	N/A	1.4hours	48%	41%	94%	25%	0ms

Related Frameworks

Memory Budgeting

View framework →

Agent Reliability Patterns

View framework →

Rag Observability Triad

View framework →

Want to submit results or contribute?

This benchmark is open source. Submit your results, suggest improvements, or use it to evaluate your systems.

View on GitHub See Related Frameworks

Back to benchmarks

v1.0

memory

Agent Memory Benchmark

How well do AI agents remember across long conversations?

Published 10/28/2025

Updated 10/28/2025

5 datasets

Memory Retention Half-Life

6.3hours

↑ higher is better

Retrieval Precision@5

93%

↑ higher is better

Retrieval Recall@5

86%

↑ higher is better

Context Utilization

68%

↓ lower is better

Task Drift Rate

3.4%

↓ lower is better

Retrieval Latency P95

88ms

↓ lower is better

Agent Memory Benchmark

Overview

Most AI agents fail not because of model quality, but because they forget. This benchmark measures how well agents remember information across long conversations (100+ turns, 4+ hours).

Why This Matters

When agents run for hours or days, they need to:

Remember user preferences and corrections
Recall past decisions and outcomes
Retrieve relevant context quickly
Manage limited context windows

Most teams don't measure this systematically. This benchmark provides a standard way to compare memory systems.

Methodology

Test Scenarios

We test agents across 5 scenarios:

Software Development (200 turns, 6 hours)
- Agent helps debug a complex codebase
- Must remember file locations, error patterns, user preferences
- Tests: recall of past fixes, code context retention
Customer Support (150 turns, 4 hours)
- Agent handles multiple customer issues
- Must remember customer history, past solutions, escalation patterns
- Tests: customer context retention, solution recall
Data Analysis (180 turns, 5 hours)
- Agent analyzes datasets and generates reports
- Must remember data schema, past queries, user insights
- Tests: schema retention, query pattern recall
Project Management (220 turns, 8 hours)
- Agent tracks tasks, deadlines, and dependencies
- Must remember task status, blockers, team context
- Tests: task retention, dependency recall
Research Assistant (250 turns, 10 hours)
- Agent helps with literature review and synthesis
- Must remember papers read, key findings, connections
- Tests: paper retention, finding recall, synthesis quality

Metrics

For each scenario, we measure:

1. Memory Retention Half-Life

Time until 50% of critical information is forgotten
Measured by asking agent to recall facts from earlier in conversation
Higher is better (target: >4 hours)

2. Retrieval Precision@5

% of top-5 retrieved memories that are relevant
Measured by comparing retrieved context to ground truth
Higher is better (target: >90%)

3. Retrieval Recall@5

% of relevant memories found in top-5 results
Measured by checking if all relevant context is retrieved
Higher is better (target: >80%)

4. Context Utilization

% of context window actually used for relevant information
Measured by analyzing token usage vs. relevance
Target: 60-80% (too low = wasted space, too high = no buffer)

5. Task Drift Rate

% of tasks where agent loses focus or forgets objective
Measured by comparing agent actions to task requirements
Lower is better (target: <5%)

6. Retrieval Latency (P95)

95th percentile time to retrieve relevant memories
Measured in milliseconds
Lower is better (target: <100ms)

Test Protocol

Setup: Initialize agent with system prompt and task description
Execution: Run agent through scenario with predefined user inputs
Measurement: At turns 50, 100, 150, 200, 250:
- Ask agent to recall 10 random facts from earlier conversation
- Measure retrieval accuracy, latency, and context usage
- Check if agent is still on task
Analysis: Calculate metrics and compare to baseline

Ground Truth

For each scenario, we have:

100+ predefined facts that should be remembered
Expected retrieval results for each query
Task completion criteria
Acceptable drift thresholds

Results

r3 Memory System (v1.0)

Tested on 2025-10-28 with GPT-4 as base model.

Software Development Scenario:

Memory Retention Half-Life: 6.2 hours ✅
Retrieval Precision@5: 94% ✅
Retrieval Recall@5: 87% ✅
Context Utilization: 68% ✅
Task Drift Rate: 3% ✅
Retrieval Latency P95: 85ms ✅

Customer Support Scenario:

Memory Retention Half-Life: 5.8 hours ✅
Retrieval Precision@5: 92% ✅
Retrieval Recall@5: 84% ✅
Context Utilization: 72% ✅
Task Drift Rate: 4% ✅
Retrieval Latency P95: 78ms ✅

Data Analysis Scenario:

Memory Retention Half-Life: 5.5 hours ✅
Retrieval Precision@5: 91% ✅
Retrieval Recall@5: 82% ✅
Context Utilization: 65% ✅
Task Drift Rate: 5% ✅
Retrieval Latency P95: 92ms ✅

Project Management Scenario:

Memory Retention Half-Life: 7.1 hours ✅
Retrieval Precision@5: 95% ✅
Retrieval Recall@5: 89% ✅
Context Utilization: 70% ✅
Task Drift Rate: 2% ✅
Retrieval Latency P95: 88ms ✅

Research Assistant Scenario:

Memory Retention Half-Life: 6.8 hours ✅
Retrieval Precision@5: 93% ✅
Retrieval Recall@5: 86% ✅
Context Utilization: 67% ✅
Task Drift Rate: 3% ✅
Retrieval Latency P95: 95ms ✅

Baseline (No Memory System)

Tested with GPT-4 using only conversation history (no external memory).

Software Development Scenario:

Memory Retention Half-Life: 1.2 hours ❌
Retrieval Precision@5: 45% ❌
Retrieval Recall@5: 38% ❌
Context Utilization: 95% ⚠️ (context overflow)
Task Drift Rate: 28% ❌
Retrieval Latency P95: N/A

Average Across All Scenarios:

Memory Retention Half-Life: 1.4 hours ❌
Retrieval Precision@5: 48% ❌
Retrieval Recall@5: 41% ❌
Context Utilization: 94% ⚠️
Task Drift Rate: 25% ❌

Key Findings

1. Memory Systems Dramatically Improve Long-Horizon Performance

r3 achieved 4-5x longer memory retention (6.2 hours vs 1.4 hours) and 2x better retrieval accuracy (93% vs 48%) compared to baseline.

2. Task Drift Correlates with Memory Retention

Agents with poor memory drift off task 5-10x more often (25% vs 3%). Memory isn't just about recall—it's about staying focused.

3. Sub-100ms Retrieval is Achievable

r3 achieved P95 latency of 85-95ms across all scenarios. This is fast enough for real-time agent interactions.

4. Context Utilization Sweet Spot is 60-80%

Too low (<50%) means wasted context space. Too high (>90%) means no buffer for new information. r3's 65-72% utilization is optimal.

5. Retrieval Precision Matters More Than Recall

High precision (94%) with moderate recall (87%) works better than moderate precision (75%) with high recall (95%). Quality over quantity.

Datasets

All datasets are available for download:

1. Software Development Dataset

200 turns, 6 hours of conversation
120 facts to remember (file locations, error patterns, fixes)
50 retrieval queries with ground truth
Format: JSONL
Size: 2.5 MB
Download

2. Customer Support Dataset

150 turns, 4 hours of conversation
100 facts to remember (customer history, solutions, escalations)
40 retrieval queries with ground truth
Format: JSONL
Size: 1.8 MB
Download

3. Data Analysis Dataset

180 turns, 5 hours of conversation
110 facts to remember (schema, queries, insights)
45 retrieval queries with ground truth
Format: JSONL
Size: 2.2 MB
Download

4. Project Management Dataset

220 turns, 8 hours of conversation
140 facts to remember (tasks, deadlines, dependencies)
55 retrieval queries with ground truth
Format: JSONL
Size: 3.1 MB
Download

5. Research Assistant Dataset

250 turns, 10 hours of conversation
160 facts to remember (papers, findings, connections)
60 retrieval queries with ground truth
Format: JSONL
Size: 3.8 MB
Download

Reproducing Results

Requirements

Python 3.9+
OpenAI API key (or other LLM provider)
Redis (for r3 memory system)
16GB RAM

Installation

```bash

Clone benchmark repo

git clone https://github.com/n3wth/agent-memory-benchmark cd agent-memory-benchmark

Install dependencies

pip install -r requirements.txt

Set up environment

export OPENAI_API_KEY=your_key_here export REDIS_URL=redis://localhost:6379 ```

Running Benchmarks

```bash

Run all scenarios

python run_benchmark.py --all

Run specific scenario

python run_benchmark.py --scenario software-dev

Run with custom memory system

python run_benchmark.py --memory-system custom --config config.yaml

Generate report

python generate_report.py --output results/ ```

Expected Runtime

Each scenario: 30-60 minutes
All scenarios: 3-4 hours
Parallel execution: 1-2 hours (with 5 workers)

Submitting Results

Want to benchmark your memory system? Submit results via:

Run benchmarks with your system
Generate report: python generate_report.py
Submit PR to benchmark repo
Include: system name, version, config, and results JSON

We'll review and add to public leaderboard.

Roadmap

Q1 2026:

Add multimodal scenarios (images, code, documents)
Expand to 500+ turn conversations
Add cost metrics (tokens, compute, storage)

Q2 2026:

Add adversarial scenarios (misleading info, contradictions)
Benchmark memory compression strategies
Add real-time streaming scenarios

Q3 2026:

Add multi-agent scenarios (shared memory)
Benchmark privacy-preserving memory systems
Add domain-specific scenarios (medical, legal, finance)

Related Frameworks

This benchmark demonstrates concepts from:

Citation

If you use this benchmark in your research, please cite:

Contact

Questions or suggestions? Reach out:

Email: oliver@newth.ai
GitHub: @n3wth
Twitter: @n3wth

Available Datasets

Software Development Dataset

200 turns, 6 hours, 120 facts to remember

Size: 2.5 MB•Format: JSONL

Download

Customer Support Dataset

150 turns, 4 hours, 100 facts to remember

Size: 1.8 MB•Format: JSONL

Download

Data Analysis Dataset

180 turns, 5 hours, 110 facts to remember

Size: 2.2 MB•Format: JSONL

Download

Project Management Dataset

220 turns, 8 hours, 140 facts to remember

Size: 3.1 MB•Format: JSONL

Download

Research Assistant Dataset

250 turns, 10 hours, 160 facts to remember

Size: 3.8 MB•Format: JSONL

Download

Results Leaderboard

System	Version	Memory Retention Half-Life	Retrieval Precision@5	Retrieval Recall@5	Context Utilization	Task Drift Rate	Retrieval Latency P95
r3 Memory System	1.0	6.3hours	93%	86%	68%	3.4%	88ms
Baseline (No Memory)	N/A	1.4hours	48%	41%	94%	25%	0ms

Want to submit results or contribute?

This benchmark is open source. Submit your results, suggest improvements, or use it to evaluate your systems.

View on GitHub See Related Frameworks

Agent Memory Benchmark

Agent Memory Benchmark

Overview

Why This Matters

Methodology

Test Scenarios

Metrics

Test Protocol

Ground Truth

Results

r3 Memory System (v1.0)

Baseline (No Memory System)

Key Findings

1. Memory Systems Dramatically Improve Long-Horizon Performance

2. Task Drift Correlates with Memory Retention

3. Sub-100ms Retrieval is Achievable

4. Context Utilization Sweet Spot is 60-80%

5. Retrieval Precision Matters More Than Recall

Datasets

Reproducing Results

Requirements

Installation

Clone benchmark repo

Install dependencies

Set up environment

Running Benchmarks

Run all scenarios

Run specific scenario

Run with custom memory system

Generate report

Expected Runtime

Submitting Results

Roadmap

Related Frameworks

Citation

Contact

Available Datasets

Software Development Dataset

Customer Support Dataset

Data Analysis Dataset

Project Management Dataset

Research Assistant Dataset

Results Leaderboard

Related Frameworks

Memory Budgeting

Agent Reliability Patterns

Rag Observability Triad

Want to submit results or contribute?

Loading...

Agent Memory Benchmark

Agent Memory Benchmark

Overview

Why This Matters

Methodology

Test Scenarios

Metrics

Test Protocol

Ground Truth

Results

r3 Memory System (v1.0)

Baseline (No Memory System)

Key Findings

1. Memory Systems Dramatically Improve Long-Horizon Performance

2. Task Drift Correlates with Memory Retention

3. Sub-100ms Retrieval is Achievable

4. Context Utilization Sweet Spot is 60-80%

5. Retrieval Precision Matters More Than Recall

Datasets

Reproducing Results

Requirements

Installation

Clone benchmark repo

Install dependencies

Set up environment

Running Benchmarks

Run all scenarios

Run specific scenario

Run with custom memory system

Generate report

Expected Runtime