Memory Budgeting for Long-Horizon Tasks

The hardest problem in building AI agents isn't the model—it's memory. When agents run for hours or days, context windows fill up. Most teams handle this poorly and wonder why their agents drift off task.

The Memory Budget Problem

A typical agent has:

System prompt: 500-1000 tokens (fixed)
Task context: 1000-2000 tokens (grows slowly)
Conversation history: Unbounded (grows fast)
Retrieved context: 2000-4000 tokens per retrieval

With an 8K context window, you run out of space in ~20 turns. With a 128K window (like GPT-4) you have more room but retrieval quality degrades.

The Framework: Three-Tier Memory

Tier 1: Working Memory (Hot)

What the agent needs right now. Keep this under 4K tokens.

Current task and immediate context
Last 5-10 conversation turns
Active tool outputs

Eviction policy: FIFO after 10 turns, keep task context pinned

Tier 2: Session Memory (Warm)

What the agent might need soon. Store in fast retrieval (Redis, Pinecone, or Weaviate).

Compressed conversation history
Completed subtasks and outcomes
User preferences and corrections

Eviction policy: LRU after 1 hour, compress before eviction

Tier 3: Long-Term Memory (Cold)

What the agent rarely needs but shouldn't forget. Store in durable storage.

Full conversation logs
Historical task outcomes
Learned patterns and failures

Eviction policy: Never delete, but compress aggressively

Implementation with r3

This is why I built r3. It implements three-tier memory with:

Sub-100ms retrieval from warm tier
Automatic compression and eviction
Semantic search across all tiers
MCP protocol integration

Learn more in the Agent Memory Benchmark.

Memory Budget Allocation

For an 8K context window:

System prompt: 1K (12%)
Task context: 1K (12%)
Working memory: 4K (50%)
Retrieved context: 2K (25%)

For a 128K context window:

System prompt: 1K (1%)
Task context: 2K (2%)
Working memory: 8K (6%)
Retrieved context: 16K (12%)
Buffer for retrieval: 100K (78%)

Real-World Example

At Google, we built ML deployment agents that ran for 6-8 hours:

Working memory: current deployment step + last 3 logs
Session memory: all previous steps + error patterns
Long-term memory: historical deployments + success patterns

Result: Agents stayed on task for full deployment cycles, with <2% drift rate.

Measuring Success

Track these metrics:

Memory Retention Half-life: How long before critical info is forgotten
Retrieval Precision@5: % of retrieved memories that are relevant
Context Utilization: % of context window actually used
Task Drift Rate: % of tasks where agent loses focus

Target: Half-life > 4 hours, P@5 > 90%, Utilization 60-80%, Drift < 5%

Memory Budgeting for Long-Horizon Tasks

The Memory Budget Problem

A typical agent has:

System prompt: 500-1000 tokens (fixed)
Task context: 1000-2000 tokens (grows slowly)
Conversation history: Unbounded (grows fast)
Retrieved context: 2000-4000 tokens per retrieval

With an 8K context window, you run out of space in ~20 turns. With a 128K window (like GPT-4) you have more room but retrieval quality degrades.

The Framework: Three-Tier Memory

Tier 1: Working Memory (Hot)

What the agent needs right now. Keep this under 4K tokens.

Current task and immediate context
Last 5-10 conversation turns
Active tool outputs

Eviction policy: FIFO after 10 turns, keep task context pinned

Tier 2: Session Memory (Warm)

What the agent might need soon. Store in fast retrieval (Redis, Pinecone, or Weaviate).

Compressed conversation history
Completed subtasks and outcomes
User preferences and corrections

Eviction policy: LRU after 1 hour, compress before eviction

Tier 3: Long-Term Memory (Cold)

What the agent rarely needs but shouldn't forget. Store in durable storage.

Full conversation logs
Historical task outcomes
Learned patterns and failures

Eviction policy: Never delete, but compress aggressively

Implementation with r3

This is why I built r3. It implements three-tier memory with:

Sub-100ms retrieval from warm tier
Automatic compression and eviction
Semantic search across all tiers
MCP protocol integration

Learn more in the Agent Memory Benchmark.

Memory Budget Allocation

For an 8K context window:

System prompt: 1K (12%)
Task context: 1K (12%)
Working memory: 4K (50%)
Retrieved context: 2K (25%)

For a 128K context window:

System prompt: 1K (1%)
Task context: 2K (2%)
Working memory: 8K (6%)
Retrieved context: 16K (12%)
Buffer for retrieval: 100K (78%)

Real-World Example

At Google, we built ML deployment agents that ran for 6-8 hours:

Working memory: current deployment step + last 3 logs
Session memory: all previous steps + error patterns
Long-term memory: historical deployments + success patterns

Result: Agents stayed on task for full deployment cycles, with <2% drift rate.

Measuring Success

Track these metrics:

Memory Retention Half-life: How long before critical info is forgotten
Retrieval Precision@5: % of retrieved memories that are relevant
Context Utilization: % of context window actually used
Task Drift Rate: % of tasks where agent loses focus

Target: Half-life > 4 hours, P@5 > 90%, Utilization 60-80%, Drift < 5%

Loading...

Memory Budgeting for Long-Horizon Tasks

Memory Budgeting for Long-Horizon Tasks

The Memory Budget Problem

The Framework: Three-Tier Memory

Tier 1: Working Memory (Hot)

Tier 2: Session Memory (Warm)

Tier 3: Long-Term Memory (Cold)

Implementation with r3

Memory Budget Allocation

Real-World Example

Measuring Success

When to Use

Common Failure Modes

Instrumentation Checklist

Related Frameworks

Agent Reliability Patterns

RAG Observability Triad

Want help implementing this framework?

Memory Budgeting for Long-Horizon Tasks

Memory Budgeting for Long-Horizon Tasks

The Memory Budget Problem

The Framework: Three-Tier Memory

Tier 1: Working Memory (Hot)

Tier 2: Session Memory (Warm)

Tier 3: Long-Term Memory (Cold)

Implementation with r3

Memory Budget Allocation

Real-World Example

Measuring Success

When to Use

Common Failure Modes

Instrumentation Checklist

Related Frameworks

Agent Reliability Patterns

RAG Observability Triad

Want help implementing this framework?