Open Benchmarks

Reproducible benchmarks for AI systems. Public datasets, clear methodology, and open results. Built to advance the field, not just promote products.

1 benchmarks

5 datasets available

Quarterly updates

Featured Benchmarks

v1.0

Agent Memory Benchmark

How well do AI agents remember across long conversations?

A reproducible benchmark for measuring memory retention, retrieval accuracy, and context management in long-running AI agents. Based on r3 memory system.

Methodology

We test agents across 5 realistic scenarios (software dev, customer support, data analysis, project management, research) with 100-250 turns over 4-10 hours. At regular intervals, we measure memory retention, retrieval accuracy, context utilization, task drift, and latency. All datasets and ground truth are publicly available for reproducibility.

Key Metrics

Memory Retention Half-LifeRetrieval Precision@5Retrieval Recall@5+3 more

5 datasets available

Updated 10/28/2025View benchmark →

Why Open Benchmarks?

Most AI benchmarks are marketing tools. They're designed to make specific products look good, not to advance the field. The methodology is hidden, the datasets are proprietary, and the results can't be reproduced.

I'm building open benchmarks because I believe the field needs better standards. These benchmarks are:

✓Reproducible: Full methodology, public datasets, and open-source evaluation code
✓Practical: Based on real-world scenarios, not academic toy problems
✓Evolving: Quarterly updates as the field advances
✓Community-driven: Submit your results, suggest improvements, and help set standards

If you're building AI systems, these benchmarks give you a way to measure progress objectively. If you're evaluating vendors, they give you a common standard to compare against.

Want to contribute or use these benchmarks?

All benchmarks are open source. Submit results, suggest improvements, or use them to evaluate your systems.

View on GitHub See Related Frameworks

Back to home

Open Benchmarks

Reproducible benchmarks for AI systems. Public datasets, clear methodology, and open results. Built to advance the field, not just promote products.

1 benchmarks

5 datasets available

Quarterly updates

Featured Benchmarks

v1.0

Agent Memory Benchmark

How well do AI agents remember across long conversations?

A reproducible benchmark for measuring memory retention, retrieval accuracy, and context management in long-running AI agents. Based on r3 memory system.

Methodology

Key Metrics

Memory Retention Half-LifeRetrieval Precision@5Retrieval Recall@5+3 more

5 datasets available

Updated 10/28/2025View benchmark →

Why Open Benchmarks?

I'm building open benchmarks because I believe the field needs better standards. These benchmarks are:

✓Reproducible: Full methodology, public datasets, and open-source evaluation code
✓Practical: Based on real-world scenarios, not academic toy problems
✓Evolving: Quarterly updates as the field advances
✓Community-driven: Submit your results, suggest improvements, and help set standards

If you're building AI systems, these benchmarks give you a way to measure progress objectively. If you're evaluating vendors, they give you a common standard to compare against.

Want to contribute or use these benchmarks?

All benchmarks are open source. Submit results, suggest improvements, or use them to evaluate your systems.

View on GitHub See Related Frameworks

Open Benchmarks

Featured Benchmarks

Agent Memory Benchmark

Why Open Benchmarks?

Want to contribute or use these benchmarks?

Loading...

Open Benchmarks

Featured Benchmarks

Agent Memory Benchmark

Why Open Benchmarks?

Want to contribute or use these benchmarks?