Warehouse Robotics: 72% to 85% Accuracy, -20% Safety Incidents
Building vision AI reliable enough to drive Amazon acquisition
Covariant (acquired by Amazon)
2022-2024
14 min read
Object Detection Accuracy
72% → 85%
+13 points
Safety Incidents
baseline → -20%
across deployments
Edge Case Accuracy
45% → 78%
+33 points
Pick Rate
baseline → +8%
objects per hour
Inference Latency
120ms → 95ms
21% faster
Enterprise Pipeline
$20M → $100M+
to acquisition
Objective
Achieve 85%+ accuracy in cluttered warehouse environments, reduce safety incidents by 20%, build $100M+ enterprise pipeline, and position company for strategic acquisition
Warehouse Robotics: 72% to 85% Accuracy, -20% Safety Incidents
The Challenge
In 2022, Covariant's warehouse robots could pick objects with 72% accuracy in controlled environments. But real warehouses are chaotic: mixed SKUs, damaged packaging, poor lighting, objects piled on top of each other. Our robots failed too often, and when they failed, they sometimes caused safety incidents (damaged products, jammed conveyors, near-misses with humans).
Amazon was watching. They wanted to acquire us, but only if we could prove reliability at scale.
The Objective
Improve vision AI to:
Achieve 85%+ accuracy in cluttered, real-world warehouse environments
Reduce safety incidents by 20% across all deployments
Maintain or improve pick rate (objects per hour)
Build enterprise sales pipeline to $100M+ valuation
Edge deployment (no cloud connectivity in warehouses)
Limited compute (NVIDIA Jetson, not datacenter GPUs)
Must work in poor lighting, dust, and vibration
Operational:
50+ customer deployments already running
Can't break existing robots with updates
Limited access to customer sites for testing
Safety certification requirements for each customer
Business:
Burning $2M/month, needed acquisition or Series C
Customers threatening to churn due to reliability issues
Amazon acquisition talks stalled on reliability concerns
18-month runway to prove value
Key Decisions
Decision 1: Focus on Edge Cases, Not Average Case
Context: Our models worked well on clean, well-lit objects. Failed on edge cases (damaged boxes, transparent packaging, reflective surfaces).
Decision: Build a systematic edge case collection and training pipeline:
Operators flag failures in real-time
Failures automatically uploaded with full context
Weekly model retraining on edge cases
A/B test new models on 10% of robots
Rationale:
Average case was already good enough
Edge cases caused most failures and safety incidents
Customers judged us on worst case, not average
Systematic collection beats random data
Result: Accuracy improved from 72% to 81% in 6 months just from edge case focus
Decision 2: Multi-Model Consensus for High-Risk Picks
Context: Single model failures caused safety incidents (e.g., robot tries to pick something too heavy, damages conveyor).
Decision: Run 3 models in parallel for high-risk scenarios:
Fast model (50ms): initial detection
Accurate model (100ms): verification
Safety model (20ms): risk assessment
If models disagree or safety model flags risk, skip the pick.
Rationale:
Safety incidents are expensive (downtime, damage, liability)
Multi-model consensus catches errors single models miss
Latency budget allows 120ms total (50+100+20 = 170ms, but parallel execution = 100ms)
Better to skip a pick than cause an incident
Result: Safety incidents reduced by 35% with only 3% reduction in pick rate
Decision 3: Synthetic Data for Rare Scenarios
Context: Some failure modes were rare but critical (e.g., picking near humans, transparent objects, reflective surfaces). Not enough real data to train on.
Decision: Build synthetic data pipeline using tools like NVIDIA Omniverse and Unity:
3D models of warehouse environments
Physics simulation for object interactions
Lighting and texture variation
Inject synthetic data into training (20% of dataset)
Rationale:
Can't wait for rare events to happen naturally
Synthetic data lets us test dangerous scenarios safely
20% synthetic + 80% real gave best results
Faster iteration than waiting for real data
Result: Accuracy on rare scenarios improved from 45% to 78%
Decision 4: Gradual Model Rollout with Auto-Rollback
Context: Deploying bad models to 50+ customer sites was catastrophic. Needed safe deployment.
Decision: Implement crawl-walk-run for model updates:
Week 1: Shadow mode (run new model, don't use predictions)
Week 2: 10% of robots at 3 pilot sites
Week 3: 50% of robots at pilot sites
Week 4: 100% of pilot sites
Week 5+: Gradual rollout to all sites
Auto-rollback if:
Accuracy drops >2%
Safety incidents increase
Pick rate drops >5%
Rationale:
Bad models hurt customer trust and safety
Gradual rollout catches issues early
Auto-rollback prevents prolonged failures
Customers appreciate caution over speed
Result: Zero major incidents from model updates over 18 months
Decision 5: Operator Feedback Loop
Context: Operators knew when robots made mistakes, but had no way to tell us.
Decision: Build operator tablet app:
One-tap to flag failures
Optional: add photo and description
Failures automatically create training data
Operators see their impact (accuracy improvements)
Multi-model consensus accuracy: 89% (vs 72% single model)
Safety Metrics
Safety incidents: -20% across all deployments
Near-miss events: -35%
Product damage rate: -28%
Conveyor jams: -40%
Performance Metrics
Pick rate: +8% (faster inference)
Inference latency: 120ms → 95ms
Model update frequency: Monthly → Weekly
Uptime: 98.5% → 99.2%
Business Impact
Customer churn: 15% → 3%
Enterprise pipeline: $20M → $100M+
Amazon acquisition: $100M+ valuation
Deployment sites: 50 → 75
Key Tradeoffs
Tradeoff 1: Multi-Model Consensus vs. Pick Rate
Chose: Multi-model consensus with 3% pick rate reduction
Gained: 35% fewer safety incidents, customer trust
Lost: 3% pick rate (recovered with latency optimization)
Would I do it again? Yes. Safety incidents are more expensive than lost picks.
Tradeoff 2: Synthetic Data vs. Real Data
Chose: 20% synthetic, 80% real
Gained: Faster iteration on rare scenarios
Lost: Some model overfitting to synthetic patterns
Would I do it again? Yes, but would tune synthetic data quality more carefully.
Tradeoff 3: Gradual Rollout vs. Fast Deployment
Chose: 5-week rollout per model update
Gained: Zero major incidents, customer trust
Lost: Slower improvement velocity
Would I do it again? Yes. Trust is harder to rebuild than to move fast.
Tradeoff 4: Edge Case Focus vs. Average Case
Chose: Focus 80% of effort on edge cases
Gained: 13 percentage point accuracy improvement
Lost: Diminishing returns on average case
Would I do it again? Yes. Customers judge you on worst case.
Lessons Learned
1. Edge Cases Matter More Than Average Case
We spent 80% of our effort on 20% of scenarios (edge cases). This is where we won. Customers don't care if you're 95% accurate on easy objects if you fail on the hard ones.
2. Multi-Model Consensus is Worth the Latency
Running 3 models in parallel added 30ms latency but reduced safety incidents by 35%. The tradeoff was obvious in hindsight, but controversial at the time.
3. Operator Feedback is Gold
Operators flagged failures we never would have found in testing. Building the feedback loop was the highest-ROI feature we shipped.
4. Synthetic Data Accelerates Rare Scenarios
We couldn't wait for rare events to happen naturally. Synthetic data let us test dangerous scenarios safely and iterate 10x faster.
5. Gradual Rollout Saves Customers (and Your Reputation)
We caught 8 major issues in pilot deployments that would have been catastrophic at scale. Never skip crawl phase, even when customers are impatient.
6. Safety is a Feature, Not a Constraint
We initially saw safety as a constraint that slowed us down. Reframing it as a feature (multi-model consensus, operator controls) made it a competitive advantage.