ML Platform: 6 Hours to 80 Minutes Deployment, 99.99% Uptime
The Challenge
In 2023, deploying an ML model at Google took 6 hours on average. Product teams waited days for approvals, ran manual tests, and dealt with cryptic errors. The ML platform team (my team) was a bottleneck for 30+ product teams trying to ship AI features.
Meanwhile, platform uptime was 99.5% - good, but not good enough when you're serving billions of users. Every outage blocked dozens of teams.
The Objective
Transform ML deployment to:
- Reduce deployment time from 6 hours to <90 minutes
- Achieve 99.99% platform uptime (4x improvement)
- Enable 30+ product teams to self-serve
- Integrate DeepMind models into production surfaces
Timeline: 18 months
Constraints
Technical:
- Legacy deployment system built in 2018
- 50+ different model formats and frameworks
- Strict security and privacy requirements
- Must support TensorFlow, PyTorch, JAX, and custom models
Organizational:
- 30+ product teams with different needs
- Competing priorities with other infrastructure teams
- Limited headcount (team of 12 engineers)
- Can't break existing deployments
Operational:
- Serving billions of users across 100+ countries
- 99.99% uptime SLA (4.38 minutes downtime/month)
- Must support gradual rollout and instant rollback
- 24/7 on-call rotation
Key Decisions
Decision 1: Automate the Critical Path
Context: Manual steps (approvals, testing, config validation) took 4 of the 6 hours.
Decision: Build automated pipeline:
- Auto-validation of model configs
- Automated testing (unit, integration, load)
- Auto-approval for low-risk changes
- One-click deployment for approved models
Rationale:
- Humans are slow and make mistakes
- Automation is consistent and scalable
- Can always add human review for high-risk changes
- Faster feedback loop improves quality
Result: Deployment time dropped from 6 hours to 2 hours (67% reduction)
Decision 2: Standardize on Container-Based Deployment
Context: 50+ model formats meant 50+ deployment paths. Each path had unique bugs.
Decision: Standardize on containers:
- All models packaged as Docker containers
- Standard interface (REST API + gRPC)
- Automatic health checks and monitoring
- Framework-agnostic (TensorFlow, PyTorch, JAX all work)
Rationale:
- One deployment path = one set of bugs to fix
- Containers are portable and reproducible
- Industry standard (easy to hire for)
- Enables gradual rollout and rollback
Result: Deployment time dropped from 2 hours to 80 minutes (33% reduction)
Decision 3: Build Self-Service Tooling
Context: Product teams waited for ML platform team to deploy models. We were the bottleneck.
Decision: Build self-service deployment UI:
- Drag-and-drop model upload
- Automatic config generation
- Built-in testing and validation
- One-click rollout with safety checks
Rationale:
- Product teams know their models best
- Self-service scales better than tickets
- Reduces our operational load
- Faster iteration for product teams
Result: 30+ teams self-serving, platform team freed up for infrastructure work
Decision 4: Implement Chaos Engineering
Context: We didn't know what would break until it broke in production.
Decision: Run chaos experiments weekly:
- Randomly kill pods in production
- Inject latency and errors
- Simulate datacenter failures
- Test rollback procedures
Rationale:
- Better to find bugs in controlled chaos than real outages
- Builds confidence in system resilience
- Forces us to fix weak points
- Trains on-call team for real incidents
Result: Uptime improved from 99.5% to 99.99% (4x reduction in downtime)
Decision 5: Gradual Rollout with Automatic Rollback
Context: Bad deployments caused outages. Manual rollback took 15-30 minutes.
Decision: Implement automatic gradual rollout:
- Deploy to 1% → 10% → 50% → 100% over 2 hours
- Monitor error rate, latency, and quality metrics
- Auto-rollback if metrics degrade >2%
- Manual override for emergencies
Rationale:
- Catches bad deployments before they affect everyone
- Automatic rollback is faster than manual (2 min vs 15 min)
- Reduces blast radius of failures
- Builds confidence in deployment process
Result: Zero major outages from bad deployments in 2 years
The Execution
Phase 1: Automation (Months 1-6)
- Built automated validation and testing pipeline
- Implemented auto-approval for low-risk changes
- Created one-click deployment workflow
- Migrated 10 pilot teams to new system
Key metrics:
- Deployment time: 6h → 2h
- Manual steps: 12 → 3
- Deployment success rate: 85% → 95%
Phase 2: Standardization (Months 7-12)
- Standardized on container-based deployment
- Built framework-agnostic interfaces
- Implemented gradual rollout and auto-rollback
- Migrated 20 more teams to new system
Key metrics:
- Deployment time: 2h → 80min
- Deployment paths: 50+ → 1
- Rollback time: 15min → 2min
Phase 3: Self-Service (Months 13-18)
- Built self-service deployment UI
- Created documentation and training
- Implemented chaos engineering
- Migrated remaining teams to new system
Key metrics:
- Self-service adoption: 0% → 100%
- Platform uptime: 99.5% → 99.99%
- Support tickets: -70%
The Results
Deployment Metrics
- Deployment time: 6 hours → 80 minutes (87% reduction)
- Deployment success rate: 85% → 98%
- Deployments per week: 50 → 200 (4x increase)
- Rollback time: 15 minutes → 2 minutes (87% reduction)
Reliability Metrics
- Platform uptime: 99.5% → 99.99% (4x improvement)
- Mean time to recovery (MTTR): 30 min → 5 min
- Incidents per month: 8 → 1
- Blast radius (users affected): 100M → 10M (90% reduction)
Team Productivity
- Self-service adoption: 0% → 100% (30+ teams)
- Support tickets: 200/month → 60/month (70% reduction)
- Platform team capacity: 50% ops → 90% building
- Product team velocity: +40% (faster iteration)
Business Impact
- DeepMind integration: Enabled production deployment
- AI feature launches: 3/year → 12/year (4x increase)
- User-facing AI features: 10 → 40
- Revenue impact: Enabled $100M+ in AI-driven features
Key Tradeoffs
Tradeoff 1: Automation vs. Human Review
Chose: Automate low-risk, human review for high-risk
Gained: 67% faster deployments, consistent quality
Lost: Some edge cases slip through automation
Would I do it again? Yes. Automation scales, humans don't.
Tradeoff 2: Standardization vs. Flexibility
Chose: Standardize on containers, framework-agnostic
Gained: One deployment path, easier to maintain
Lost: Some teams wanted custom deployment flows
Would I do it again? Yes. Standardization is worth the constraint.
Tradeoff 3: Self-Service vs. Control
Chose: Self-service with safety rails
Gained: Product teams unblocked, platform team freed up
Lost: Some control over deployment quality
Would I do it again? Yes. Safety rails prevent most issues.
Tradeoff 4: Chaos Engineering vs. Stability
Chose: Weekly chaos experiments in production
Gained: 4x uptime improvement, better incident response
Lost: Some controlled downtime during experiments
Would I do it again? Yes. Controlled chaos beats uncontrolled outages.
Lessons Learned
1. Automate the Critical Path First
We automated the slowest, most error-prone steps first (validation, testing, approvals). This gave us the biggest wins early.
2. Standardization Scales, Customization Doesn't
Supporting 50+ deployment paths was unsustainable. Standardizing on containers let us focus on making one path excellent.
3. Self-Service Requires Safety Rails
We couldn't just give teams access and hope for the best. Automatic validation, testing, and rollback made self-service safe.
4. Chaos Engineering Finds Bugs Before Users Do
Running chaos experiments weekly found dozens of bugs we never would have caught in testing. Controlled chaos beats uncontrolled outages.
5. Gradual Rollout is Non-Negotiable
Automatic gradual rollout with auto-rollback prevented every major outage over 2 years. Never deploy to 100% at once.
6. Documentation is a Product Feature
We spent 20% of our time on docs and training. This enabled self-service and reduced support load by 70%.
7. Metrics Drive Behavior
We made deployment time and uptime visible to all teams. This created healthy competition and drove continuous improvement.
What I'd Do Differently
1. Build Self-Service UI Sooner
We built this in month 13. Should have been in month 6. Would have unblocked teams faster.
2. Invest More in Observability
Our monitoring was good but not great. Should have built better dashboards and alerting from day one.
3. Run Chaos Engineering from Day One
We started chaos experiments in month 12. Should have been in month 1. Would have found bugs earlier.
4. Build Better Rollback Testing
We tested rollback manually. Should have automated rollback testing and run it weekly.
Frameworks Used
This case study demonstrates several frameworks in action:
- Crawl-Walk-Run Ladder: Pilot teams → 20 teams → all teams
- Latency-Learning Flywheel: Faster deployments → more iterations → better models
- Agent Reliability Patterns: Graceful degradation, automatic rollback
- Safety SLO Ladder: Bronze → Silver → Gold deployment safety
Takeaways for Your Product
If you're building ML infrastructure:
- Automate the critical path first (validation, testing, approvals)
- Standardize on containers for portability and consistency
- Build self-service with safety rails (validation, rollback)
- Run chaos engineering weekly to find bugs before users do
- Implement gradual rollout with automatic rollback
- Invest in documentation and training (20% of effort)
- Make metrics visible to drive continuous improvement
If you're scaling infrastructure:
- Automation scales, humans don't
- Standardization is worth the constraint
- Self-service requires safety rails
- Controlled chaos beats uncontrolled outages
- Gradual rollout is non-negotiable
- Documentation is a product feature
- Metrics drive behavior