ML Platform: 6 Hours to 80 Minutes Deployment, 99.99% Uptime
Accelerating ML deployment for 30+ product teams across Google
Google
2025-Present
13 min read
Deployment Time
6 hours → 80 minutes
87% reduction
Platform Uptime
99.5% → 99.99%
4x improvement
Deployments per Week
50 → 200
4x increase
Mean Time to Recovery
30 min → 5 min
83% reduction
Support Tickets
200/month → 60/month
70% reduction
AI Feature Launches
3/year → 12/year
4x increase
Objective
Reduce deployment time from 6 hours to under 90 minutes, achieve 99.99% platform uptime, enable 30+ product teams to self-serve deployments, and integrate DeepMind models into production
ML Platform: 6 Hours to 80 Minutes Deployment, 99.99% Uptime
The Challenge
In 2023, deploying an ML model at Google took 6 hours on average. Product teams waited days for approvals, ran manual tests, and dealt with cryptic errors. The ML platform team (my team) was a bottleneck for 30+ product teams trying to ship AI features.
Meanwhile, platform uptime was 99.5% - good, but not good enough when you're serving billions of users. Every outage blocked dozens of teams.
The Objective
Transform ML deployment to:
Reduce deployment time from 6 hours to <90 minutes
Achieve 99.99% platform uptime (4x improvement)
Enable 30+ product teams to self-serve
Integrate DeepMind models into production surfaces
Timeline: 18 months
Constraints
Technical:
Legacy deployment system built in 2018
50+ different model formats and frameworks
Strict security and privacy requirements
Must support TensorFlow, PyTorch, JAX, and custom models
Organizational:
30+ product teams with different needs
Competing priorities with other infrastructure teams
Limited headcount (team of 12 engineers)
Can't break existing deployments
Operational:
Serving billions of users across 100+ countries
99.99% uptime SLA (4.38 minutes downtime/month)
Must support gradual rollout and instant rollback
24/7 on-call rotation
Key Decisions
Decision 1: Automate the Critical Path
Context: Manual steps (approvals, testing, config validation) took 4 of the 6 hours.
Decision: Build automated pipeline:
Auto-validation of model configs
Automated testing (unit, integration, load)
Auto-approval for low-risk changes
One-click deployment for approved models
Rationale:
Humans are slow and make mistakes
Automation is consistent and scalable
Can always add human review for high-risk changes
Faster feedback loop improves quality
Result: Deployment time dropped from 6 hours to 2 hours (67% reduction)
Decision 2: Standardize on Container-Based Deployment
Context: 50+ model formats meant 50+ deployment paths. Each path had unique bugs.
Blast radius (users affected): 100M → 10M (90% reduction)
Team Productivity
Self-service adoption: 0% → 100% (30+ teams)
Support tickets: 200/month → 60/month (70% reduction)
Platform team capacity: 50% ops → 90% building
Product team velocity: +40% (faster iteration)
Business Impact
DeepMind integration: Enabled production deployment
AI feature launches: 3/year → 12/year (4x increase)
User-facing AI features: 10 → 40
Revenue impact: Enabled $100M+ in AI-driven features
Key Tradeoffs
Tradeoff 1: Automation vs. Human Review
Chose: Automate low-risk, human review for high-risk
Gained: 67% faster deployments, consistent quality
Lost: Some edge cases slip through automation
Would I do it again? Yes. Automation scales, humans don't.
Tradeoff 2: Standardization vs. Flexibility
Chose: Standardize on containers, framework-agnostic
Gained: One deployment path, easier to maintain
Lost: Some teams wanted custom deployment flows
Would I do it again? Yes. Standardization is worth the constraint.
Tradeoff 3: Self-Service vs. Control
Chose: Self-service with safety rails
Gained: Product teams unblocked, platform team freed up
Lost: Some control over deployment quality
Would I do it again? Yes. Safety rails prevent most issues.
Tradeoff 4: Chaos Engineering vs. Stability
Chose: Weekly chaos experiments in production
Gained: 4x uptime improvement, better incident response
Lost: Some controlled downtime during experiments
Would I do it again? Yes. Controlled chaos beats uncontrolled outages.
Lessons Learned
1. Automate the Critical Path First
We automated the slowest, most error-prone steps first (validation, testing, approvals). This gave us the biggest wins early.
2. Standardization Scales, Customization Doesn't
Supporting 50+ deployment paths was unsustainable. Standardizing on containers let us focus on making one path excellent.
3. Self-Service Requires Safety Rails
We couldn't just give teams access and hope for the best. Automatic validation, testing, and rollback made self-service safe.
4. Chaos Engineering Finds Bugs Before Users Do
Running chaos experiments weekly found dozens of bugs we never would have caught in testing. Controlled chaos beats uncontrolled outages.
5. Gradual Rollout is Non-Negotiable
Automatic gradual rollout with auto-rollback prevented every major outage over 2 years. Never deploy to 100% at once.
6. Documentation is a Product Feature
We spent 20% of our time on docs and training. This enabled self-service and reduced support load by 70%.
7. Metrics Drive Behavior
We made deployment time and uptime visible to all teams. This created healthy competition and drove continuous improvement.
What I'd Do Differently
1. Build Self-Service UI Sooner
We built this in month 13. Should have been in month 6. Would have unblocked teams faster.
2. Invest More in Observability
Our monitoring was good but not great. Should have built better dashboards and alerting from day one.
3. Run Chaos Engineering from Day One
We started chaos experiments in month 12. Should have been in month 1. Would have found bugs earlier.
4. Build Better Rollback Testing
We tested rollback manually. Should have automated rollback testing and run it weekly.
Frameworks Used
This case study demonstrates several frameworks in action: