The AI Product Trilemma: Cost, Latency, and Quality
Every AI product manager faces the same impossible choice: you can have fast, cheap, or good—pick two. This isn't a temporary limitation of current technology. It's a fundamental tradeoff that defines AI product strategy.
After shipping AI products at Google, Meta, and in robotics, I've learned that the teams who succeed are those who understand this trilemma deeply and make deliberate choices about which tradeoffs to accept. This article provides a framework for product leaders to navigate these tradeoffs strategically.
The Trilemma Explained
Cost: How much does each inference cost?
- Model size (larger models cost more to run)
- Compute requirements (GPUs, TPUs, CPUs)
- API costs (if using third-party models)
- Infrastructure overhead (serving, monitoring, storage)
Typical range: $0.0001 to $0.10 per inference
Latency: How fast does the model respond?
- Inference time (how long to generate a response)
- Network latency (time to reach the model)
- Queueing delays (waiting for available compute)
- Post-processing time (formatting, safety checks)
Typical range: 50ms to 10 seconds
Quality: How good are the model's outputs?
- Accuracy (% of correct predictions)
- Relevance (how well outputs match user intent)
- Coherence (how well-structured are outputs)
- Safety (absence of harmful, biased, or hallucinated content)
Typical range: 70% to 95% task success rate
Why You Can't Have All Three
The trilemma exists because of fundamental constraints:
Constraint 1: Bigger Models Are Better But Slower and More Expensive
The Physics:
- GPT-4 (1.7T parameters): High quality, high cost ($0.03/1K tokens), slow (2-5 seconds)
- GPT-3.5 (175B parameters): Medium quality, medium cost ($0.002/1K tokens), fast (500ms)
- Distilled models (7B parameters): Lower quality, low cost ($0.0001/1K tokens), very fast (50ms)
The Tradeoff: You can improve quality by using bigger models, but you'll pay more and wait longer.
Constraint 2: Faster Inference Requires More Compute or Simpler Models
The Physics:
- Parallel processing (multiple GPUs) reduces latency but increases cost
- Smaller models are faster but lower quality
- Caching helps but only for repeated queries
The Tradeoff: You can reduce latency by throwing more compute at the problem (expensive) or using simpler models (lower quality).
Constraint 3: Higher Quality Requires More Compute or Slower Inference
The Physics:
- Larger context windows improve quality but increase latency and cost
- Ensemble methods (multiple models voting) improve quality but multiply cost and latency
- Chain-of-thought reasoning improves quality but adds latency
The Tradeoff: You can improve quality by using more sophisticated techniques, but you'll pay more and wait longer.
The Three Strategies: Which Two Do You Choose?
Strategy 1: Fast + Cheap (Sacrifice Quality)
When to use:
- High-volume, low-value use cases (e.g., spam detection, simple classification)
- Use cases where "good enough" is acceptable (e.g., autocomplete suggestions)
- Cost-sensitive applications (e.g., consumer apps with thin margins)
Implementation:
- Use small, distilled models (7B-13B parameters)
- Aggressive caching of common queries
- Simple prompts, no chain-of-thought
- Minimal post-processing
Real-World Example: Instagram Content Moderation At Meta, we used small, fast models for initial content screening:
- Cost: $0.0001 per image
- Latency: 50ms P95
- Quality: 85% accuracy (good enough for first pass)
We accepted lower quality because: 1. High volume (billions of images per day) 2. False positives reviewed by humans 3. Speed critical (users expect instant uploads)
Result: Saved $50M annually vs. using GPT-4 for all moderation.
Strategy 2: Fast + Good (Sacrifice Cost)
When to use:
- User-facing features where latency kills engagement (e.g., search, recommendations)
- High-value use cases where quality matters (e.g., customer support, medical diagnosis)
- Competitive markets where speed is a differentiator
Implementation:
- Use large models with massive parallel compute
- Edge deployment to reduce network latency
- Speculative execution (start inference before user finishes typing)
- Expensive but fast infrastructure (A100 GPUs, custom ASICs)
Real-World Example: Google Search Google uses massive compute to deliver high-quality results in <200ms:
- Cost: $0.01-$0.05 per search (estimated)
- Latency: P95 = 150ms
- Quality: 95%+ relevance
They accept high cost because: 1. Search is core to business model (ads pay for it) 2. Latency directly impacts engagement (100ms delay = 1% revenue loss) 3. Quality is competitive moat
Result: Maintains search dominance despite high infrastructure costs.
Strategy 3: Cheap + Good (Sacrifice Latency)
When to use:
- Batch processing where latency doesn't matter (e.g., overnight data analysis)
- Asynchronous workflows (e.g., email summarization, report generation)
- Use cases where users expect to wait (e.g., complex analysis, creative generation)
Implementation:
- Use large models with limited compute (queue requests)
- Batch processing to amortize overhead
- Asynchronous APIs (return results later)
- Optimize for throughput, not latency
Real-World Example: GitHub Copilot GitHub Copilot uses large models but accepts higher latency:
- Cost: $0.001 per completion (estimated)
- Latency: 1-3 seconds
- Quality: 90%+ acceptance rate
They accept higher latency because: 1. Developers expect to wait for AI suggestions 2. Quality is critical (bad suggestions waste time) 3. Cost must be low to support $10/month subscription
Result: Profitable at scale with high user satisfaction.
Advanced Strategies: Dynamic Tradeoffs
The best AI products don't pick one strategy—they dynamically adjust based on context.
Strategy 4: Adaptive Quality (Cascade Models)
Concept: Start with fast, cheap models. Escalate to slow, expensive models only when needed.
Implementation: 1. Tier 1: Small model (7B params) handles 80% of queries 2. Tier 2: Medium model (70B params) handles 15% of queries (when Tier 1 confidence < 0.8) 3. Tier 3: Large model (GPT-4) handles 5% of queries (when Tier 2 confidence < 0.8)
Real-World Example: Customer Support Chatbot
- Tier 1: Handles simple FAQs (cost: $0.0001, latency: 100ms, quality: 85%)
- Tier 2: Handles complex questions (cost: $0.001, latency: 500ms, quality: 92%)
- Tier 3: Handles edge cases (cost: $0.01, latency: 2s, quality: 97%)
Result: Average cost $0.0005, average latency 200ms, average quality 90%.
Strategy 5: Adaptive Latency (Speculative Execution)
Concept: Start with fast model, run slow model in parallel, use slow model only if fast model fails.
Implementation: 1. Fast path: Small model returns result in 100ms 2. Slow path: Large model runs in parallel, returns result in 2s 3. Decision: If fast model confidence > 0.9, use it. Otherwise, wait for slow model.
Real-World Example: Search Autocomplete
- Fast path: Simple prefix matching (latency: 50ms, quality: 80%)
- Slow path: LLM-based suggestions (latency: 500ms, quality: 95%)
- Decision: Use fast path for common queries, slow path for rare queries
Result: P50 latency 50ms, P95 latency 500ms, average quality 92%.
Strategy 6: Adaptive Cost (Caching + Precomputation)
Concept: Precompute expensive queries, cache results, serve from cache when possible.
Implementation: 1. Precompute: Run expensive model on common queries overnight 2. Cache: Store results in fast key-value store (Redis, Memcached) 3. Serve: Check cache first, fall back to live inference if miss
Real-World Example: Product Recommendations
- Precompute: Generate recommendations for top 10M users overnight (cost: $10K, latency: 8 hours)
- Cache: Store in Redis (cost: $100/month, latency: 5ms)
- Serve: 95% cache hit rate, 5% live inference
Result: Average cost $0.0001, average latency 10ms, quality 95%.
Decision Framework: How to Choose Your Strategy
Step 1: Define Your Constraints
Questions to ask: 1. What's your budget per inference? (e.g., $0.001, $0.01, $0.10) 2. What's your latency requirement? (e.g., <100ms, <1s, <10s) 3. What's your quality bar? (e.g., 80%, 90%, 95% task success rate)
Step 2: Identify Your Use Case Type
High-volume, low-value: Fast + Cheap (Strategy 1)
- Examples: Spam detection, simple classification, autocomplete
User-facing, high-value: Fast + Good (Strategy 2)
- Examples: Search, recommendations, real-time chat
Batch processing, high-value: Cheap + Good (Strategy 3)
- Examples: Data analysis, report generation, creative work
Mixed workload: Adaptive (Strategies 4-6)
- Examples: Customer support, content moderation, personalization
Step 3: Measure Your Baseline
Metrics to track: 1. Cost: Average cost per inference, total monthly cost 2. Latency: P50, P95, P99 latency 3. Quality: Task success rate, user satisfaction, error rate
Step 4: Experiment with Tradeoffs
Process: 1. Baseline: Measure current cost, latency, quality 2. Experiment: Try different models, architectures, strategies 3. Measure: Track impact on cost, latency, quality 4. Decide: Choose the strategy that best fits your constraints
Step 5: Optimize Continuously
Tactics:
- Model distillation: Train smaller models to mimic larger models (improve cost + latency)
- Quantization: Reduce model precision (improve cost + latency, slight quality loss)
- Pruning: Remove unnecessary model weights (improve cost + latency, slight quality loss)
- Caching: Store common queries (improve cost + latency, no quality loss)
- Batching: Process multiple requests together (improve cost, slight latency increase)
Common Mistakes and How to Avoid Them
Mistake 1: Optimizing for One Dimension Without Considering Others
Symptom: You improve quality but cost explodes or latency becomes unacceptable.
Example: We switched from GPT-3.5 to GPT-4 to improve quality. Quality went from 85% to 92%, but cost increased 15x and latency doubled. Users churned due to slow responses.
Fix: Always measure all three dimensions. Set acceptable ranges for each (e.g., cost < $0.01, latency < 1s, quality > 90%).
Mistake 2: Not Segmenting by Use Case
Symptom: You use the same model for all use cases, even though they have different requirements.
Example: We used GPT-4 for both simple FAQs and complex analysis. FAQs didn't need GPT-4's quality, but we paid for it anyway.
Fix: Segment use cases by requirements. Use fast, cheap models for simple cases, slow, expensive models for complex cases.
Mistake 3: Ignoring the Long Tail
Symptom: Your average metrics look good, but P95/P99 latency or error rate is unacceptable.
Example: Our P50 latency was 200ms, but P95 was 5 seconds. 5% of users had terrible experience and churned.
Fix: Track and optimize for P95/P99, not just averages. Set SLAs for worst-case performance.
Mistake 4: Not Accounting for Scale
Symptom: Your solution works at small scale but breaks at large scale.
Example: Our model cost $0.01 per inference. At 1M requests/day, that's $10K/day or $3.6M/year. We didn't budget for it.
Fix: Model costs at target scale before committing. Multiply cost per inference by expected daily volume.
Mistake 5: Premature Optimization
Symptom: You optimize for cost/latency before proving product-market fit.
Example: We spent 3 months optimizing latency from 1s to 200ms. But users didn't care—they valued quality more.
Fix: Start with quality. Optimize for cost/latency only after proving users value the feature.
Real-World Case Studies
Case Study 1: Instagram Calling Quality Prediction
Goal: Predict call quality before the call starts to improve user experience.
Constraints:
- Cost: <$0.001 per prediction (billions of calls per year)
- Latency: <100ms (users expect instant call setup)
- Quality: >85% accuracy (false positives hurt engagement)
Strategy: Fast + Cheap (sacrifice quality)
Implementation:
- Small logistic regression model (1M parameters)
- Features: network metrics, device type, location
- Deployed on edge servers (reduce network latency)
Results:
- Cost: $0.0001 per prediction ✅
- Latency: P95 = 50ms ✅
- Quality: 85% accuracy ✅
Tradeoff: Accepted 85% accuracy (vs. 92% with GPT-4) to hit cost and latency targets.
Case Study 2: Covariant Warehouse Robotics Vision
Goal: Detect objects in cluttered warehouse bins for robotic picking.
Constraints:
- Cost: <$0.01 per pick (thin margins in logistics)
- Latency: <200ms (robots can't wait)
- Quality: >85% accuracy (errors are expensive)
Strategy: Fast + Good (sacrifice cost)
Implementation:
- Large vision model (500M parameters)
- Custom ASICs for fast inference
- Edge deployment on robots
Results:
- Cost: $0.002 per pick ✅
- Latency: P95 = 120ms ✅
- Quality: 85% accuracy ✅
Tradeoff: Accepted higher cost (custom hardware) to hit latency and quality targets.
Case Study 3: GitHub Copilot Code Suggestions
Goal: Generate high-quality code suggestions for developers.
Constraints:
- Cost: <$0.01 per suggestion (support $10/month subscription)
- Latency: <3 seconds (developers will wait for good suggestions)
- Quality: >90% acceptance rate (bad suggestions waste time)
Strategy: Cheap + Good (sacrifice latency)
Implementation:
- Large language model (GPT-4 class)
- Asynchronous API (return suggestions after 1-3 seconds)
- Caching for common patterns
Results:
- Cost: $0.001 per suggestion ✅
- Latency: P95 = 2 seconds ✅
- Quality: 90%+ acceptance rate ✅
Tradeoff: Accepted higher latency (developers expect to wait) to hit cost and quality targets.
Conclusion: Making Strategic Tradeoffs
The AI Product Trilemma is not a problem to solve—it's a reality to navigate. The teams that succeed are those who:
1. Understand the tradeoffs. You can't have fast, cheap, and good. Pick two. 2. Choose deliberately. Match your strategy to your use case and constraints. 3. Measure continuously. Track cost, latency, and quality for every model change. 4. Optimize dynamically. Use adaptive strategies (cascade, speculative execution, caching) to get the best of all three. 5. Segment by use case. Different use cases have different requirements. Don't use one-size-fits-all.
The framework I've outlined provides a roadmap for product leaders to navigate these tradeoffs strategically. The key insight: there's no universal "best" strategy—only the strategy that best fits your constraints and use case.
Start by defining your constraints (budget, latency requirement, quality bar). Identify your use case type (high-volume/low-value, user-facing/high-value, batch processing). Choose your strategy (fast + cheap, fast + good, cheap + good, or adaptive). Measure your baseline. Experiment with tradeoffs. Optimize continuously.
The teams that master this framework will ship AI products that are fast enough, cheap enough, and good enough to win in their market. The teams that ignore it will waste time and money chasing impossible goals.
What are your constraints? Which two dimensions matter most for your use case? Use this framework to make informed tradeoffs and ship AI products that actually work.