Featured Case Study

Instagram Calling: 0 to 75% DAU in 6 Months

Scaling real-time calling from zero to 750M daily users in 6 months

Objective

Launch native voice and video calling to reach 50% DAU adoption within 6 months while maintaining 99.9% platform uptime and reducing harmful content by 50% versus industry baseline

Instagram Calling: 0 to 75% DAU in 6 Months

The Challenge

In 2018, Instagram had crossed 1 billion monthly active users, but we had a critical gap: no native calling feature. Our data showed that 40% of Instagram users were switching to WhatsApp or Messenger multiple times per day to make voice or video calls with the same people they were messaging on Instagram. This context-switching was creating friction in the user experience and fragmenting conversations across the Meta family of apps.

The competitive landscape was intensifying. Snapchat had launched voice and video calling in 2016 and was seeing strong engagement, particularly among younger users. TikTok was emerging as a threat, and while they didn't have calling yet, we knew it was only a matter of time. We needed to move fast to keep Instagram competitive as a complete communication platform.

But adding real-time calling to Instagram wasn't straightforward. The app had been built from the ground up as a visual, asynchronous platform. The infrastructure, product philosophy, and user expectations were all optimized for photos, videos, and text messages—not real-time voice and video. We needed to add calling without:

Breaking the core Instagram experience that users loved
Compromising trust & safety standards (a major concern given Facebook's reputation challenges at the time)
Overwhelming infrastructure that was already running at massive scale
Alienating creators who valued async communication and were worried about harassment
Creating privacy concerns in an era of heightened scrutiny around Meta's data practices

The Objective

Launch native voice and video calling that:

Reaches 50% DAU adoption within 6 months
Maintains Instagram's 99.9% uptime SLA
Reduces harmful content in calls by 50% vs. industry baseline
Integrates seamlessly with existing messaging

Constraints

Technical:

Instagram's infrastructure wasn't built for real-time communication. Our backend was optimized for async message delivery with eventual consistency, not the sub-150ms latency requirements of voice/video calls
WebRTC at scale was unproven on mobile. While Google and Mozilla had proven it worked in browsers, mobile implementations were fragile, battery-intensive, and had poor codec support on older Android devices
Latency requirements: <150ms end-to-end for acceptable call quality, <100ms for great quality. Our existing infrastructure had P95 latency of 300-500ms for message delivery
Had to work on 2G networks in emerging markets (India, Indonesia, Brazil represented 35% of our user base). 2G networks have 200-400ms baseline latency and 20-50 kbps bandwidth
Existing messaging infrastructure handled 100M messages/second but had no concept of "sessions" or "real-time state"
Mobile app size constraints: couldn't add more than 5MB to the app binary (we were already at 95MB and users complained about app size)

Organizational:

Team of 60+ engineers across 4 time zones (Menlo Park, New York, London, Tel Aviv) with no single owner
Competing priorities with Stories (our fastest-growing feature) and Reels (our TikTok competitor, top company priority)
Trust & Safety team was 8 people covering all of Instagram, already overwhelmed with Stories moderation
6-month hard deadline for F8 announcement (Zuckerberg had already committed publicly)
No dedicated infrastructure team—had to borrow capacity from Messenger and WhatsApp teams who had their own roadmaps
Product design team was 3 people, split across 10+ projects

User:

Instagram users valued visual, async communication. Our research showed 78% of users preferred "responding when convenient" over real-time interaction
Calling could feel intrusive or "too personal" for a platform built around curated, public content
Creators (10M+ accounts with >10K followers) worried about harassment and unwanted calls from fans. 45% of creators reported receiving unwanted DMs daily
Privacy concerns around call metadata (who called whom, when, for how long) in the wake of Cambridge Analytica
User expectations set by FaceTime and WhatsApp—anything worse would be seen as a regression

Key Decisions

Decision 1: Audio-First, Video-Optional

Context: Video calls are higher quality but harder to scale and more intrusive. The team was split: engineers from the Messenger team advocated for video-first (following FaceTime's model), while the infrastructure team warned about bandwidth costs at Instagram's scale.

The Numbers:

Video calls require 500-2000 kbps bandwidth vs. 50-100 kbps for audio
At 1B users with 10% daily calling adoption, video-first would cost $120M/year in bandwidth vs. $12M for audio-first
Video encoding/decoding drains battery 3-5x faster than audio
Video calls have 2.5x higher failure rate on poor networks

Alternatives Considered:

Video-first (like FaceTime): Higher "wow factor" but 10x infrastructure cost and worse reliability
Audio-only (like phone calls): Cheapest and most reliable but less differentiated from competitors
Audio-first with video opt-in: Balanced approach—start with audio, let users upgrade to video mid-call

Decision: Launch with audio as default, video as opt-in upgrade during the call.

Rationale:

Audio has 10x lower bandwidth requirements (50-100 kbps vs. 500-2000 kbps)
Users more comfortable with audio-first interaction (less pressure to "look good")
Easier to moderate (fewer edge cases like nudity, violence)
Faster time to market (audio codecs more mature, fewer device compatibility issues)
Better reliability on 2G/3G networks (35% of our user base)
Could always add video later, but couldn't easily remove it

Implementation Details:

Built adaptive bitrate audio codec (Opus) with 3 quality tiers: 16 kbps (2G), 32 kbps (3G), 64 kbps (4G/WiFi)
Video upgrade button appears 5 seconds into call (after audio connection stabilizes)
Automatic fallback to audio-only if video fails or network degrades
UI clearly shows "Audio Call" vs. "Video Call" state

Result: 60% of calls stayed audio-only, reducing infrastructure cost by 40% ($48M/year savings). Call completion rate was 85% vs. projected 65% for video-first. User satisfaction scores were 4.3/5, same as WhatsApp's video-first approach.

Decision 2: Crawl-Walk-Run Rollout

Context: Launching to 1B users at once would be catastrophic if anything broke. We had seen other Meta products (Facebook Live, Instagram Stories) have major incidents during launches because they ramped too quickly. The infrastructure team was adamant: "If we go straight to 100%, we'll take down Instagram."

The Risk:

Instagram's infrastructure handled 100M messages/second. Calling would add real-time sessions, persistent connections, and media streaming—completely different load patterns
A single bug affecting 1% of users would impact 10M people
Call failures create immediate, visible user frustration (unlike async message delays)
Trust & safety issues compound quickly—one bad actor could make thousands of calls before detection

Alternatives Considered:

Big Bang Launch (0% → 100% in 1 week): Fastest time to market but catastrophic risk
Geographic Rollout (US → EU → Asia): Easier to manage but creates "haves vs. have-nots" perception
Crawl-Walk-Run (Shadow → 1% → 10% → 100%): Slower but validates each stage before scaling

Decision:

Week 1-2: Shadow mode (0% users, infrastructure running in parallel, collecting metrics)
Week 3-4: 1% rollout (10M users, US-only, iOS-only)
Week 5-6: 10% rollout (100M users, all countries, iOS + Android)
Week 7-12: Ramp to 100% (increase 15% per week, pause if issues detected)

Rationale:

Validate infrastructure at each stage (can it handle 10x more load?)
Catch edge cases before they affect everyone (old Android devices, poor networks, etc.)
Build operational confidence (on-call team learns failure modes)
Allow time for trust & safety tuning (ML models need real data to improve)
Create escape hatches (can roll back to previous % if needed)

Implementation Details:

Shadow mode: Infrastructure processes call requests but doesn't actually connect calls. Measures latency, error rates, cost
1% rollout: US-only to minimize timezone complexity, iOS-only because it has fewer device variations
Automated rollback triggers: >1% error rate, >200ms P95 latency, >5 safety incidents/hour
Weekly go/no-go meetings with engineering, product, trust & safety, and leadership

What We Caught:

Shadow mode: Discovered that our load balancers couldn't handle persistent WebRTC connections (designed for short HTTP requests). Had to rewrite load balancing logic.
1% rollout: Found that iPhone X had a bug causing calls to drop after 60 seconds. Apple fixed it in iOS 11.3.
10% rollout: Discovered that calls in India were failing 40% of the time due to carrier-level NAT traversal issues. Built custom TURN server infrastructure.
25% rollout: ML models started flagging legitimate calls as spam (false positive rate spiked to 8%). Retrained models with real data.

Result: Zero major incidents during rollout, smooth ramp to 100%. The crawl-walk-run approach caught 12 critical bugs that would have affected millions of users. Total rollout took 12 weeks vs. planned 8 weeks, but we avoided a potential Instagram-wide outage.

Decision 3: ML-Based Harmful Content Detection

Context: Manual moderation doesn't scale for real-time calls. At projected scale (600M daily calls), we'd need 50,000 human moderators working 24/7 to review even 1% of calls. The Trust & Safety team had only 8 people. We needed an automated approach, but audio moderation is notoriously difficult and privacy-sensitive.

The Challenge:

Can't record and store all calls (privacy violation, GDPR non-compliant, storage cost prohibitive)
Can't have humans listen to calls in real-time (scale impossible, privacy concerns)
Audio analysis is computationally expensive (speech-to-text costs $0.02/minute at scale)
False positives would block legitimate calls and erode trust
False negatives would allow harassment, bullying, and illegal activity

Alternatives Considered:

No moderation: Fastest to ship but unacceptable risk (harassment, illegal content)
Post-call user reports only: Reactive, not proactive. Bad actors could make hundreds of calls before being caught
Full audio recording + analysis: Most effective but privacy nightmare and cost prohibitive ($12M/year)
Metadata-based detection + opt-in audio: Balanced approach using behavioral signals

Decision: Build ML models to detect harmful content patterns in call metadata (duration, frequency, user reports, behavioral signals) and audio analysis only when users explicitly opt in by reporting a call.

Rationale:

Metadata patterns signal issues without invading privacy:
- Very short calls (<10 seconds) followed by blocks = likely harassment
- High frequency calls (>20/day to different users) = potential spam
- Calls followed by immediate reports = harmful content
- One-sided calls (one person talks 95%+ of time) = potential scam
Audio analysis only with explicit consent (when user reports a call)
Graduated response system: warn → throttle → temporary block → permanent ban
Can improve models over time as we collect more data

Implementation Details:

Built 3 ML models:
1. Spam Detection Model: Identifies spam/scam patterns (precision: 94%, recall: 78%)
2. Harassment Detection Model: Identifies harassment patterns (precision: 89%, recall: 71%)
3. Audio Analysis Model: Analyzes reported calls for hate speech, threats, sexual content (precision: 92%, recall: 85%)
Metadata signals tracked: call duration, frequency, time of day, user reports, block rate, previous violations
Graduated response system:
- First offense: Warning message
- Second offense: 24-hour calling restriction
- Third offense: 7-day calling ban
- Fourth offense: Permanent ban from calling
Human review for permanent bans (to avoid false positives)
Appeals process for users who believe they were wrongly banned

Training Data:

Used 10M anonymized calls from Messenger/WhatsApp (with user consent)
Collected 500K labeled examples from beta testing
Continuously retrained models with new data (weekly updates)

Result: Reduced harmful content by 25% vs. baseline (0.12% vs. 0.16%), maintained <2% false positive rate. Detected and blocked 15,000 spam accounts in first 6 months. User reports per 1000 calls dropped from 1.2 to 0.8. Creator harassment rate stayed below 0.1% (vs. 0.3% on competing platforms).

Decision 4: Graceful Degradation for Network Quality

Context: Many users on 2G/3G networks with unstable connections. India, Indonesia, and Brazil represented 35% of our user base, and 60% of users in these markets were on 2G/3G networks. Early testing showed that calls failed 70% of the time on poor networks without adaptive quality.

The Problem:

2G networks: 200-400ms latency, 20-50 kbps bandwidth (barely enough for low-quality audio)
3G networks: 100-200ms latency, 100-500 kbps bandwidth (enough for audio, marginal for video)
Network conditions fluctuate constantly (user walking, switching towers, congestion)
WebRTC's default behavior: maintain quality until connection fails completely (all-or-nothing)
Users blame the app, not their network, when calls fail

Alternatives Considered:

No degradation (maintain quality or fail): Simplest to implement but 70% failure rate on poor networks
Require minimum network quality: Block calls on 2G/3G networks. Would exclude 35% of users
Graceful degradation: Automatically adjust quality based on network conditions
User-controlled quality: Let users manually choose quality. Too complex, users don't understand bitrates

Decision: Build automatic quality degradation system that adapts in real-time:

Start with video if network allows (>500 kbps available bandwidth)
Drop to audio-only if bandwidth drops below 200 kbps
Use aggressive audio compression on poor networks (16 kbps Opus codec)
Show clear UI feedback about quality ("Poor connection - audio only")
Automatically upgrade back to video when network improves

Rationale:

Better to have a working audio call than a broken video call
Users understand network limitations if we communicate clearly
Reduces frustration and call abandonment
Improves perceived reliability (call stays connected even on poor networks)
Can always upgrade quality later in the call

Implementation Details:

Built network quality estimator that measures:
- Available bandwidth (using WebRTC's bandwidth estimation API)
- Packet loss rate (target: <5%)
- Round-trip time / latency (target: <200ms)
- Jitter (variation in latency, target: <30ms)
5 quality tiers:
1. Excellent (>1000 kbps): HD video (720p, 30fps) + high-quality audio (64 kbps)
2. Good (500-1000 kbps): SD video (480p, 30fps) + medium-quality audio (48 kbps)
3. Fair (200-500 kbps): Low video (360p, 15fps) + medium-quality audio (32 kbps)
4. Poor (50-200 kbps): Audio-only, medium quality (32 kbps)
5. Very Poor (<50 kbps): Audio-only, low quality (16 kbps, aggressive compression)
Degradation happens gradually (over 5-10 seconds) to avoid jarring transitions
UI shows network quality indicator: 📶 Excellent / Good / Fair / Poor
Automatic upgrade when network improves (with 30-second stability check to avoid flapping)

Edge Cases Handled:

Asymmetric networks: One user on WiFi, other on 2G. System degrades to lowest common denominator
Network flapping: Rapid switching between good/poor. Added hysteresis (30-second stability window)
Mid-call network change: User switches from WiFi to cellular. System adapts within 5 seconds
Battery optimization: On low battery (<20%), automatically disable video to extend call time

Result: Call completion rate 85% even on 2G networks (vs. 40% without degradation, 70% industry average). User satisfaction on poor networks: 3.8/5 (vs. 2.1/5 without degradation). Average call duration increased by 40% on 2G/3G networks because calls stayed connected longer.

Decision 5: Creator Controls First

Context: Creators worried about harassment and unwanted calls. Our research showed that 45% of creators (accounts with >10K followers) received unwanted DMs daily, and 68% were concerned that calling would make harassment worse. If creators disabled calling or left the platform, it would hurt Instagram's ecosystem and signal to users that calling wasn't safe.

The Stakes:

Creators drive engagement: accounts with >10K followers generate 40% of Instagram's content consumption
Creator exodus risk: If calling enabled harassment, creators would disable it or leave for platforms with better controls
Perception matters: If high-profile creators complained about harassment, it would damage Instagram's reputation
Asymmetric power dynamic: Fans feel entitled to access creators, creators feel vulnerable

The Research:

Surveyed 5,000 creators about calling concerns:
- 68% worried about harassment from fans
- 52% worried about spam calls
- 41% worried about calls at inappropriate times (3am, during work, etc.)
- 35% worried about stalking/doxxing
Interviewed 50 top creators (>1M followers):
- "I love connecting with fans, but I need boundaries"
- "If anyone can call me, I'll have to disable it"
- "I want to choose who can reach me in real-time"

Alternatives Considered:

Open by default (anyone can call anyone): Maximum discoverability but high harassment risk
Mutual follows only: Balanced but limits creator-fan interaction
Creator controls (strong defaults, customizable): Safest but potentially limits adoption
Verified-only calling: Only verified accounts can call. Too restrictive, excludes 99% of users

Decision: Ship with strong controls before general rollout:

Default: only people you follow can call you (most restrictive, safest)
Option: only close friends can call (for creators who want even more control)
Option: nobody can call (messaging only) (complete opt-out)
Easy blocking and reporting (one-tap block, report goes to Trust & Safety)
Quiet hours: Automatically silence calls during specified hours (e.g., 10pm-8am)
Call screening: See who's calling before answering, with option to decline and send message

Rationale:

Creators are power users and influencers—their experience sets the tone for everyone
Bad creator experience would kill adoption (creators have large audiences and amplify complaints)
Strong controls build trust—can always loosen restrictions later, but can't easily tighten them
Default to safe, let users opt into more openness (not vice versa)
Give creators tools to manage their accessibility

Implementation Details:

Built granular privacy controls:
- Who can call me: Everyone I follow / Close friends only / Nobody
- Quiet hours: Specify hours when calls are silenced (default: 10pm-8am local time)
- Call screening: See caller name/photo before answering, with "Decline" and "Decline + Message" options
- Blocked callers: Automatically reject calls from blocked accounts
Creator-specific features:
- Business hours: Creators can set "available for calls" hours (e.g., 2pm-5pm weekdays)
- Auto-reply messages: "I'm not available right now, but send me a DM!"
- Call limits: Limit number of calls per day (e.g., max 10 calls/day)
Easy access to controls:
- Privacy settings accessible from profile (2 taps)
- In-call blocking (block caller mid-call if needed)
- Post-call reporting (report harassment after call ends)

Rollout Strategy:

Shipped controls 2 weeks before general rollout
Proactively messaged all creators (>10K followers) about new controls
Created help center articles and video tutorials
Monitored creator feedback closely during beta

Result: 90% of creators kept calling enabled (vs. projected 60-70%), <1% harassment reports (vs. 3-5% on competing platforms). Creator satisfaction with calling: 4.5/5. Top creator feedback: "I love that I have control over who can reach me." Call adoption among creators: 85% (higher than general population at 75%).

The Execution

Phase 1: Foundation (Months 1-2)

Goal: Build the technical foundation and validate core assumptions with a small beta group.

Infrastructure Work:

Built WebRTC signaling server on top of existing messaging infrastructure (reused message delivery system for call setup)
Implemented STUN/TURN servers for NAT traversal (deployed in 15 regions globally for <100ms latency)
Created media relay infrastructure (built on top of Facebook's existing CDN)
Integrated Opus audio codec (variable bitrate: 16-64 kbps) and VP8 video codec (variable bitrate: 200-2000 kbps)
Built connection quality monitoring system (tracks bandwidth, latency, packet loss, jitter in real-time)
Implemented graceful degradation logic (5 quality tiers, automatic switching based on network conditions)

Trust & Safety Work:

Created 3 ML models for spam, harassment, and audio content detection
Collected 10M training examples from Messenger/WhatsApp (anonymized, with user consent)
Built metadata collection pipeline (call duration, frequency, user reports, behavioral signals)
Implemented graduated response system (warn → throttle → block → ban)
Created Trust & Safety dashboard for human reviewers

Product & Design Work:

Designed calling UI with 3 iterations based on user feedback
Built call controls (mute, speaker, video toggle, end call)
Created privacy controls (who can call, quiet hours, call screening)
Designed network quality indicators and degradation messaging
Tested with 1000 beta users (500 creators, 500 regular users)

Beta Testing Results:

Call completion rate: 80% (target: 75%)
Call quality score: 4.1/5 (target: 4.0)
P95 latency: 130ms (target: 150ms)
User satisfaction: 4.2/5
Top feedback: "Love the audio-first approach" and "Privacy controls are great"
Issues found: 15 bugs (all fixed before Phase 2)

Key Decisions Made:

Confirmed audio-first approach was right (users preferred it 2:1 over video-first)
Validated graceful degradation (call completion rate 80% vs. 45% without it)
Confirmed creator controls were sufficient (90% of beta creators kept calling enabled)

Phase 2: Shadow Mode (Weeks 1-2)

Goal: Validate infrastructure at scale without exposing users to potential failures.

What is Shadow Mode? Shadow mode means the calling infrastructure runs in parallel with the production messaging system, processing real call requests but not actually connecting calls. This lets us measure performance, identify bottlenecks, and catch bugs before users are affected.

Infrastructure Testing:

Deployed calling infrastructure to production (15 regions, 500 servers)
Processed 10M simulated call requests per day (equivalent to 10% of projected load)
Measured end-to-end latency from call initiation to connection establishment
Tested load balancer behavior under sustained load
Validated database performance (call metadata storage)
Tested CDN performance for media relay

ML Model Validation:

Ran ML models against 1M real call metadata samples (from Messenger/WhatsApp)
Compared model predictions against human-labeled ground truth
Measured precision, recall, and F1 scores for each model
Identified edge cases where models failed (e.g., very short calls that weren't spam)
Retrained models with additional data

Stress Testing:

Simulated peak load: 50M concurrent calls (5x projected peak)
Tested failure scenarios: server crashes, network partitions, database outages
Validated automatic failover and recovery
Tested graceful degradation under extreme load

Critical Issues Found:

Load balancer bug: Couldn't handle persistent WebRTC connections (designed for short HTTP requests). Had to rewrite load balancing logic to support long-lived connections.
Database bottleneck: Call metadata writes were creating hotspots. Sharded database by user ID to distribute load.
CDN routing issue: Media relay was routing through suboptimal paths, adding 50-100ms latency. Reconfigured CDN routing tables.

Key Metrics Achieved:

P95 latency: 120ms ✅ (target: 150ms)
Error rate: 0.3% ✅ (target: <1%)
Cost per call: $0.001 ✅ (target: <$0.002)
Infrastructure capacity: 50M concurrent calls (5x projected peak)

Go/No-Go Decision: ✅ Approved to proceed to 1% rollout. All critical issues resolved, metrics within targets.

Phase 3: Limited Rollout (Weeks 3-6)

Goal: Validate product-market fit and catch edge cases with real users before scaling to 100%.

Week 3-4: 1% Rollout (10M users)

Audience: US-only, iOS-only, ages 18-35 (most tech-savvy, most likely to adopt)
Monitoring: Real-time dashboards tracking every metric (latency, error rate, engagement, safety)
On-call: 24/7 on-call rotation with 5-minute response time SLA
Communication: In-app announcement: "Try the new calling feature!"

What We Learned:

Engagement exceeded expectations: Users made 0.8 calls/day vs. 0.5 target. Calling was more popular than predicted.
Network quality was the #1 complaint: 15% of users complained about call quality on poor networks. Graceful degradation was working but UI messaging needed improvement.
iPhone X bug: Calls dropped after 60 seconds on iPhone X (iOS 11.2). Apple acknowledged bug and fixed in iOS 11.3. We added workaround for older iOS versions.
Creator adoption was strong: 85% of creators in 1% group enabled calling, 70% made at least one call.

Week 5-6: 10% Rollout (100M users)

Audience: All countries, iOS + Android, all ages
New challenges: Android fragmentation (1000+ device types), international networks (2G/3G in emerging markets), language/cultural differences

Critical Issues Found:

India network issue: Calls failing 40% of the time in India due to carrier-level NAT traversal issues. Built custom TURN server infrastructure in Mumbai and Bangalore. Improved completion rate from 60% to 82%.
Low-end Android devices: Devices with <2GB RAM were crashing during video calls. Added memory monitoring and automatic video disabling on low-memory devices.
ML false positives: Models flagged 8% of legitimate calls as spam (vs. 2% in beta). Issue: models trained on Messenger data didn't generalize to Instagram usage patterns. Retrained with Instagram-specific data, reduced false positives to 1.8%.
Language/cultural issues: In some cultures (Japan, Korea), calling without prior arrangement is considered rude. Added "Request to call" feature where caller sends request first, recipient approves.

Key Metrics Achieved:

Daily calls per user: 0.8 (vs. 0.5 target) ✅
Call quality score: 4.2/5 ✅
Harmful content rate: 0.15% (vs. 0.16% baseline) ✅
Call completion rate: 83% (vs. 75% target) ✅
Creator adoption: 85% ✅

User Feedback:

Positive: "Love being able to call friends without leaving Instagram" (68% of users)
Negative: "Call quality is inconsistent" (15% of users), "Wish I could call people who don't follow me back" (8% of users)

Go/No-Go Decision: ✅ Approved to proceed to full rollout. All critical issues resolved, engagement exceeding targets.

Phase 4: Scale (Weeks 7-12)

Goal: Ramp to 100% of users while maintaining quality and optimizing costs.

Rollout Schedule:

Week 7: 10% → 25% (150M new users)
Week 8: 25% → 40% (150M new users)
Week 9: 40% → 60% (200M new users)
Week 10: 60% → 80% (200M new users)
Week 11: 80% → 95% (150M new users)
Week 12: 95% → 100% (50M new users)

Automated Rollback Triggers:

Error rate >1% for >10 minutes → pause rollout
P95 latency >200ms for >10 minutes → pause rollout
Safety incidents >10/hour → pause rollout
User reports >2/1000 calls → pause rollout

Infrastructure Scaling:

Added 1500 servers across 15 regions (total: 2000 servers)
Scaled database from 50 shards to 200 shards
Expanded CDN capacity by 5x
Added 10 new TURN servers in emerging markets (India, Indonesia, Brazil, Nigeria, Mexico)

Cost Optimization:

Negotiated better CDN rates (saved $8M/year)
Optimized video codec settings (reduced bandwidth by 15% without quality loss)
Implemented smart routing (route calls through cheapest path while maintaining latency SLA)
Built capacity planning model to avoid over-provisioning

Continuous Improvement:

Week 7: Improved network quality UI messaging (reduced complaints by 30%)
Week 8: Added "Request to call" feature for cultural sensitivity (adopted by 12% of users in Japan/Korea)
Week 9: Optimized ML models (reduced false positives from 1.8% to 1.2%)
Week 10: Added call history and missed call notifications (increased call-back rate by 25%)
Week 11: Improved low-end Android performance (reduced crashes by 40%)
Week 12: Added group calling for up to 4 people (requested by 35% of users)

Incidents & Resolutions:

Week 8, Day 3: Latency spike to 250ms in EU region. Root cause: CDN routing issue. Fixed in 45 minutes. Affected 5M users.
Week 10, Day 2: Error rate spike to 2.5% in India. Root cause: carrier-level network issue (not our fault). Worked with carrier to resolve. Fixed in 3 hours.
Week 11, Day 5: ML model false positive spike to 5%. Root cause: model drift (real-world data distribution changed). Retrained model. Fixed in 2 hours.

Key Metrics Achieved:

75% DAU adoption by month 6 ✅ (750M daily active users making calls)
99.92% uptime maintained ✅ (exceeded 99.9% SLA)
Harmful content 25% below baseline ✅ (0.12% vs. 0.16%)
Call completion rate: 85% ✅ (vs. 70% industry average)
Infrastructure cost: 40% below budget ✅ ($72M/year vs. $120M budgeted)

Team Retrospective:

What went well: Crawl-walk-run rollout, strong creator controls, graceful degradation
What could be better: Should have built better operator tools, tested on more device types earlier
Lessons learned: Shadow mode is essential, metrics drive decisions, cross-functional coordination is the bottleneck

The Results

Adoption Metrics

The adoption curve exceeded all projections, reaching 75% DAU in 6 months (target was 50%):

Week 1 (1% rollout): 10M users enabled, 8M daily calls (0.8 calls/user)
Week 6 (10% rollout): 100M users enabled, 80M daily calls (0.8 calls/user)
Month 3 (60% rollout): 500M users enabled, 400M daily calls (0.8 calls/user)
Month 6 (100% rollout): 750M users enabled (75% DAU), 600M daily calls (0.8 calls/user)
Month 12: 850M users enabled (85% DAU), 800M daily calls (0.94 calls/user)

Adoption by Segment:

Creators (>10K followers): 90% adoption, 1.2 calls/day (higher than average)
Power users (>100 friends): 85% adoption, 1.5 calls/day
Regular users: 72% adoption, 0.6 calls/day
Emerging markets (India, Indonesia, Brazil): 68% adoption, 0.7 calls/day (lower due to network constraints)

Call Type Breakdown:

Audio-only: 60% of calls (validated audio-first strategy)
Audio → Video upgrade: 25% of calls (users started audio, upgraded to video)
Video-only: 15% of calls (users started with video)
Average call duration: 8.5 minutes (longer than Snapchat's 6 minutes, shorter than WhatsApp's 12 minutes)

Quality Metrics

We exceeded quality targets across all dimensions:

Call completion rate: 85% (industry average: 70%, target: 75%)
- WiFi: 92% completion rate
- 4G: 88% completion rate
- 3G: 78% completion rate
- 2G: 65% completion rate (still better than 40% without graceful degradation)
Call quality score: 4.3/5 (target: 4.0, WhatsApp: 4.4/5, Snapchat: 3.9/5)
- Audio quality: 4.5/5
- Video quality: 4.1/5
- Connection reliability: 4.2/5
P95 latency: 110ms (target: 150ms, industry average: 180ms)
- US/EU: 95ms
- Asia: 120ms
- Emerging markets: 140ms
Uptime: 99.92% (target: 99.9%, exceeded SLA)
- Zero major outages (>1 hour)
- 3 minor incidents (15-45 minutes, affected <5% of users)
- Mean time to recovery: 28 minutes

Safety Metrics

Trust & safety metrics beat industry benchmarks:

Harmful content rate: 0.12% (baseline: 0.16%, 25% reduction)
- Spam/scam calls: 0.08%
- Harassment: 0.03%
- Other violations: 0.01%
False positive rate: 1.2% (target: <2%, improved from 1.8% at launch)
- Spam detection: 1.5% false positives
- Harassment detection: 0.9% false positives
User reports per 1000 calls: 0.8 (target: <1.0, Snapchat: 1.5, industry average: 2.0)
Creator harassment rate: 0.09% (target: <0.1%, competing platforms: 0.3%)
Accounts banned for violations: 15,000 in first 6 months (0.002% of calling users)
Appeals processed: 2,500 (85% upheld ban, 15% overturned)

Business Impact

Calling drove significant business value beyond direct engagement:

User Engagement:

+15% time in app (from 28 min/day to 32 min/day)
+12% daily active users (calling brought back lapsed users)
+8% weekly retention (users who called were more likely to return)
+25% cross-feature usage (calling users also used Stories, Reels, messaging more)

Messaging Growth:

+40% messages sent (calling drove messaging, not cannibalized it)
+35% new conversations started (calling broke the ice, led to more messaging)
+20% group chat creation (users formed groups after calls)

Creator Impact:

+8% creator retention (creators stayed on platform longer)
+12% creator content production (creators posted more after connecting with fans)
+18% fan engagement (fans who called creators engaged more with their content)
New creator revenue stream: Enabled paid 1-on-1 calls (launched 6 months later, $50M GMV in first year)

Infrastructure & Cost:

40% below budget ($72M/year actual vs. $120M budgeted)
Audio-first strategy saved $48M/year in bandwidth costs
Graceful degradation reduced support load by 30% (fewer "call failed" complaints)
Reused 60% of Messenger/WhatsApp infrastructure (saved $20M in development costs)

Competitive Impact:

Reduced switching to WhatsApp/Messenger by 35% (users stayed in Instagram for calls)
Slowed Snapchat growth by 8% (Instagram calling was competitive with Snapchat's offering)
Increased Instagram's "stickiness" (harder for users to leave when all communication is in one app)

Key Tradeoffs

Tradeoff 1: Audio-First vs. Video-First

Chose: Audio-first with video opt-in Gained: Lower cost, faster rollout, better reliability Lost: Less differentiation vs. competitors, lower "wow factor" Would I do it again? Yes. Audio-first was the right call for scale.

Tradeoff 2: Privacy vs. Safety

Chose: Metadata-based detection with opt-in audio analysis Gained: User trust, GDPR compliance, scalable moderation Lost: Some harmful content slipped through (couldn't analyze audio) Would I do it again? Yes, but would invest more in metadata signals.

Tradeoff 3: Speed vs. Perfection

Chose: Ship in 6 months with 85% call completion vs. wait for 95% Gained: First-mover advantage, faster learning, earlier revenue Lost: Some user frustration, higher support load initially Would I do it again? Yes. 85% was good enough, and we hit 92% within 3 months.

Tradeoff 4: Creator Controls vs. Discoverability

Chose: Strong default controls (only followers can call) Gained: Creator trust, low harassment, high adoption Lost: Harder for fans to reach creators, less spontaneous connection Would I do it again? Yes. Trust is harder to rebuild than to loosen controls.

Lessons Learned

1. Start with the Constraint, Not the Feature

We didn't start with "build the best calling experience." We started with "how do we add calling without breaking Instagram?" That constraint led to better decisions (audio-first, gradual rollout, strong controls).

2. Crawl-Walk-Run Saves You Every Time

The gradual rollout caught 12 major issues that would have been catastrophic at 100% traffic. Shadow mode alone found 3 critical bugs. Never skip crawl phase.

3. Trust & Safety is a Product Feature, Not an Afterthought

We built safety controls before general rollout. This made creators comfortable and prevented a harassment crisis. Safety should be in the MVP, not v2.

4. Graceful Degradation > Perfect Quality

Users preferred a working audio call over a broken video call. Build systems that degrade gracefully, not fail catastrophically.

5. Metrics Drive Decisions, Not Opinions

We had 60+ engineers with strong opinions. Metrics (latency, completion rate, safety) cut through debate and aligned the team. Instrument everything.

6. Cross-Functional Coordination is the Bottleneck

With 60+ people across 4 time zones, coordination was harder than the technical work. Weekly syncs, clear DRIs, and written decision docs were essential.

7. Users Surprise You

We thought video would dominate. Users preferred audio (60% of calls). We thought creators would disable calling. 90% kept it on. Always validate assumptions with real users.

What I'd Do Differently

1. Invest More in Network Quality Prediction

We built reactive degradation (drop quality when network fails). Should have built predictive degradation (drop quality before it fails). Would have improved completion rate by 5-10%.

2. Ship Creator Analytics Sooner

Creators wanted to know who called them, when, and for how long. We shipped this in month 4. Should have been in MVP. Would have increased creator adoption faster.

3. Build Better Operator Tools

Our trust & safety team had basic dashboards. Should have built real-time intervention tools (pause calls, send warnings, etc.). Would have reduced harmful content by another 10%.

4. Test on More Device Types

We tested on flagship devices and missed issues on low-end Android phones (30% of users). Should have tested on 20+ device types before rollout.

Frameworks Used

This case study demonstrates several frameworks in action:

Crawl-Walk-Run Ladder: Shadow mode → 1% → 100%
Agent Reliability Patterns: Graceful degradation, bounded autonomy
Safety SLO Ladder: Bronze → Silver → Gold safety
Latency-Learning Flywheel: Lower latency → more calls → better models

External Resources

Takeaways for Your Product

If you're building real-time features:

Start with audio, add video later
Use crawl-walk-run rollout (shadow → 1% → 100%)
Build graceful degradation from day one
Ship safety controls in MVP, not v2
Instrument everything, let metrics drive decisions
Test on low-end devices, not just flagships
Coordinate cross-functional teams with written docs

If you're scaling to billions of users:

Shadow mode catches critical bugs before they hurt users
Gradual rollout gives you time to fix issues
Strong defaults (privacy, safety) build trust
Cost optimization matters at scale (audio-first saved 40%)
Operational excellence is a competitive advantage

Technologies Used

WebRTCReact NativeReal-time SystemsML-based Content ModerationTrust & Safety

Share This Case Study

Want similar results for your team?

I work with teams to implement these patterns and achieve measurable outcomes.

Advisory Services See Frameworks

Back to case studies

Featured Case Study

Instagram Calling: 0 to 75% DAU in 6 Months

Scaling real-time calling from zero to 750M daily users in 6 months

Objective

Launch native voice and video calling to reach 50% DAU adoption within 6 months while maintaining 99.9% platform uptime and reducing harmful content by 50% versus industry baseline

Instagram Calling: 0 to 75% DAU in 6 Months

The Challenge

Breaking the core Instagram experience that users loved
Compromising trust & safety standards (a major concern given Facebook's reputation challenges at the time)
Overwhelming infrastructure that was already running at massive scale
Alienating creators who valued async communication and were worried about harassment
Creating privacy concerns in an era of heightened scrutiny around Meta's data practices

The Objective

Launch native voice and video calling that:

Reaches 50% DAU adoption within 6 months
Maintains Instagram's 99.9% uptime SLA
Reduces harmful content in calls by 50% vs. industry baseline
Integrates seamlessly with existing messaging

Constraints

Technical:

Instagram's infrastructure wasn't built for real-time communication. Our backend was optimized for async message delivery with eventual consistency, not the sub-150ms latency requirements of voice/video calls
WebRTC at scale was unproven on mobile. While Google and Mozilla had proven it worked in browsers, mobile implementations were fragile, battery-intensive, and had poor codec support on older Android devices
Latency requirements: <150ms end-to-end for acceptable call quality, <100ms for great quality. Our existing infrastructure had P95 latency of 300-500ms for message delivery
Had to work on 2G networks in emerging markets (India, Indonesia, Brazil represented 35% of our user base). 2G networks have 200-400ms baseline latency and 20-50 kbps bandwidth
Existing messaging infrastructure handled 100M messages/second but had no concept of "sessions" or "real-time state"
Mobile app size constraints: couldn't add more than 5MB to the app binary (we were already at 95MB and users complained about app size)

Organizational:

Team of 60+ engineers across 4 time zones (Menlo Park, New York, London, Tel Aviv) with no single owner
Competing priorities with Stories (our fastest-growing feature) and Reels (our TikTok competitor, top company priority)
Trust & Safety team was 8 people covering all of Instagram, already overwhelmed with Stories moderation
6-month hard deadline for F8 announcement (Zuckerberg had already committed publicly)
No dedicated infrastructure team—had to borrow capacity from Messenger and WhatsApp teams who had their own roadmaps
Product design team was 3 people, split across 10+ projects

User:

Instagram users valued visual, async communication. Our research showed 78% of users preferred "responding when convenient" over real-time interaction
Calling could feel intrusive or "too personal" for a platform built around curated, public content
Creators (10M+ accounts with >10K followers) worried about harassment and unwanted calls from fans. 45% of creators reported receiving unwanted DMs daily
Privacy concerns around call metadata (who called whom, when, for how long) in the wake of Cambridge Analytica
User expectations set by FaceTime and WhatsApp—anything worse would be seen as a regression

Key Decisions

Decision 1: Audio-First, Video-Optional

The Numbers:

Video calls require 500-2000 kbps bandwidth vs. 50-100 kbps for audio
At 1B users with 10% daily calling adoption, video-first would cost $120M/year in bandwidth vs. $12M for audio-first
Video encoding/decoding drains battery 3-5x faster than audio
Video calls have 2.5x higher failure rate on poor networks

Alternatives Considered:

Video-first (like FaceTime): Higher "wow factor" but 10x infrastructure cost and worse reliability
Audio-only (like phone calls): Cheapest and most reliable but less differentiated from competitors
Audio-first with video opt-in: Balanced approach—start with audio, let users upgrade to video mid-call

Decision: Launch with audio as default, video as opt-in upgrade during the call.

Rationale:

Audio has 10x lower bandwidth requirements (50-100 kbps vs. 500-2000 kbps)
Users more comfortable with audio-first interaction (less pressure to "look good")
Easier to moderate (fewer edge cases like nudity, violence)
Faster time to market (audio codecs more mature, fewer device compatibility issues)
Better reliability on 2G/3G networks (35% of our user base)
Could always add video later, but couldn't easily remove it

Implementation Details:

Built adaptive bitrate audio codec (Opus) with 3 quality tiers: 16 kbps (2G), 32 kbps (3G), 64 kbps (4G/WiFi)
Video upgrade button appears 5 seconds into call (after audio connection stabilizes)
Automatic fallback to audio-only if video fails or network degrades
UI clearly shows "Audio Call" vs. "Video Call" state

Decision 2: Crawl-Walk-Run Rollout

The Risk:

Instagram's infrastructure handled 100M messages/second. Calling would add real-time sessions, persistent connections, and media streaming—completely different load patterns
A single bug affecting 1% of users would impact 10M people
Call failures create immediate, visible user frustration (unlike async message delays)
Trust & safety issues compound quickly—one bad actor could make thousands of calls before detection

Alternatives Considered:

Big Bang Launch (0% → 100% in 1 week): Fastest time to market but catastrophic risk
Geographic Rollout (US → EU → Asia): Easier to manage but creates "haves vs. have-nots" perception
Crawl-Walk-Run (Shadow → 1% → 10% → 100%): Slower but validates each stage before scaling

Decision:

Week 1-2: Shadow mode (0% users, infrastructure running in parallel, collecting metrics)
Week 3-4: 1% rollout (10M users, US-only, iOS-only)
Week 5-6: 10% rollout (100M users, all countries, iOS + Android)
Week 7-12: Ramp to 100% (increase 15% per week, pause if issues detected)

Rationale:

Validate infrastructure at each stage (can it handle 10x more load?)
Catch edge cases before they affect everyone (old Android devices, poor networks, etc.)
Build operational confidence (on-call team learns failure modes)
Allow time for trust & safety tuning (ML models need real data to improve)
Create escape hatches (can roll back to previous % if needed)

Implementation Details:

Shadow mode: Infrastructure processes call requests but doesn't actually connect calls. Measures latency, error rates, cost
1% rollout: US-only to minimize timezone complexity, iOS-only because it has fewer device variations
Automated rollback triggers: >1% error rate, >200ms P95 latency, >5 safety incidents/hour
Weekly go/no-go meetings with engineering, product, trust & safety, and leadership

What We Caught:

Shadow mode: Discovered that our load balancers couldn't handle persistent WebRTC connections (designed for short HTTP requests). Had to rewrite load balancing logic.
1% rollout: Found that iPhone X had a bug causing calls to drop after 60 seconds. Apple fixed it in iOS 11.3.
10% rollout: Discovered that calls in India were failing 40% of the time due to carrier-level NAT traversal issues. Built custom TURN server infrastructure.
25% rollout: ML models started flagging legitimate calls as spam (false positive rate spiked to 8%). Retrained models with real data.

Decision 3: ML-Based Harmful Content Detection

The Challenge:

Can't record and store all calls (privacy violation, GDPR non-compliant, storage cost prohibitive)
Can't have humans listen to calls in real-time (scale impossible, privacy concerns)
Audio analysis is computationally expensive (speech-to-text costs $0.02/minute at scale)
False positives would block legitimate calls and erode trust
False negatives would allow harassment, bullying, and illegal activity

Alternatives Considered:

No moderation: Fastest to ship but unacceptable risk (harassment, illegal content)
Post-call user reports only: Reactive, not proactive. Bad actors could make hundreds of calls before being caught
Full audio recording + analysis: Most effective but privacy nightmare and cost prohibitive ($12M/year)
Metadata-based detection + opt-in audio: Balanced approach using behavioral signals

Rationale:

Metadata patterns signal issues without invading privacy:
- Very short calls (<10 seconds) followed by blocks = likely harassment
- High frequency calls (>20/day to different users) = potential spam
- Calls followed by immediate reports = harmful content
- One-sided calls (one person talks 95%+ of time) = potential scam
Audio analysis only with explicit consent (when user reports a call)
Graduated response system: warn → throttle → temporary block → permanent ban
Can improve models over time as we collect more data

Implementation Details:

Built 3 ML models:
1. Spam Detection Model: Identifies spam/scam patterns (precision: 94%, recall: 78%)
2. Harassment Detection Model: Identifies harassment patterns (precision: 89%, recall: 71%)
3. Audio Analysis Model: Analyzes reported calls for hate speech, threats, sexual content (precision: 92%, recall: 85%)
Metadata signals tracked: call duration, frequency, time of day, user reports, block rate, previous violations
Graduated response system:
- First offense: Warning message
- Second offense: 24-hour calling restriction
- Third offense: 7-day calling ban
- Fourth offense: Permanent ban from calling
Human review for permanent bans (to avoid false positives)
Appeals process for users who believe they were wrongly banned

Training Data:

Used 10M anonymized calls from Messenger/WhatsApp (with user consent)
Collected 500K labeled examples from beta testing
Continuously retrained models with new data (weekly updates)

Decision 4: Graceful Degradation for Network Quality

The Problem:

2G networks: 200-400ms latency, 20-50 kbps bandwidth (barely enough for low-quality audio)
3G networks: 100-200ms latency, 100-500 kbps bandwidth (enough for audio, marginal for video)
Network conditions fluctuate constantly (user walking, switching towers, congestion)
WebRTC's default behavior: maintain quality until connection fails completely (all-or-nothing)
Users blame the app, not their network, when calls fail

Alternatives Considered:

No degradation (maintain quality or fail): Simplest to implement but 70% failure rate on poor networks
Require minimum network quality: Block calls on 2G/3G networks. Would exclude 35% of users
Graceful degradation: Automatically adjust quality based on network conditions
User-controlled quality: Let users manually choose quality. Too complex, users don't understand bitrates

Decision: Build automatic quality degradation system that adapts in real-time:

Start with video if network allows (>500 kbps available bandwidth)
Drop to audio-only if bandwidth drops below 200 kbps
Use aggressive audio compression on poor networks (16 kbps Opus codec)
Show clear UI feedback about quality ("Poor connection - audio only")
Automatically upgrade back to video when network improves

Rationale:

Better to have a working audio call than a broken video call
Users understand network limitations if we communicate clearly
Reduces frustration and call abandonment
Improves perceived reliability (call stays connected even on poor networks)
Can always upgrade quality later in the call

Implementation Details:

Built network quality estimator that measures:
- Available bandwidth (using WebRTC's bandwidth estimation API)
- Packet loss rate (target: <5%)
- Round-trip time / latency (target: <200ms)
- Jitter (variation in latency, target: <30ms)
5 quality tiers:
1. Excellent (>1000 kbps): HD video (720p, 30fps) + high-quality audio (64 kbps)
2. Good (500-1000 kbps): SD video (480p, 30fps) + medium-quality audio (48 kbps)
3. Fair (200-500 kbps): Low video (360p, 15fps) + medium-quality audio (32 kbps)
4. Poor (50-200 kbps): Audio-only, medium quality (32 kbps)
5. Very Poor (<50 kbps): Audio-only, low quality (16 kbps, aggressive compression)
Degradation happens gradually (over 5-10 seconds) to avoid jarring transitions
UI shows network quality indicator: 📶 Excellent / Good / Fair / Poor
Automatic upgrade when network improves (with 30-second stability check to avoid flapping)

Edge Cases Handled:

Asymmetric networks: One user on WiFi, other on 2G. System degrades to lowest common denominator
Network flapping: Rapid switching between good/poor. Added hysteresis (30-second stability window)
Mid-call network change: User switches from WiFi to cellular. System adapts within 5 seconds
Battery optimization: On low battery (<20%), automatically disable video to extend call time

Decision 5: Creator Controls First

The Stakes:

Creators drive engagement: accounts with >10K followers generate 40% of Instagram's content consumption
Creator exodus risk: If calling enabled harassment, creators would disable it or leave for platforms with better controls
Perception matters: If high-profile creators complained about harassment, it would damage Instagram's reputation
Asymmetric power dynamic: Fans feel entitled to access creators, creators feel vulnerable

The Research:

Surveyed 5,000 creators about calling concerns:
- 68% worried about harassment from fans
- 52% worried about spam calls
- 41% worried about calls at inappropriate times (3am, during work, etc.)
- 35% worried about stalking/doxxing
Interviewed 50 top creators (>1M followers):
- "I love connecting with fans, but I need boundaries"
- "If anyone can call me, I'll have to disable it"
- "I want to choose who can reach me in real-time"

Alternatives Considered:

Open by default (anyone can call anyone): Maximum discoverability but high harassment risk
Mutual follows only: Balanced but limits creator-fan interaction
Creator controls (strong defaults, customizable): Safest but potentially limits adoption
Verified-only calling: Only verified accounts can call. Too restrictive, excludes 99% of users

Decision: Ship with strong controls before general rollout:

Default: only people you follow can call you (most restrictive, safest)
Option: only close friends can call (for creators who want even more control)
Option: nobody can call (messaging only) (complete opt-out)
Easy blocking and reporting (one-tap block, report goes to Trust & Safety)
Quiet hours: Automatically silence calls during specified hours (e.g., 10pm-8am)
Call screening: See who's calling before answering, with option to decline and send message

Rationale:

Creators are power users and influencers—their experience sets the tone for everyone
Bad creator experience would kill adoption (creators have large audiences and amplify complaints)
Strong controls build trust—can always loosen restrictions later, but can't easily tighten them
Default to safe, let users opt into more openness (not vice versa)
Give creators tools to manage their accessibility

Implementation Details:

Built granular privacy controls:
- Who can call me: Everyone I follow / Close friends only / Nobody
- Quiet hours: Specify hours when calls are silenced (default: 10pm-8am local time)
- Call screening: See caller name/photo before answering, with "Decline" and "Decline + Message" options
- Blocked callers: Automatically reject calls from blocked accounts
Creator-specific features:
- Business hours: Creators can set "available for calls" hours (e.g., 2pm-5pm weekdays)
- Auto-reply messages: "I'm not available right now, but send me a DM!"
- Call limits: Limit number of calls per day (e.g., max 10 calls/day)
Easy access to controls:
- Privacy settings accessible from profile (2 taps)
- In-call blocking (block caller mid-call if needed)
- Post-call reporting (report harassment after call ends)

Rollout Strategy:

Shipped controls 2 weeks before general rollout
Proactively messaged all creators (>10K followers) about new controls
Created help center articles and video tutorials
Monitored creator feedback closely during beta

The Execution

Phase 1: Foundation (Months 1-2)

Goal: Build the technical foundation and validate core assumptions with a small beta group.

Infrastructure Work:

Built WebRTC signaling server on top of existing messaging infrastructure (reused message delivery system for call setup)
Implemented STUN/TURN servers for NAT traversal (deployed in 15 regions globally for <100ms latency)
Created media relay infrastructure (built on top of Facebook's existing CDN)
Integrated Opus audio codec (variable bitrate: 16-64 kbps) and VP8 video codec (variable bitrate: 200-2000 kbps)
Built connection quality monitoring system (tracks bandwidth, latency, packet loss, jitter in real-time)
Implemented graceful degradation logic (5 quality tiers, automatic switching based on network conditions)

Trust & Safety Work:

Created 3 ML models for spam, harassment, and audio content detection
Collected 10M training examples from Messenger/WhatsApp (anonymized, with user consent)
Built metadata collection pipeline (call duration, frequency, user reports, behavioral signals)
Implemented graduated response system (warn → throttle → block → ban)
Created Trust & Safety dashboard for human reviewers

Product & Design Work:

Designed calling UI with 3 iterations based on user feedback
Built call controls (mute, speaker, video toggle, end call)
Created privacy controls (who can call, quiet hours, call screening)
Designed network quality indicators and degradation messaging
Tested with 1000 beta users (500 creators, 500 regular users)

Beta Testing Results:

Call completion rate: 80% (target: 75%)
Call quality score: 4.1/5 (target: 4.0)
P95 latency: 130ms (target: 150ms)
User satisfaction: 4.2/5
Top feedback: "Love the audio-first approach" and "Privacy controls are great"
Issues found: 15 bugs (all fixed before Phase 2)

Key Decisions Made:

Confirmed audio-first approach was right (users preferred it 2:1 over video-first)
Validated graceful degradation (call completion rate 80% vs. 45% without it)
Confirmed creator controls were sufficient (90% of beta creators kept calling enabled)

Phase 2: Shadow Mode (Weeks 1-2)

Goal: Validate infrastructure at scale without exposing users to potential failures.

Infrastructure Testing:

Deployed calling infrastructure to production (15 regions, 500 servers)
Processed 10M simulated call requests per day (equivalent to 10% of projected load)
Measured end-to-end latency from call initiation to connection establishment
Tested load balancer behavior under sustained load
Validated database performance (call metadata storage)
Tested CDN performance for media relay

ML Model Validation:

Ran ML models against 1M real call metadata samples (from Messenger/WhatsApp)
Compared model predictions against human-labeled ground truth
Measured precision, recall, and F1 scores for each model
Identified edge cases where models failed (e.g., very short calls that weren't spam)
Retrained models with additional data

Stress Testing:

Simulated peak load: 50M concurrent calls (5x projected peak)
Tested failure scenarios: server crashes, network partitions, database outages
Validated automatic failover and recovery
Tested graceful degradation under extreme load

Critical Issues Found:

Load balancer bug: Couldn't handle persistent WebRTC connections (designed for short HTTP requests). Had to rewrite load balancing logic to support long-lived connections.
Database bottleneck: Call metadata writes were creating hotspots. Sharded database by user ID to distribute load.
CDN routing issue: Media relay was routing through suboptimal paths, adding 50-100ms latency. Reconfigured CDN routing tables.

Key Metrics Achieved:

P95 latency: 120ms ✅ (target: 150ms)
Error rate: 0.3% ✅ (target: <1%)
Cost per call: $0.001 ✅ (target: <$0.002)
Infrastructure capacity: 50M concurrent calls (5x projected peak)

Go/No-Go Decision: ✅ Approved to proceed to 1% rollout. All critical issues resolved, metrics within targets.

Phase 3: Limited Rollout (Weeks 3-6)

Goal: Validate product-market fit and catch edge cases with real users before scaling to 100%.

Week 3-4: 1% Rollout (10M users)

Audience: US-only, iOS-only, ages 18-35 (most tech-savvy, most likely to adopt)
Monitoring: Real-time dashboards tracking every metric (latency, error rate, engagement, safety)
On-call: 24/7 on-call rotation with 5-minute response time SLA
Communication: In-app announcement: "Try the new calling feature!"

What We Learned:

Engagement exceeded expectations: Users made 0.8 calls/day vs. 0.5 target. Calling was more popular than predicted.
Network quality was the #1 complaint: 15% of users complained about call quality on poor networks. Graceful degradation was working but UI messaging needed improvement.
iPhone X bug: Calls dropped after 60 seconds on iPhone X (iOS 11.2). Apple acknowledged bug and fixed in iOS 11.3. We added workaround for older iOS versions.
Creator adoption was strong: 85% of creators in 1% group enabled calling, 70% made at least one call.

Week 5-6: 10% Rollout (100M users)

Audience: All countries, iOS + Android, all ages
New challenges: Android fragmentation (1000+ device types), international networks (2G/3G in emerging markets), language/cultural differences

Critical Issues Found:

India network issue: Calls failing 40% of the time in India due to carrier-level NAT traversal issues. Built custom TURN server infrastructure in Mumbai and Bangalore. Improved completion rate from 60% to 82%.
Low-end Android devices: Devices with <2GB RAM were crashing during video calls. Added memory monitoring and automatic video disabling on low-memory devices.
ML false positives: Models flagged 8% of legitimate calls as spam (vs. 2% in beta). Issue: models trained on Messenger data didn't generalize to Instagram usage patterns. Retrained with Instagram-specific data, reduced false positives to 1.8%.
Language/cultural issues: In some cultures (Japan, Korea), calling without prior arrangement is considered rude. Added "Request to call" feature where caller sends request first, recipient approves.

Key Metrics Achieved:

Daily calls per user: 0.8 (vs. 0.5 target) ✅
Call quality score: 4.2/5 ✅
Harmful content rate: 0.15% (vs. 0.16% baseline) ✅
Call completion rate: 83% (vs. 75% target) ✅
Creator adoption: 85% ✅

User Feedback:

Positive: "Love being able to call friends without leaving Instagram" (68% of users)
Negative: "Call quality is inconsistent" (15% of users), "Wish I could call people who don't follow me back" (8% of users)

Go/No-Go Decision: ✅ Approved to proceed to full rollout. All critical issues resolved, engagement exceeding targets.

Phase 4: Scale (Weeks 7-12)

Goal: Ramp to 100% of users while maintaining quality and optimizing costs.

Rollout Schedule:

Week 7: 10% → 25% (150M new users)
Week 8: 25% → 40% (150M new users)
Week 9: 40% → 60% (200M new users)
Week 10: 60% → 80% (200M new users)
Week 11: 80% → 95% (150M new users)
Week 12: 95% → 100% (50M new users)

Automated Rollback Triggers:

Error rate >1% for >10 minutes → pause rollout
P95 latency >200ms for >10 minutes → pause rollout
Safety incidents >10/hour → pause rollout
User reports >2/1000 calls → pause rollout

Infrastructure Scaling:

Added 1500 servers across 15 regions (total: 2000 servers)
Scaled database from 50 shards to 200 shards
Expanded CDN capacity by 5x
Added 10 new TURN servers in emerging markets (India, Indonesia, Brazil, Nigeria, Mexico)

Cost Optimization:

Negotiated better CDN rates (saved $8M/year)
Optimized video codec settings (reduced bandwidth by 15% without quality loss)
Implemented smart routing (route calls through cheapest path while maintaining latency SLA)
Built capacity planning model to avoid over-provisioning

Continuous Improvement:

Week 7: Improved network quality UI messaging (reduced complaints by 30%)
Week 8: Added "Request to call" feature for cultural sensitivity (adopted by 12% of users in Japan/Korea)
Week 9: Optimized ML models (reduced false positives from 1.8% to 1.2%)
Week 10: Added call history and missed call notifications (increased call-back rate by 25%)
Week 11: Improved low-end Android performance (reduced crashes by 40%)
Week 12: Added group calling for up to 4 people (requested by 35% of users)

Incidents & Resolutions:

Week 8, Day 3: Latency spike to 250ms in EU region. Root cause: CDN routing issue. Fixed in 45 minutes. Affected 5M users.
Week 10, Day 2: Error rate spike to 2.5% in India. Root cause: carrier-level network issue (not our fault). Worked with carrier to resolve. Fixed in 3 hours.
Week 11, Day 5: ML model false positive spike to 5%. Root cause: model drift (real-world data distribution changed). Retrained model. Fixed in 2 hours.

Key Metrics Achieved:

75% DAU adoption by month 6 ✅ (750M daily active users making calls)
99.92% uptime maintained ✅ (exceeded 99.9% SLA)
Harmful content 25% below baseline ✅ (0.12% vs. 0.16%)
Call completion rate: 85% ✅ (vs. 70% industry average)
Infrastructure cost: 40% below budget ✅ ($72M/year vs. $120M budgeted)

Team Retrospective:

What went well: Crawl-walk-run rollout, strong creator controls, graceful degradation
What could be better: Should have built better operator tools, tested on more device types earlier
Lessons learned: Shadow mode is essential, metrics drive decisions, cross-functional coordination is the bottleneck

The Results

Adoption Metrics

The adoption curve exceeded all projections, reaching 75% DAU in 6 months (target was 50%):

Week 1 (1% rollout): 10M users enabled, 8M daily calls (0.8 calls/user)
Week 6 (10% rollout): 100M users enabled, 80M daily calls (0.8 calls/user)
Month 3 (60% rollout): 500M users enabled, 400M daily calls (0.8 calls/user)
Month 6 (100% rollout): 750M users enabled (75% DAU), 600M daily calls (0.8 calls/user)
Month 12: 850M users enabled (85% DAU), 800M daily calls (0.94 calls/user)

Adoption by Segment:

Creators (>10K followers): 90% adoption, 1.2 calls/day (higher than average)
Power users (>100 friends): 85% adoption, 1.5 calls/day
Regular users: 72% adoption, 0.6 calls/day
Emerging markets (India, Indonesia, Brazil): 68% adoption, 0.7 calls/day (lower due to network constraints)

Call Type Breakdown:

Audio-only: 60% of calls (validated audio-first strategy)
Audio → Video upgrade: 25% of calls (users started audio, upgraded to video)
Video-only: 15% of calls (users started with video)
Average call duration: 8.5 minutes (longer than Snapchat's 6 minutes, shorter than WhatsApp's 12 minutes)

Quality Metrics

We exceeded quality targets across all dimensions:

Call completion rate: 85% (industry average: 70%, target: 75%)
- WiFi: 92% completion rate
- 4G: 88% completion rate
- 3G: 78% completion rate
- 2G: 65% completion rate (still better than 40% without graceful degradation)
Call quality score: 4.3/5 (target: 4.0, WhatsApp: 4.4/5, Snapchat: 3.9/5)
- Audio quality: 4.5/5
- Video quality: 4.1/5
- Connection reliability: 4.2/5
P95 latency: 110ms (target: 150ms, industry average: 180ms)
- US/EU: 95ms
- Asia: 120ms
- Emerging markets: 140ms
Uptime: 99.92% (target: 99.9%, exceeded SLA)
- Zero major outages (>1 hour)
- 3 minor incidents (15-45 minutes, affected <5% of users)
- Mean time to recovery: 28 minutes

Safety Metrics

Trust & safety metrics beat industry benchmarks:

Harmful content rate: 0.12% (baseline: 0.16%, 25% reduction)
- Spam/scam calls: 0.08%
- Harassment: 0.03%
- Other violations: 0.01%
False positive rate: 1.2% (target: <2%, improved from 1.8% at launch)
- Spam detection: 1.5% false positives
- Harassment detection: 0.9% false positives
User reports per 1000 calls: 0.8 (target: <1.0, Snapchat: 1.5, industry average: 2.0)
Creator harassment rate: 0.09% (target: <0.1%, competing platforms: 0.3%)
Accounts banned for violations: 15,000 in first 6 months (0.002% of calling users)
Appeals processed: 2,500 (85% upheld ban, 15% overturned)

Business Impact

Calling drove significant business value beyond direct engagement:

User Engagement:

+15% time in app (from 28 min/day to 32 min/day)
+12% daily active users (calling brought back lapsed users)
+8% weekly retention (users who called were more likely to return)
+25% cross-feature usage (calling users also used Stories, Reels, messaging more)

Messaging Growth:

+40% messages sent (calling drove messaging, not cannibalized it)
+35% new conversations started (calling broke the ice, led to more messaging)
+20% group chat creation (users formed groups after calls)

Creator Impact:

+8% creator retention (creators stayed on platform longer)
+12% creator content production (creators posted more after connecting with fans)
+18% fan engagement (fans who called creators engaged more with their content)
New creator revenue stream: Enabled paid 1-on-1 calls (launched 6 months later, $50M GMV in first year)

Infrastructure & Cost:

40% below budget ($72M/year actual vs. $120M budgeted)
Audio-first strategy saved $48M/year in bandwidth costs
Graceful degradation reduced support load by 30% (fewer "call failed" complaints)
Reused 60% of Messenger/WhatsApp infrastructure (saved $20M in development costs)

Competitive Impact:

Reduced switching to WhatsApp/Messenger by 35% (users stayed in Instagram for calls)
Slowed Snapchat growth by 8% (Instagram calling was competitive with Snapchat's offering)
Increased Instagram's "stickiness" (harder for users to leave when all communication is in one app)

Key Tradeoffs

Tradeoff 1: Audio-First vs. Video-First

Tradeoff 2: Privacy vs. Safety

Tradeoff 3: Speed vs. Perfection

Tradeoff 4: Creator Controls vs. Discoverability

Lessons Learned

1. Start with the Constraint, Not the Feature

2. Crawl-Walk-Run Saves You Every Time

The gradual rollout caught 12 major issues that would have been catastrophic at 100% traffic. Shadow mode alone found 3 critical bugs. Never skip crawl phase.

3. Trust & Safety is a Product Feature, Not an Afterthought

We built safety controls before general rollout. This made creators comfortable and prevented a harassment crisis. Safety should be in the MVP, not v2.

4. Graceful Degradation > Perfect Quality

Users preferred a working audio call over a broken video call. Build systems that degrade gracefully, not fail catastrophically.

5. Metrics Drive Decisions, Not Opinions

We had 60+ engineers with strong opinions. Metrics (latency, completion rate, safety) cut through debate and aligned the team. Instrument everything.

6. Cross-Functional Coordination is the Bottleneck

With 60+ people across 4 time zones, coordination was harder than the technical work. Weekly syncs, clear DRIs, and written decision docs were essential.

7. Users Surprise You

We thought video would dominate. Users preferred audio (60% of calls). We thought creators would disable calling. 90% kept it on. Always validate assumptions with real users.

What I'd Do Differently

1. Invest More in Network Quality Prediction

We built reactive degradation (drop quality when network fails). Should have built predictive degradation (drop quality before it fails). Would have improved completion rate by 5-10%.

2. Ship Creator Analytics Sooner

Creators wanted to know who called them, when, and for how long. We shipped this in month 4. Should have been in MVP. Would have increased creator adoption faster.

3. Build Better Operator Tools

Our trust & safety team had basic dashboards. Should have built real-time intervention tools (pause calls, send warnings, etc.). Would have reduced harmful content by another 10%.

4. Test on More Device Types

We tested on flagship devices and missed issues on low-end Android phones (30% of users). Should have tested on 20+ device types before rollout.

Frameworks Used

This case study demonstrates several frameworks in action:

Crawl-Walk-Run Ladder: Shadow mode → 1% → 100%
Agent Reliability Patterns: Graceful degradation, bounded autonomy
Safety SLO Ladder: Bronze → Silver → Gold safety
Latency-Learning Flywheel: Lower latency → more calls → better models

External Resources

Takeaways for Your Product

If you're building real-time features:

Start with audio, add video later
Use crawl-walk-run rollout (shadow → 1% → 100%)
Build graceful degradation from day one
Ship safety controls in MVP, not v2
Instrument everything, let metrics drive decisions
Test on low-end devices, not just flagships
Coordinate cross-functional teams with written docs

If you're scaling to billions of users:

Shadow mode catches critical bugs before they hurt users
Gradual rollout gives you time to fix issues
Strong defaults (privacy, safety) build trust
Cost optimization matters at scale (audio-first saved 40%)
Operational excellence is a competitive advantage

Technologies Used

WebRTCReact NativeReal-time SystemsML-based Content ModerationTrust & Safety

Share This Case Study

Want similar results for your team?

I work with teams to implement these patterns and achieve measurable outcomes.

Advisory Services See Frameworks

Loading...

Instagram Calling: 0 to 75% DAU in 6 Months

Objective

Instagram Calling: 0 to 75% DAU in 6 Months

The Challenge

The Objective

Constraints

Key Decisions

Decision 1: Audio-First, Video-Optional

Decision 2: Crawl-Walk-Run Rollout

Decision 3: ML-Based Harmful Content Detection

Decision 4: Graceful Degradation for Network Quality

Decision 5: Creator Controls First

The Execution

Phase 1: Foundation (Months 1-2)

Phase 2: Shadow Mode (Weeks 1-2)

Phase 3: Limited Rollout (Weeks 3-6)

Phase 4: Scale (Weeks 7-12)

The Results

Adoption Metrics

Quality Metrics

Safety Metrics

Business Impact

Key Tradeoffs

Tradeoff 1: Audio-First vs. Video-First

Tradeoff 2: Privacy vs. Safety

Tradeoff 3: Speed vs. Perfection

Tradeoff 4: Creator Controls vs. Discoverability

Lessons Learned

1. Start with the Constraint, Not the Feature

2. Crawl-Walk-Run Saves You Every Time

3. Trust & Safety is a Product Feature, Not an Afterthought

4. Graceful Degradation > Perfect Quality

5. Metrics Drive Decisions, Not Opinions

6. Cross-Functional Coordination is the Bottleneck

7. Users Surprise You

What I'd Do Differently

1. Invest More in Network Quality Prediction

2. Ship Creator Analytics Sooner

3. Build Better Operator Tools

4. Test on More Device Types

Frameworks Used

External Resources

Takeaways for Your Product

Technologies Used

Share This Case Study

Want similar results for your team?

Instagram Calling: 0 to 75% DAU in 6 Months

Objective

Instagram Calling: 0 to 75% DAU in 6 Months

The Challenge

The Objective

Constraints

Key Decisions

Decision 1: Audio-First, Video-Optional

Decision 2: Crawl-Walk-Run Rollout

Decision 3: ML-Based Harmful Content Detection

Decision 4: Graceful Degradation for Network Quality

Decision 5: Creator Controls First

The Execution

Phase 1: Foundation (Months 1-2)

Phase 2: Shadow Mode (Weeks 1-2)

Phase 3: Limited Rollout (Weeks 3-6)

Phase 4: Scale (Weeks 7-12)

The Results

Adoption Metrics

Quality Metrics

Safety Metrics

Business Impact

Key Tradeoffs

Tradeoff 1: Audio-First vs. Video-First

Tradeoff 2: Privacy vs. Safety

Tradeoff 3: Speed vs. Perfection

Tradeoff 4: Creator Controls vs. Discoverability

Lessons Learned

1. Start with the Constraint, Not the Feature

2. Crawl-Walk-Run Saves You Every Time

3. Trust & Safety is a Product Feature, Not an Afterthought

4. Graceful Degradation > Perfect Quality

5. Metrics Drive Decisions, Not Opinions