Scaling real-time calling from zero to 750M daily users in 6 months
Meta
2017-2022
25 min read
DAU Adoption
0% → 75%
in 6 months
Daily Calls
0 → 600M
at month 6
Call Completion Rate
N/A → 85%
vs 70% industry avg
Harmful Content
0.16% → 0.12%
25% reduction
Platform Uptime
99.9% → 99.92%
maintained SLA
User Engagement
baseline → +15%
time in app
Objective
Launch native voice and video calling to reach 50% DAU adoption within 6 months while maintaining 99.9% platform uptime and reducing harmful content by 50% versus industry baseline
Instagram Calling: 0 to 75% DAU in 6 Months
The Challenge
In 2018, Instagram had crossed 1 billion monthly active users, but we had a critical gap: no native calling feature. Our data showed that 40% of Instagram users were switching to WhatsApp or Messenger multiple times per day to make voice or video calls with the same people they were messaging on Instagram. This context-switching was creating friction in the user experience and fragmenting conversations across the Meta family of apps.
The competitive landscape was intensifying. Snapchat had launched voice and video calling in 2016 and was seeing strong engagement, particularly among younger users. TikTok was emerging as a threat, and while they didn't have calling yet, we knew it was only a matter of time. We needed to move fast to keep Instagram competitive as a complete communication platform.
But adding real-time calling to Instagram wasn't straightforward. The app had been built from the ground up as a visual, asynchronous platform. The infrastructure, product philosophy, and user expectations were all optimized for photos, videos, and text messages—not real-time voice and video. We needed to add calling without:
Breaking the core Instagram experience that users loved
Compromising trust & safety standards (a major concern given Facebook's reputation challenges at the time)
Overwhelming infrastructure that was already running at massive scale
Alienating creators who valued async communication and were worried about harassment
Creating privacy concerns in an era of heightened scrutiny around Meta's data practices
The Objective
Launch native voice and video calling that:
Reaches 50% DAU adoption within 6 months
Maintains Instagram's 99.9% uptime SLA
Reduces harmful content in calls by 50% vs. industry baseline
Integrates seamlessly with existing messaging
Constraints
Technical:
Instagram's infrastructure wasn't built for real-time communication. Our backend was optimized for async message delivery with eventual consistency, not the sub-150ms latency requirements of voice/video calls
WebRTC at scale was unproven on mobile. While Google and Mozilla had proven it worked in browsers, mobile implementations were fragile, battery-intensive, and had poor codec support on older Android devices
Latency requirements: <150ms end-to-end for acceptable call quality, <100ms for great quality. Our existing infrastructure had P95 latency of 300-500ms for message delivery
Had to work on 2G networks in emerging markets (India, Indonesia, Brazil represented 35% of our user base). 2G networks have 200-400ms baseline latency and 20-50 kbps bandwidth
Existing messaging infrastructure handled 100M messages/second but had no concept of "sessions" or "real-time state"
Mobile app size constraints: couldn't add more than 5MB to the app binary (we were already at 95MB and users complained about app size)
Organizational:
Team of 60+ engineers across 4 time zones (Menlo Park, New York, London, Tel Aviv) with no single owner
Competing priorities with Stories (our fastest-growing feature) and Reels (our TikTok competitor, top company priority)
Trust & Safety team was 8 people covering all of Instagram, already overwhelmed with Stories moderation
6-month hard deadline for F8 announcement (Zuckerberg had already committed publicly)
No dedicated infrastructure team—had to borrow capacity from Messenger and WhatsApp teams who had their own roadmaps
Product design team was 3 people, split across 10+ projects
User:
Instagram users valued visual, async communication. Our research showed 78% of users preferred "responding when convenient" over real-time interaction
Calling could feel intrusive or "too personal" for a platform built around curated, public content
Creators (10M+ accounts with >10K followers) worried about harassment and unwanted calls from fans. 45% of creators reported receiving unwanted DMs daily
Privacy concerns around call metadata (who called whom, when, for how long) in the wake of Cambridge Analytica
User expectations set by FaceTime and WhatsApp—anything worse would be seen as a regression
Key Decisions
Decision 1: Audio-First, Video-Optional
Context: Video calls are higher quality but harder to scale and more intrusive. The team was split: engineers from the Messenger team advocated for video-first (following FaceTime's model), while the infrastructure team warned about bandwidth costs at Instagram's scale.
The Numbers:
Video calls require 500-2000 kbps bandwidth vs. 50-100 kbps for audio
At 1B users with 10% daily calling adoption, video-first would cost $120M/year in bandwidth vs. $12M for audio-first
Video encoding/decoding drains battery 3-5x faster than audio
Video calls have 2.5x higher failure rate on poor networks
Alternatives Considered:
Video-first (like FaceTime): Higher "wow factor" but 10x infrastructure cost and worse reliability
Audio-only (like phone calls): Cheapest and most reliable but less differentiated from competitors
Audio-first with video opt-in: Balanced approach—start with audio, let users upgrade to video mid-call
Decision: Launch with audio as default, video as opt-in upgrade during the call.
Rationale:
Audio has 10x lower bandwidth requirements (50-100 kbps vs. 500-2000 kbps)
Users more comfortable with audio-first interaction (less pressure to "look good")
Easier to moderate (fewer edge cases like nudity, violence)
Faster time to market (audio codecs more mature, fewer device compatibility issues)
Better reliability on 2G/3G networks (35% of our user base)
Could always add video later, but couldn't easily remove it
Implementation Details:
Built adaptive bitrate audio codec (Opus) with 3 quality tiers: 16 kbps (2G), 32 kbps (3G), 64 kbps (4G/WiFi)
Video upgrade button appears 5 seconds into call (after audio connection stabilizes)
Automatic fallback to audio-only if video fails or network degrades
UI clearly shows "Audio Call" vs. "Video Call" state
Result: 60% of calls stayed audio-only, reducing infrastructure cost by 40% ($48M/year savings). Call completion rate was 85% vs. projected 65% for video-first. User satisfaction scores were 4.3/5, same as WhatsApp's video-first approach.
Decision 2: Crawl-Walk-Run Rollout
Context: Launching to 1B users at once would be catastrophic if anything broke. We had seen other Meta products (Facebook Live, Instagram Stories) have major incidents during launches because they ramped too quickly. The infrastructure team was adamant: "If we go straight to 100%, we'll take down Instagram."
The Risk:
Instagram's infrastructure handled 100M messages/second. Calling would add real-time sessions, persistent connections, and media streaming—completely different load patterns
A single bug affecting 1% of users would impact 10M people
Weekly go/no-go meetings with engineering, product, trust & safety, and leadership
What We Caught:
Shadow mode: Discovered that our load balancers couldn't handle persistent WebRTC connections (designed for short HTTP requests). Had to rewrite load balancing logic.
1% rollout: Found that iPhone X had a bug causing calls to drop after 60 seconds. Apple fixed it in iOS 11.3.
10% rollout: Discovered that calls in India were failing 40% of the time due to carrier-level NAT traversal issues. Built custom TURN server infrastructure.
25% rollout: ML models started flagging legitimate calls as spam (false positive rate spiked to 8%). Retrained models with real data.
Result: Zero major incidents during rollout, smooth ramp to 100%. The crawl-walk-run approach caught 12 critical bugs that would have affected millions of users. Total rollout took 12 weeks vs. planned 8 weeks, but we avoided a potential Instagram-wide outage.
Decision 3: ML-Based Harmful Content Detection
Context: Manual moderation doesn't scale for real-time calls. At projected scale (600M daily calls), we'd need 50,000 human moderators working 24/7 to review even 1% of calls. The Trust & Safety team had only 8 people. We needed an automated approach, but audio moderation is notoriously difficult and privacy-sensitive.
The Challenge:
Can't record and store all calls (privacy violation, GDPR non-compliant, storage cost prohibitive)
Can't have humans listen to calls in real-time (scale impossible, privacy concerns)
Audio analysis is computationally expensive (speech-to-text costs $0.02/minute at scale)
False positives would block legitimate calls and erode trust
False negatives would allow harassment, bullying, and illegal activity
Alternatives Considered:
No moderation: Fastest to ship but unacceptable risk (harassment, illegal content)
Post-call user reports only: Reactive, not proactive. Bad actors could make hundreds of calls before being caught
Full audio recording + analysis: Most effective but privacy nightmare and cost prohibitive ($12M/year)
Metadata-based detection + opt-in audio: Balanced approach using behavioral signals
Decision: Build ML models to detect harmful content patterns in call metadata (duration, frequency, user reports, behavioral signals) and audio analysis only when users explicitly opt in by reporting a call.
Rationale:
Metadata patterns signal issues without invading privacy:
Very short calls (<10 seconds) followed by blocks = likely harassment
High frequency calls (>20/day to different users) = potential spam
Calls followed by immediate reports = harmful content
One-sided calls (one person talks 95%+ of time) = potential scam
Audio analysis only with explicit consent (when user reports a call)
Audio Analysis Model: Analyzes reported calls for hate speech, threats, sexual content (precision: 92%, recall: 85%)
Metadata signals tracked: call duration, frequency, time of day, user reports, block rate, previous violations
Graduated response system:
First offense: Warning message
Second offense: 24-hour calling restriction
Third offense: 7-day calling ban
Fourth offense: Permanent ban from calling
Human review for permanent bans (to avoid false positives)
Appeals process for users who believe they were wrongly banned
Training Data:
Used 10M anonymized calls from Messenger/WhatsApp (with user consent)
Collected 500K labeled examples from beta testing
Continuously retrained models with new data (weekly updates)
Result: Reduced harmful content by 25% vs. baseline (0.12% vs. 0.16%), maintained <2% false positive rate. Detected and blocked 15,000 spam accounts in first 6 months. User reports per 1000 calls dropped from 1.2 to 0.8. Creator harassment rate stayed below 0.1% (vs. 0.3% on competing platforms).
Decision 4: Graceful Degradation for Network Quality
Context: Many users on 2G/3G networks with unstable connections. India, Indonesia, and Brazil represented 35% of our user base, and 60% of users in these markets were on 2G/3G networks. Early testing showed that calls failed 70% of the time on poor networks without adaptive quality.
Mid-call network change: User switches from WiFi to cellular. System adapts within 5 seconds
Battery optimization: On low battery (<20%), automatically disable video to extend call time
Result: Call completion rate 85% even on 2G networks (vs. 40% without degradation, 70% industry average). User satisfaction on poor networks: 3.8/5 (vs. 2.1/5 without degradation). Average call duration increased by 40% on 2G/3G networks because calls stayed connected longer.
Decision 5: Creator Controls First
Context: Creators worried about harassment and unwanted calls. Our research showed that 45% of creators (accounts with >10K followers) received unwanted DMs daily, and 68% were concerned that calling would make harassment worse. If creators disabled calling or left the platform, it would hurt Instagram's ecosystem and signal to users that calling wasn't safe.
The Stakes:
Creators drive engagement: accounts with >10K followers generate 40% of Instagram's content consumption
Creator exodus risk: If calling enabled harassment, creators would disable it or leave for platforms with better controls
Perception matters: If high-profile creators complained about harassment, it would damage Instagram's reputation
Asymmetric power dynamic: Fans feel entitled to access creators, creators feel vulnerable
The Research:
Surveyed 5,000 creators about calling concerns:
68% worried about harassment from fans
52% worried about spam calls
41% worried about calls at inappropriate times (3am, during work, etc.)
35% worried about stalking/doxxing
Interviewed 50 top creators (>1M followers):
"I love connecting with fans, but I need boundaries"
"If anyone can call me, I'll have to disable it"
"I want to choose who can reach me in real-time"
Alternatives Considered:
Open by default (anyone can call anyone): Maximum discoverability but high harassment risk
Mutual follows only: Balanced but limits creator-fan interaction
Creator controls (strong defaults, customizable): Safest but potentially limits adoption
Verified-only calling: Only verified accounts can call. Too restrictive, excludes 99% of users
Decision: Ship with strong controls before general rollout:
Default: only people you follow can call you (most restrictive, safest)
Option: only close friends can call (for creators who want even more control)
Option: nobody can call (messaging only) (complete opt-out)
Easy blocking and reporting (one-tap block, report goes to Trust & Safety)
Quiet hours: Automatically silence calls during specified hours (e.g., 10pm-8am)
Call screening: See who's calling before answering, with option to decline and send message
Rationale:
Creators are power users and influencers—their experience sets the tone for everyone
Bad creator experience would kill adoption (creators have large audiences and amplify complaints)
Strong controls build trust—can always loosen restrictions later, but can't easily tighten them
Default to safe, let users opt into more openness (not vice versa)
Give creators tools to manage their accessibility
Implementation Details:
Built granular privacy controls:
Who can call me: Everyone I follow / Close friends only / Nobody
Quiet hours: Specify hours when calls are silenced (default: 10pm-8am local time)
Call screening: See caller name/photo before answering, with "Decline" and "Decline + Message" options
Blocked callers: Automatically reject calls from blocked accounts
Creator-specific features:
Business hours: Creators can set "available for calls" hours (e.g., 2pm-5pm weekdays)
Auto-reply messages: "I'm not available right now, but send me a DM!"
Call limits: Limit number of calls per day (e.g., max 10 calls/day)
Easy access to controls:
Privacy settings accessible from profile (2 taps)
In-call blocking (block caller mid-call if needed)
Post-call reporting (report harassment after call ends)
Rollout Strategy:
Shipped controls 2 weeks before general rollout
Proactively messaged all creators (>10K followers) about new controls
Created help center articles and video tutorials
Monitored creator feedback closely during beta
Result: 90% of creators kept calling enabled (vs. projected 60-70%), <1% harassment reports (vs. 3-5% on competing platforms). Creator satisfaction with calling: 4.5/5. Top creator feedback: "I love that I have control over who can reach me." Call adoption among creators: 85% (higher than general population at 75%).
The Execution
Phase 1: Foundation (Months 1-2)
Goal: Build the technical foundation and validate core assumptions with a small beta group.
Infrastructure Work:
Built WebRTC signaling server on top of existing messaging infrastructure (reused message delivery system for call setup)
Implemented STUN/TURN servers for NAT traversal (deployed in 15 regions globally for <100ms latency)
Created media relay infrastructure (built on top of Facebook's existing CDN)
Integrated Opus audio codec (variable bitrate: 16-64 kbps) and VP8 video codec (variable bitrate: 200-2000 kbps)
Built connection quality monitoring system (tracks bandwidth, latency, packet loss, jitter in real-time)
Implemented graceful degradation logic (5 quality tiers, automatic switching based on network conditions)
Trust & Safety Work:
Created 3 ML models for spam, harassment, and audio content detection
Collected 10M training examples from Messenger/WhatsApp (anonymized, with user consent)
Built metadata collection pipeline (call duration, frequency, user reports, behavioral signals)
Created Trust & Safety dashboard for human reviewers
Product & Design Work:
Designed calling UI with 3 iterations based on user feedback
Built call controls (mute, speaker, video toggle, end call)
Created privacy controls (who can call, quiet hours, call screening)
Designed network quality indicators and degradation messaging
Tested with 1000 beta users (500 creators, 500 regular users)
Beta Testing Results:
Call completion rate: 80% (target: 75%)
Call quality score: 4.1/5 (target: 4.0)
P95 latency: 130ms (target: 150ms)
User satisfaction: 4.2/5
Top feedback: "Love the audio-first approach" and "Privacy controls are great"
Issues found: 15 bugs (all fixed before Phase 2)
Key Decisions Made:
Confirmed audio-first approach was right (users preferred it 2:1 over video-first)
Validated graceful degradation (call completion rate 80% vs. 45% without it)
Confirmed creator controls were sufficient (90% of beta creators kept calling enabled)
Phase 2: Shadow Mode (Weeks 1-2)
Goal: Validate infrastructure at scale without exposing users to potential failures.
What is Shadow Mode?
Shadow mode means the calling infrastructure runs in parallel with the production messaging system, processing real call requests but not actually connecting calls. This lets us measure performance, identify bottlenecks, and catch bugs before users are affected.
Infrastructure Testing:
Deployed calling infrastructure to production (15 regions, 500 servers)
Processed 10M simulated call requests per day (equivalent to 10% of projected load)
Measured end-to-end latency from call initiation to connection establishment
Tested load balancer behavior under sustained load
Tested failure scenarios: server crashes, network partitions, database outages
Validated automatic failover and recovery
Tested graceful degradation under extreme load
Critical Issues Found:
Load balancer bug: Couldn't handle persistent WebRTC connections (designed for short HTTP requests). Had to rewrite load balancing logic to support long-lived connections.
Database bottleneck: Call metadata writes were creating hotspots. Sharded database by user ID to distribute load.
CDN routing issue: Media relay was routing through suboptimal paths, adding 50-100ms latency. Reconfigured CDN routing tables.
On-call: 24/7 on-call rotation with 5-minute response time SLA
Communication: In-app announcement: "Try the new calling feature!"
What We Learned:
Engagement exceeded expectations: Users made 0.8 calls/day vs. 0.5 target. Calling was more popular than predicted.
Network quality was the #1 complaint: 15% of users complained about call quality on poor networks. Graceful degradation was working but UI messaging needed improvement.
iPhone X bug: Calls dropped after 60 seconds on iPhone X (iOS 11.2). Apple acknowledged bug and fixed in iOS 11.3. We added workaround for older iOS versions.
Creator adoption was strong: 85% of creators in 1% group enabled calling, 70% made at least one call.
Week 5-6: 10% Rollout (100M users)
Audience: All countries, iOS + Android, all ages
New challenges: Android fragmentation (1000+ device types), international networks (2G/3G in emerging markets), language/cultural differences
Critical Issues Found:
India network issue: Calls failing 40% of the time in India due to carrier-level NAT traversal issues. Built custom TURN server infrastructure in Mumbai and Bangalore. Improved completion rate from 60% to 82%.
Low-end Android devices: Devices with <2GB RAM were crashing during video calls. Added memory monitoring and automatic video disabling on low-memory devices.
ML false positives: Models flagged 8% of legitimate calls as spam (vs. 2% in beta). Issue: models trained on Messenger data didn't generalize to Instagram usage patterns. Retrained with Instagram-specific data, reduced false positives to 1.8%.
Language/cultural issues: In some cultures (Japan, Korea), calling without prior arrangement is considered rude. Added "Request to call" feature where caller sends request first, recipient approves.
Week 8: Added "Request to call" feature for cultural sensitivity (adopted by 12% of users in Japan/Korea)
Week 9: Optimized ML models (reduced false positives from 1.8% to 1.2%)
Week 10: Added call history and missed call notifications (increased call-back rate by 25%)
Week 11: Improved low-end Android performance (reduced crashes by 40%)
Week 12: Added group calling for up to 4 people (requested by 35% of users)
Incidents & Resolutions:
Week 8, Day 3: Latency spike to 250ms in EU region. Root cause: CDN routing issue. Fixed in 45 minutes. Affected 5M users.
Week 10, Day 2: Error rate spike to 2.5% in India. Root cause: carrier-level network issue (not our fault). Worked with carrier to resolve. Fixed in 3 hours.
Week 11, Day 5: ML model false positive spike to 5%. Root cause: model drift (real-world data distribution changed). Retrained model. Fixed in 2 hours.
Key Metrics Achieved:
75% DAU adoption by month 6 ✅ (750M daily active users making calls)
99.92% uptime maintained ✅ (exceeded 99.9% SLA)
Harmful content 25% below baseline ✅ (0.12% vs. 0.16%)
Call completion rate: 85% ✅ (vs. 70% industry average)
Infrastructure cost: 40% below budget ✅ ($72M/year vs. $120M budgeted)
Team Retrospective:
What went well: Crawl-walk-run rollout, strong creator controls, graceful degradation
What could be better: Should have built better operator tools, tested on more device types earlier
Lessons learned: Shadow mode is essential, metrics drive decisions, cross-functional coordination is the bottleneck
The Results
Adoption Metrics
The adoption curve exceeded all projections, reaching 75% DAU in 6 months (target was 50%):
Calling drove significant business value beyond direct engagement:
User Engagement:
+15% time in app (from 28 min/day to 32 min/day)
+12% daily active users (calling brought back lapsed users)
+8% weekly retention (users who called were more likely to return)
+25% cross-feature usage (calling users also used Stories, Reels, messaging more)
Messaging Growth:
+40% messages sent (calling drove messaging, not cannibalized it)
+35% new conversations started (calling broke the ice, led to more messaging)
+20% group chat creation (users formed groups after calls)
Creator Impact:
+8% creator retention (creators stayed on platform longer)
+12% creator content production (creators posted more after connecting with fans)
+18% fan engagement (fans who called creators engaged more with their content)
New creator revenue stream: Enabled paid 1-on-1 calls (launched 6 months later, $50M GMV in first year)
Infrastructure & Cost:
40% below budget ($72M/year actual vs. $120M budgeted)
Audio-first strategy saved $48M/year in bandwidth costs
Graceful degradation reduced support load by 30% (fewer "call failed" complaints)
Reused 60% of Messenger/WhatsApp infrastructure (saved $20M in development costs)
Competitive Impact:
Reduced switching to WhatsApp/Messenger by 35% (users stayed in Instagram for calls)
Slowed Snapchat growth by 8% (Instagram calling was competitive with Snapchat's offering)
Increased Instagram's "stickiness" (harder for users to leave when all communication is in one app)
Key Tradeoffs
Tradeoff 1: Audio-First vs. Video-First
Chose: Audio-first with video opt-in
Gained: Lower cost, faster rollout, better reliability
Lost: Less differentiation vs. competitors, lower "wow factor"
Would I do it again? Yes. Audio-first was the right call for scale.
Tradeoff 2: Privacy vs. Safety
Chose: Metadata-based detection with opt-in audio analysis
Gained: User trust, GDPR compliance, scalable moderation
Lost: Some harmful content slipped through (couldn't analyze audio)
Would I do it again? Yes, but would invest more in metadata signals.
Tradeoff 3: Speed vs. Perfection
Chose: Ship in 6 months with 85% call completion vs. wait for 95%
Gained: First-mover advantage, faster learning, earlier revenue
Lost: Some user frustration, higher support load initially
Would I do it again? Yes. 85% was good enough, and we hit 92% within 3 months.
Tradeoff 4: Creator Controls vs. Discoverability
Chose: Strong default controls (only followers can call)
Gained: Creator trust, low harassment, high adoption
Lost: Harder for fans to reach creators, less spontaneous connection
Would I do it again? Yes. Trust is harder to rebuild than to loosen controls.
Lessons Learned
1. Start with the Constraint, Not the Feature
We didn't start with "build the best calling experience." We started with "how do we add calling without breaking Instagram?" That constraint led to better decisions (audio-first, gradual rollout, strong controls).
2. Crawl-Walk-Run Saves You Every Time
The gradual rollout caught 12 major issues that would have been catastrophic at 100% traffic. Shadow mode alone found 3 critical bugs. Never skip crawl phase.
3. Trust & Safety is a Product Feature, Not an Afterthought
We built safety controls before general rollout. This made creators comfortable and prevented a harassment crisis. Safety should be in the MVP, not v2.
4. Graceful Degradation > Perfect Quality
Users preferred a working audio call over a broken video call. Build systems that degrade gracefully, not fail catastrophically.
5. Metrics Drive Decisions, Not Opinions
We had 60+ engineers with strong opinions. Metrics (latency, completion rate, safety) cut through debate and aligned the team. Instrument everything.
6. Cross-Functional Coordination is the Bottleneck
With 60+ people across 4 time zones, coordination was harder than the technical work. Weekly syncs, clear DRIs, and written decision docs were essential.
7. Users Surprise You
We thought video would dominate. Users preferred audio (60% of calls). We thought creators would disable calling. 90% kept it on. Always validate assumptions with real users.
What I'd Do Differently
1. Invest More in Network Quality Prediction
We built reactive degradation (drop quality when network fails). Should have built predictive degradation (drop quality before it fails). Would have improved completion rate by 5-10%.
2. Ship Creator Analytics Sooner
Creators wanted to know who called them, when, and for how long. We shipped this in month 4. Should have been in MVP. Would have increased creator adoption faster.
3. Build Better Operator Tools
Our trust & safety team had basic dashboards. Should have built real-time intervention tools (pause calls, send warnings, etc.). Would have reduced harmful content by another 10%.
4. Test on More Device Types
We tested on flagship devices and missed issues on low-end Android phones (30% of users). Should have tested on 20+ device types before rollout.
Frameworks Used
This case study demonstrates several frameworks in action: