Design For Millions

Design For Millions: Creating Services That Serve Millions of Concurrent Users

Checkout the slide deck: Design For Millions - Mebin J Thattil

Introduction

Designing services that can serve Millions concurrently is nothing short of an engineering marvel. The engineering behind such systems often goes unnoticed, but if you need to build such systems you need to have a look under the hood.

Core Principles

When building services that need to scale to millions of users, there are several fundamental objectives that must be achieved:

1. Scalability

What is scalability? Simply put, it's how well your system can handle growth. This growth can manifest in several ways:

There are two primary approaches to scaling:

Vertical Scaling: Simply put, it's beefing up your current system and upgrading it's specs. Think of it as upgrading from a 2GB RAM server to 4GB, then to 8GB. While straightforward, this approach hits physical and economic limits quickly.

Horizontal Scaling: In this approach we add more machines to distribute the load. Instead of one powerful server, you have multiple servers working together. This is the approach that enables true large-scale systems.

2. Compatibility

The internet runs on protocols - mutual agreements on how systems communicate. As technology evolves, new protocols emerge that offer better performance, compression, or security. However, this creates a dilemma:

So we try to balance optimization with backward compatibility.

3. Reliability

Reliability means keeping systems running at all times and preventing failures. This involves:

If all fails then the system must have a plan B where core services are still up but the other services can go down.

4. Observability

"If you can't measure it, you can't improve it." Observability is about understanding what's happening in your system at any given moment. This is achieved through:

The LGTM stack (Loki, Grafana, Tempo, Mimir) is a popular open-source solution for observability:
- Loki: Log aggregation
- Grafana: Visualization and dashboards
- Tempo: Distributed tracing
- Mimir: Long-term metrics storage

5. Failover Mechanisms

When failures occur (and they will), you need strategies to:

6. Reduced Costs

At scale, even small inefficiencies become expensive. Cost optimization involves:

7. Performance

Performance isn't just about speed - it's about consistent, predictable behavior under all conditions. This includes:

Understanding Scalability

Let's dive deeper into scalability since it's the foundation of everything else.

Types of Scaling

Vertical Scaling (Scaling Up)

When your system starts running out of resources, the simplest solution is to upgrade the hardware. However, this approach has significant limitations:

Horizontal Scaling (Scaling Out)

Adding more machines to distribute load is the path to true scalability:

Core idea is, for example: At some point, it becomes cheaper to run two 4GB machines than one 8GB machine, and you get better reliability as a bonus.

Bob's Journey: A Simplified Story

To understand these principles practically, here is a small story that's hopefully more relatable

Chapter 1: The Beginning

Bob starts simple:
- Rents a T2 micro EC2 instance
- Deploys a blogging website
- Everything works perfectly... initially

Chapter 2: First Success

Bob's blogs go viral! Suddenly:
- Traffic increases 10x
- Memory fills up
- Website becomes painfully slow
- His friends yell at him

Bob's Solution: Upgrade to a 4GB instance (vertical scaling). Cost: $1/month increase.

Chapter 3: Even More Success

The blogs become even more popular:
- Traffic increases another 5x
- The 4GB instance is maxed out
- Two options:
- 4GB → 8GB upgrade: $5/month
- Get a second 4GB instance: $2/month (2 × $1)

Bob's Realization: Horizontal scaling is more cost-effective! But this introduces a new problem: How to distribute traffic?

Chapter 4: Load Balancing

With two servers, Bob needs a load balancer:
- Health checks: Monitors if servers are responsive
- Traffic distribution: Spreads requests across healthy servers
- Automatic failover: Routes traffic away from failing servers

There are many ways to load balance an application, some common methods are - round robin, least connections, IP hashes.
Which method we choose depends on the use case.

Chapter 5: Global Reach

Bob's friend in the US complains the site is slow. Why?

The Problem: Physical distance. Data travels at the speed of light, but:
- Network hops add latency
- Long distances mean longer round-trip times
- International routing can be inefficient

The Solution: CDN (Content Delivery Network)

A CDN caches content at edge locations worldwide:
- Static content (images, CSS, JS) served from nearby servers
- Reduced latency for users
- Lower bandwidth costs for origin servers
- DDoS protection as a bonus

Chapter 6: Live Streaming Challenges

Bob wants to add live streaming to his site. He immediately faces multiple challenges:

Bob realizes he needs to study how the experts do it. Enter: Hotstar.

The Hotstar Case Study

Hotstar (now JioHotstar) is one of the world's largest video streaming platforms. During the 2019 ICC Cricket World Cup, they achieved something unprecedented. They served a maxium of 25.3 million concurrent viewers 🤯. No platform had come close to this scale before.

The Concurrency Graph for 2019 World Cup

Concurrency Graph

1 -> Toss (beginning): 5-10M viewers tune in
2 -> Match Starts: Gradual climb to 15M viewers
3 -> Rain delay: 15M → 3M (most switch away)
4 -> Match Resumes: 3M → 15M in minutes (massive spike!)
5 -> Innings Break: Small dip
6 -> India batting: 15M → 20M+ (peak engagement)
7 -> Critical moments: Can spike to 25M+
8 -> Match ends: Rapid drop-off

The hardest part isn't handling 25M concurrent users - it's handling the rapid spikes and drops:

The Dhoni Effect: When Thala got out, viewership dropped from 25M to 1M in minutes. The system had to gracefully handle both the peak and the sudden decline.

Rain Resumes: When play resumed after rain, viewership jumped from 3M to 15M almost instantly. The system had to scale from serving 3M to 15M in under a minute.

Architecture Overview

Hotstar's architecture is designed with multiple layers of redundancy and optimization:

Hotstar's architecture Image

External API Gateway (Akamai CDN):
- Security checks
- Rate limiting
- DDoS protection
- Request routing

Internal API Gateway:
- Sits behind Application Load Balancers
- Routes requests to appropriate backend services
- Handles authentication and authorization

Backend Services:
- Microservices architecture

Databases:
- self-hosted and managed DB instances

API Segregation

A eureka moment of figuring out that all requests need not be treated the same way helped optimize their systems.

The team realized they could segregate APIs into two categories:

Cacheable Content:
- Scorecards
- Concurrency metrics
- Key moments/highlights
- Match statistics

Non-cacheable Content:
- User-specific data
- Personalized recommendations
- Authentication tokens
- Real-time user interactions

By treating these differently, they could:
- Apply lighter security and rate controls to cacheable content
- Reduce stress on compute capacity
- Increase overall throughput

They created a separate CDN domain specifically for cacheable content with:
- Optimized configuration rules for cacheable paths
- Better isolation from core services
- Protection against misconfigurations affecting critical services

Infrastructure Challenges

1. NAT Gateway Bottleneck

The Problem: At just 1/10th of peak traffic load, a single Kubernetes cluster was using 50% of the NAT Gateway's network throughput capacity.

Standard Setup:
- One NAT Gateway per AWS Availability Zone (AZ)
- All subnets route external traffic through that gateway
- This became a bottleneck at scale

Solution: Provision NAT Gateways at the subnet level instead of AZ level. This provided more network paths and distributed the throughput load.

2. Kubernetes Node Throughput

The Problem: Network throughput analysis revealed:
- Some services consumed 8-9 Gbps per node
- Nodes running multiple internal API Gateway pods simultaneously created throughput spikes
- Uneven distribution caused some nodes to max out while others were underutilized

Solution:
1. Deploy high-throughput nodes (minimum 10 Gbps network capacity)
2. Implement topology spread constraints to ensure each node hosted only a single API Gateway pod
3. Result: Throughput per node stayed at 2-3 Gbps even during peak load

3. The IP Address Crisis

The Setup: Before the 2023 World Cup:
- 3 subnets across multiple AZs
- Each with /20 CIDR blocks (~4,090 IPs per subnet)
- Total: ~12,300 IP addresses

The Problem: They needed 400 Kubernetes nodes but couldn't scale beyond 350. Why? They ran out of IP addresses!

But wait - 12,300 IPs should be enough for 400 nodes, right? Not quite.

Understanding AWS ENI (Elastic Network Interface):
- Each Kubernetes node can run multiple pods
- Each pod needs its own IP address
- AWS pre-allocates IPs for faster pod startup (propagation takes time)
- This means each node requires multiple IPs to be reserved

Default AWS CNI Settings:
- MINIMUM_IP_TARGET: 35 IPs per node
- WARM_IP_TARGET: 10 additional IPs pre-allocated

So each node actually required ~45 IP addresses!

With 350 nodes × 45 IPs = 15,750 IPs required (they only had 12,300 available)

Solution:
1. Modified VPC CNI configuration:
- MINIMUM_IP_TARGET: 35 → 20
- WARM_IP_TARGET: 10 → 5
2. Added new subnets with larger /19 CIDR blocks (~8,190 IPs per subnet)
3. Result: ~48,000 additional IP addresses available

Testing at Scale: Project Hulk

You can't just hope your system works at scale - you need to test it. Hotstar built an in-house load testing system called Project Hulk:

Specifications:
- C5.9xlarge EC2 instances
- Distributed across 8 AWS regions
- 108,000 CPUs
- 216 TB RAM
- 200 Gbps network output capacity

Purpose:
- Simulate millions of concurrent users
- Test system behavior under various load patterns
- Identify bottlenecks before the actual event
- Verify autoscaling and failover mechanisms

Proactive Scaling

The Math: Imagine 1 million users join per minute:
- Application boot time: 1 minute
- EC2 instance provisioning + Load Balancer health check: 5-6 minutes
- By the time new instances are ready, 5-6 million more users have arrived
- Reactive scaling can't keep up!

Solution: Proactive scaling with buffers

Instead of reacting to load, Hotstar:
- Predicted traffic patterns based on historical data
- Pre-scaled infrastructure before expected spikes
- Maintained buffer capacity above predicted peak
- Had "ladders" defined: For X million users → provision Y servers

Why Not AWS Autoscaling?

AWS Autoscaling has limitations at extreme scale:

  1. Insufficient Capacity Errors: When requesting large numbers of instances, the specific instance type might have run out
  2. Single Instance Type per Auto Scaling Group: Limits flexibility, eg. if one instance type is not available, I might want to spin up another instance type
  3. Step Size Limitations: Can't scale fast enough for rapid spikes, usually increments in step sizes of 10-20
  4. Retry Logic: Exponential backoffs and AZ unavailability create large queues - eventually worse for the system

Hotstar's Custom Solution:
- Scales based on RPS (Requests Per Second) and RPM (Requests Per Minute) instead of CPU/RAM
- Uses "ladders": Pre-defined rules like "for 10M users → 500 servers"
- Secondary scaling groups with more powerful instances as fallback
- Utilizes SPOT instances for cost savings

Panic Mode: Graceful Degradation

When things go wrong (and they will), you need a plan. Hotstar implements "Panic Mode":

Priority System:
- P-0 Services: Critical (must ALWAYS be up)
- Video playback
- Authentication
- Basic navigation
- P-1 Services: Important but not critical
- Detailed statistics
- Commentary
- Social features
- P-2+ Services: Nice-to-have
- Recommendations
- User profiles
- Historical data

During Panic Mode:
1. Turn off non-critical services (P-1 and below)
2. Redirect all resources to P-0 services
3. Bypass certain checks if necessary to keep core services running
4. User experience degrades (fewer features) but doesn't disrupt (service stays up)

This ensures that even in worst-case scenarios, users can still watch the stream.

Video Streaming Architecture

Let's understand how video streaming actually works, since this is core to Hotstar's service.

The Streaming Pipeline

  1. Capture Input: Video source (camera, screen capture, file)
  2. Media Encoding: Convert raw video to digital format
  3. Transcoding: Convert to multiple quality levels (240p, 360p, 480p, 720p, 1080p, 4K)
  4. Segmentation: Break video into small chunks (typically 2-10 seconds)
  5. Manifest Generation: Create playlist files (.m3u8) that tell clients how to fetch segments
  6. Distribution: Send segments (.ts files) to CDN
  7. Delivery: Serve via HTTP to clients
  8. Client Playback: Player downloads segments and stitches them together

Adaptive Bitrate Streaming (ABR)

The magic of modern streaming is ABR - the ability to adapt video quality based on network conditions:

How It Works:
1. Encode the same video at multiple bitrates (qualities)
2. Client player monitors available bandwidth
3. Player automatically switches between quality levels
4. Seamless viewing experience without buffering

Example: If you're watching on fast WiFi at 1080p and switch to cellular, the player automatically drops to 720p or lower to prevent buffering.

Transport Protocols

HLS (HTTP Live Streaming):
- Developed by Apple
- Uses standard HTTP
- Works with any CDN
- Good device compatibility
- Segment-based delivery (.ts files)
- Manifest files (.m3u8) describe playlist

RTMP (Real-Time Messaging Protocol):
- Legacy protocol
- Used primarily for ingestion (from encoder to server)
- Being phased out in favor of modern alternatives

Video Codecs

There are various codecs available such as: H.264(AVC), H.265(HEVC), VP9, AV1

The Compatibility-Performance Trade-off

This is where compatibility principles come into play:

At Hotstar's scale, even small bandwidth savings multiply enormously:
- 1 MB saved per user per hour
- 25M concurrent users
- 3-hour match
- = 75 PB of bandwidth saved!

Key Takeaways

Building systems for millions requires thinking differently:

1. Plan for Scale from Day One

Re-architecting systems once you're knee deep is much harder than building them right initially.

2. Measure Everything

You can't improve what you can't measure. Try to setup observability tools early on.

3. Design for Failure

Failures will happen. Design systems that:
- Degrade gracefully
- Isolate failures
- Recover automatically
- Maintain core functionality

4. Optimize Costs Continuously

At scale, small inefficiencies become expensive. Try to be as efficient as possible.

5. Test at Scale

You can't know how your system behaves under load until you test it. After testing identify the bottlenecks and fix those.

6. Segregate by Requirements

Not all requests are created equal:
- Identify cacheable vs. non-cacheable content
- Apply appropriate security based on sensitivity
- Create isolation between critical and non-critical services
- Optimize each category separately

7. Automate

Manual scaling doesn't work at this level:
- Automate provisioning and deployment
- Implement intelligent auto-scaling
- Build self-healing systems

Conclusion

The engineering behind services like Hotstar is a testament to careful planning, continuous optimization, and learning from real-world challenges. While not every application needs to handle 25 million concurrent users, the principles remain the same:

Both Bob's simple blog and Hotstar's massive infrastructure illustrates that scaling is a continuous process. You start simple, identify bottlenecks, solve them systematically, and iterate.
No one builds a perfect system on day 1. Build systems that evolve.


Additional Resources

Try It Yourself

Want to experience scaling challenges firsthand? Checkout this cool game I found (disclaimer: it's super addicting):
Server Survival Game

This blog post is based on the presentation "Design For Millions" I gave at Talks@ACM 2026.




Powered Not An SSG 😎