Delivering Millions of Messages within seconds @ Duolingo

Delivering Duolingo Millions of messages within seconds @ Duolingo 08/04/2024
Presented by: Vitor Pellegrino, Zhen Zhou

Duolingo’s mission is to develop the best education in the
world and make it universally available.

Our social presence

123.4M

Philip Pacheco / Getty Images

5 seconds

4 million users

what we will cover General Architecture Deep Dive: Superb Owl

General Architecture

100s of microservice s

100s of microservice s 10s of millions of requests/min

Technology stack

Technology stack Amazon RDS Amazon DynamoDB Amazon Elastic Container Service

Why ECS? Why not Kubernetes? Reduced complexity was very important
in the beginning AWS native constructs Size of the platform team

CI/CD Galaxy Apps gRPC/OpenAPI Shareable terraform modules

Zombie mode

Jeeves

Jeeves Jeeves Zendesk (cs reports, twitter) AppFigures (reviews) Jira (str
tickets)

tickets) Shake to report

tickets) external format internal format s3 ElasticSearch Shake to report

Freeze Gun

TODO: Architecture diagram https://blog.duolingo.com/dogfooding-app/

Everything was ﬁne… for a while

Introducing VBitor: The Marketing Manager

Oh I forgot… there is one last thing

Deep Dive: Superb Owl

Where did the “impossible” task come from?

To send 4M notiﬁcations in 5 seconds… Speed (How fast?)
Scale (How big?) Timing (What time?) Operating principle: Embrace challenges!

Challenge 1: Speed

Challenge 2: Scale

Challenge 3: Timing

Technical Requirements How do we send notiﬁcations at the target
rate? How do we get all the cloud resources we need in time? How do we ensure resiliency/idempotency?

Our solution: Asynchronous!

System Diagram

Use case A: upload a campaign 1. Engineer creates a
campaign with a list of userID. 2. Server acknowledges the request, then asynchronously starts fetching data from DynamoDB. 3. Server puts the parsed data in S3. Logs the result to CloudWatch.

Use case B: prepare the send 1. Cloud operation admin
scales up the ASG. 2. Engineer scales up workers in the ECS console. 3. Workers fetches data from S3, stores data in memory(a mapping of user -> deviceID). 4. Workers log their “complete” status to CloudWatch.

Use case C: hit the GO button 1. Marketing admin
hits the “GO” button to send out the campaign. 2. API server dispatches 50+ messages to FIFO SQS queue. 3. Interim workers dispatch 10k+ SQS messages to the next SQS queue. 4. Notiﬁcation workers send notiﬁcations by calling batch APNS/FCM’s API. 5. Workers log their process time to CloudWatch.

Does it send notiﬁcations at the target rate? Desired send
rate: 4M/5s = 800,000 messages/second SQS in-ﬂight message limit: 120,000 messages/second What’s the solution? Batching(size differs by iOS/Android)

Can we provision all the cloud resources we need in
time? 1. A technical contact from AWS 2. An IEM (Infrastructure Event Management) document 3. Spot instances 4. Dedicated ECS cluster

Can we ensure resiliency? We leveraged a FIFO queue from
the AWS SQS service: • Dedupe by message identiﬁers • Has a deduplication window of 5 minutes • Limited capacity (300 requests/second) Alternative: a cache/a table

System Diagram 1. Send notiﬁcations at the target rate? 2.
Can provision all the cloud resources we need in time? 3. Resiliency?

Operating principle: Test it ﬁrst! 1. Throughput tests 2. Cloud
resource tests 3. Testing with real users How we tested it

This is “how we tested the MVP” 1. Test with
silent notiﬁcations: a. Bottleneck -> Thread count 2. Test with # of Threads: a. 10 -> 5 -> 1 b. Bottleneck: ??? 3. Test with a larger audience: a. 500k -> 3M b. Bottleneck -> task count 4. Test with # of Processes: a. 1 -> 4 -> 2 How we tested it: throughput

This is “how we tested the Cloud” 1. Can we
scale up superb-owl? 2. Can we scale up backend? 3. Can we scale up both? 4. Can we scale up both in <3 hours? How we tested it: cloud resources

How we tested it: real users October 1M users Novembe
r 2M users January 4M users Thanks to Zombie mode, we are more comfortable with testing with real users. Lesson learned: Send yourself a copy before sending to the users.

1. Make system foolproof + Write a foolproof plan How
we tested it…(continued)

Day of Super Bowl The playbook

Campaign Results 99% notiﬁcations were out in 5.7 seconds, 95%
in 3.9 seconds Figma blog post: The anatomy of a Super Bowl ad Wall Street Journal: Duolingo’s Mascot Has a 'Buttception' in a Five-Second Regional Super Bowl Ad …

Summary

1. Building on a solid foundation is important. 2. Be
open-minded about design, rigorous with testing. 3. Always build systems with resilience + robustness. 4. Things can always go wrong, so accept it. Lessons learned

1. Python -> Async Python? -> Go? 2. Optimize memory
-> Less cost and intervention 3. More automation -> Less error What would we do differently?

What is next for us? Platform as a Product Edge
computing MLOps

Q&A Join a life-changing mission that brings us together! Visit
careers.duolingo.com for more information.

stay in touch! linkedin.com/in/ zhen-zhou-cmuzz LINKEDIN linkedin.com/in/ vitorpellegrino

gracias!

Delivering Millions of Messages within seconds @ Duolingo

Delivering Millions of Messages within seconds @ Duolingo

Other Decks in Technology

Featured

Transcript