Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Delivering Millions of Messages within seconds @ Duolingo

Delivering Millions of Messages within seconds @ Duolingo

Building a notification system may seem trivial, but what about building one that could reach millions of users within a few seconds? What about doing that right after your advertisement airs?

Event-based notification systems are not uncommon anymore, but there’s rarely a cost-effective example of an on-demand, highly parallel notification system. The complexity of building such a system comes from the intersection of system design, site reliability, and cloud resource management. All of that while being pressured by the demands of an unhinged marketing campaign over TV and the Web.

Vitor Pellegrino

April 18, 2024
Tweet

Other Decks in Technology

Transcript

  1. Duolingo’s mission is to develop the best education in the

    world and make it universally available.
  2. Why ECS? Why not Kubernetes? Reduced complexity was very important

    in the beginning AWS native constructs Size of the platform team
  3. Jeeves Jeeves Zendesk (cs reports, twitter) AppFigures (reviews) Jira (str

    tickets) external format internal format s3 ElasticSearch Shake to report
  4. To send 4M notifications in 5 seconds… Speed (How fast?)

    Scale (How big?) Timing (What time?) Operating principle: Embrace challenges!
  5. Technical Requirements How do we send notifications at the target

    rate? How do we get all the cloud resources we need in time? How do we ensure resiliency/idempotency?
  6. Use case A: upload a campaign 1. Engineer creates a

    campaign with a list of userID. 2. Server acknowledges the request, then asynchronously starts fetching data from DynamoDB. 3. Server puts the parsed data in S3. Logs the result to CloudWatch.
  7. Use case B: prepare the send 1. Cloud operation admin

    scales up the ASG. 2. Engineer scales up workers in the ECS console. 3. Workers fetches data from S3, stores data in memory(a mapping of user -> deviceID). 4. Workers log their “complete” status to CloudWatch.
  8. Use case C: hit the GO button 1. Marketing admin

    hits the “GO” button to send out the campaign. 2. API server dispatches 50+ messages to FIFO SQS queue. 3. Interim workers dispatch 10k+ SQS messages to the next SQS queue. 4. Notification workers send notifications by calling batch APNS/FCM’s API. 5. Workers log their process time to CloudWatch.
  9. Does it send notifications at the target rate? Desired send

    rate: 4M/5s = 800,000 messages/second SQS in-flight message limit: 120,000 messages/second What’s the solution? Batching(size differs by iOS/Android)
  10. Can we provision all the cloud resources we need in

    time? 1. A technical contact from AWS 2. An IEM (Infrastructure Event Management) document 3. Spot instances 4. Dedicated ECS cluster
  11. Can we ensure resiliency? We leveraged a FIFO queue from

    the AWS SQS service: • Dedupe by message identifiers • Has a deduplication window of 5 minutes • Limited capacity (300 requests/second) Alternative: a cache/a table
  12. System Diagram 1. Send notifications at the target rate? 2.

    Can provision all the cloud resources we need in time? 3. Resiliency?
  13. Operating principle: Test it first! 1. Throughput tests 2. Cloud

    resource tests 3. Testing with real users How we tested it
  14. This is “how we tested the MVP” 1. Test with

    silent notifications: a. Bottleneck -> Thread count 2. Test with # of Threads: a. 10 -> 5 -> 1 b. Bottleneck: ??? 3. Test with a larger audience: a. 500k -> 3M b. Bottleneck -> task count 4. Test with # of Processes: a. 1 -> 4 -> 2 How we tested it: throughput
  15. This is “how we tested the Cloud” 1. Can we

    scale up superb-owl? 2. Can we scale up backend? 3. Can we scale up both? 4. Can we scale up both in <3 hours? How we tested it: cloud resources
  16. How we tested it: real users October 1M users Novembe

    r 2M users January 4M users Thanks to Zombie mode, we are more comfortable with testing with real users. Lesson learned: Send yourself a copy before sending to the users.
  17. Campaign Results 99% notifications were out in 5.7 seconds, 95%

    in 3.9 seconds Figma blog post: The anatomy of a Super Bowl ad Wall Street Journal: Duolingo’s Mascot Has a 'Buttception' in a Five-Second Regional Super Bowl Ad …
  18. 1. Building on a solid foundation is important. 2. Be

    open-minded about design, rigorous with testing. 3. Always build systems with resilience + robustness. 4. Things can always go wrong, so accept it. Lessons learned
  19. 1. Python -> Async Python? -> Go? 2. Optimize memory

    -> Less cost and intervention 3. More automation -> Less error What would we do differently?
  20. Q&A Join a life-changing mission that brings us together! Visit

    careers.duolingo.com for more information.