Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling The Monolith

Scaling The Monolith

Mateus Guimarães

January 26, 2023
Tweet

More Decks by Mateus Guimarães

Other Decks in Programming

Transcript

  1. Code Scalability Come on, Mateus. This term does not exist…

    We usually worry about the infrastructure side of things, but often neglect the quality and our code’s ability to grow — in functionalities, modi fi cations or processing power. Well-tested, properly designed code = easier to grow and to modify
  2. No.

  3. Giving some context In 2019, I worked on an SMS

    app. In short: • Clients uploaded huge lists of customers • Campaigns were created targeting segmented people based on those lists • The platform would then automatically send SMS messages, respecting the limits of each number provider, etc
  4. Application Modules List uploading Campaign Creation Message Sending • Lots

    of data ( 100M + ) • Expensive geographic lookup • Slow • Expensive queries • Fairly database • Complex processes needed to run during the actual creation of each pending message record • Need to be FAST • High number of operations per second • The maximum amount of messages per minute per account needed to be calculated during runtime • We need to support dozens of SMS providers!
  5. Application Modules Response Handling Automatic Replies (Drips) • Simple logic,

    but expensive operations in high scale • “Stop word” verifications against the database • Would possibly trigger another command to block a number from receiving messages • Would happen after receiving a “reply keyword” (e.g: yes) • Adds a pending message to the database
  6. The Stack • Laravel • Mongo • Vue.js • Go

    Most of the app was written in Laravel. To send messages, there was a simple Go Script that fetched pending messages from the database and sent them.
  7. Problem: list uploading • Extremely slow • Queued, but jobs

    would frequently timeout several times and be discarded
  8. Problem: contact creation • Complex and intensive • Expensive geolocation

    data generation through the contact’s phone number • High amount of queries to create each content • High number of contacts being created at any minute
  9. Problem: message sending • High number of operations per second

    • Needed to verify whether the contact had already received a message from that speci fi c sender over the last 24 hours
  10. Problem: requests We had two types of webhooks: delivery reports

    and text replies. The number of delivery reports was 1 - 1 with sent messages — for each message we sent, we would receive one delivery report. After campaigns were sending for a while, we would get flooded with requests and start responding very slowly and, after some time, with 500s.
  11. Problem: Mongo Mongo worked very well… until the collections got

    fairly large and it didn’t. The fact that pausing/resuming campaigns and sending messaged involved *moving* data between collections obviously did not help.
  12. Problem: Message processing The Go script — that processed pending

    messages — did not have a UI, logs, or anything. Monitoring was very complicated. Adding new drivers was very complicated too, since code needed to be added to Laravel *and* to the Go script.
  13. The state of things • ~100M contacts per day •

    1-2M messages sent per day • Servers costed 5 digits a month • Frequent failures • Daily data cleanup
  14. A New Hope • Laravel • MySQL • Laravel Horizon

    • Redis We decided I was build a new MVP, alone, to try and correct the problems we had. I decided to use a very simple stack:
  15. Figure out the basics It’s easy to think about all

    the fancy tooling and hype things and forget about the basics, sometimes optimizing things that were never a problem. • Does your database have the necessary indexes? • Does your database have the correct data types?(looking at you, VARCHAR ( 255 ) ) • Are your server and process manager ( NGINX and PHP - FPM, for example) correctly configured? • Are your queries optimized? • Do you have tools to show you where the bottleneck is? ( Observability)
  16. Fixing the processing problems Since we had lots of intensive

    and/or frequent processes, the solution was, in short, to queue everything we could and read against a cache whenever possible.
  17. Things are going to break. Accept it. Things break often

    in programming. They break even more often when they’re external. And even more often in high-scale environments. Accept that they *are* going to break, and instead focus on writing good countermeasures to those problems and ensuring you have good observability — that is, you can spot them as soon as they happen.
  18. Know your tools We all leverage lots of abstraction, but

    it is important to have an understanding of how some things work behind the scenes. For example…
  19. Results! • 15M+ sent messages a day without breaking a

    sweat • 300M+ imported contacts a day • 500M+ jobs processed in a 12h timeframe • 30,000+ requests/minute • Infrastructure cost under $1000/month • Easily horizontally scalable • Easy queue management and monitoring through Horizon • Easy to add new providers • Well-tested, easy to maintain, extend, and modify codebase