Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons We Learned Through Hell When Scaling Blibli.com

Lessons We Learned Through Hell When Scaling Blibli.com

Various lessons we learned on scaling site

Alex Xandra Albert Sim

October 29, 2019
Tweet

More Decks by Alex Xandra Albert Sim

Other Decks in Technology

Transcript

  1. Disclaimer Presentations are intended for educational purposes only and not

    to replace independent professional judgment. The views and opinions expressed in this presentation do not necessarily reflect the official policy or position of blibli.com. Audience discretion is advised.
  2. Who am I? • Alex Xandra Albert Sim • Lead

    Principal R&D Engineer at blibli.com • [email protected] • bertzzie(.sim)
  3. The “Usual” Backend Architecture Gateway (Reverse Proxy) Server / Servlet

    (Tomcat, Netty, etc.) App (Spring, JavaEE, etc.) Database Cache Other Services (Internal / External)
  4. Threads • Basic computational concept to handle concurrency Service Queue

    Task 1 Task 2 … Thread Pool Thread 1 Thread 2 Thread 3 … The Java Threading Model
  5. Optimizing Thread Utilization • For applications that you don’t write

    (i.e. Tomcat), there’s usually a config for it • Too much threads will result in performance degradation due to thread switching cost • The common formula [0] for max thread size is: [0] Goetz, Brian. Java Concurrency in Practice Num of Threads = Num of Cores * (1 + Wait Time / Service Time) Wait time: time spent waiting for IO bound task to complete Service time: time spent processing • Remember, this is oversimplified. We usually have multiple thread pools (HTTP, JDBC, etc.) with different workload requirements
  6. Know Your Limit • Check your backend capacity with Little’s

    Law Capacity = Average Arrival Rate / average latency • Arrival rate is measured in request per second • This formula measures how many request you can measure with a stable response time • Repeatedly perf test your backend with these two rough calculation to see its real capacity
  7. Handling Concurrency On Your Own • For your own code,

    it’s usually better and easier to use a higher-level abstraction than threads • Battle tested, a lot of examples, easier to learn • Example: RxJava, Reactor, Akka • We choose RxJava (old projects) and Reactor (new projects)
  8. RxJava • Reactive Extensions, help us in processing and composing

    events • Single-threaded by default, could easily be made concurrent • Concurrency is achieved via Schedulers • Changing Scheduler can have a major performance impact, depending on your use case • Remember: test, test, test!
  9. Common Reactive Schedulers • Immediate Scheduler – Blocks current task

    and run task immediately on the same thread • Single Scheduler – Run task on another thread, but only one thread is provided • Computation Scheduler – RxJava only: run task on other threads. Thread count == CPU core count • IO Scheduler – RxJava only: run task on other threads. Threads are unbounded • Bounded Elastic – Reactor only: like IO Scheduler, but with cap on max thread count
  10. Lesson 1 • Understand your basics, know your libraries •

    No shortcut in performance tuning, it’s typically: test, test, test!
  11. Background Story • Incoming traffic so huge it’s indistinguishable with

    DDoS • At gateway level: port exhaustion keeps happening • At server level: never ending thread exhaustion • At application level: timeouts, timeouts, timeouts
  12. Circuit Breaker • B-but we have circuit breaker? • First

    thing first: what’s a circuit breaker? Normal Request Flow Service A Service B HTTP Request HTTP Response HTTP Library Service Layer With Circuit Breaker Service A Service B HTTP Request HTTP Response Circuit Breaker Service Layer HTTP Library
  13. Queue? Service A Service B (Slow Response) HTTP Request Request

    1 Come Request 2 Come Request 3 Come (Thread Exhaustion)
  14. Queue? Service A Service B (Slow Response) HTTP Request Request

    1 Come Request 2 Come Request 3 Come (Thread Exhaustion) Request 4 Queued
  15. Queue? Service A Service B (Slow Response) HTTP Request Request

    1 Come Request 2 Come Request 3 Come (Thread Exhaustion) Request 4 Queued Request 5 Queued
  16. Queue? Service A Service B (Slow Response) HTTP Request Request

    1 Come Request 2 Come Request 3 Come (Thread Exhaustion) Request 4 Queued Request 5 Queued Request 6 Queued (Max Queue Reached)
  17. Cascading • This Queue Problem is cascading! Tomcat Hystrix Threading

    in General <Executor maxQueueSize= <Connector acceptCount= hystrix.threadpool.HystrixThreadPoolKey.maxQueueSize • There are caches involved in every layer – making us late to notice!
  18. Lesson 2 • Timeouts, managed incorrectly could eat resources VERY

    fast • Be careful with your queues and cache! • Drop the request you can’t handle ASAP • Plan and anticipate for cascading failure
  19. Monitoring Applications • On a high scale application, monitoring is

    a very crucial tool for operations and optimization • Usually we can peek into performance details without much impact • Example trace:
  20. Enter Distributed Tracing Distributed tracing, also called distributed request tracing,

    is a method used to profile and monitor applications, especially those built using a microservices architecture. Distributed tracing helps pinpoint where failures occur and what causes poor performance. https://opentracing.io/docs/overview/what-is-tracing/
  21. Service B Common Transactions Flow in Distributed System Service A

    Executor Thread 1 Thread 2 Thread 3 … Request Executor Thread 1 Thread 2 Thread 3 … Request
  22. Tracing Request Flow Service A Executor Thread 1 (Span 1)

    Thread 2 (Span 3) Thread 3 (Span 2) Thread 4 (Span 4) Request 1 Trace ID
  23. Traces and Spans • Trace ID represents the lifetime of

    a request • Trace ID is the same across services, covering the whole request-response flow • Span ID is an individual unit of work • Span ID must contain a Trace ID to create a relationship between spans and trace • Span ID can be continued to link between spans • 1 Trace ID can have multiple Span IDs • To trace a request we: – mark every request with a Trace ID – create Spans for each thread or unit of work – mark every external request with the trace and span id created earlier
  24. Lesson 3 • To save yourselves from headaches, make sure

    your monitoring tools work well • Use a standard architecture and tools whenever possible • Future discussion: baggage items, logging, tagging
  25. Closing • Only way to know your real performance is

    by testing (on production) • Manage your threads, caches, and queues VERY carefully • Failures could cascade if you are not careful • Your Metrics and Instrumentation tools should be prepared for your architecture