Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lesson learned from the adoption of Armeria to ...

Lesson learned from the adoption of Armeria to LINE's authentication system

Presentation at TWJUG 201908 聚會, in Taipei on 2019/8/15
https://www.meetup.com/taiwanjug/events/263411288/

LINE Developers

August 15, 2019
Tweet

More Decks by LINE Developers

Other Decks in Programming

Transcript

  1. About me • Masahiro Ide • Software Engineer of LINE

    server development • Messaging backend server, Redis cluster management • Armeria adoption • Contribute to Armeria, line-bot-sdk-java • https://github.com/line/armeria • https://github.com/line/line-bot-sdk-java
  2. Agenda • User request authentication at LINE • Inside of

    our auth system • Various issues faced during Armeria adoption • Future work
  3. What is LINE Authentication service? • Authenticates requests come to

    LINE messaging service • Text message, broadcasting Bot message, purchase sticker, etc. • Server-side authentication ↔ Client-side authentication Talk server Client Reverse Proxy Talk server Talk Server StickerShop Bot backend StickerShop Bot backend Auth server
  4. Auth Server Components • Netty 4.1.0-beta8 • epoll, HTTP/2 •

    Jedis (Redis client for Java) • In-house redis cluster • Mybatis+mysql-connector-java • With In-house sharding system • L4 Load Balancer (LB) Auth Server Netty Jedis Mybatis L4-LB Redis Redis Redis Redis Redis MySQL Apps Apps Apps
  5. Requirement • Fast • Latency SLO: 99% of requests served

    in < 10ms • Reliable • Handles 600K req/sec within reasonable time • Maintainable • Easy to modify by anyone • Efficiency • HTTP/2 + L4 LB causes load imbalance
  6. HTTP/2 + L4 LB causes load imbalance? • With HTTP/2,

    you can multiplex the requests into single connection • To maximize the merit, HTTP/2 client/server try to keep the connection as much as possible Client Server stream1 stream2 … stream3
  7. HTTP/2 + L4 LB causes load imbalance? • With HTTP/2,

    you can multiplex the requests into single connection • To maximize the merit, HTTP/2 client/server try to keep the connection as much as possible Client Server Client Server
  8. HTTP/2 + L4 LB causes load imbalance? • With HTTP/2+LB,

    server/client /LB continue to use the connection that you created once at startup, the client may unintentionally communicate only with the same server. • In worst case, all the requests go to 1 server Client Server Client Server LB
  9. HTTP/2 + L4 LB causes load imbalance? • Disconnect aged

    connection can relax the imbalance • But it is not a perfect solution val conn = getConnectionPool(); if (conn.maxAge >= 5 min) conn.disconnect() conn = getConnectionPool(); 7 K req/sec
  10. Requirement • Fast • Latency SLO: 99% of requests served

    in < 10ms • Reliable • Handles 600K req/sec within reasonable time • Maintainable • Easy to modify by anyone • Efficiency • HTTP/2 + L4 LB Q. Can Armeria solve the Requirements?
  11. Armeria - Our RPC layer • Asynchronous RPC/REST library •

    built on top of Java 8, Netty, HTTP/2, Thrift and gRPC • Take care common functionality for microservice • Client side LB • Circuit Breaker / Retry / Throttling • Tracing (Zipkin) / Monitoring integration • etc. https://line.github.io/armeria/
  12. Requirement for Armeria adoption • Fast and Reliable • Netty

    + epoll + HTTP/2 (mostly same with prev impl) • Maintainable • → Easy to modify by anyone • Succeeded to cleanup tons of forked code & Netty handlers • Efficiency • HTTP/2 + L4 LB → HTTP/2 + Client side LB
  13. Requirement for Armeria adoption • Fast and Reliable • Netty

    + epoll + HTTP/2 (mostly same with prev impl) • Maintainable • → Easy to modify by anyone • Succeeded to cleanup tons of forked code & Netty handlers • Efficiency • HTTP/2 + L4 LB → HTTP/2 + Client side LB
  14. Easy to write, easy to maintain your service • Armeria

    provides similar usability as existing web framework https://line.github.io/armeria/server-annotated-service.html
  15. Requirement for Armeria adoption • Fast and Reliable • Netty

    + epoll + HTTP/2 (mostly same with prev impl) • Maintainable • → Easy to modify by anyone • Succeeded to cleanup tons of forked code & Netty handlers • Efficiency • HTTP/2 + L4 LB → HTTP/2 + Client side LB
  16. Client-side load balancing • Problem: Proxy-based LB + HTTP/2 causes

    load imbalance if keep a connection • Solution? • Disconnect every time a client receive a response? • Connect between client and server directly + load balance at client side Client LB Client Server Proxy-based LB Server Client Client Server Client-side LB Server LB LB
  17. Client-side load balancing • Load balance at client side resolves

    the load imbalance • But.. how can we know the endpoint location at client? • Hardcode endpoint list to app? • Register locations of all services instances to a service registry and query the location to the registry from client Client Client Server Server LB Service Registry Register LB Query {"myService": [ "10.120.16.2:8080", "10.120.16.2:8080"]} Add "10.120.16.2:8080” to "myService"
  18. Client-side load balancing • Client side LB creates more connections

    b/w client and server • … but provides balanced load to servers even with HTTP/2 Apply Client-side LB
  19. Requirement for Armeria adoption • Fast and Reliable • Netty

    + epoll + HTTP/2 (mostly same with prev impl) • Maintainable • → Easy to modify by anyone • Succeeded to cleanup tons of forked code & Netty handlers • Efficiency • → HTTP/2 + Client side LB
  20. Everything goes well? • No, it was not ready for

    production, yet Requirement Legacy Armaria Armaria+ Fast Reliable Maintainable Efficiency Fault tolerance
  21. Why need to care about Fault tolerance? • Because… •

    Authentication is Single Point Of Failure in our system You can’t do anything at LINE until the service comes back • Outage affects User Experience A user cannot send a sticker to your friend via LINE. I cannot share a photo to my family, etc. • Lose an opportunity to earn
  22. Canary release • A way to apply few % of

    real load to newly released binary • Useful to check the regression without affecting to all users • If it does not find any issue, increase load more • If you find anything, rollback easily
  23. Canary release solves all issues? • Case of Large scale

    web application • Takes 4 mins per server to restart, and has 1000 servers • Takes 25 mins even if we parallelize the operation • With low-load, new release may have no issue but if apply to all, may find performance issue • Requires many time to resolve • We may solve this by fine grain canary release but, is there any way to release more flexible??
  24. Flag-based feature rollout • Merge all features into release binary

    • Enable/disable a feature with a property file
  25. How to update the flag at runtime? • All the

    flag users needs to know latest values to make rollout smoothly • But there is no general way to achieve this. • Distribute property file to each servers? Ansible, Chef, or rsync + inotify? • Make a new management API? Need to make APIs every time we make new flag or services?
  26. Central Dogma • Repository service for textual configurations • JSON,

    YAML, XML … • Highly available • multi-master, eventually consistency • Provides an API to notify the change to users https://line.github.io/centraldogma/
  27. Rollout features using CentralDogma • Rollout to production within 1

    minute of being changed • Deploy a YAML/JSON file to all our production servers via CentralDogma • Record changes in CentralDogma commit log 30% CacheV2 rollout: Service Service Service commit YAML Pull & Reload config Developer CentralDogma
  28. Rollout features using CentralDogma • Rollout to production within 1

    minute of being changed • Deploy a YAML/JSON file to all our production servers via CentralDogma • Record changes in CentralDogma commit log
  29. Consideration for Flag-based rollout • Regression tests required for halfway

    rollout • 1 halfway rollout requires 2 regression tests new feature + old feature • If you have tens of halfway rollout, then… • Hard to guarantee the flag completeness • Might be revealed partially • Sometimes hard to control by the flag • e.g. SDK, JVM, library upgrade
  30. Everything goes well? • Succeeded to rollout finally but failed

    to rollout 4 times Requirement Legacy Impl Armaria Armaria+ Armaria++ Fast Reliable Maintainable Efficiency Fault tolerance
  31. How did it lead to success? • Find unusual patiently

    • Check various kind of metrics • CPU, GC, Thread utilization, API latency, Heap dump, histogram, etc. • Find a bottleneck using profiler
  32. How did it lead to success? • Find unusual from

    metrics • At client side, max latency of new client was always 5sec • But server side seems normal
  33. How did it lead to success? • Observed that HTTP

    client hits DNS every time send requests • Q. Why it need to ask to DNS? → Disabled DNS cache + forgot to use pre-resolved IP address for endpoints
  34. • Find a bottleneck using profiler • For java app,

    let’s use async-profiler • https://github.com/jvm-profiling-tools/async-profiler • Once you solve it, time to find next How did it lead to success?
  35. • Find a bottleneck using profiler • For java app,

    let’s use async-profiler • Once you solve it, time to find next How did it lead to success?
  36. Future work • Shave more inefficiencies • Many connections are

    created under high load It may cause GC pressure on client/server side for managing the connections. https://github.com/line/armeria/issues/816 https://github.com/line/armeria/pull/1886 • Long-polling-based server healthiness notification a client send a healthcheck request periodically to learn if a server is not healthy. It means a client will send a request to the unhealthy server until it sends a health check request next time. https://github.com/line/armeria/issues/1756 https://github.com/line/armeria/pull/1878
  37. Conclusion • Our system is built on top of our

    OSSs • Armeria https://line.github.io/armeria/ CentralDogma https://github.com/line/centraldogma • … are still evolving • Trouble may happen • Make a well trouble-prepared system • Let's do a smooth release