Lesson learned from the adoption of Armeria to LINE's authentication system

Lesson learned from the adoption of Armeria to LINE's authentication system

Presentation at TWJUG 201908 聚會, in Taipei on 2019/8/15
https://www.meetup.com/taiwanjug/events/263411288/

53850955f15249a1a9dc49df6113e400?s=128

LINE Developers

August 15, 2019
Tweet

Transcript

  1. Lesson learned from the adoption of Armeria to LINE's authentication

    system. Masahiro IDE, LINE CORP
  2. About me • Masahiro Ide • Software Engineer of LINE

    server development • Messaging backend server, Redis cluster management • Armeria adoption • Contribute to Armeria, line-bot-sdk-java • https://github.com/line/armeria • https://github.com/line/line-bot-sdk-java
  3. Agenda • User request authentication at LINE • Inside of

    our auth system • Various issues faced during Armeria adoption • Future work
  4. What is LINE Authentication service? • Authenticates requests come to

    LINE messaging service • Text message, broadcasting Bot message, purchase sticker, etc. • Server-side authentication ↔ Client-side authentication Talk server Client Reverse Proxy Talk server Talk Server StickerShop Bot backend StickerShop Bot backend Auth server
  5. Auth Server Components • Netty 4.1.0-beta8 • epoll, HTTP/2 •

    Jedis (Redis client for Java) • In-house redis cluster • Mybatis+mysql-connector-java • With In-house sharding system • L4 Load Balancer (LB) Auth Server Netty Jedis Mybatis L4-LB Redis Redis Redis Redis Redis MySQL Apps Apps Apps
  6. Requirement • Fast • Latency SLO: 99% of requests served

    in < 10ms • Reliable • Handles 600K req/sec within reasonable time • Maintainable • Easy to modify by anyone • Efficiency • HTTP/2 + L4 LB causes load imbalance
  7. HTTP/2 + L4 LB causes load imbalance? • With HTTP/2,

    you can multiplex the requests into single connection • To maximize the merit, HTTP/2 client/server try to keep the connection as much as possible Client Server stream1 stream2 … stream3
  8. HTTP/2 + L4 LB causes load imbalance? • With HTTP/2,

    you can multiplex the requests into single connection • To maximize the merit, HTTP/2 client/server try to keep the connection as much as possible Client Server Client Server
  9. HTTP/2 + L4 LB causes load imbalance? • With HTTP/2+LB,

    server/client /LB continue to use the connection that you created once at startup, the client may unintentionally communicate only with the same server. • In worst case, all the requests go to 1 server Client Server Client Server LB
  10. HTTP/2 + L4 LB causes load imbalance? • Disconnect aged

    connection can relax the imbalance • But it is not a perfect solution val conn = getConnectionPool(); if (conn.maxAge >= 5 min) conn.disconnect() conn = getConnectionPool(); 7 K req/sec
  11. Requirement • Fast • Latency SLO: 99% of requests served

    in < 10ms • Reliable • Handles 600K req/sec within reasonable time • Maintainable • Easy to modify by anyone • Efficiency • HTTP/2 + L4 LB Q. Can Armeria solve the Requirements?
  12. Armeria - Our RPC layer • Asynchronous RPC/REST library •

    built on top of Java 8, Netty, HTTP/2, Thrift and gRPC • Take care common functionality for microservice • Client side LB • Circuit Breaker / Retry / Throttling • Tracing (Zipkin) / Monitoring integration • etc. https://line.github.io/armeria/
  13. Requirement for Armeria adoption • Fast and Reliable • Netty

    + epoll + HTTP/2 (mostly same with prev impl) • Maintainable • → Easy to modify by anyone • Succeeded to cleanup tons of forked code & Netty handlers • Efficiency • HTTP/2 + L4 LB → HTTP/2 + Client side LB
  14. Requirement for Armeria adoption • Fast and Reliable • Netty

    + epoll + HTTP/2 (mostly same with prev impl) • Maintainable • → Easy to modify by anyone • Succeeded to cleanup tons of forked code & Netty handlers • Efficiency • HTTP/2 + L4 LB → HTTP/2 + Client side LB
  15. Easy to write, easy to maintain your service • Armeria

    provides similar usability as existing web framework https://line.github.io/armeria/server-annotated-service.html
  16. Requirement for Armeria adoption • Fast and Reliable • Netty

    + epoll + HTTP/2 (mostly same with prev impl) • Maintainable • → Easy to modify by anyone • Succeeded to cleanup tons of forked code & Netty handlers • Efficiency • HTTP/2 + L4 LB → HTTP/2 + Client side LB
  17. Client-side load balancing • Problem: Proxy-based LB + HTTP/2 causes

    load imbalance if keep a connection • Solution? • Disconnect every time a client receive a response? • Connect between client and server directly + load balance at client side Client LB Client Server Proxy-based LB Server Client Client Server Client-side LB Server LB LB
  18. Client-side load balancing • Load balance at client side resolves

    the load imbalance • But.. how can we know the endpoint location at client? • Hardcode endpoint list to app? • Register locations of all services instances to a service registry and query the location to the registry from client Client Client Server Server LB Service Registry Register LB Query {"myService": [ "10.120.16.2:8080", "10.120.16.2:8080"]} Add "10.120.16.2:8080” to "myService"
  19. Client-side load balancing • Client side LB creates more connections

    b/w client and server • … but provides balanced load to servers even with HTTP/2 Apply Client-side LB
  20. Requirement for Armeria adoption • Fast and Reliable • Netty

    + epoll + HTTP/2 (mostly same with prev impl) • Maintainable • → Easy to modify by anyone • Succeeded to cleanup tons of forked code & Netty handlers • Efficiency • → HTTP/2 + Client side LB
  21. Everything goes well? Requirement Legacy Armaria Fast Reliable Maintainable Efficiency

  22. Everything goes well? • No, it was not ready for

    production, yet Requirement Legacy Armaria Armaria+ Fast Reliable Maintainable Efficiency Fault tolerance
  23. Why need to care about Fault tolerance? • Because… •

    Authentication is Single Point Of Failure in our system You can’t do anything at LINE until the service comes back • Outage affects User Experience A user cannot send a sticker to your friend via LINE. I cannot share a photo to my family, etc. • Lose an opportunity to earn
  24. How to reduce # of outages? • Canary release •

    Flag based feature rollout
  25. Canary release • A way to apply few % of

    real load to newly released binary • Useful to check the regression without affecting to all users • If it does not find any issue, increase load more • If you find anything, rollback easily
  26. Canary release solves all issues? • Case of Large scale

    web application • Takes 4 mins per server to restart, and has 1000 servers • Takes 25 mins even if we parallelize the operation • With low-load, new release may have no issue but if apply to all, may find performance issue • Requires many time to resolve • We may solve this by fine grain canary release but, is there any way to release more flexible??
  27. Flag-based feature rollout • Merge all features into release binary

    • Enable/disable a feature with a property file
  28. Flag-based feature rollout Example.1 • Case1: Percentage rollout 5% 10%

  29. Flag-based feature rollout Example.2 • Case2: Change new feature visibility

  30. How to update the flag at runtime? • All the

    flag users needs to know latest values to make rollout smoothly • But there is no general way to achieve this. • Distribute property file to each servers? Ansible, Chef, or rsync + inotify? • Make a new management API? Need to make APIs every time we make new flag or services?
  31. Central Dogma • Repository service for textual configurations • JSON,

    YAML, XML … • Highly available • multi-master, eventually consistency • Provides an API to notify the change to users https://line.github.io/centraldogma/
  32. Rollout features using CentralDogma • Rollout to production within 1

    minute of being changed • Deploy a YAML/JSON file to all our production servers via CentralDogma • Record changes in CentralDogma commit log 30% CacheV2 rollout: Service Service Service commit YAML Pull & Reload config Developer CentralDogma
  33. Rollout features using CentralDogma • Rollout to production within 1

    minute of being changed • Deploy a YAML/JSON file to all our production servers via CentralDogma • Record changes in CentralDogma commit log
  34. How to control the flag by CentralDogma? CentralDogma /myFeatureFlag.json

  35. How to control the flag by CentralDogma? CentralDogma /myFeatureFlag.json

  36. How to control the flag by CentralDogma? CentralDogma /myFeatureFlag.json

  37. Consideration for Flag-based rollout • Regression tests required for halfway

    rollout • 1 halfway rollout requires 2 regression tests new feature + old feature • If you have tens of halfway rollout, then… • Hard to guarantee the flag completeness • Might be revealed partially • Sometimes hard to control by the flag • e.g. SDK, JVM, library upgrade
  38. Everything goes well? • Succeeded to rollout finally but failed

    to rollout 4 times Requirement Legacy Impl Armaria Armaria+ Armaria++ Fast Reliable Maintainable Efficiency Fault tolerance
  39. How did it lead to success? • Find unusual patiently

    • Check various kind of metrics • CPU, GC, Thread utilization, API latency, Heap dump, histogram, etc. • Find a bottleneck using profiler
  40. How did it lead to success? • Find unusual from

    metrics • At client side, max latency of new client was always 5sec • But server side seems normal
  41. How did it lead to success? • Observed that client

    takes time at DNS resolution
  42. How did it lead to success? • Observed that HTTP

    client hits DNS every time send requests • Q. Why it need to ask to DNS? → Disabled DNS cache + forgot to use pre-resolved IP address for endpoints
  43. • Find a bottleneck using profiler • For java app,

    let’s use async-profiler • https://github.com/jvm-profiling-tools/async-profiler • Once you solve it, time to find next How did it lead to success?
  44. • Find a bottleneck using profiler • For java app,

    let’s use async-profiler • Once you solve it, time to find next How did it lead to success?
  45. Future work • Shave more inefficiencies • Many connections are

    created under high load It may cause GC pressure on client/server side for managing the connections. https://github.com/line/armeria/issues/816 https://github.com/line/armeria/pull/1886 • Long-polling-based server healthiness notification a client send a healthcheck request periodically to learn if a server is not healthy. It means a client will send a request to the unhealthy server until it sends a health check request next time. https://github.com/line/armeria/issues/1756 https://github.com/line/armeria/pull/1878
  46. Conclusion • Our system is built on top of our

    OSSs • Armeria https://line.github.io/armeria/ CentralDogma https://github.com/line/centraldogma • … are still evolving • Trouble may happen • Make a well trouble-prepared system • Let's do a smooth release
  47. Q & A