Lesson learned from the adoption of Armeria to LINE's authentication system

Lesson learned from the adoption of Armeria to LINE's authentication
system. Masahiro IDE, LINE CORP

About me • Masahiro Ide • Software Engineer of LINE
server development • Messaging backend server, Redis cluster management • Armeria adoption • Contribute to Armeria, line-bot-sdk-java • https://github.com/line/armeria • https://github.com/line/line-bot-sdk-java

Agenda • User request authentication at LINE • Inside of
our auth system • Various issues faced during Armeria adoption • Future work

What is LINE Authentication service? • Authenticates requests come to
LINE messaging service • Text message, broadcasting Bot message, purchase sticker, etc. • Server-side authentication ↔ Client-side authentication Talk server Client Reverse Proxy Talk server Talk Server StickerShop Bot backend StickerShop Bot backend Auth server

Auth Server Components • Netty 4.1.0-beta8 • epoll, HTTP/2 •
Jedis (Redis client for Java) • In-house redis cluster • Mybatis+mysql-connector-java • With In-house sharding system • L4 Load Balancer (LB) Auth Server Netty Jedis Mybatis L4-LB Redis Redis Redis Redis Redis MySQL Apps Apps Apps

Requirement • Fast • Latency SLO: 99% of requests served
in < 10ms • Reliable • Handles 600K req/sec within reasonable time • Maintainable • Easy to modify by anyone • Efficiency • HTTP/2 + L4 LB causes load imbalance

HTTP/2 + L4 LB causes load imbalance? • With HTTP/2,
you can multiplex the requests into single connection • To maximize the merit, HTTP/2 client/server try to keep the connection as much as possible Client Server stream1 stream2 … stream3

HTTP/2 + L4 LB causes load imbalance? • With HTTP/2,
you can multiplex the requests into single connection • To maximize the merit, HTTP/2 client/server try to keep the connection as much as possible Client Server Client Server

HTTP/2 + L4 LB causes load imbalance? • With HTTP/2+LB,
server/client /LB continue to use the connection that you created once at startup, the client may unintentionally communicate only with the same server. • In worst case, all the requests go to 1 server Client Server Client Server LB

HTTP/2 + L4 LB causes load imbalance? • Disconnect aged
connection can relax the imbalance • But it is not a perfect solution val conn = getConnectionPool(); if (conn.maxAge >= 5 min) conn.disconnect() conn = getConnectionPool(); 7 K req/sec

Requirement • Fast • Latency SLO: 99% of requests served
in < 10ms • Reliable • Handles 600K req/sec within reasonable time • Maintainable • Easy to modify by anyone • Efficiency • HTTP/2 + L4 LB Q. Can Armeria solve the Requirements?

Armeria - Our RPC layer • Asynchronous RPC/REST library •
built on top of Java 8, Netty, HTTP/2, Thrift and gRPC • Take care common functionality for microservice • Client side LB • Circuit Breaker / Retry / Throttling • Tracing (Zipkin) / Monitoring integration • etc. https://line.github.io/armeria/

Requirement for Armeria adoption • Fast and Reliable • Netty
+ epoll + HTTP/2 (mostly same with prev impl) • Maintainable • → Easy to modify by anyone • Succeeded to cleanup tons of forked code & Netty handlers • Efficiency • HTTP/2 + L4 LB → HTTP/2 + Client side LB

Easy to write, easy to maintain your service • Armeria
provides similar usability as existing web framework https://line.github.io/armeria/server-annotated-service.html

+ epoll + HTTP/2 (mostly same with prev impl) • Maintainable • → Easy to modify by anyone • Succeeded to cleanup tons of forked code & Netty handlers • Efficiency • HTTP/2 + L4 LB → HTTP/2 + Client side LB

Client-side load balancing • Problem: Proxy-based LB + HTTP/2 causes
load imbalance if keep a connection • Solution? • Disconnect every time a client receive a response? • Connect between client and server directly + load balance at client side Client LB Client Server Proxy-based LB Server Client Client Server Client-side LB Server LB LB

Client-side load balancing • Load balance at client side resolves
the load imbalance • But.. how can we know the endpoint location at client? • Hardcode endpoint list to app? • Register locations of all services instances to a service registry and query the location to the registry from client Client Client Server Server LB Service Registry Register LB Query {"myService": [ "10.120.16.2:8080", "10.120.16.2:8080"]} Add "10.120.16.2:8080” to "myService"

Client-side load balancing • Client side LB creates more connections
b/w client and server • … but provides balanced load to servers even with HTTP/2 Apply Client-side LB

+ epoll + HTTP/2 (mostly same with prev impl) • Maintainable • → Easy to modify by anyone • Succeeded to cleanup tons of forked code & Netty handlers • Efficiency • → HTTP/2 + Client side LB

Everything goes well? Requirement Legacy Armaria Fast Reliable Maintainable Efficiency

Everything goes well? • No, it was not ready for
production, yet Requirement Legacy Armaria Armaria+ Fast Reliable Maintainable Efficiency Fault tolerance

Why need to care about Fault tolerance? • Because… •
Authentication is Single Point Of Failure in our system You can’t do anything at LINE until the service comes back • Outage affects User Experience A user cannot send a sticker to your friend via LINE. I cannot share a photo to my family, etc. • Lose an opportunity to earn

How to reduce # of outages? • Canary release •
Flag based feature rollout

Canary release • A way to apply few % of
real load to newly released binary • Useful to check the regression without affecting to all users • If it does not find any issue, increase load more • If you find anything, rollback easily

Canary release solves all issues? • Case of Large scale
web application • Takes 4 mins per server to restart, and has 1000 servers • Takes 25 mins even if we parallelize the operation • With low-load, new release may have no issue but if apply to all, may find performance issue • Requires many time to resolve • We may solve this by fine grain canary release but, is there any way to release more flexible??

Flag-based feature rollout • Merge all features into release binary
• Enable/disable a feature with a property file

Flag-based feature rollout Example.1 • Case1: Percentage rollout 5% 10%
…

Flag-based feature rollout Example.2 • Case2: Change new feature visibility

How to update the flag at runtime? • All the
flag users needs to know latest values to make rollout smoothly • But there is no general way to achieve this. • Distribute property file to each servers? Ansible, Chef, or rsync + inotify? • Make a new management API? Need to make APIs every time we make new flag or services?

Central Dogma • Repository service for textual configurations • JSON,
YAML, XML … • Highly available • multi-master, eventually consistency • Provides an API to notify the change to users https://line.github.io/centraldogma/

Rollout features using CentralDogma • Rollout to production within 1
minute of being changed • Deploy a YAML/JSON file to all our production servers via CentralDogma • Record changes in CentralDogma commit log 30% CacheV2 rollout: Service Service Service commit YAML Pull & Reload config Developer CentralDogma

Rollout features using CentralDogma • Rollout to production within 1
minute of being changed • Deploy a YAML/JSON file to all our production servers via CentralDogma • Record changes in CentralDogma commit log

How to control the flag by CentralDogma? CentralDogma /myFeatureFlag.json

Consideration for Flag-based rollout • Regression tests required for halfway
rollout • 1 halfway rollout requires 2 regression tests new feature + old feature • If you have tens of halfway rollout, then… • Hard to guarantee the flag completeness • Might be revealed partially • Sometimes hard to control by the flag • e.g. SDK, JVM, library upgrade

Everything goes well? • Succeeded to rollout finally but failed
to rollout 4 times Requirement Legacy Impl Armaria Armaria+ Armaria++ Fast Reliable Maintainable Efficiency Fault tolerance

How did it lead to success? • Find unusual patiently
• Check various kind of metrics • CPU, GC, Thread utilization, API latency, Heap dump, histogram, etc. • Find a bottleneck using profiler

How did it lead to success? • Find unusual from
metrics • At client side, max latency of new client was always 5sec • But server side seems normal

How did it lead to success? • Observed that client
takes time at DNS resolution

How did it lead to success? • Observed that HTTP
client hits DNS every time send requests • Q. Why it need to ask to DNS? → Disabled DNS cache + forgot to use pre-resolved IP address for endpoints

• Find a bottleneck using profiler • For java app,
let’s use async-profiler • https://github.com/jvm-profiling-tools/async-profiler • Once you solve it, time to find next How did it lead to success?

• Find a bottleneck using profiler • For java app,
let’s use async-profiler • Once you solve it, time to find next How did it lead to success?

Future work • Shave more inefficiencies • Many connections are
created under high load It may cause GC pressure on client/server side for managing the connections. https://github.com/line/armeria/issues/816 https://github.com/line/armeria/pull/1886 • Long-polling-based server healthiness notification a client send a healthcheck request periodically to learn if a server is not healthy. It means a client will send a request to the unhealthy server until it sends a health check request next time. https://github.com/line/armeria/issues/1756 https://github.com/line/armeria/pull/1878

Conclusion • Our system is built on top of our
OSSs • Armeria https://line.github.io/armeria/ CentralDogma https://github.com/line/centraldogma • … are still evolving • Trouble may happen • Make a well trouble-prepared system • Let's do a smooth release

Lesson learned from the adoption of Armeria to ...

Lesson learned from the adoption of Armeria to LINE's authentication system

More Decks by LINE Developers

Other Decks in Programming

Featured

Transcript