Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LINEのB2Bプラットフォームにおけるトラブルシューティング2選

 LINEのB2Bプラットフォームにおけるトラブルシューティング2選

LINEには、LINE公式アカウントやLINEアプリを中心にした様々なサービスにおける広告など、多種多様なB2Bプロダクトとそれを支えるプラットフォームがあります。それらは、社内/社外の多くのシステムと連携しており、大規模なトラフィックとデータを扱っています。

こうしたB2Bプラットフォームを運用する上で発生した"問題"とそのトラブルシューティングの事例をいくつか面白おかしくご紹介したいと思います。

弊社環境でしか発生しない問題もいくつかあると思いますが、トラブルシューティングの過程が参考になれば幸いです。

発表者:長谷部 良輔

こちらの資料は、JJUG CCC 2022 Springで発表した内容です。
https://fortee.jp/jjug-ccc-2022-spring/proposal/730d46e2-a295-45c2-abfa-bb7bf13ad7c9

A3966f193f4bef226a0d3e3c1f728d7f?s=128

LINE Developers
PRO

June 19, 2022
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. LINE B 2 B 
 Ryosuke Hasebe 20 22 .

    0 6 . 19
  2. Speaker 2 LINE (2013 9 ) 4 OA Dev 2

    / OA SRE LINE LINE / LINE / LINE ( )/ LINE / LINE / LINE Login (OAuth 2 /OIDC) / LINE / LINE Profile+ / LINE Notify / LINE / Java/Kotlin / (Reactive Streams / Kotlin Coroutines) K 8 s Ryosuke Hasebe Github: be-hase 
 Twitter: be_hasee
  3. 1 . About LINE s B 2 B Platform 2

    . Case 1 : Slow latency issue after updating to Lettuce v 6 3 . Case 2 : Direct buffer OOME issue due to bad usage of Spring WebClient Agenda 3
  4. About LINE s B 2 B Platform

  5. : LINE B 2 B LINE LINE /CRM API BOT

    LINE LINE 5 LINE LINE Talk Head View
  6. : LINE 6 CPU 4 , 50 0 core 


    (request/sec) 10 Memory 14 TB ※ 2021೥9݄࣌఺
  7. : (B 2 B) 7

  8. \ / Kotlin/Java, Spring Boot, Armeria, gRPC/Thrift MySQL, HBase, Redis,

    Kafka, Elasticsearch, Centraldogma, nginx, fluentd Verda(OpenStack based Private Cloud) VM/PM, Kubernetes / Prometheus, Grafana, IU( ), Kibana, IMON( ) GHE, Jenkins, Drone, Circle CI, Ansible, ArgoCD 8
  9. Case 1 : Slow latency issue after updating 
 to

    Lettuce v 6
  10. Lettuce v 4 . 5 . 0 v 6 .

    0 . 0 99 . 9 percentile latency ( 1 sec ) Lettuce = Redis client library for Java spring-data-redis Kafka Consumer 96 Redis Cluster 1 3K commands/sec HGETALL 1 0
  11. Workaround v 5 (v 5 . 3 . 5 )

    v 4 -> v 6 v 5 . 3 . 5 -> v 6 . 0 . 0 v 6 1 1
  12. 
 (1 )

  13. Lettuce 5 . 3 EOL 😨 > 5 . 3

    .x is EOL (end-of-life) as of June 2 021 . https://github.com/lettuce-io/lettuce-core/wiki/Lettuce-Versions EOL Spring 4 Shell Lettuce v 6 . 1 . 6 1 3
  14. (Lettuce version client-side ) Redis server-side latency SLOWLOG client-side(= java

    application = Lettuce ) 1 4 client-side(Lettuce) / server-side(Redis)
  15. GC STW 99.9 percentile latency ( ) Lettuce GC(STW) or

    GC time Micrometer GC HeapDump Eclipse Memory Analyzer GC STW Unified JVM Logging safepoint log( ) STW 1 5 Stop The World(STW) [2022-03-14T17:30:16.483+0900][192775.478][info ][safepoint] Total time for which application threads were stopped: 0.xxx seconds, Stopping threads took: 0.xxx seconds safepoint log 
 https://krzysztofslusarski.github.io/ 2020 / 11 / 13 /stw.html
  16. JVM Redis v 5 . 3 . 5 v 6

    . 0 . 0 Try & Error Local 1 6
  17. Kafka Consumer Consumer Group Lettuce 6 . 1 . 16

    1 7 Lettuce v5.3.15 Lettuce v6.1.16
  18. Lettuce v 6 RESP 3 RESP 3 Redis v 6

    https://github.com/antirez/RESP 3 /blob/master/spec.md Lettuce v 6 RESP 3 Redis RESP 3 ⾒ RESP 2 fallback ( ) 1 8 RESP 3 ClusterClientOptions .builder() // RESP2ͷΈ࢖༻͢ΔΑ͏ʹ .protocolVersion(ProtocolVersion.RESP2) .build()
  19. Lettuce 6 Big Keys HGETALL Redis Big Keys https://www.alibabacloud.com/blog/a-detailed-explanation-of-the-detection-and-processing-of- bigkey-and-hotkey-in-redis_

    59 8143 Log Hash Latency 1 9 Big Keys v5.3.15 v6.1.16
  20. CPU 
 async-profiler framegraph 2 0 async-profiler v5.3.15 v6.1.16

  21. Lettuce event-loop(non-blocking) ClusterTopologyRefresh.getNodeSpecificViews framegraph 頻 Cluster Topology Refresh 2 1

    async-profiler v5.3.15 v6.1.16
  22. Lettuce Cluster Topology Refresh(CTR) Redis Cluster key(slot) client-side Lettuce Cluster

    Topology Refresh(CTR) CLUSTER NODES 60 1 MOVED redirection https://github.com/lettuce-io/lettuce-core/issues/ 3 3 9 2 2
  23. Lettuce Cluster Topology Refresh(CTR) Redis Cluster key(slot) client-side Lettuce Cluster

    Topology Refresh(CTR) CLUSTER NODES 60 1 MOVED redirection https://github.com/lettuce-io/lettuce-core/issues/ 3 3 9 2 3 e7d1eecce10fd6bb5eb35b9f99a514335d9ba9ca 127.0.0.1:30001@31001 master - 0 0 1 connected 0-5460 
 67ed2db8d677e59ec4a4cefb06858cf2a1a89fa1 127.0.0.1:30002@31002 master - 0 1426238316232 2 connected 5461-10922 292f8b365bb7edb5e285caf0b7e6ddc7265d2f4f 127.0.0.1:30003@31003 master - 0 1426238318243 3 connected 10923-16383 07c37dfeb235213a872192d90877d0cd55635b91 127.0.0.1:30004@31004 slave e7d1eecce10fd6bb5eb35b9f99a514335d9ba9ca 0 1426238317239 4 connected 6ec23923021cf3ffec47632106199cb7f496ce01 127.0.0.1:30005@31005 slave 67ed2db8d677e59ec4a4cefb06858cf2a1a89fa1 0 1426238316232 5 connected 824fe116063bc5fcf9f4ffd895bc17aee7731ac3 127.0.0.1:30006@31006 slave 292f8b365bb7edb5e285caf0b7e6ddc7265d2f4f 0 1426238317741 6 connected
  24. Lettuce Cluster Topology Refresh(CTR) Redis Cluster key(slot) client-side Lettuce Cluster

    Topology Refresh(CTR) CLUSTER NODES 60 1 MOVED redirection https://github.com/lettuce-io/lettuce-core/issues/ 3 3 9 2 4
  25. framegraph ? 60 1 (CTR) framegraph node n O(n^ 2

    ) 9 6 node 1 sec Lettuce v 6 . 0 . 0 event loop 2 5 ← EpollEventLoop.run
  26. CPU x 2 Event-loop 1 1sec (CTR) 
 Redis Command

    I/O 
 ( letency 99.9 percentile 1sec ) 2 6
  27. 2 7 CTR CTR latency latency

  28. 2 8 event-loop CTR CTR event-loop 
 latency

  29. Lettuce Issue & PR 2 9 Issue https://github.com/lettuce-io/lettuce-core/issues/ 2 0

    4 5 PR https://github.com/lettuce-io/lettuce-core/pull/ 2048 6.1.8 https://github.com/lettuce-io/lettuce-core/releases/ tag/ 6 . 1 . 8 .RELEASE
  30. Other Solution 3 0 CTR dynamic refresh source ( :

    ) > CLUSTER NODES dynamic refresh source Initial Seed Nodes 頻 Cluster Initial Seed Nodes down CTR dynamic refresh source / / ͜͜Ͱࢦఆͨ͠ϊʔυ(Initial Seed Nodes)ʹݶఆ͢Δ͜ͱ͕Ͱ͖Δ RedisURI node1 = RedisURI.create("node1", 6379); RedisURI node2 = RedisURI.create("node2", 6379); RedisClusterClient clusterClient = RedisClusterClient.create(Arrays.asList(node1, node2));
  31. / 3 1 頻 Redis Cluster ⾒ ⾒ Lettuce version

    6.1.8 Local
  32. Case 2 : Direct buffer OOME issue due to bad

    usage of Spring WebClient
  33. Out of Memory Error(OOME) 頻 CSV spec ( 2 0

    core, 64 GB Mem, -Xmx 24 g) 3 3
  34. Workaround 2,3 ( ) -XX:+ExitOnOutOfMemoryError JVM OOME JVM supervisord restart

    3 4
  35. (-Xmx 2 4 g) CSV ? Eclipse Memory Analyzer 


    3 5
  36. OutOfMemoryError Direct buffer memory ( ) (native <-> ) /

    3 6 OutOfMemoryError: Direct buffer memory Caused by: java.lang.OutOfMemoryError: Direct buffer memory at java.base/java.nio.Bits.reserveMemory(Bits.java:175) at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118) at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317) at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:645) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:621) ※ ͳ͓ɺJava13͔Β͸Τϥʔϝοηʔδ͕Θ͔Γ΍͘͢਌੾ʹͳ͍ͬͯ·͢ 
 https://bugs.openjdk.java.net/browse/JDK-8048192
  37. (Micrometer ) https://github.com/micrometer-metrics/.../JvmMemoryMetrics.java 3 7

  38. JDK https://github.com/openjdk/jdk/blob/jdk- 11 + 28 /src/java.base/share/classes/java/nio/Bits.java#L 1 75 Runtime.getRuntime().maxMemory() -Xmx

    ( JVM -Xmx * 2 ) -XX:MaxDirectMemorySize 
 OOME 3 8 OOME
  39. netty netty ⾒ 2 netty Spring WebClient (WebFlux) Lettuce (Redis

    Client) Spring WebClient Lettuce 3 9
  40. Spring WebClient Lettuce Spring WebClient 頻 CSV Mono<byte[]> Reactor Flux<DataBuffer>

    Spring WebClient Spring Boot 2 . 1 2 56 KB 
 4 0 webClient.get() .uri(uri) .retrieve() .bodyToMono(byte[].class) // ո͍͠ .block(); WebClient.builder() .codecs(configurer -> configurer.defaultCodecs() .maxInMemorySize(-1)) // ແ੍ݶʹ͍ͯͨ͠ .build(); Spring WebClient Lettuce
  41. 2 30 0 MB text file 4 1

  42. OOME 4 4 * 300MB OOME 4 2 Caused by:

    java.lang.OutOfMemoryError: Direct buffer memory at java.base/java.nio.Bits.reserveMemory(Bits.java:175) at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118) at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317) at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:648) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:623) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:202) at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:186) at io.netty.buffer.PoolArena.allocate(PoolArena.java:136) at io.netty.buffer.PoolArena.allocate(PoolArena.java:126) at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:394) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179) seq 4 | xargs -P 4 -I{} curl localhost:8080/mono -XX:MaxDirectMemorySize= 1 g
  43. (jcmd PID GC.run ) OOME 4 3 watch curl localhost:8080/mem

    name=direct, count=76, memoryUsed=1008MB, totalCapacity=1008MB name=direct, count=76, memoryUsed=1008MB, totalCapacity=1008MB …
  44. OOME ⾒ -XX:+ExitOnOutOfMemoryError report_java_out_of_memory https://github.com/openjdk/jdk 11 u/.../src/hotspot/share/utilities/debug.cpp#L 3 19 Systemd

    ⾒ ⾒ Micrometer 4 4 -XX:+ExitOnOutOfMemoryError !!
  45. Flux<DataBuffer> Flux<DataBuffer> ( ) 4 5 seq 4 | xargs

    -P 4 -I{} curl localhost:8080/flux curl localhost:8080/mem name=direct, count=17, memoryUsed=80MB, totalCapacity=80MB
  46. 4 6 OOME Before After !! 570MB΄ͲͰऩ·ΔΑ͏ʹ ࠓ೥ͷ೥຤೥࢝͸Ժ΍͔ʹաͤͦ͝͏🙌

  47. netty PooledByteBufAllocator jemalloc https://people.freebsd.org/~jasone/jemalloc/bsdcan 200 6 /jemalloc.pdf https://www.facebook.com/notes/ 1015879 1475

    077200 / 4 7
  48. WebClient( reactor-netty) CPU I/O CPU 20 頻 2GB CSV 20

    * 2GB = 4 0 GB 24GB = -Xmx or -XX:MaxDirectMemorySize 4 8
  49. / 4 9 JVM OOME -XX:+ExitOnOutOfMemoryError ⾒ Reactive Streams netty

  50. !!

  51. We re Hiring !! 5 1 LINE B 2 B

    2 SRE (https://linecorp.com/ja/career/position/ 3112 ) / (https://linecorp.com/ja/career/position/ 231 6 )
  52. None