Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LINEのB2Bプラットフォームにおけるトラブルシューティング2選

 LINEのB2Bプラットフォームにおけるトラブルシューティング2選

LINEには、LINE公式アカウントやLINEアプリを中心にした様々なサービスにおける広告など、多種多様なB2Bプロダクトとそれを支えるプラットフォームがあります。それらは、社内/社外の多くのシステムと連携しており、大規模なトラフィックとデータを扱っています。

こうしたB2Bプラットフォームを運用する上で発生した"問題"とそのトラブルシューティングの事例をいくつか面白おかしくご紹介したいと思います。

弊社環境でしか発生しない問題もいくつかあると思いますが、トラブルシューティングの過程が参考になれば幸いです。

発表者:長谷部 良輔

こちらの資料は、JJUG CCC 2022 Springで発表した内容です。
https://fortee.jp/jjug-ccc-2022-spring/proposal/730d46e2-a295-45c2-abfa-bb7bf13ad7c9

LINE Developers
PRO

June 19, 2022
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. LINE B
    2
    B

    Ryosuke Hasebe
    20
    22
    .
    0
    6
    .
    19

    View Slide

  2. Speaker


    2
    LINE (2013 9 )


    4


    OA Dev
    2
    / OA SRE


    LINE


    LINE /


    LINE /


    LINE ( )/


    LINE /


    LINE /


    LINE Login (OAuth
    2
    /OIDC) /


    LINE /


    LINE Profile+ /


    LINE Notify /


    LINE /


    Java/Kotlin /


    (Reactive Streams / Kotlin Coroutines) K
    8
    s
    Ryosuke Hasebe
    Github: be-hase

    Twitter: be_hasee

    View Slide

  3. 1
    . About LINE s B
    2
    B Platform


    2
    . Case
    1
    : Slow latency issue after updating to Lettuce v
    6


    3
    . Case
    2
    : Direct buffer OOME issue due to bad usage of Spring WebClient
    Agenda
    3

    View Slide

  4. About LINE s B
    2
    B Platform

    View Slide

  5. :
    LINE B
    2
    B
    LINE


    LINE /CRM


    API BOT


    LINE


    LINE
    5
    LINE LINE Talk Head View

    View Slide

  6. : LINE
    6
    CPU
    4
    ,
    50 0
    core

    (request/sec)
    10
    Memory
    14
    TB
    ※ 2021೥9݄࣌఺

    View Slide

  7. : (B
    2
    B)
    7

    View Slide

  8. \
    / Kotlin/Java, Spring Boot, Armeria, gRPC/Thrift
    MySQL, HBase, Redis, Kafka, Elasticsearch, Centraldogma, nginx, fluentd
    Verda(OpenStack based Private Cloud) VM/PM, Kubernetes
    / Prometheus, Grafana, IU( ), Kibana, IMON( )
    GHE, Jenkins, Drone, Circle CI, Ansible, ArgoCD
    8

    View Slide

  9. Case
    1
    :


    Slow latency issue after updating

    to Lettuce v
    6

    View Slide

  10. Lettuce v
    4
    .
    5
    .
    0
    v
    6
    .
    0
    .
    0 99
    .
    9
    percentile latency (
    1
    sec )


    Lettuce = Redis client library for Java spring-data-redis




    Kafka Consumer


    96 Redis Cluster


    1 3K commands/sec HGETALL
    1
    0

    View Slide

  11. Workaround
    v
    5
    (v
    5
    .
    3
    .
    5
    )


    v
    4
    -> v
    6

    v
    5
    .
    3
    .
    5
    -> v
    6
    .
    0
    .
    0

    v
    6
    1
    1

    View Slide


  12. (1 )

    View Slide

  13. Lettuce
    5
    .
    3
    EOL 😨
    >
    5
    .
    3
    .x is EOL (end-of-life) as of June
    2
    021
    .


    https://github.com/lettuce-io/lettuce-core/wiki/Lettuce-Versions


    EOL


    Spring
    4
    Shell


    Lettuce v
    6
    .
    1
    .
    6

    1
    3

    View Slide

  14. (Lettuce version client-side
    )


    Redis server-side latency SLOWLOG


    client-side(= java application = Lettuce )
    1
    4
    client-side(Lettuce) / server-side(Redis)

    View Slide

  15. GC STW 99.9 percentile latency ( )


    Lettuce GC(STW) or GC time


    Micrometer GC


    HeapDump Eclipse Memory Analyzer


    GC STW Unified JVM Logging safepoint log( )


    STW
    1
    5
    Stop The World(STW)
    [2022-03-14T17:30:16.483+0900][192775.478][info ][safepoint] Total time for which
    application threads were stopped: 0.xxx seconds, Stopping threads took: 0.xxx
    seconds
    safepoint log

    https://krzysztofslusarski.github.io/
    2020
    /
    11
    /
    13
    /stw.html

    View Slide

  16. JVM Redis


    v
    5
    .
    3
    .
    5
    v
    6
    .
    0
    .
    0

    Try & Error


    Local
    1
    6

    View Slide

  17. Kafka Consumer


    Consumer Group


    Lettuce
    6
    .
    1
    .
    16
    1
    7
    Lettuce v5.3.15
    Lettuce v6.1.16

    View Slide

  18. Lettuce v
    6
    RESP
    3

    RESP
    3
    Redis v
    6

    https://github.com/antirez/RESP
    3
    /blob/master/spec.md


    Lettuce v
    6
    RESP
    3
    Redis RESP
    3
    ⾒ RESP
    2
    fallback



    ( )
    1
    8
    RESP
    3


    ClusterClientOptions


    .builder()


    //
    RESP2ͷΈ࢖༻͢ΔΑ͏ʹ


    .protocolVersion(ProtocolVersion.RESP2)


    .build()


    View Slide

  19. Lettuce
    6
    Big Keys HGETALL


    Redis Big Keys


    https://www.alibabacloud.com/blog/a-detailed-explanation-of-the-detection-and-processing-of-
    bigkey-and-hotkey-in-redis_
    59 8143

    Log Hash Latency
    1
    9
    Big Keys
    v5.3.15 v6.1.16

    View Slide

  20. CPU

    async-profiler


    framegraph
    2
    0
    async-profiler
    v5.3.15 v6.1.16

    View Slide

  21. Lettuce event-loop(non-blocking)


    ClusterTopologyRefresh.getNodeSpecificViews framegraph 頻


    Cluster Topology Refresh
    2
    1
    async-profiler
    v5.3.15 v6.1.16

    View Slide

  22. Lettuce Cluster Topology Refresh(CTR)
    Redis Cluster key(slot) client-side


    Lettuce Cluster Topology Refresh(CTR)


    CLUSTER NODES


    60 1


    MOVED redirection



    https://github.com/lettuce-io/lettuce-core/issues/
    3 3
    9
    2
    2

    View Slide

  23. Lettuce Cluster Topology Refresh(CTR)
    Redis Cluster key(slot) client-side


    Lettuce Cluster Topology Refresh(CTR)


    CLUSTER NODES


    60 1


    MOVED redirection



    https://github.com/lettuce-io/lettuce-core/issues/
    3 3
    9
    2
    3
    e7d1eecce10fd6bb5eb35b9f99a514335d9ba9ca 127.0.0.1:30[email protected] master - 0 0 1 connected 0-5460

    67ed2db8d677e59ec4a4cefb06858cf2a1a89fa1 127.0.0.1:[email protected] master - 0 1426238316232 2 connected 5461-10922


    292f8b365bb7edb5e285caf0b7e6ddc7265d2f4f 127.0.0.1:[email protected] master - 0 1426238318243 3 connected 10923-16383


    07c37dfeb235213a872192d90877d0cd55635b91 127.0.0.1:[email protected] slave e7d1eecce10fd6bb5eb35b9f99a514335d9ba9ca 0
    1426238317239 4 connected


    6ec23923021cf3ffec47632106199cb7f496ce01 127.0.0.1:[email protected] slave 67ed2db8d677e59ec4a4cefb06858cf2a1a89fa1 0
    1426238316232 5 connected


    824fe116063bc5fcf9f4ffd895bc17aee7731ac3 127.0.0.1:[email protected] slave 292f8b365bb7edb5e285caf0b7e6ddc7265d2f4f 0
    1426238317741 6 connected

    View Slide

  24. Lettuce Cluster Topology Refresh(CTR)
    Redis Cluster key(slot) client-side


    Lettuce Cluster Topology Refresh(CTR)


    CLUSTER NODES


    60 1


    MOVED redirection



    https://github.com/lettuce-io/lettuce-core/issues/
    3 3
    9
    2
    4

    View Slide

  25. framegraph ?
    60 1 (CTR) framegraph




    node n O(n^
    2
    )


    9 6
    node
    1
    sec


    Lettuce v
    6
    .
    0
    .
    0
    event loop
    2
    5
    ← EpollEventLoop.run

    View Slide

  26. CPU x
    2
    Event-loop 1 1sec (CTR)

    Redis Command I/O

    ( letency 99.9 percentile 1sec )
    2
    6

    View Slide

  27. 2
    7
    CTR
    CTR latency


    latency

    View Slide

  28. 2
    8
    event-loop


    CTR


    CTR event-loop


    latency

    View Slide

  29. Lettuce Issue & PR
    2
    9
    Issue


    https://github.com/lettuce-io/lettuce-core/issues/
    2 0
    4 5

    PR


    https://github.com/lettuce-io/lettuce-core/pull/
    2048

    6.1.8


    https://github.com/lettuce-io/lettuce-core/releases/
    tag/
    6
    .
    1
    .
    8
    .RELEASE

    View Slide

  30. Other Solution
    3
    0
    CTR dynamic refresh source ( : )


    > CLUSTER NODES


    dynamic refresh source Initial Seed Nodes


    頻 Cluster Initial Seed Nodes down CTR
    dynamic refresh source


    /
    /
    ͜͜Ͱࢦఆͨ͠ϊʔυ(Initial Seed Nodes)ʹݶఆ͢Δ͜ͱ͕Ͱ͖Δ


    RedisURI node1 = RedisURI.create("node1", 6379);


    RedisURI node2 = RedisURI.create("node2", 6379);


    RedisClusterClient clusterClient = RedisClusterClient.create(Arrays.asList(node1, node2));


    View Slide

  31. /
    3
    1
    頻 Redis Cluster ⾒


    ⾒ Lettuce version 6.1.8






    Local

    View Slide

  32. Case
    2
    :


    Direct buffer OOME issue due to bad usage
    of Spring WebClient

    View Slide

  33. Out of Memory Error(OOME)




    頻 CSV


    spec (
    2
    0
    core,
    64
    GB Mem, -Xmx
    24
    g)
    3
    3

    View Slide

  34. Workaround



    2,3


    ( )


    -XX:+ExitOnOutOfMemoryError JVM


    OOME JVM supervisord restart
    3
    4

    View Slide

  35. (-Xmx
    2
    4
    g) CSV


    ?





    Eclipse Memory Analyzer

    3
    5

    View Slide

  36. OutOfMemoryError Direct buffer memory




    ( )


    (native <-> )


    /




    3
    6
    OutOfMemoryError: Direct buffer memory
    Caused by: java.lang.OutOfMemoryError: Direct buffer memory


    at java.base/java.nio.Bits.reserveMemory(Bits.java:175)


    at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)


    at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317)


    at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:645)


    at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:621)


    ※ ͳ͓ɺJava13͔Β͸Τϥʔϝοηʔδ͕Θ͔Γ΍͘͢਌੾ʹͳ͍ͬͯ·͢

    https://bugs.openjdk.java.net/browse/JDK-8048192

    View Slide

  37. (Micrometer )


    https://github.com/micrometer-metrics/.../JvmMemoryMetrics.java


    3
    7

    View Slide

  38. JDK


    https://github.com/openjdk/jdk/blob/jdk-
    11
    +
    28
    /src/java.base/share/classes/java/nio/Bits.java#L
    1
    75

    Runtime.getRuntime().maxMemory() -Xmx


    ( JVM -Xmx *
    2
    )


    -XX:MaxDirectMemorySize


    OOME
    3
    8
    OOME

    View Slide




  39. netty


    netty


    ⾒ 2 netty


    Spring WebClient (WebFlux)


    Lettuce (Redis Client)


    Spring WebClient Lettuce
    3
    9

    View Slide

  40. Spring WebClient Lettuce



    Spring WebClient 頻 CSV


    Mono


    Reactor Flux


    Spring WebClient Spring Boot
    2
    .
    1 2 56
    KB

    4
    0
    webClient.get()


    .uri(uri)


    .retrieve()


    .bodyToMono(byte[].class)
    //
    ո͍͠


    .block();


    WebClient.builder()


    .codecs(configurer -> configurer.defaultCodecs()


    .maxInMemorySize(-1)) // ແ੍ݶʹ͍ͯͨ͠


    .build();


    Spring WebClient Lettuce

    View Slide

  41. 2


    30
    0
    MB text file


    4
    1

    View Slide

  42. OOME 4


    4 * 300MB


    OOME
    4
    2

    Caused by: java.lang.OutOfMemoryError: Direct buffer memory


    at java.base/java.nio.Bits.reserveMemory(Bits.java:175)


    at java.base/java.nio.DirectByteBuffer.(DirectByteBuffer.java:118)


    at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317)


    at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:648)


    at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:623)


    at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:202)


    at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:186)


    at io.netty.buffer.PoolArena.allocate(PoolArena.java:136)


    at io.netty.buffer.PoolArena.allocate(PoolArena.java:126)


    at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:394)


    at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)


    at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)


    seq 4 | xargs -P 4 -I{} curl localhost:8080/mono
    -XX:MaxDirectMemorySize=
    1
    g

    View Slide



  43. (jcmd PID GC.run )


    OOME


    4
    3
    watch curl localhost:8080/mem


    name=direct, count=76, memoryUsed=1008MB, totalCapacity=1008MB


    name=direct, count=76, memoryUsed=1008MB, totalCapacity=1008MB



    View Slide

  44. OOME ⾒ -XX:+ExitOnOutOfMemoryError


    report_java_out_of_memory


    https://github.com/openjdk/jdk
    11
    u/.../src/hotspot/share/utilities/debug.cpp#L
    3 19

    Systemd ⾒





    Micrometer
    4
    4
    -XX:+ExitOnOutOfMemoryError !!

    View Slide

  45. Flux




    Flux


    ( )
    4
    5
    seq 4 | xargs -P 4 -I{} curl localhost:8080/flux
    curl localhost:8080/mem


    name=direct, count=17, memoryUsed=80MB, totalCapacity=80MB

    View Slide

  46. 4
    6
    OOME
    Before
    After !!


    570MB΄ͲͰऩ·ΔΑ͏ʹ
    ࠓ೥ͷ೥຤೥࢝͸Ժ΍͔ʹաͤͦ͝͏🙌

    View Slide



  47. netty PooledByteBufAllocator






    jemalloc


    https://people.freebsd.org/~jasone/jemalloc/bsdcan
    200
    6
    /jemalloc.pdf


    https://www.facebook.com/notes/
    1015879
    1475
    077200
    /
    4
    7

    View Slide

  48. WebClient( reactor-netty) CPU I/O


    CPU 20 頻 2GB CSV


    20 * 2GB =
    4
    0
    GB 24GB


    = -Xmx or -XX:MaxDirectMemorySize
    4
    8

    View Slide

  49. /
    4
    9
    JVM


    OOME -XX:+ExitOnOutOfMemoryError





    Reactive Streams netty



    View Slide

  50. !!

    View Slide

  51. We re Hiring !!
    5
    1
    LINE


    B
    2
    B 2


    SRE (https://linecorp.com/ja/career/position/
    3112
    )


    /


    (https://linecorp.com/ja/career/position/
    231
    6
    )


    View Slide

  52. View Slide