Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Log message with JSON item count for root cause analysis in microservices

Log message with JSON item count for root cause analysis in microservices

CIoT-2023 / @Lisbon,Portugal / Mar. 20, 2023

Tomoyuki KOYAMA

March 22, 2023
Tweet

More Decks by Tomoyuki KOYAMA

Other Decks in Research

Transcript

  1. Log message with JSON item count
    for root cause analysis in microservices
    Tomoyuki KOYAMA, Takayuki KUSHIDA
    Graduate School of Bionics, Computer and Media Sciences
    Tokyo University of Technology
    March 20, 2023
    1
    CIoT 2023 @Lisbon

    View Slide

  2. Background
    • A web service “Doktor”

    Provide keyword search and download for end users

    Record access history into logs on receiving accesses
    • End users
    • Access the web service for technical reports download
    • Monitoring server
    • Send HTTP requests to Doktor*
    • Detect error response counts increase

    Send an alert to system administrator
    • System administrator

    Analyze root cause of error response by log search
    • Log server

    Receive and store logs from microservices

    Provide log search for system administrator
    2
    web service
    (Doktor)
    log server
    system
    administrators
    HTTP request
    HTTP response
    End users
    monitoring
    server
    send alert log search
    logs
    m
    onitor
    Error response
    counts increases
    Microservice
    Logs are utilized for root cause analysis.
    Error
    * https://github.com/cdsl-research/doktor-v2

    View Slide

  3. Issue 3
    • Microservices: front, fulltext search
    • Provide keyword search on Doktor
    • An application bug in “front” by developer
    • Conditional error occurance on “front”
    • An error occurs on search keyword “recall”
    • Error occurance depends on
    HTTP response body structure from “fulltext”
    • System administrator
    • Find log messages corresponding to root cause of
    error response
    • Take time to diagnose condition that error occurs
    ⇒ Not enough information for system administrator
    to obtain root cause from logs
    user A
    front
    fulltext
    search
    {"found": [
    {"title": "xx"},
    {"title": "yy"},
    ]}
    log server
    HTTP response body strucutre is dynamically changed by HTTP requests.
    Keyword
    “recall”
    Keyword
    “IoT”
    web service (Doktor)
    {"found": [ ]}
    Microservice
    HTTP request
    HTTP response
    logs
    bugs
    system
    administrator
    log search
    Error
    user B
    [2023-01-28T03:22:39.272Z] "GET /fulltext/e0ac896 ..." 200 ...
    Example log

    View Slide

  4. Related studies
    • There are related studies for root cause analysis or fault localization.
    • MicroRCA [Wu et al. 2020]
    • Proposed graph based anomaly propagation model to detect performance issues
    • Included “total sent bytes” as an metric for fault localization
    ⇒ ”total sent bytes” is different in HTTP requests even if URL endpoint is the same
    among requests.
    • CloudRanger [Wang et al. 2018]
    • Utilized response time and throughput as anomaly detection metrics
    ⇒This approach cannot detect HTTP response body structure changes.
    4

    View Slide

  5. Proposed method
    • Response Body Summary Index (RBSI)
    • Represent item counts per depth from HTTP response body
    • Provide observability for HTTP response body structure changes (e.g. item counts)
    • Support system administrator in finding log messages corresponding to errors
    • RBSI is added to the end of each log message.
    5
    Goal: System administrator can obtain information for root cause analysis
    from log messages.
    [2022-11-27T08:07:21...] "POST /search ..." 200 … 1, 2
    [2022-11-27T08:07:24...] "POST /search ..." 500 … 1, 0
    [2022-11-27T08:07:25...] "POST /search ..." 200 … 1, 8
    Example: RBSI in log messages {
    ”fulltexts”: [
    {“paper_id”: “xxx”, ”page_n”: 1, …},
    {“paper_id”: “yyy”, ”page_n”: 3, …}
    ]
    }
    RBSI HTTP response body in JSON format
    RBSI
    1,2

    View Slide

  6. Proposed method
    • RBSI is obtained from RBSA (Response Body Summary Algorithm)
    6
    {
    “fulltexts”: [
    {“paper_id”: “xxx”, “page_n”: 1, …},
    {“paper_id”: “yyy”, “page_n”: 3, …}
    ]
    }
    1,2
    RBSI
    Count
    keys
    1
    2 Count
    items
    Example: HTTP response body in JSON format
    • Logs with RBSI represent changes in HTTP response body structure.
    Input
    Output
    RBSA
    ...08:07:21...]
    "POST /search ..." 200 … 1, 2
    ...08:07:24...]
    "POST /search ..." 200 … 1, 0
    ...08:07:25...]
    "POST /search ..." 200 … 1, 8
    ...08:07:29...]
    "POST /search ..." 200 … 1, 9
    ...08:07:32...]
    "POST /search ..." 200 … 1, 0
    system
    administrator
    ...08:07:21...]
    ”GET / ..." 200 …
    ...08:07:24...]
    ”GET /?s=drbd ..."
    500

    ...08:07:25...]
    ”GET /?s=WiFi ..." 200 …
    ...08:07:29...]
    ”GET /?s=IoT ..." 200 …
    ...08:07:32...]
    ”GET /?s=recall ..."
    500

    Join by
    request ID
    “fulltext-search”
    microservice logs
    “front” microservice logs
    Realize
    empty HTTP
    responses

    View Slide

  7. Usecase
    • RBSI is used for web service.
    • Web service consists of microservices .
    • Web service adopts service mesh (Istio).
    • Proxy servers for service mesh
    • Intercept HTTP requests/responses
    • Generate logs when requests is received
    • Add RBSA to proxy servers
    • System administrator finds log messages related to errors.
    • RBSI represents HTTP response body structure changes.
    • RBSI allows system administrator to reduce the time to analyze
    root cause of errors through log messages.
    7
    web service
    app1
    microservice 1
    app2
    microservice 2
    [2022-11-27T08:07:21...] "POST /search ..." 200 … 1, 2
    end user
    add generate
    log message
    RBSA
    proxy
    server
    RBSI
    system
    administrator
    watch

    View Slide

  8. Experimental environment
    • Kubernetes Cluster (K3s, 1 virtual machine, Istio)
    • Customized proxy: “Istio proxy+”
    • Include RBSA by Lua
    • Application: “Doktor” (6 microservices)
    • Traffic dataset: 17,845 logs over 3 weeks
    • Source: “Doktor” production environment
    • Collect HTTP Method & Path
    • Simulate end user’s access patterns
    • HTTP request sender: “loader”
    • Measure HTTP response time
    • Log collection: ”kubectl logs” command
    • Measure log message length
    8
    microservices(Doktor)
    virtual machine
    production
    logs
    paper
    author
    stats
    fulltext
    thumbnail
    Istio
    gateway
    Istio proxy+
    microservice
    loader
    kubectl logs …
    Istio proxy logs
    persistent volume
    log message length
    response time
    front
    load logs
    Experiemntal environement architecture

    View Slide

  9. Experimental methods
    • Evaluate advantage and disadvantage about RBSI
    • [experiment 1] HTTP response time ⇒ RBSI calculation time overheads
    • Comparison targets: 3 log message formats
    • [experiment 2] Log message length ⇒ RBSI effect on storage capacity
    • Comparison targets: 3 log message formats
    9
    Itsio default logging format
    proposed logging format
    Response body full logging format
    Default values Additional values
    Full HTTP response body
    RBSI

    View Slide

  10. Experimental results 10
    [experiment 1] HTTP response time
    Average response time for “proposed logging” increases by 7% compared with
    ”default logging”
    Log message
    length increase
    [experiment 2] Log message length
    Median log message length for “proposed logging” increases by 7 bytes (0.02%)
    compared with ”default logging”
    0%
    20%
    40%
    60%
    80%
    100%
    120%
    default logging response body
    full logging
    proposed
    logging
    Response time increase
    HTTP response
    time overheads
    [experiment 1]
    [experiment 2]
    7%
    better
    better

    View Slide

  11. Discussion – RBSI Expressiveness 11
    • [Fault case] Order bugs on web application by complex SQL queries
    • Change item orders in HTTP response body
    • RBSI doesn’t represent list item order changes
    • The same RBSI values: “Normal response”

    “Item order changes”
    • [Solution] Use hash algorithm (e.g. SHA-1) to represent the order of items
    • Use short hash value such as Git commit ID to reduce log message length
    {"found": [
    {"publish_year": 2020},
    {"publish_year": 2022},
    {"publish_year": 2019},
    ]}
    {"found": [
    {"publish_year": 2022},
    {"publish_year": 2020},
    {"publish_year": 2019},
    ]}

    Normal response

    Item order changes (fault)
    1,3
    1,3
    HTTP response body
    (JSON format)
    6aaad70
    e179be7
    SHA-1 (short) SHA-1 (short) RBSI
    RBSI

    View Slide

  12. Conclusion
    Goal • System administrator can obtain information for root cause
    analysis from log messages.
    Issue • System administrator takes time to diagnose a condition that error
    occurs by logs.
    ⇒ Values in log messages are insufficient for diagnosis.
    Proposed
    method
    • Propose identifier "RBSI" for diagnostics with log messages
    • RBSI represent HTTP response body structure changes
    Experimental
    results
    • Average response time:
    proposed log format increases 7% than “default log format”
    • Log message length:
    proposed log format increases 0.02% than ”default log format”
    12

    View Slide