Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Log message with JSON item count for root cause analysis in microservices

Log message with JSON item count for root cause analysis in microservices

CIoT-2023 / @Lisbon,Portugal / Mar. 20, 2023

Tomoyuki KOYAMA

March 22, 2023
Tweet

More Decks by Tomoyuki KOYAMA

Other Decks in Research

Transcript

  1. Log message with JSON item count for root cause analysis

    in microservices Tomoyuki KOYAMA, Takayuki KUSHIDA Graduate School of Bionics, Computer and Media Sciences Tokyo University of Technology March 20, 2023 1 CIoT 2023 @Lisbon
  2. Background • A web service “Doktor” • Provide keyword search

    and download for end users • Record access history into logs on receiving accesses • End users • Access the web service for technical reports download • Monitoring server • Send HTTP requests to Doktor* • Detect error response counts increase ⇒ Send an alert to system administrator • System administrator • Analyze root cause of error response by log search • Log server • Receive and store logs from microservices • Provide log search for system administrator 2 web service (Doktor) log server system administrators HTTP request HTTP response End users monitoring server send alert log search logs m onitor Error response counts increases Microservice Logs are utilized for root cause analysis. Error * https://github.com/cdsl-research/doktor-v2
  3. Issue 3 • Microservices: front, fulltext search • Provide keyword

    search on Doktor • An application bug in “front” by developer • Conditional error occurance on “front” • An error occurs on search keyword “recall” • Error occurance depends on HTTP response body structure from “fulltext” • System administrator • Find log messages corresponding to root cause of error response • Take time to diagnose condition that error occurs ⇒ Not enough information for system administrator to obtain root cause from logs user A front fulltext search {"found": [ {"title": "xx"}, {"title": "yy"}, ]} log server HTTP response body strucutre is dynamically changed by HTTP requests. Keyword “recall” Keyword “IoT” web service (Doktor) {"found": [ ]} Microservice HTTP request HTTP response logs bugs system administrator log search Error user B [2023-01-28T03:22:39.272Z] "GET /fulltext/e0ac896 ..." 200 ... Example log
  4. Related studies • There are related studies for root cause

    analysis or fault localization. • MicroRCA [Wu et al. 2020] • Proposed graph based anomaly propagation model to detect performance issues • Included “total sent bytes” as an metric for fault localization ⇒ ”total sent bytes” is different in HTTP requests even if URL endpoint is the same among requests. • CloudRanger [Wang et al. 2018] • Utilized response time and throughput as anomaly detection metrics ⇒This approach cannot detect HTTP response body structure changes. 4
  5. Proposed method • Response Body Summary Index (RBSI) • Represent

    item counts per depth from HTTP response body • Provide observability for HTTP response body structure changes (e.g. item counts) • Support system administrator in finding log messages corresponding to errors • RBSI is added to the end of each log message. 5 Goal: System administrator can obtain information for root cause analysis from log messages. [2022-11-27T08:07:21...] "POST /search ..." 200 … 1, 2 [2022-11-27T08:07:24...] "POST /search ..." 500 … 1, 0 [2022-11-27T08:07:25...] "POST /search ..." 200 … 1, 8 Example: RBSI in log messages { ”fulltexts”: [ {“paper_id”: “xxx”, ”page_n”: 1, …}, {“paper_id”: “yyy”, ”page_n”: 3, …} ] } RBSI HTTP response body in JSON format RBSI 1,2
  6. Proposed method • RBSI is obtained from RBSA (Response Body

    Summary Algorithm) 6 { “fulltexts”: [ {“paper_id”: “xxx”, “page_n”: 1, …}, {“paper_id”: “yyy”, “page_n”: 3, …} ] } 1,2 RBSI Count keys 1 2 Count items Example: HTTP response body in JSON format • Logs with RBSI represent changes in HTTP response body structure. Input Output RBSA ...08:07:21...] "POST /search ..." 200 … 1, 2 ...08:07:24...] "POST /search ..." 200 … 1, 0 ...08:07:25...] "POST /search ..." 200 … 1, 8 ...08:07:29...] "POST /search ..." 200 … 1, 9 ...08:07:32...] "POST /search ..." 200 … 1, 0 system administrator ...08:07:21...] ”GET / ..." 200 … ...08:07:24...] ”GET /?s=drbd ..." 500 … ...08:07:25...] ”GET /?s=WiFi ..." 200 … ...08:07:29...] ”GET /?s=IoT ..." 200 … ...08:07:32...] ”GET /?s=recall ..." 500 … Join by request ID “fulltext-search” microservice logs “front” microservice logs Realize empty HTTP responses
  7. Usecase • RBSI is used for web service. • Web

    service consists of microservices . • Web service adopts service mesh (Istio). • Proxy servers for service mesh • Intercept HTTP requests/responses • Generate logs when requests is received • Add RBSA to proxy servers • System administrator finds log messages related to errors. • RBSI represents HTTP response body structure changes. • RBSI allows system administrator to reduce the time to analyze root cause of errors through log messages. 7 web service app1 microservice 1 app2 microservice 2 [2022-11-27T08:07:21...] "POST /search ..." 200 … 1, 2 end user add generate log message RBSA proxy server RBSI system administrator watch
  8. Experimental environment • Kubernetes Cluster (K3s, 1 virtual machine, Istio)

    • Customized proxy: “Istio proxy+” • Include RBSA by Lua • Application: “Doktor” (6 microservices) • Traffic dataset: 17,845 logs over 3 weeks • Source: “Doktor” production environment • Collect HTTP Method & Path • Simulate end user’s access patterns • HTTP request sender: “loader” • Measure HTTP response time • Log collection: ”kubectl logs” command • Measure log message length 8 microservices(Doktor) virtual machine production logs paper author stats fulltext thumbnail Istio gateway Istio proxy+ microservice loader kubectl logs … Istio proxy logs persistent volume log message length response time front load logs Experiemntal environement architecture
  9. Experimental methods • Evaluate advantage and disadvantage about RBSI •

    [experiment 1] HTTP response time ⇒ RBSI calculation time overheads • Comparison targets: 3 log message formats • [experiment 2] Log message length ⇒ RBSI effect on storage capacity • Comparison targets: 3 log message formats 9 Itsio default logging format proposed logging format Response body full logging format Default values Additional values Full HTTP response body RBSI
  10. Experimental results 10 [experiment 1] HTTP response time Average response

    time for “proposed logging” increases by 7% compared with ”default logging” Log message length increase [experiment 2] Log message length Median log message length for “proposed logging” increases by 7 bytes (0.02%) compared with ”default logging” 0% 20% 40% 60% 80% 100% 120% default logging response body full logging proposed logging Response time increase HTTP response time overheads [experiment 1] [experiment 2] 7% better better
  11. Discussion – RBSI Expressiveness 11 • [Fault case] Order bugs

    on web application by complex SQL queries • Change item orders in HTTP response body • RBSI doesn’t represent list item order changes • The same RBSI values: “Normal response” ↔ “Item order changes” • [Solution] Use hash algorithm (e.g. SHA-1) to represent the order of items • Use short hash value such as Git commit ID to reduce log message length {"found": [ {"publish_year": 2020}, {"publish_year": 2022}, {"publish_year": 2019}, ]} {"found": [ {"publish_year": 2022}, {"publish_year": 2020}, {"publish_year": 2019}, ]} ◆ Normal response ◆ Item order changes (fault) 1,3 1,3 HTTP response body (JSON format) 6aaad70 e179be7 SHA-1 (short) SHA-1 (short) RBSI RBSI
  12. Conclusion Goal • System administrator can obtain information for root

    cause analysis from log messages. Issue • System administrator takes time to diagnose a condition that error occurs by logs. ⇒ Values in log messages are insufficient for diagnosis. Proposed method • Propose identifier "RBSI" for diagnostics with log messages • RBSI represent HTTP response body structure changes Experimental results • Average response time: proposed log format increases 7% than “default log format” • Log message length: proposed log format increases 0.02% than ”default log format” 12