Slide 1

Slide 1 text

Log message with JSON item count for root cause analysis in microservices Tomoyuki KOYAMA, Takayuki KUSHIDA Graduate School of Bionics, Computer and Media Sciences Tokyo University of Technology March 20, 2023 1 CIoT 2023 @Lisbon

Slide 2

Slide 2 text

Background • A web service “Doktor” • Provide keyword search and download for end users • Record access history into logs on receiving accesses • End users • Access the web service for technical reports download • Monitoring server • Send HTTP requests to Doktor* • Detect error response counts increase ⇒ Send an alert to system administrator • System administrator • Analyze root cause of error response by log search • Log server • Receive and store logs from microservices • Provide log search for system administrator 2 web service (Doktor) log server system administrators HTTP request HTTP response End users monitoring server send alert log search logs m onitor Error response counts increases Microservice Logs are utilized for root cause analysis. Error * https://github.com/cdsl-research/doktor-v2

Slide 3

Slide 3 text

Issue 3 • Microservices: front, fulltext search • Provide keyword search on Doktor • An application bug in “front” by developer • Conditional error occurance on “front” • An error occurs on search keyword “recall” • Error occurance depends on HTTP response body structure from “fulltext” • System administrator • Find log messages corresponding to root cause of error response • Take time to diagnose condition that error occurs ⇒ Not enough information for system administrator to obtain root cause from logs user A front fulltext search {"found": [ {"title": "xx"}, {"title": "yy"}, ]} log server HTTP response body strucutre is dynamically changed by HTTP requests. Keyword “recall” Keyword “IoT” web service (Doktor) {"found": [ ]} Microservice HTTP request HTTP response logs bugs system administrator log search Error user B [2023-01-28T03:22:39.272Z] "GET /fulltext/e0ac896 ..." 200 ... Example log

Slide 4

Slide 4 text

Related studies • There are related studies for root cause analysis or fault localization. • MicroRCA [Wu et al. 2020] • Proposed graph based anomaly propagation model to detect performance issues • Included “total sent bytes” as an metric for fault localization ⇒ ”total sent bytes” is different in HTTP requests even if URL endpoint is the same among requests. • CloudRanger [Wang et al. 2018] • Utilized response time and throughput as anomaly detection metrics ⇒This approach cannot detect HTTP response body structure changes. 4

Slide 5

Slide 5 text

Proposed method • Response Body Summary Index (RBSI) • Represent item counts per depth from HTTP response body • Provide observability for HTTP response body structure changes (e.g. item counts) • Support system administrator in finding log messages corresponding to errors • RBSI is added to the end of each log message. 5 Goal: System administrator can obtain information for root cause analysis from log messages. [2022-11-27T08:07:21...] "POST /search ..." 200 … 1, 2 [2022-11-27T08:07:24...] "POST /search ..." 500 … 1, 0 [2022-11-27T08:07:25...] "POST /search ..." 200 … 1, 8 Example: RBSI in log messages { ”fulltexts”: [ {“paper_id”: “xxx”, ”page_n”: 1, …}, {“paper_id”: “yyy”, ”page_n”: 3, …} ] } RBSI HTTP response body in JSON format RBSI 1,2

Slide 6

Slide 6 text

Proposed method • RBSI is obtained from RBSA (Response Body Summary Algorithm) 6 { “fulltexts”: [ {“paper_id”: “xxx”, “page_n”: 1, …}, {“paper_id”: “yyy”, “page_n”: 3, …} ] } 1,2 RBSI Count keys 1 2 Count items Example: HTTP response body in JSON format • Logs with RBSI represent changes in HTTP response body structure. Input Output RBSA ...08:07:21...] "POST /search ..." 200 … 1, 2 ...08:07:24...] "POST /search ..." 200 … 1, 0 ...08:07:25...] "POST /search ..." 200 … 1, 8 ...08:07:29...] "POST /search ..." 200 … 1, 9 ...08:07:32...] "POST /search ..." 200 … 1, 0 system administrator ...08:07:21...] ”GET / ..." 200 … ...08:07:24...] ”GET /?s=drbd ..." 500 … ...08:07:25...] ”GET /?s=WiFi ..." 200 … ...08:07:29...] ”GET /?s=IoT ..." 200 … ...08:07:32...] ”GET /?s=recall ..." 500 … Join by request ID “fulltext-search” microservice logs “front” microservice logs Realize empty HTTP responses

Slide 7

Slide 7 text

Usecase • RBSI is used for web service. • Web service consists of microservices . • Web service adopts service mesh (Istio). • Proxy servers for service mesh • Intercept HTTP requests/responses • Generate logs when requests is received • Add RBSA to proxy servers • System administrator finds log messages related to errors. • RBSI represents HTTP response body structure changes. • RBSI allows system administrator to reduce the time to analyze root cause of errors through log messages. 7 web service app1 microservice 1 app2 microservice 2 [2022-11-27T08:07:21...] "POST /search ..." 200 … 1, 2 end user add generate log message RBSA proxy server RBSI system administrator watch

Slide 8

Slide 8 text

Experimental environment • Kubernetes Cluster (K3s, 1 virtual machine, Istio) • Customized proxy: “Istio proxy+” • Include RBSA by Lua • Application: “Doktor” (6 microservices) • Traffic dataset: 17,845 logs over 3 weeks • Source: “Doktor” production environment • Collect HTTP Method & Path • Simulate end user’s access patterns • HTTP request sender: “loader” • Measure HTTP response time • Log collection: ”kubectl logs” command • Measure log message length 8 microservices(Doktor) virtual machine production logs paper author stats fulltext thumbnail Istio gateway Istio proxy+ microservice loader kubectl logs … Istio proxy logs persistent volume log message length response time front load logs Experiemntal environement architecture

Slide 9

Slide 9 text

Experimental methods • Evaluate advantage and disadvantage about RBSI • [experiment 1] HTTP response time ⇒ RBSI calculation time overheads • Comparison targets: 3 log message formats • [experiment 2] Log message length ⇒ RBSI effect on storage capacity • Comparison targets: 3 log message formats 9 Itsio default logging format proposed logging format Response body full logging format Default values Additional values Full HTTP response body RBSI

Slide 10

Slide 10 text

Experimental results 10 [experiment 1] HTTP response time Average response time for “proposed logging” increases by 7% compared with ”default logging” Log message length increase [experiment 2] Log message length Median log message length for “proposed logging” increases by 7 bytes (0.02%) compared with ”default logging” 0% 20% 40% 60% 80% 100% 120% default logging response body full logging proposed logging Response time increase HTTP response time overheads [experiment 1] [experiment 2] 7% better better

Slide 11

Slide 11 text

Discussion – RBSI Expressiveness 11 • [Fault case] Order bugs on web application by complex SQL queries • Change item orders in HTTP response body • RBSI doesn’t represent list item order changes • The same RBSI values: “Normal response” ↔ “Item order changes” • [Solution] Use hash algorithm (e.g. SHA-1) to represent the order of items • Use short hash value such as Git commit ID to reduce log message length {"found": [ {"publish_year": 2020}, {"publish_year": 2022}, {"publish_year": 2019}, ]} {"found": [ {"publish_year": 2022}, {"publish_year": 2020}, {"publish_year": 2019}, ]} ◆ Normal response ◆ Item order changes (fault) 1,3 1,3 HTTP response body (JSON format) 6aaad70 e179be7 SHA-1 (short) SHA-1 (short) RBSI RBSI

Slide 12

Slide 12 text

Conclusion Goal • System administrator can obtain information for root cause analysis from log messages. Issue • System administrator takes time to diagnose a condition that error occurs by logs. ⇒ Values in log messages are insufficient for diagnosis. Proposed method • Propose identifier "RBSI" for diagnostics with log messages • RBSI represent HTTP response body structure changes Experimental results • Average response time: proposed log format increases 7% than “default log format” • Log message length: proposed log format increases 0.02% than ”default log format” 12