Log message with JSON item count for root cause analysis in microservices Tomoyuki KOYAMA, Takayuki KUSHIDA Graduate School of Bionics, Computer and Media Sciences Tokyo University of Technology March 20, 2023 1 CIoT 2023 @Lisbon
Background • A web service “Doktor” • Provide keyword search and download for end users • Record access history into logs on receiving accesses • End users • Access the web service for technical reports download • Monitoring server • Send HTTP requests to Doktor* • Detect error response counts increase ⇒ Send an alert to system administrator • System administrator • Analyze root cause of error response by log search • Log server • Receive and store logs from microservices • Provide log search for system administrator 2 web service (Doktor) log server system administrators HTTP request HTTP response End users monitoring server send alert log search logs m onitor Error response counts increases Microservice Logs are utilized for root cause analysis. Error * https://github.com/cdsl-research/doktor-v2
Issue 3 • Microservices: front, fulltext search • Provide keyword search on Doktor • An application bug in “front” by developer • Conditional error occurance on “front” • An error occurs on search keyword “recall” • Error occurance depends on HTTP response body structure from “fulltext” • System administrator • Find log messages corresponding to root cause of error response • Take time to diagnose condition that error occurs ⇒ Not enough information for system administrator to obtain root cause from logs user A front fulltext search {"found": [ {"title": "xx"}, {"title": "yy"}, ]} log server HTTP response body strucutre is dynamically changed by HTTP requests. Keyword “recall” Keyword “IoT” web service (Doktor) {"found": [ ]} Microservice HTTP request HTTP response logs bugs system administrator log search Error user B [2023-01-28T03:22:39.272Z] "GET /fulltext/e0ac896 ..." 200 ... Example log
Related studies • There are related studies for root cause analysis or fault localization. • MicroRCA [Wu et al. 2020] • Proposed graph based anomaly propagation model to detect performance issues • Included “total sent bytes” as an metric for fault localization ⇒ ”total sent bytes” is different in HTTP requests even if URL endpoint is the same among requests. • CloudRanger [Wang et al. 2018] • Utilized response time and throughput as anomaly detection metrics ⇒This approach cannot detect HTTP response body structure changes. 4
Proposed method • Response Body Summary Index (RBSI) • Represent item counts per depth from HTTP response body • Provide observability for HTTP response body structure changes (e.g. item counts) • Support system administrator in finding log messages corresponding to errors • RBSI is added to the end of each log message. 5 Goal: System administrator can obtain information for root cause analysis from log messages. [2022-11-27T08:07:21...] "POST /search ..." 200 … 1, 2 [2022-11-27T08:07:24...] "POST /search ..." 500 … 1, 0 [2022-11-27T08:07:25...] "POST /search ..." 200 … 1, 8 Example: RBSI in log messages { ”fulltexts”: [ {“paper_id”: “xxx”, ”page_n”: 1, …}, {“paper_id”: “yyy”, ”page_n”: 3, …} ] } RBSI HTTP response body in JSON format RBSI 1,2
Usecase • RBSI is used for web service. • Web service consists of microservices . • Web service adopts service mesh (Istio). • Proxy servers for service mesh • Intercept HTTP requests/responses • Generate logs when requests is received • Add RBSA to proxy servers • System administrator finds log messages related to errors. • RBSI represents HTTP response body structure changes. • RBSI allows system administrator to reduce the time to analyze root cause of errors through log messages. 7 web service app1 microservice 1 app2 microservice 2 [2022-11-27T08:07:21...] "POST /search ..." 200 … 1, 2 end user add generate log message RBSA proxy server RBSI system administrator watch
Discussion – RBSI Expressiveness 11 • [Fault case] Order bugs on web application by complex SQL queries • Change item orders in HTTP response body • RBSI doesn’t represent list item order changes • The same RBSI values: “Normal response” ↔ “Item order changes” • [Solution] Use hash algorithm (e.g. SHA-1) to represent the order of items • Use short hash value such as Git commit ID to reduce log message length {"found": [ {"publish_year": 2020}, {"publish_year": 2022}, {"publish_year": 2019}, ]} {"found": [ {"publish_year": 2022}, {"publish_year": 2020}, {"publish_year": 2019}, ]} ◆ Normal response ◆ Item order changes (fault) 1,3 1,3 HTTP response body (JSON format) 6aaad70 e179be7 SHA-1 (short) SHA-1 (short) RBSI RBSI
Conclusion Goal • System administrator can obtain information for root cause analysis from log messages. Issue • System administrator takes time to diagnose a condition that error occurs by logs. ⇒ Values in log messages are insufficient for diagnosis. Proposed method • Propose identifier "RBSI" for diagnostics with log messages • RBSI represent HTTP response body structure changes Experimental results • Average response time: proposed log format increases 7% than “default log format” • Log message length: proposed log format increases 0.02% than ”default log format” 12