Monitoring and Visualizing Your (Micro)services

MONITORING AND VISUALIZING YOUR (MICRO)SERVICES Shin TANIMOTO (@cero_t)  Everforth /
Acroquest Technology

DEMO SOURCE CODE http://bit.ly/jc2019monitoring

WHO AM I? ➤ Shin Tanimoto (Twitter: @cero_t) ➤ Senior
Solution Architect / Troubleshooter ➤ Everforth Co.,LTD. ➤ Acroquest Technology Co.,LTD. ➤ Leader of Japan Java User Group (JJUG) ➤ Java Champion ➤ Oracle Groundbreaker Ambassador ➤ Fighting Games / BABYMETAL

MONITORING / VISUALIZING ➤ When it comes to monitoring,  what
do you think of? ➤ Tools? ➤ Charts? ➤ Alerts?

USE-CASE FIRST ➤ We should start from use-cases. ➤ Monitoring
➤ Alerting ➤ Troubleshooting ➤ Discovering

USE-CASE FIRST ➤ We should start from use-cases. ➤ Monitoring
→ What’s going on “now” ➤ Alerting → Is okay “now" ➤ Troubleshooting → What happened in the “past” ➤ Discovering → To know the unknown

A METAPHOR ➤ The speed-meter of cars is a tachometer 
(or a digital meter with single number)  ➤ It’s enough to get the current status, isn’t it?

A METAPHOR ➤ Japanese cars were equipped with a warning
sound when the speed exceeded 100km/h ➤ This is alerting ➤ When the traﬃc accident, logs like drive recorders are necessary ➤ That is troubleshooting ➤ Analyzing driver’s steering and pedal operation my prevent the potential traﬃc accident ➤ That is discovering

AGENDA 1. Collecting Data 2. Building monitoring environment 3. Tip
and pitfalls 4. Improving monitoring

COLLECTING DATA

MONITORING TARGET ➤ E-Commerce Microservices  built with Spring Boot. UI 
(vue.js / Nginx) store-web  (Spring Boot) item-service  (Spring Boot) stock-service  (Spring Boot) cart-service  (Spring Boot) order-service  (Spring Boot) payment-service  (Spring Boot)

WHAT SHOULD BE COLLECTED ➤ Metrics ➤ Time-series numerical data
➤ In most cases, sampling-based ➤ Logs ➤ Text messages with timestamp ➤ In most cases, event-driven

WHAT SHOULD BE COLLECTED ➤ Metrics to collect ➤ Resources
➤ Performance ➤ Health

COLLECTING DATA / METRICS ➤ Resource metrics ➤ Server /
Container resources ➤ CPU usages, Memory usages, Disk usages ➤ Disk IO, network IO ➤ JVM resource ➤ Heap usages, Non-heap usages ➤ GC pause, GC count ➤ Thread count

Container resources ➤ CPU usages, Memory usages, Disk usages ➤ Disk IO, network IO ➤ JVM resource ➤ Heap usages, Non-heap usages ➤ GC pause, GC count ➤ Thread count Capacity Performance Capacity Performance Capacity

Container resources ➤ CPU usages, Memory usages, Disk usages ➤ Disk IO, network IO ➤ JVM resource ➤ Heap usages, Non-heap usages ➤ GC pause, GC count ➤ Thread count Monitoring / Alerting Troubleshooting Monitoring / Alerting Troubleshooting Monitoring

COLLECTING DATA / METRICS ➤ Performance metrics ➤ Avg. HTTP
response time of front-end servers ➤ Avg. HTTP response time of each (micro)services ➤ Database response time

COLLECTING DATA / METRICS ➤ Performance metrics ➤ Avg. HTTP
response time of front-end servers ➤ Avg. HTTP response time of each (micro)services ➤ Database response time Troubleshooting Troubleshooting Monitoring / Alerting

COLLECTING DATA / METRICS ➤ Health metrics ➤ HTTP access
count of front-end servers ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services

count of front-end servers ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services Monitoring / Alerting Monitoring Troubleshooting Troubleshooting

count of front-end servers ➤ Both too much, and too less are problems ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services Monitoring / Alerting Monitoring Troubleshooting Troubleshooting

COLLECTING DATA / LOGS ➤ Logs ➤ Access logs of
front-end server ➤ Access logs of each (micro)services ➤ Application logs of each (micro)services ➤ SQL logs ➤ GC logs Troubleshooting

COLLECTING DATA / LOGS ➤ Logs ➤ Access logs of
front-end server ➤ Access logs of each (micro)services ➤ Application logs of each (micro)services ➤ SQL logs ➤ GC logs Troubleshooting Alert ERRORs?

BUILDING  MONITORING ENVIRONMENT

MONITORING ENVIRONMENT / ARCHITECTURE Server / Container Spring Boot Application
Agent Data Store Visualizer /   Alerting Metrics  Logs Read  Summary Server / Container Spring Boot Application Agent Metrics  Logs .  .  .

MONITORING ENVIRONMENT / AGENTS Server / Container Spring Boot Application
In-app Agent In-server Agent Log Shipper  Agent Logs Data Store JVM  Metrics Server  Metrics Logs

MONITORING ENVIRONMENT / ELASTICSEARCH ➤ Elasticsearch + Kibana ➤ Elasticsearch
➤ Open source full-text search engine ➤ Useful for logs and metrics data store ➤ Can store, search and aggregate logs quickly ➤ Kibana ➤ Open source visualizer for Elasticsearch ➤ View / create charts, maps, tables, etc. in web browser

MONITORING ENVIRONMENT / SPOG ➤ Single Pane of Glass ➤
See all the information in one tool ➤ Monitoring environment should be like a baseball scoreboard. ➤ Viewed by all members ➤ To see the current situation at a glance ➤ Don't break the combination between Monitoring, Alerting, Troubleshooting and Discovering

MONITORING ENVIRONMENT / AGENTS ➤ Agents (Data collectors) ➤ Metricbeat
➤ Server / Container resources ➤ Filebeat or Fluentd ➤ Log shipper ➤ Elastic APM / Elastic APM Java Agent ➤ APM (Application Performance Monitoring) ➤ JVM resources

MONITORING ENVIRONMENT / AGENTS Server Spring Boot Application Elastic APM
Java Agent Metricbeat Filebeat Logs Elasticsearch JVM  Metrics Server  Metrics Logs Kibana Read  Summary

MONITORING ENVIRONMENT / AGENTS Container Spring Boot Application Elastic APM
Java Agent Metricbeat Fluentd Logs Elasticsearch JVM  Metrics Container  Metrics Logs (fluentd Log-driver) Kibana Read  Summary

(APPENDIX)  CONFIGURATIONS  OF AGENTS

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Metricbeat ➤ Server monitoring
➤ Just run metricbeat (out-of-the-box, no conﬁguration) ➤ Container (Docker) monitoring ➤ Enable metricbeat/module.d/docker.yml ➤ Run metricbeat on docker ➤ Metricbeat (auto)discovers containers and retrieves container metrics

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Elastic APM Java Agent
(1/2) ➤ Add dependency to the pom.xml of Spring Boot application <dependency> <groupId>co.elastic.apm</groupId> <artifactId>apm-agent-attach</artifactId>  <version>1.9.0</version> </dependency>

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Elastic APM Java Agent
(2/2) ➤ Add elasticapm.propereties ➤ Add “ElasticApmAttacher.attach();” to the main method service_name=store-web public static void main(String[] args) { ElasticApmAttacher.attach(); SpringApplication.run(StoreApplication.class, args); }

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Fluentd ➤ Run (micro)services
with fluentd log-driver ➤ Run fluentd on Docker store-web:  # snip  logging: driver: "fluentd" options: tag: "docker.services"

TIPS AND PITFALLS

TIPS #1  LOGS SHOULD BE  JSON FORMATTED

TIPS #1 / JSON FORMAT LOGS ➤ Logs should be
parsed before storing to Elasticsearch ➤ Log is like ➤ Grok pattern is like ➤ Multi line logs are hell ➤ Like aggregation of stream events 2016-02-26 11:15:47.561 INFO [service1,2485ec27856c56f4,2485ec27856c56f4,true] 68058 --- [nio-8081-exec-1] i.s.c.sleuth.docs.service1.Application : Hello from service1. Calling service2 %{TIMESTAMP_ISO8601:timestamp}\s+%{LOGLEVEL:severity}\s+\[% {DATA:service},%{DATA:trace},%{DATA:span},%{DATA:exportable}\]\s+%{DATA:pid} \s+---\s+\[%{DATA:thread}\]\s+%{DATA:class}\s+:\s+%{GREEDYDATA:rest}

TIPS #1 / JSON FORMAT LOGS ➤ Format logs as
JSON at the log out put time! ➤ Application Log ➤ logstash-logback-encoder <appender name="json-console" class="ch.qos.logback.core.ConsoleAppender"> <encoder class="net.logstash.logback.encoder.LoggingEventCompositeJsonEncoder"> <providers> <timestamp> <timeZone>UTC</timeZone> </timestamp> <pattern> <pattern> { "severity": "%level", "service": "${springAppName:-}", "type": "application",

TIPS #1 / JSON FORMAT LOGS ➤ Format logs as
JSON at the log out put time! ➤ Access Log ➤ Write conﬁguration by hand server.tomcat.accesslog.pattern={"@timestamp":"%{yyyy-MM- dd'T'HH:mm:ss.SSSZ}t","service":"$ {spring.application.name}","type":"access","method":"%m","rem ote":"%a","path":"%U","query":"%q","duration":"%D","status":" %s","bytes":"%B","user-agent":"%{User-Agent}i","referer":"% {Referer}i","session-id":"%S"}

TIPS #2  LOGS SHOULD BE  DISTRIBUTED TRACED

TIPS #2 / DISTRIBUTED TRACING ➤ In monolithic Java applications,
application behaviors can be traced from application logs using “thread-id”. ➤ Logs of microservices are micro-partitioned ➤ No correlation ids for a single request! ➤ Distributed tracing help us trace the behaviors for a request ➤ Trace-ID is like thread-id or request-id in the microservices ➤ It is passed via HTTP header or AMQP header etc.

TIPS #2 / DISTRIBUTED TRACING ➤ Tools ➤ Spring Cloud
Sleuth ➤ Depends on Spring DI, AOP ➤ Available for only Spring Boot applications ➤ Elastic APM ➤ Using bytecode instrumentation by ByteBuddy ➤ Available for all Java applications

TAKE A LOOK AT  THE LOGS

TIPS #2 / DISTRIBUTED TRACING ➤ Spring Cloud Sleuth vs
Elastic APM ➤ Spring Cloud Sleuth ➤ Supports most of Spring modules ➤ RestTemplate, WebFlux, Netty … ➤ Spring Cloud Stream, Spring RabbitMQ, Spring Kafka … ➤ Elastic APM ➤ Supports limited libraries ➤ Servlet, JAX-RS, Apache HttpClient, JDBC … ➤ Spring MVC, RestTemplate … ➤ Does NOT support WebFlux, RabbitMQ, Kafka …

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Spring Cloud Sleuth conﬁguration
➤ Add dependency to the pom.xml of Spring Boot application <dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-starter-sleuth</artifactId> </dependency>

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Spring Cloud Sleuth conﬁguration
➤ Modify logback-spring.xml if you customized the log output format <pattern> <pattern> { "severity": "%level", "service": "${springAppName:-}", "type": "application", "trace": "%X{X-B3-TraceId:-}", "span": "%X{X-B3-SpanId:-}", "parent": "%X{X-B3-ParentSpanId:-}", "exportable": "%X{X-Span-Export:-}",

From 17:00 at Room 401 How to Properly Blame Things
for Causing Latency:  An Introduction to Distributed Tracing and Zipkin By Adrian Cole

PITFALL #1  ALERTING CAN ONLY DETECT  PREDICTABLE PROBLEMS

PITFALL #1 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS ➤
Threshold base alerting only detects predictable problems ➤ If we only collect CPU usages and memory usages, we cannot notice disk full. ➤ If we only collect disk volume usage, we cannot notice the partition out of space. ➤ If we only collect disk size, we cannot notice the inode insuﬃciency. ➤ Not all problems can be predicted from the beginning. ➤ If an unknown problem occurs, it should be fed back to the monitoring for the problem will occur in the near future

To discover the unknown problem … ➤ Filter out all “known” logs on Kibana ➤ Only unknown logs are displayed ➤ Unknown logs may indicate unknown problems ➤ Leading more tasks, and system more stable

PITFALL #2  LIVING DEAD PROCESS

(Micro)service seems healthy but down actually ➤ Returns 200 OK to the health check request ➤ Metrics are normal ➤ But returns no response at all ➤ Last update time of log of each services should be monitored ➤ Living application produces some log ➤ If there is no log, it may be dead

IMPROVE MONITORING

KPI OF  MONITORING ITSELF

MONITORING AND ALERTING / METRICS ➤ Monitoring KPIs (1)The mean
time from when a problem occurs  to when it is detected (2)# of problems detected by alerting / Total # of problems (3)# of monitoring improvements / Total # of problems (4) (2) +(3) / Total # of problems ➤ We can clarify if the monitoring environment became better or not ➤ Monitoring improvement leads to more stable system operation

ENJOY MONITORING YOUR SYSTEM! ➤ Demo application and monitoring environment
➤ https://github.com/cero-t/ spring-store-2019/ ➤ My Twitter (@cero_t) ➤ https://twitter.com/cero_t

Monitoring and Visualizing Your (Micro)services

Monitoring and Visualizing Your (Micro)services

More Decks by Shin Tanimoto

Other Decks in Technology

Featured

Transcript