Monitoring and Visualizing Your (Micro)services

Slide 1

Slide 1 text

MONITORING AND VISUALIZING YOUR (MICRO)SERVICES Shin TANIMOTO (@cero_t)  Everforth / Acroquest Technology

Slide 2

Slide 2 text

DEMO SOURCE CODE http://bit.ly/jc2019monitoring

Slide 3

Slide 3 text

WHO AM I? ➤ Shin Tanimoto (Twitter: @cero_t) ➤ Senior Solution Architect / Troubleshooter ➤ Everforth Co.,LTD. ➤ Acroquest Technology Co.,LTD. ➤ Leader of Japan Java User Group (JJUG) ➤ Java Champion ➤ Oracle Groundbreaker Ambassador ➤ Fighting Games / BABYMETAL

Slide 4

Slide 4 text

MONITORING / VISUALIZING ➤ When it comes to monitoring,  what do you think of? ➤ Tools? ➤ Charts? ➤ Alerts?

Slide 5

Slide 5 text

USE-CASE FIRST ➤ We should start from use-cases. ➤ Monitoring ➤ Alerting ➤ Troubleshooting ➤ Discovering

Slide 6

Slide 6 text

USE-CASE FIRST ➤ We should start from use-cases. ➤ Monitoring → What’s going on “now” ➤ Alerting → Is okay “now" ➤ Troubleshooting → What happened in the “past” ➤ Discovering → To know the unknown

Slide 7

Slide 7 text

A METAPHOR ➤ The speed-meter of cars is a tachometer  (or a digital meter with single number)  ➤ It’s enough to get the current status, isn’t it?

Slide 8

Slide 8 text

A METAPHOR ➤ Japanese cars were equipped with a warning sound when the speed exceeded 100km/h ➤ This is alerting ➤ When the traﬃc accident, logs like drive recorders are necessary ➤ That is troubleshooting ➤ Analyzing driver’s steering and pedal operation my prevent the potential traﬃc accident ➤ That is discovering

Slide 9

Slide 9 text

AGENDA 1. Collecting Data 2. Building monitoring environment 3. Tip and pitfalls 4. Improving monitoring

Slide 10

Slide 10 text

COLLECTING DATA

Slide 11

Slide 11 text

MONITORING TARGET ➤ E-Commerce Microservices  built with Spring Boot. UI  (vue.js / Nginx) store-web  (Spring Boot) item-service  (Spring Boot) stock-service  (Spring Boot) cart-service  (Spring Boot) order-service  (Spring Boot) payment-service  (Spring Boot)

Slide 12

Slide 12 text

DEMO

Slide 13

Slide 13 text

WHAT SHOULD BE COLLECTED ➤ Metrics ➤ Time-series numerical data ➤ In most cases, sampling-based ➤ Logs ➤ Text messages with timestamp ➤ In most cases, event-driven

Slide 14

Slide 14 text

WHAT SHOULD BE COLLECTED ➤ Metrics to collect ➤ Resources ➤ Performance ➤ Health

Slide 15

Slide 15 text

COLLECTING DATA / METRICS ➤ Resource metrics ➤ Server / Container resources ➤ CPU usages, Memory usages, Disk usages ➤ Disk IO, network IO ➤ JVM resource ➤ Heap usages, Non-heap usages ➤ GC pause, GC count ➤ Thread count

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

COLLECTING DATA / METRICS ➤ Performance metrics ➤ Avg. HTTP response time of front-end servers ➤ Avg. HTTP response time of each (micro)services ➤ Database response time

Slide 19

Slide 19 text

COLLECTING DATA / METRICS ➤ Performance metrics ➤ Avg. HTTP response time of front-end servers ➤ Avg. HTTP response time of each (micro)services ➤ Database response time Troubleshooting Troubleshooting Monitoring / Alerting

Slide 20

Slide 20 text

COLLECTING DATA / METRICS ➤ Health metrics ➤ HTTP access count of front-end servers ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services

Slide 21

Slide 21 text

Slide 22

Slide 22 text

COLLECTING DATA / METRICS ➤ Health metrics ➤ HTTP access count of front-end servers ➤ Both too much, and too less are problems ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services Monitoring / Alerting Monitoring Troubleshooting Troubleshooting

Slide 23

Slide 23 text

COLLECTING DATA / LOGS ➤ Logs ➤ Access logs of front-end server ➤ Access logs of each (micro)services ➤ Application logs of each (micro)services ➤ SQL logs ➤ GC logs Troubleshooting

Slide 24

Slide 24 text

COLLECTING DATA / LOGS ➤ Logs ➤ Access logs of front-end server ➤ Access logs of each (micro)services ➤ Application logs of each (micro)services ➤ SQL logs ➤ GC logs Troubleshooting Alert ERRORs?

Slide 25

Slide 25 text

BUILDING  MONITORING ENVIRONMENT

Slide 26

Slide 26 text

MONITORING ENVIRONMENT / ARCHITECTURE Server / Container Spring Boot Application Agent Data Store Visualizer /   Alerting Metrics  Logs Read  Summary Server / Container Spring Boot Application Agent Metrics  Logs .  .  .

Slide 27

Slide 27 text

MONITORING ENVIRONMENT / AGENTS Server / Container Spring Boot Application In-app Agent In-server Agent Log Shipper  Agent Logs Data Store JVM  Metrics Server  Metrics Logs

Slide 28

Slide 28 text

MONITORING ENVIRONMENT / ELASTICSEARCH ➤ Elasticsearch + Kibana ➤ Elasticsearch ➤ Open source full-text search engine ➤ Useful for logs and metrics data store ➤ Can store, search and aggregate logs quickly ➤ Kibana ➤ Open source visualizer for Elasticsearch ➤ View / create charts, maps, tables, etc. in web browser

Slide 29

Slide 29 text

MONITORING ENVIRONMENT / SPOG ➤ Single Pane of Glass ➤ See all the information in one tool ➤ Monitoring environment should be like a baseball scoreboard. ➤ Viewed by all members ➤ To see the current situation at a glance ➤ Don't break the combination between Monitoring, Alerting, Troubleshooting and Discovering

Slide 30

Slide 30 text

MONITORING ENVIRONMENT / AGENTS ➤ Agents (Data collectors) ➤ Metricbeat ➤ Server / Container resources ➤ Filebeat or Fluentd ➤ Log shipper ➤ Elastic APM / Elastic APM Java Agent ➤ APM (Application Performance Monitoring) ➤ JVM resources

Slide 31

Slide 31 text

MONITORING ENVIRONMENT / AGENTS Server Spring Boot Application Elastic APM Java Agent Metricbeat Filebeat Logs Elasticsearch JVM  Metrics Server  Metrics Logs Kibana Read  Summary

Slide 32

Slide 32 text

MONITORING ENVIRONMENT / AGENTS Container Spring Boot Application Elastic APM Java Agent Metricbeat Fluentd Logs Elasticsearch JVM  Metrics Container  Metrics Logs (fluentd Log-driver) Kibana Read  Summary

Slide 33

Slide 33 text

DEMO

Slide 34

Slide 34 text

(APPENDIX)  CONFIGURATIONS  OF AGENTS

Slide 35

Slide 35 text

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Metricbeat ➤ Server monitoring ➤ Just run metricbeat (out-of-the-box, no conﬁguration) ➤ Container (Docker) monitoring ➤ Enable metricbeat/module.d/docker.yml ➤ Run metricbeat on docker ➤ Metricbeat (auto)discovers containers and retrieves container metrics

Slide 36

Slide 36 text

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Elastic APM Java Agent (1/2) ➤ Add dependency to the pom.xml of Spring Boot application co.elastic.apm apm-agent-attach  1.9.0

Slide 37

Slide 37 text

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Elastic APM Java Agent (2/2) ➤ Add elasticapm.propereties ➤ Add “ElasticApmAttacher.attach();” to the main method service_name=store-web public static void main(String[] args) { ElasticApmAttacher.attach(); SpringApplication.run(StoreApplication.class, args); }

Slide 38

Slide 38 text

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Fluentd ➤ Run (micro)services with fluentd log-driver ➤ Run fluentd on Docker store-web:  # snip  logging: driver: "fluentd" options: tag: "docker.services"

Slide 39

Slide 39 text

TIPS AND PITFALLS

Slide 40

Slide 40 text

TIPS #1  LOGS SHOULD BE  JSON FORMATTED

Slide 41

Slide 41 text

TIPS #1 / JSON FORMAT LOGS ➤ Logs should be parsed before storing to Elasticsearch ➤ Log is like ➤ Grok pattern is like ➤ Multi line logs are hell ➤ Like aggregation of stream events 2016-02-26 11:15:47.561 INFO [service1,2485ec27856c56f4,2485ec27856c56f4,true] 68058 --- [nio-8081-exec-1] i.s.c.sleuth.docs.service1.Application : Hello from service1. Calling service2 %{TIMESTAMP_ISO8601:timestamp}\s+%{LOGLEVEL:severity}\s+\[% {DATA:service},%{DATA:trace},%{DATA:span},%{DATA:exportable}\]\s+%{DATA:pid} \s+---\s+\[%{DATA:thread}\]\s+%{DATA:class}\s+:\s+%{GREEDYDATA:rest}

Slide 42

Slide 42 text

TIPS #1 / JSON FORMAT LOGS ➤ Format logs as JSON at the log out put time! ➤ Application Log ➤ logstash-logback-encoder UTC { "severity": "%level", "service": "${springAppName:-}", "type": "application",

Slide 43

Slide 43 text

TIPS #1 / JSON FORMAT LOGS ➤ Format logs as JSON at the log out put time! ➤ Access Log ➤ Write conﬁguration by hand server.tomcat.accesslog.pattern={"@timestamp":"%{yyyy-MM- dd'T'HH:mm:ss.SSSZ}t","service":"$ {spring.application.name}","type":"access","method":"%m","rem ote":"%a","path":"%U","query":"%q","duration":"%D","status":" %s","bytes":"%B","user-agent":"%{User-Agent}i","referer":"% {Referer}i","session-id":"%S"}

Slide 44

Slide 44 text

TIPS #2  LOGS SHOULD BE  DISTRIBUTED TRACED

Slide 45

Slide 45 text

TIPS #2 / DISTRIBUTED TRACING ➤ In monolithic Java applications, application behaviors can be traced from application logs using “thread-id”. ➤ Logs of microservices are micro-partitioned ➤ No correlation ids for a single request! ➤ Distributed tracing help us trace the behaviors for a request ➤ Trace-ID is like thread-id or request-id in the microservices ➤ It is passed via HTTP header or AMQP header etc.

Slide 46

Slide 46 text

TIPS #2 / DISTRIBUTED TRACING ➤ Tools ➤ Spring Cloud Sleuth ➤ Depends on Spring DI, AOP ➤ Available for only Spring Boot applications ➤ Elastic APM ➤ Using bytecode instrumentation by ByteBuddy ➤ Available for all Java applications

Slide 47

Slide 47 text

TAKE A LOOK AT  THE LOGS

Slide 48

Slide 48 text

TIPS #2 / DISTRIBUTED TRACING ➤ Spring Cloud Sleuth vs Elastic APM ➤ Spring Cloud Sleuth ➤ Supports most of Spring modules ➤ RestTemplate, WebFlux, Netty … ➤ Spring Cloud Stream, Spring RabbitMQ, Spring Kafka … ➤ Elastic APM ➤ Supports limited libraries ➤ Servlet, JAX-RS, Apache HttpClient, JDBC … ➤ Spring MVC, RestTemplate … ➤ Does NOT support WebFlux, RabbitMQ, Kafka …

Slide 49

Slide 49 text

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Spring Cloud Sleuth conﬁguration ➤ Add dependency to the pom.xml of Spring Boot application org.springframework.cloud spring-cloud-starter-sleuth

Slide 50

Slide 50 text

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Spring Cloud Sleuth conﬁguration ➤ Modify logback-spring.xml if you customized the log output format { "severity": "%level", "service": "${springAppName:-}", "type": "application", "trace": "%X{X-B3-TraceId:-}", "span": "%X{X-B3-SpanId:-}", "parent": "%X{X-B3-ParentSpanId:-}", "exportable": "%X{X-Span-Export:-}",

Slide 51

Slide 51 text

From 17:00 at Room 401 How to Properly Blame Things for Causing Latency:  An Introduction to Distributed Tracing and Zipkin By Adrian Cole

Slide 52

Slide 52 text

PITFALL #1  ALERTING CAN ONLY DETECT  PREDICTABLE PROBLEMS

Slide 53

Slide 53 text

PITFALL #1 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS ➤ Threshold base alerting only detects predictable problems ➤ If we only collect CPU usages and memory usages, we cannot notice disk full. ➤ If we only collect disk volume usage, we cannot notice the partition out of space. ➤ If we only collect disk size, we cannot notice the inode insuﬃciency. ➤ Not all problems can be predicted from the beginning. ➤ If an unknown problem occurs, it should be fed back to the monitoring for the problem will occur in the near future

Slide 54

Slide 54 text

PITFALL #1 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS ➤ To discover the unknown problem … ➤ Filter out all “known” logs on Kibana ➤ Only unknown logs are displayed ➤ Unknown logs may indicate unknown problems ➤ Leading more tasks, and system more stable

Slide 55

Slide 55 text

PITFALL #2  LIVING DEAD PROCESS

Slide 56

Slide 56 text

PITFALL #2 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS ➤ (Micro)service seems healthy but down actually ➤ Returns 200 OK to the health check request ➤ Metrics are normal ➤ But returns no response at all ➤ Last update time of log of each services should be monitored ➤ Living application produces some log ➤ If there is no log, it may be dead

Slide 57

Slide 57 text

IMPROVE MONITORING

Slide 58

Slide 58 text

KPI OF  MONITORING ITSELF

Slide 59

Slide 59 text

MONITORING AND ALERTING / METRICS ➤ Monitoring KPIs (1)The mean time from when a problem occurs  to when it is detected (2)# of problems detected by alerting / Total # of problems (3)# of monitoring improvements / Total # of problems (4) (2) +(3) / Total # of problems ➤ We can clarify if the monitoring environment became better or not ➤ Monitoring improvement leads to more stable system operation

Slide 60

Slide 60 text

ENJOY MONITORING YOUR SYSTEM! ➤ Demo application and monitoring environment ➤ https://github.com/cero-t/ spring-store-2019/ ➤ My Twitter (@cero_t) ➤ https://twitter.com/cero_t