Slide 1

Slide 1 text

 Everforth / Acroquest Technology

Slide 2

Slide 2 text


Slide 3

Slide 3 text

WHO AM I? ➤ Shin Tanimoto (Twitter: @cero_t) ➤ Senior Solution Architect / Troubleshooter ➤ Everforth Co.,LTD. ➤ Acroquest Technology Co.,LTD. ➤ Leader of Japan Java User Group (JJUG) ➤ Java Champion ➤ Oracle Groundbreaker Ambassador ➤ Fighting Games / BABYMETAL

Slide 4

Slide 4 text

MONITORING / VISUALIZING ➤ When it comes to monitoring,
 what do you think of? ➤ Tools? ➤ Charts? ➤ Alerts?

Slide 5

Slide 5 text

USE-CASE FIRST ➤ We should start from use-cases. ➤ Monitoring ➤ Alerting ➤ Troubleshooting ➤ Discovering

Slide 6

Slide 6 text

USE-CASE FIRST ➤ We should start from use-cases. ➤ Monitoring → What’s going on “now” ➤ Alerting → Is okay “now" ➤ Troubleshooting → What happened in the “past” ➤ Discovering → To know the unknown

Slide 7

Slide 7 text

A METAPHOR ➤ The speed-meter of cars is a tachometer
 (or a digital meter with single number)
 ➤ It’s enough to get the current status, isn’t it?

Slide 8

Slide 8 text

A METAPHOR ➤ Japanese cars were equipped with a warning sound when the speed exceeded 100km/h ➤ This is alerting ➤ When the traffic accident, logs like drive recorders are necessary ➤ That is troubleshooting ➤ Analyzing driver’s steering and pedal operation my prevent the potential traffic accident ➤ That is discovering

Slide 9

Slide 9 text

AGENDA 1. Collecting Data 2. Building monitoring environment 3. Tip and pitfalls 4. Improving monitoring

Slide 10

Slide 10 text


Slide 11

Slide 11 text

MONITORING TARGET ➤ E-Commerce Microservices
 built with Spring Boot. UI
 (vue.js / Nginx) store-web
 (Spring Boot) item-service
 (Spring Boot) stock-service
 (Spring Boot) cart-service
 (Spring Boot) order-service
 (Spring Boot) payment-service
 (Spring Boot)

Slide 12

Slide 12 text


Slide 13

Slide 13 text

WHAT SHOULD BE COLLECTED ➤ Metrics ➤ Time-series numerical data ➤ In most cases, sampling-based ➤ Logs ➤ Text messages with timestamp ➤ In most cases, event-driven

Slide 14

Slide 14 text

WHAT SHOULD BE COLLECTED ➤ Metrics to collect ➤ Resources ➤ Performance ➤ Health

Slide 15

Slide 15 text

COLLECTING DATA / METRICS ➤ Resource metrics ➤ Server / Container resources ➤ CPU usages, Memory usages, Disk usages ➤ Disk IO, network IO ➤ JVM resource ➤ Heap usages, Non-heap usages ➤ GC pause, GC count ➤ Thread count

Slide 16

Slide 16 text

COLLECTING DATA / METRICS ➤ Resource metrics ➤ Server / Container resources ➤ CPU usages, Memory usages, Disk usages ➤ Disk IO, network IO ➤ JVM resource ➤ Heap usages, Non-heap usages ➤ GC pause, GC count ➤ Thread count Capacity Performance Capacity Performance Capacity

Slide 17

Slide 17 text

COLLECTING DATA / METRICS ➤ Resource metrics ➤ Server / Container resources ➤ CPU usages, Memory usages, Disk usages ➤ Disk IO, network IO ➤ JVM resource ➤ Heap usages, Non-heap usages ➤ GC pause, GC count ➤ Thread count Monitoring / Alerting Troubleshooting Monitoring / Alerting Troubleshooting Monitoring

Slide 18

Slide 18 text

COLLECTING DATA / METRICS ➤ Performance metrics ➤ Avg. HTTP response time of front-end servers ➤ Avg. HTTP response time of each (micro)services ➤ Database response time

Slide 19

Slide 19 text

COLLECTING DATA / METRICS ➤ Performance metrics ➤ Avg. HTTP response time of front-end servers ➤ Avg. HTTP response time of each (micro)services ➤ Database response time Troubleshooting Troubleshooting Monitoring / Alerting

Slide 20

Slide 20 text

COLLECTING DATA / METRICS ➤ Health metrics ➤ HTTP access count of front-end servers ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services

Slide 21

Slide 21 text

COLLECTING DATA / METRICS ➤ Health metrics ➤ HTTP access count of front-end servers ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services Monitoring / Alerting Monitoring Troubleshooting Troubleshooting

Slide 22

Slide 22 text

COLLECTING DATA / METRICS ➤ Health metrics ➤ HTTP access count of front-end servers ➤ Both too much, and too less are problems ➤ HTTP access count of each (micro)services ➤ HTTP status of front-end servers ➤ HTTP status of each (micro)services Monitoring / Alerting Monitoring Troubleshooting Troubleshooting

Slide 23

Slide 23 text

COLLECTING DATA / LOGS ➤ Logs ➤ Access logs of front-end server ➤ Access logs of each (micro)services ➤ Application logs of each (micro)services ➤ SQL logs ➤ GC logs Troubleshooting

Slide 24

Slide 24 text

COLLECTING DATA / LOGS ➤ Logs ➤ Access logs of front-end server ➤ Access logs of each (micro)services ➤ Application logs of each (micro)services ➤ SQL logs ➤ GC logs Troubleshooting Alert ERRORs?

Slide 25

Slide 25 text


Slide 26

Slide 26 text

MONITORING ENVIRONMENT / ARCHITECTURE Server / Container Spring Boot Application Agent Data Store Visualizer / 
 Alerting Metrics
 Logs Read
 Summary Server / Container Spring Boot Application Agent Metrics
 Logs .

Slide 27

Slide 27 text

MONITORING ENVIRONMENT / AGENTS Server / Container Spring Boot Application In-app Agent In-server Agent Log Shipper
 Agent Logs Data Store JVM
 Metrics Server
 Metrics Logs

Slide 28

Slide 28 text

MONITORING ENVIRONMENT / ELASTICSEARCH ➤ Elasticsearch + Kibana ➤ Elasticsearch ➤ Open source full-text search engine ➤ Useful for logs and metrics data store ➤ Can store, search and aggregate logs quickly ➤ Kibana ➤ Open source visualizer for Elasticsearch ➤ View / create charts, maps, tables, etc. in web browser

Slide 29

Slide 29 text

MONITORING ENVIRONMENT / SPOG ➤ Single Pane of Glass ➤ See all the information in one tool ➤ Monitoring environment should be like a baseball scoreboard. ➤ Viewed by all members ➤ To see the current situation at a glance ➤ Don't break the combination between Monitoring, Alerting, Troubleshooting and Discovering

Slide 30

Slide 30 text

MONITORING ENVIRONMENT / AGENTS ➤ Agents (Data collectors) ➤ Metricbeat ➤ Server / Container resources ➤ Filebeat or Fluentd ➤ Log shipper ➤ Elastic APM / Elastic APM Java Agent ➤ APM (Application Performance Monitoring) ➤ JVM resources

Slide 31

Slide 31 text

MONITORING ENVIRONMENT / AGENTS Server Spring Boot Application Elastic APM Java Agent Metricbeat Filebeat Logs Elasticsearch JVM
 Metrics Server
 Metrics Logs Kibana Read

Slide 32

Slide 32 text

MONITORING ENVIRONMENT / AGENTS Container Spring Boot Application Elastic APM Java Agent Metricbeat Fluentd Logs Elasticsearch JVM
 Metrics Container
 Metrics Logs (fluentd Log-driver) Kibana Read

Slide 33

Slide 33 text


Slide 34

Slide 34 text


Slide 35

Slide 35 text

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Metricbeat ➤ Server monitoring ➤ Just run metricbeat (out-of-the-box, no configuration) ➤ Container (Docker) monitoring ➤ Enable metricbeat/module.d/docker.yml ➤ Run metricbeat on docker ➤ Metricbeat (auto)discovers containers and retrieves container metrics

Slide 36

Slide 36 text

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Elastic APM Java Agent (1/2) ➤ Add dependency to the pom.xml of Spring Boot application co.elastic.apm apm-agent-attach

Slide 37

Slide 37 text

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Elastic APM Java Agent (2/2) ➤ Add elasticapm.propereties ➤ Add “ElasticApmAttacher.attach();” to the main method service_name=store-web public static void main(String[] args) { ElasticApmAttacher.attach();, args); }

Slide 38

Slide 38 text

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Fluentd ➤ Run (micro)services with fluentd log-driver ➤ Run fluentd on Docker store-web:
 # snip
 logging: driver: "fluentd" options: tag: ""

Slide 39

Slide 39 text


Slide 40

Slide 40 text


Slide 41

Slide 41 text

TIPS #1 / JSON FORMAT LOGS ➤ Logs should be parsed before storing to Elasticsearch ➤ Log is like ➤ Grok pattern is like ➤ Multi line logs are hell ➤ Like aggregation of stream events 2016-02-26 11:15:47.561 INFO [service1,2485ec27856c56f4,2485ec27856c56f4,true] 68058 --- [nio-8081-exec-1] : Hello from service1. Calling service2 %{TIMESTAMP_ISO8601:timestamp}\s+%{LOGLEVEL:severity}\s+\[% {DATA:service},%{DATA:trace},%{DATA:span},%{DATA:exportable}\]\s+%{DATA:pid} \s+---\s+\[%{DATA:thread}\]\s+%{DATA:class}\s+:\s+%{GREEDYDATA:rest}

Slide 42

Slide 42 text

TIPS #1 / JSON FORMAT LOGS ➤ Format logs as JSON at the log out put time! ➤ Application Log ➤ logstash-logback-encoder UTC { "severity": "%level", "service": "${springAppName:-}", "type": "application",

Slide 43

Slide 43 text

TIPS #1 / JSON FORMAT LOGS ➤ Format logs as JSON at the log out put time! ➤ Access Log ➤ Write configuration by hand server.tomcat.accesslog.pattern={"@timestamp":"%{yyyy-MM- dd'T'HH:mm:ss.SSSZ}t","service":"$ {}","type":"access","method":"%m","rem ote":"%a","path":"%U","query":"%q","duration":"%D","status":" %s","bytes":"%B","user-agent":"%{User-Agent}i","referer":"% {Referer}i","session-id":"%S"}

Slide 44

Slide 44 text


Slide 45

Slide 45 text

TIPS #2 / DISTRIBUTED TRACING ➤ In monolithic Java applications, application behaviors can be traced from application logs using “thread-id”. ➤ Logs of microservices are micro-partitioned ➤ No correlation ids for a single request! ➤ Distributed tracing help us trace the behaviors for a request ➤ Trace-ID is like thread-id or request-id in the microservices ➤ It is passed via HTTP header or AMQP header etc.

Slide 46

Slide 46 text

TIPS #2 / DISTRIBUTED TRACING ➤ Tools ➤ Spring Cloud Sleuth ➤ Depends on Spring DI, AOP ➤ Available for only Spring Boot applications ➤ Elastic APM ➤ Using bytecode instrumentation by ByteBuddy ➤ Available for all Java applications

Slide 47

Slide 47 text


Slide 48

Slide 48 text

TIPS #2 / DISTRIBUTED TRACING ➤ Spring Cloud Sleuth vs Elastic APM ➤ Spring Cloud Sleuth ➤ Supports most of Spring modules ➤ RestTemplate, WebFlux, Netty … ➤ Spring Cloud Stream, Spring RabbitMQ, Spring Kafka … ➤ Elastic APM ➤ Supports limited libraries ➤ Servlet, JAX-RS, Apache HttpClient, JDBC … ➤ Spring MVC, RestTemplate … ➤ Does NOT support WebFlux, RabbitMQ, Kafka …

Slide 49

Slide 49 text

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Spring Cloud Sleuth configuration ➤ Add dependency to the pom.xml of Spring Boot application spring-cloud-starter-sleuth

Slide 50

Slide 50 text

MONITORING ENVIRONMENT / AGENT CONFIGURATIONS ➤ Spring Cloud Sleuth configuration ➤ Modify logback-spring.xml if you customized the log output format { "severity": "%level", "service": "${springAppName:-}", "type": "application", "trace": "%X{X-B3-TraceId:-}", "span": "%X{X-B3-SpanId:-}", "parent": "%X{X-B3-ParentSpanId:-}", "exportable": "%X{X-Span-Export:-}",

Slide 51

Slide 51 text

From 17:00 at Room 401 How to Properly Blame Things for Causing Latency:
 An Introduction to Distributed Tracing and Zipkin By Adrian Cole

Slide 52

Slide 52 text


Slide 53

Slide 53 text

PITFALL #1 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS ➤ Threshold base alerting only detects predictable problems ➤ If we only collect CPU usages and memory usages, we cannot notice disk full. ➤ If we only collect disk volume usage, we cannot notice the partition out of space. ➤ If we only collect disk size, we cannot notice the inode insufficiency. ➤ Not all problems can be predicted from the beginning. ➤ If an unknown problem occurs, it should be fed back to the monitoring for the problem will occur in the near future

Slide 54

Slide 54 text

PITFALL #1 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS ➤ To discover the unknown problem … ➤ Filter out all “known” logs on Kibana ➤ Only unknown logs are displayed ➤ Unknown logs may indicate unknown problems ➤ Leading more tasks, and system more stable

Slide 55

Slide 55 text


Slide 56

Slide 56 text

PITFALL #2 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS ➤ (Micro)service seems healthy but down actually ➤ Returns 200 OK to the health check request ➤ Metrics are normal ➤ But returns no response at all ➤ Last update time of log of each services should be monitored ➤ Living application produces some log ➤ If there is no log, it may be dead

Slide 57

Slide 57 text


Slide 58

Slide 58 text


Slide 59

Slide 59 text

MONITORING AND ALERTING / METRICS ➤ Monitoring KPIs (1)The mean time from when a problem occurs
 to when it is detected (2)# of problems detected by alerting / Total # of problems (3)# of monitoring improvements / Total # of problems (4) (2) +(3) / Total # of problems ➤ We can clarify if the monitoring environment became better or not ➤ Monitoring improvement leads to more stable system operation

Slide 60

Slide 60 text

ENJOY MONITORING YOUR SYSTEM! ➤ Demo application and monitoring environment ➤ spring-store-2019/ ➤ My Twitter (@cero_t) ➤