Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring and Visualizing Your (Micro)services

Monitoring and Visualizing Your (Micro)services

JCConf Taiwan 2019

Shin Tanimoto

October 04, 2019
Tweet

More Decks by Shin Tanimoto

Other Decks in Technology

Transcript

  1. MONITORING AND VISUALIZING
    YOUR (MICRO)SERVICES
    Shin TANIMOTO (@cero_t)

    Everforth / Acroquest Technology

    View full-size slide

  2. DEMO SOURCE CODE
    http://bit.ly/jc2019monitoring

    View full-size slide

  3. WHO AM I?
    ➤ Shin Tanimoto (Twitter: @cero_t)
    ➤ Senior Solution Architect / Troubleshooter
    ➤ Everforth Co.,LTD.
    ➤ Acroquest Technology Co.,LTD.
    ➤ Leader of Japan Java User Group (JJUG)
    ➤ Java Champion
    ➤ Oracle Groundbreaker Ambassador
    ➤ Fighting Games / BABYMETAL

    View full-size slide

  4. MONITORING / VISUALIZING
    ➤ When it comes to monitoring,

    what do you think of?
    ➤ Tools?
    ➤ Charts?
    ➤ Alerts?

    View full-size slide

  5. USE-CASE FIRST
    ➤ We should start from use-cases.
    ➤ Monitoring
    ➤ Alerting
    ➤ Troubleshooting
    ➤ Discovering

    View full-size slide

  6. USE-CASE FIRST
    ➤ We should start from use-cases.
    ➤ Monitoring → What’s going on “now”
    ➤ Alerting → Is okay “now"
    ➤ Troubleshooting → What happened in the “past”
    ➤ Discovering → To know the unknown

    View full-size slide

  7. A METAPHOR
    ➤ The speed-meter of cars is a tachometer

    (or a digital meter with single number)

    ➤ It’s enough to get the current status, isn’t it?

    View full-size slide

  8. A METAPHOR
    ➤ Japanese cars were equipped with a warning sound when the
    speed exceeded 100km/h
    ➤ This is alerting
    ➤ When the traffic accident, logs like drive recorders are necessary
    ➤ That is troubleshooting
    ➤ Analyzing driver’s steering and pedal operation my prevent the
    potential traffic accident
    ➤ That is discovering

    View full-size slide

  9. AGENDA
    1. Collecting Data
    2. Building monitoring environment
    3. Tip and pitfalls
    4. Improving monitoring

    View full-size slide

  10. COLLECTING DATA

    View full-size slide

  11. MONITORING TARGET
    ➤ E-Commerce Microservices

    built with Spring Boot.
    UI

    (vue.js / Nginx)
    store-web

    (Spring Boot)
    item-service

    (Spring Boot)
    stock-service

    (Spring Boot)
    cart-service

    (Spring Boot)
    order-service

    (Spring Boot)
    payment-service

    (Spring Boot)

    View full-size slide

  12. WHAT SHOULD BE COLLECTED
    ➤ Metrics
    ➤ Time-series numerical data
    ➤ In most cases, sampling-based
    ➤ Logs
    ➤ Text messages with timestamp
    ➤ In most cases, event-driven

    View full-size slide

  13. WHAT SHOULD BE COLLECTED
    ➤ Metrics to collect
    ➤ Resources
    ➤ Performance
    ➤ Health

    View full-size slide

  14. COLLECTING DATA / METRICS
    ➤ Resource metrics
    ➤ Server / Container resources
    ➤ CPU usages, Memory usages, Disk usages
    ➤ Disk IO, network IO
    ➤ JVM resource
    ➤ Heap usages, Non-heap usages
    ➤ GC pause, GC count
    ➤ Thread count

    View full-size slide

  15. COLLECTING DATA / METRICS
    ➤ Resource metrics
    ➤ Server / Container resources
    ➤ CPU usages, Memory usages, Disk usages
    ➤ Disk IO, network IO
    ➤ JVM resource
    ➤ Heap usages, Non-heap usages
    ➤ GC pause, GC count
    ➤ Thread count
    Capacity
    Performance
    Capacity
    Performance
    Capacity

    View full-size slide

  16. COLLECTING DATA / METRICS
    ➤ Resource metrics
    ➤ Server / Container resources
    ➤ CPU usages, Memory usages, Disk usages
    ➤ Disk IO, network IO
    ➤ JVM resource
    ➤ Heap usages, Non-heap usages
    ➤ GC pause, GC count
    ➤ Thread count
    Monitoring / Alerting
    Troubleshooting
    Monitoring / Alerting
    Troubleshooting
    Monitoring

    View full-size slide

  17. COLLECTING DATA / METRICS
    ➤ Performance metrics
    ➤ Avg. HTTP response time of front-end servers
    ➤ Avg. HTTP response time of each (micro)services
    ➤ Database response time

    View full-size slide

  18. COLLECTING DATA / METRICS
    ➤ Performance metrics
    ➤ Avg. HTTP response time of front-end servers
    ➤ Avg. HTTP response time of each (micro)services
    ➤ Database response time
    Troubleshooting
    Troubleshooting
    Monitoring / Alerting

    View full-size slide

  19. COLLECTING DATA / METRICS
    ➤ Health metrics
    ➤ HTTP access count of front-end servers
    ➤ HTTP access count of each (micro)services
    ➤ HTTP status of front-end servers
    ➤ HTTP status of each (micro)services

    View full-size slide

  20. COLLECTING DATA / METRICS
    ➤ Health metrics
    ➤ HTTP access count of front-end servers
    ➤ HTTP access count of each (micro)services
    ➤ HTTP status of front-end servers
    ➤ HTTP status of each (micro)services
    Monitoring / Alerting
    Monitoring
    Troubleshooting
    Troubleshooting

    View full-size slide

  21. COLLECTING DATA / METRICS
    ➤ Health metrics
    ➤ HTTP access count of front-end servers
    ➤ Both too much, and too less are problems
    ➤ HTTP access count of each (micro)services
    ➤ HTTP status of front-end servers
    ➤ HTTP status of each (micro)services
    Monitoring / Alerting
    Monitoring
    Troubleshooting
    Troubleshooting

    View full-size slide

  22. COLLECTING DATA / LOGS
    ➤ Logs
    ➤ Access logs of front-end server
    ➤ Access logs of each (micro)services
    ➤ Application logs of each (micro)services
    ➤ SQL logs
    ➤ GC logs
    Troubleshooting

    View full-size slide

  23. COLLECTING DATA / LOGS
    ➤ Logs
    ➤ Access logs of front-end server
    ➤ Access logs of each (micro)services
    ➤ Application logs of each (micro)services
    ➤ SQL logs
    ➤ GC logs
    Troubleshooting
    Alert ERRORs?

    View full-size slide

  24. BUILDING

    MONITORING ENVIRONMENT

    View full-size slide

  25. MONITORING ENVIRONMENT / ARCHITECTURE
    Server / Container
    Spring Boot Application
    Agent
    Data Store
    Visualizer / 

    Alerting
    Metrics

    Logs
    Read

    Summary
    Server / Container
    Spring Boot Application
    Agent Metrics

    Logs
    .

    .

    .

    View full-size slide

  26. MONITORING ENVIRONMENT / AGENTS
    Server / Container
    Spring Boot Application
    In-app
    Agent
    In-server
    Agent
    Log Shipper

    Agent
    Logs
    Data Store
    JVM

    Metrics
    Server

    Metrics
    Logs

    View full-size slide

  27. MONITORING ENVIRONMENT / ELASTICSEARCH
    ➤ Elasticsearch + Kibana
    ➤ Elasticsearch
    ➤ Open source full-text search engine
    ➤ Useful for logs and metrics data store
    ➤ Can store, search and aggregate logs quickly
    ➤ Kibana
    ➤ Open source visualizer for Elasticsearch
    ➤ View / create charts, maps, tables, etc. in web browser

    View full-size slide

  28. MONITORING ENVIRONMENT / SPOG
    ➤ Single Pane of Glass
    ➤ See all the information in one tool
    ➤ Monitoring environment should be like a baseball scoreboard.
    ➤ Viewed by all members
    ➤ To see the current situation at a glance
    ➤ Don't break the combination between Monitoring, Alerting,
    Troubleshooting and Discovering

    View full-size slide

  29. MONITORING ENVIRONMENT / AGENTS
    ➤ Agents (Data collectors)
    ➤ Metricbeat
    ➤ Server / Container resources
    ➤ Filebeat or Fluentd
    ➤ Log shipper
    ➤ Elastic APM / Elastic APM Java Agent
    ➤ APM (Application Performance Monitoring)
    ➤ JVM resources

    View full-size slide

  30. MONITORING ENVIRONMENT / AGENTS
    Server
    Spring Boot Application
    Elastic APM Java Agent
    Metricbeat
    Filebeat
    Logs
    Elasticsearch
    JVM

    Metrics
    Server

    Metrics
    Logs
    Kibana
    Read

    Summary

    View full-size slide

  31. MONITORING ENVIRONMENT / AGENTS
    Container
    Spring Boot Application
    Elastic APM Java Agent
    Metricbeat
    Fluentd
    Logs
    Elasticsearch
    JVM

    Metrics
    Container

    Metrics
    Logs
    (fluentd Log-driver)
    Kibana
    Read

    Summary

    View full-size slide

  32. (APPENDIX)

    CONFIGURATIONS

    OF AGENTS

    View full-size slide

  33. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS
    ➤ Metricbeat
    ➤ Server monitoring
    ➤ Just run metricbeat (out-of-the-box, no configuration)
    ➤ Container (Docker) monitoring
    ➤ Enable metricbeat/module.d/docker.yml
    ➤ Run metricbeat on docker
    ➤ Metricbeat (auto)discovers containers and retrieves
    container metrics

    View full-size slide

  34. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS
    ➤ Elastic APM Java Agent (1/2)
    ➤ Add dependency to the pom.xml of Spring Boot application


    co.elastic.apm

    apm-agent-attach

    1.9.0


    View full-size slide

  35. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS
    ➤ Elastic APM Java Agent (2/2)
    ➤ Add elasticapm.propereties
    ➤ Add “ElasticApmAttacher.attach();” to the main method
    service_name=store-web
    public static void main(String[] args) {

    ElasticApmAttacher.attach();

    SpringApplication.run(StoreApplication.class, args);

    }

    View full-size slide

  36. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS
    ➤ Fluentd
    ➤ Run (micro)services with fluentd log-driver
    ➤ Run fluentd on Docker
    store-web:

    # snip

    logging:

    driver: "fluentd"

    options:

    tag: "docker.services"

    View full-size slide

  37. TIPS AND PITFALLS

    View full-size slide

  38. TIPS #1

    LOGS SHOULD BE

    JSON FORMATTED

    View full-size slide

  39. TIPS #1 / JSON FORMAT LOGS
    ➤ Logs should be parsed before storing to Elasticsearch
    ➤ Log is like
    ➤ Grok pattern is like
    ➤ Multi line logs are hell
    ➤ Like aggregation of stream events
    2016-02-26 11:15:47.561 INFO
    [service1,2485ec27856c56f4,2485ec27856c56f4,true] 68058 --- [nio-8081-exec-1]
    i.s.c.sleuth.docs.service1.Application : Hello from service1. Calling service2
    %{TIMESTAMP_ISO8601:timestamp}\s+%{LOGLEVEL:severity}\s+\[%
    {DATA:service},%{DATA:trace},%{DATA:span},%{DATA:exportable}\]\s+%{DATA:pid}
    \s+---\s+\[%{DATA:thread}\]\s+%{DATA:class}\s+:\s+%{GREEDYDATA:rest}

    View full-size slide

  40. TIPS #1 / JSON FORMAT LOGS
    ➤ Format logs as JSON at the log out put time!
    ➤ Application Log
    ➤ logstash-logback-encoder








    UTC







    {

    "severity": "%level",

    "service": "${springAppName:-}",

    "type": "application",

    View full-size slide

  41. TIPS #1 / JSON FORMAT LOGS
    ➤ Format logs as JSON at the log out put time!
    ➤ Access Log
    ➤ Write configuration by hand
    server.tomcat.accesslog.pattern={"@timestamp":"%{yyyy-MM-
    dd'T'HH:mm:ss.SSSZ}t","service":"$
    {spring.application.name}","type":"access","method":"%m","rem
    ote":"%a","path":"%U","query":"%q","duration":"%D","status":"
    %s","bytes":"%B","user-agent":"%{User-Agent}i","referer":"%
    {Referer}i","session-id":"%S"}

    View full-size slide

  42. TIPS #2

    LOGS SHOULD BE

    DISTRIBUTED TRACED

    View full-size slide

  43. TIPS #2 / DISTRIBUTED TRACING
    ➤ In monolithic Java applications, application behaviors can be
    traced from application logs using “thread-id”.
    ➤ Logs of microservices are micro-partitioned
    ➤ No correlation ids for a single request!
    ➤ Distributed tracing help us trace the behaviors for a request
    ➤ Trace-ID is like thread-id or request-id in the microservices
    ➤ It is passed via HTTP header or AMQP header etc.

    View full-size slide

  44. TIPS #2 / DISTRIBUTED TRACING
    ➤ Tools
    ➤ Spring Cloud Sleuth
    ➤ Depends on Spring DI, AOP
    ➤ Available for only Spring Boot applications
    ➤ Elastic APM
    ➤ Using bytecode instrumentation by ByteBuddy
    ➤ Available for all Java applications

    View full-size slide

  45. TAKE A LOOK AT

    THE LOGS

    View full-size slide

  46. TIPS #2 / DISTRIBUTED TRACING
    ➤ Spring Cloud Sleuth vs Elastic APM
    ➤ Spring Cloud Sleuth
    ➤ Supports most of Spring modules
    ➤ RestTemplate, WebFlux, Netty …
    ➤ Spring Cloud Stream, Spring RabbitMQ, Spring Kafka …
    ➤ Elastic APM
    ➤ Supports limited libraries
    ➤ Servlet, JAX-RS, Apache HttpClient, JDBC …
    ➤ Spring MVC, RestTemplate …
    ➤ Does NOT support WebFlux, RabbitMQ, Kafka …

    View full-size slide

  47. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS
    ➤ Spring Cloud Sleuth configuration
    ➤ Add dependency to the pom.xml of Spring Boot application


    org.springframework.cloud

    spring-cloud-starter-sleuth


    View full-size slide

  48. MONITORING ENVIRONMENT / AGENT CONFIGURATIONS
    ➤ Spring Cloud Sleuth configuration
    ➤ Modify logback-spring.xml if you customized the log
    output format




    {

    "severity": "%level",

    "service": "${springAppName:-}",

    "type": "application",

    "trace": "%X{X-B3-TraceId:-}",

    "span": "%X{X-B3-SpanId:-}",

    "parent": "%X{X-B3-ParentSpanId:-}",

    "exportable": "%X{X-Span-Export:-}",

    View full-size slide

  49. From 17:00 at Room 401
    How to Properly Blame Things for Causing Latency:

    An Introduction to Distributed Tracing and Zipkin
    By Adrian Cole

    View full-size slide

  50. PITFALL #1

    ALERTING CAN ONLY DETECT

    PREDICTABLE PROBLEMS

    View full-size slide

  51. PITFALL #1 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS
    ➤ Threshold base alerting only detects predictable problems
    ➤ If we only collect CPU usages and memory usages, we cannot
    notice disk full.
    ➤ If we only collect disk volume usage, we cannot notice the
    partition out of space.
    ➤ If we only collect disk size, we cannot notice the inode
    insufficiency.
    ➤ Not all problems can be predicted from the beginning.
    ➤ If an unknown problem occurs, it should be fed back to the
    monitoring for the problem will occur in the near future

    View full-size slide

  52. PITFALL #1 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS
    ➤ To discover the unknown problem …
    ➤ Filter out all “known” logs on Kibana
    ➤ Only unknown logs are displayed
    ➤ Unknown logs may indicate unknown problems
    ➤ Leading more tasks, and system more stable

    View full-size slide

  53. PITFALL #2

    LIVING DEAD PROCESS

    View full-size slide

  54. PITFALL #2 / ALERTS IS FOR ONLY PREDICTABLE PROBLEMS
    ➤ (Micro)service seems healthy but down actually
    ➤ Returns 200 OK to the health check request
    ➤ Metrics are normal
    ➤ But returns no response at all
    ➤ Last update time of log of each services should be monitored
    ➤ Living application produces some log
    ➤ If there is no log, it may be dead

    View full-size slide

  55. IMPROVE MONITORING

    View full-size slide

  56. KPI OF

    MONITORING ITSELF

    View full-size slide

  57. MONITORING AND ALERTING / METRICS
    ➤ Monitoring KPIs
    (1)The mean time from when a problem occurs

    to when it is detected
    (2)# of problems detected by alerting / Total # of problems
    (3)# of monitoring improvements / Total # of problems
    (4) (2) +(3) / Total # of problems
    ➤ We can clarify if the monitoring environment became better or
    not
    ➤ Monitoring improvement leads to more stable system operation

    View full-size slide

  58. ENJOY MONITORING YOUR SYSTEM!
    ➤ Demo application and
    monitoring environment
    ➤ https://github.com/cero-t/
    spring-store-2019/
    ➤ My Twitter (@cero_t)
    ➤ https://twitter.com/cero_t

    View full-size slide