Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Necessary tooling and monitoring for performance critical applications - RootConf 2017

Manan
May 11, 2017

Necessary tooling and monitoring for performance critical applications - RootConf 2017

Presented at RootConf on May 11 2017

Manan

May 11, 2017
Tweet

More Decks by Manan

Other Decks in Technology

Transcript

  1. PERFORMANCE CRITICAL
    APPLICATIONS
    NECESSARY TOOLING AND MONITORING FOR
    @mananbharara

    View full-size slide

  2. PERFORMANCE CRITICAL
    APPLICATIONS
    NECESSARY TOOLING AND MONITORING FOR
    @mananbharara

    View full-size slide

  3. OTTO
    http://www.forbes.com/sites/adamtanner/2014/03/05/amazons-war-on-germanys-18-billion-patriarch/#7989b6eb4162

    View full-size slide

  4. OTTO
    ▸ 1 million visitors / day
    ▸ ~ 2 orders / second
    http://dev.otto.de/2016/03/20/why-microservices/#more-2320

    View full-size slide

  5. OTTO
    ▸ 1 million visitors / day
    ▸ ~ 2 orders / second
    http://dev.otto.de/2016/03/20/why-microservices/#more-2320

    View full-size slide

  6. OTTO
    ▸ 1 million visitors / day
    ▸ ~ 2 orders / second
    http://dev.otto.de/2016/03/20/why-microservices/#more-2320

    View full-size slide

  7. OTTO
    ▸ 1 million visitors / day
    ▸ ~ 2 orders / second
    30 customers
    X 8 screens
    X 2 orders = 480 PI’s / second
    http://dev.otto.de/2016/03/20/why-microservices/#more-2320

    View full-size slide

  8. OTTO
    ▸ 1 million visitors / day
    ▸ ~ 2 orders / second
    30 customers
    X 8 screens
    X 2 orders = 480 PI’s / second
    ~ 1000 PI’s / second
    http://dev.otto.de/2016/03/20/why-microservices/#more-2320

    View full-size slide

  9. WHY?
    WHAT MAKES THIS DIFFERENT

    View full-size slide

  10. WHY?
    WHAT MAKES THIS DIFFERENT
    ▸ We’re talking money…

    View full-size slide

  11. WHY?
    WHAT MAKES THIS DIFFERENT
    ▸ We’re talking money…
    ▸ Building more features

    View full-size slide

  12. WHY?
    WHAT MAKES THIS DIFFERENT
    ▸ We’re talking money…
    ▸ Building more features => Adding more value

    View full-size slide

  13. WHY?
    WHAT MAKES THIS DIFFERENT
    ▸ We’re talking money…
    ▸ Building more features => Adding more value => > money

    View full-size slide

  14. WHY?
    WHAT MAKES THIS DIFFERENT
    ▸ We’re talking money…
    ▸ Building more features
    ▸ Decision making
    => Adding more value => > money

    View full-size slide

  15. WHY?
    WHAT MAKES THIS DIFFERENT
    ▸ We’re talking money…
    ▸ Building more features
    ▸ Decision making
    ▸ Verify business assumptions
    => Adding more value => > money

    View full-size slide

  16. WHY?
    WHAT MAKES THIS DIFFERENT
    ▸ We’re talking money…
    ▸ Building more features
    ▸ Decision making
    ▸ Verify business assumptions
    => Adding more value => > money
    300
    billion
    3 trillion

    View full-size slide

  17. REQUIREMENTS
    I NEED…

    View full-size slide

  18. REQUIREMENTS
    I NEED…
    ▸ What every other monitoring tool does

    View full-size slide

  19. REQUIREMENTS
    I NEED…
    ▸ What every other monitoring tool does
    - Database monitoring

    View full-size slide

  20. REQUIREMENTS
    I NEED…
    ▸ What every other monitoring tool does
    - Database monitoring
    - Monitoring standard server metrics - Req/Sec, Response time, Throughput…

    View full-size slide

  21. REQUIREMENTS
    I NEED…
    ▸ What every other monitoring tool does
    - Database monitoring
    - Monitoring standard server metrics - Req/Sec, Response time, Throughput…
    - Alerting upon exceptions

    View full-size slide

  22. REQUIREMENTS
    I NEED…
    ▸ What every other monitoring tool does
    - Database monitoring
    - Monitoring standard server metrics - Req/Sec, Response time, Throughput…
    - Alerting upon exceptions
    ▸ Measuring the state of the system

    View full-size slide

  23. REQUIREMENTS
    I NEED…
    ▸ What every other monitoring tool does
    - Database monitoring
    - Monitoring standard server metrics - Req/Sec, Response time, Throughput…
    - Alerting upon exceptions
    ▸ Measuring the state of the system
    - Orders/sec, Users on website

    View full-size slide

  24. REQUIREMENTS
    I NEED…
    ▸ What every other monitoring tool does
    - Database monitoring
    - Monitoring standard server metrics - Req/Sec, Response time, Throughput…
    - Alerting upon exceptions
    ▸ Measuring the state of the system
    - Orders/sec, Users on website
    ▸ Narrow down sources of bottlenecks

    View full-size slide

  25. REQUIREMENTS
    I NEED…
    ▸ What every other monitoring tool does
    - Database monitoring
    - Monitoring standard server metrics - Req/Sec, Response time, Throughput…
    - Alerting upon exceptions
    ▸ Measuring the state of the system
    - Orders/sec, Users on website
    ▸ Narrow down sources of bottlenecks
    ▸ Validate business assumptions

    View full-size slide

  26. LOGGING
    LOGGING

    View full-size slide

  27. LOGGING IS ABOUT SCATTERED INCIDENTS IN THE SYSTEM
    LOGGING

    View full-size slide

  28. METRICS HELP IN UNDERSTANDING THE STATE OF THE SYSTEM AND
    GAIN VALUABLE INSIGHTS…
    CAPTURE

    View full-size slide

  29. METRICS HELP IN UNDERSTANDING THE STATE OF THE SYSTEM AND
    GAIN VALUABLE INSIGHTS…
    CAPTURE
    ANYTIME

    View full-size slide

  30. WHY?
    Collecting data is cheap, but not having it when you
    need it can be expensive, so you should instrument
    everything, and collect all the useful data you
    reasonably can…
    https://www.datadoghq.com

    View full-size slide

  31. METRICS
    Measure the behavior of critical components in your
    production environment…
    CAPTURE
    http://metrics.dropwizard.io/3.1.0/

    View full-size slide

  32. METRICS
    Measure the behavior of critical components in your
    production environment…
    CAPTURE
    http://metrics.dropwizard.io/3.1.0/

    View full-size slide

  33. METRICS
    Measure the behavior of critical components in your
    production environment…
    CAPTURE
    http://metrics.dropwizard.io/3.1.0/

    View full-size slide

  34. METRICS
    Measure the behavior of critical components in your
    production environment…
    CAPTURE
    http://metrics.dropwizard.io/3.1.0/

    View full-size slide

  35. METRICS
    Measure the behavior of critical components in your
    production environment…
    CAPTURE
    http://metrics.dropwizard.io/3.1.0/

    View full-size slide

  36. METRICS
    Measure the behavior of critical components in your
    production environment…
    CAPTURE
    http://metrics.dropwizard.io/3.1.0/

    View full-size slide

  37. CAPTURE
    ▸ Counters

    View full-size slide

  38. CAPTURE
    ▸ Counters
    - HTTP/any connection pools

    View full-size slide

  39. CAPTURE
    ▸ Counters
    - HTTP/any connection pools
    ▸ Gauges

    View full-size slide

  40. CAPTURE
    ▸ Counters
    - HTTP/any connection pools
    ▸ Gauges
    - Used disk space

    View full-size slide

  41. CAPTURE
    ▸ Counters
    - HTTP/any connection pools
    ▸ Gauges
    - Used disk space
    ▸ Meters

    View full-size slide

  42. CAPTURE
    ▸ Counters
    - HTTP/any connection pools
    ▸ Gauges
    - Used disk space
    ▸ Meters
    - Requests per second

    View full-size slide

  43. CAPTURE
    ▸ Counters
    - HTTP/any connection pools
    ▸ Gauges
    - Used disk space
    ▸ Meters
    - Requests per second
    - Number of personalized users
    - Number of cached products
    - Orders per second

    View full-size slide

  44. CAPTURE
    ▸ Counters
    - HTTP/any connection pools
    ▸ Gauges
    - Used disk space
    ▸ Meters
    - Requests per second
    - Number of personalized users
    - Number of cached products
    - Orders per second

    View full-size slide

  45. CAPTURE
    ▸ Histograms

    View full-size slide

  46. CAPTURE
    ▸ Histograms
    - Mean size of response bodies

    View full-size slide

  47. CAPTURE
    ▸ Histograms
    - Mean size of response bodies
    - Percentile of customers getting 5 personal recommendations or
    more

    View full-size slide

  48. CAPTURE
    ▸ Histograms
    - Mean size of response bodies
    - Percentile of customers getting 5 personal recommendations or
    more
    ▸ Timers

    View full-size slide

  49. CAPTURE
    ▸ Histograms
    - Mean size of response bodies
    - Percentile of customers getting 5 personal recommendations or
    more
    ▸ Timers
    - Page rendering time

    View full-size slide

  50. CAPTURE
    ▸ Histograms
    - Mean size of response bodies
    - Percentile of customers getting 5 personal recommendations or
    more
    ▸ Timers
    - Page rendering time
    - It took 80ms to give recommendations to 99% customers for 300
    req/sec

    View full-size slide

  51. CAPTURE
    ▸ Histograms
    - Mean size of response bodies
    - Percentile of customers getting 5 personal recommendations or
    more
    ▸ Timers
    - Page rendering time
    - It took 80ms to give recommendations to 99% customers for 300
    req/sec
    - But took 200ms at 800 req/sec

    View full-size slide

  52. CAPTURE
    ▸ Histograms
    - Mean size of response bodies
    - Percentile of customers getting 5 personal recommendations or
    more
    ▸ Timers
    - Page rendering time
    - It took 80ms to give recommendations to 99% customers for 300
    req/sec
    - But took 200ms at 800 req/sec

    View full-size slide

  53. AGGREGATE
    MAKE SENSE OF DATA

    View full-size slide

  54. AGGREGATE
    API
    METRIC
    STORAGE
    CACHE

    View full-size slide

  55. ▸ Query DSL
    AGGREGATE
    API
    METRIC
    STORAGE
    CACHE

    View full-size slide

  56. ▸ Query DSL
    ▸ Applying functions
    AGGREGATE
    API
    METRIC
    STORAGE
    CACHE

    View full-size slide

  57. ▸ Query DSL
    ▸ Applying functions
    ▸ Scalable
    AGGREGATE
    API
    METRIC
    STORAGE
    CACHE

    View full-size slide

  58. ▸ Query DSL
    ▸ Applying functions
    ▸ Scalable
    ▸ Up to date
    AGGREGATE
    API
    METRIC
    STORAGE
    CACHE

    View full-size slide

  59. VISUALIZE
    - https://docs.influxdata.com/influxdb/v1.2/concepts/storage_engine/

    View full-size slide

  60. VISUALIZE
    OBSERVE VARIATIONS

    View full-size slide

  61. VISUALIZE
    OBSERVE VARIATIONS
    ‣Drill down

    View full-size slide

  62. VISUALIZE
    OBSERVE VARIATIONS
    ‣Drill down
    ‣Interactive

    View full-size slide

  63. VISUALIZE
    OBSERVE VARIATIONS
    ‣Drill down
    ‣Interactive
    ‣Query language and editor

    View full-size slide

  64. VISUALIZE
    OBSERVE VARIATIONS
    ‣Drill down
    ‣Interactive
    ‣Query language and editor
    ‣Correlate

    View full-size slide

  65. VISUALIZE
    OSCILLATOR

    View full-size slide

  66. VISUALIZE
    OSCILLATOR

    View full-size slide

  67. VISUALIZE
    OSCILLATOR

    View full-size slide

  68. VISUALIZE
    OSCILLATOR

    View full-size slide

  69. VISUALIZE
    OSCILLATOR
    DEMO

    View full-size slide

  70. VISUALIZE
    ANNOTATIONS

    View full-size slide

  71. LEARNINGS
    LEARNINGS

    View full-size slide

  72. LEARNINGS
    LEARNINGS
    ▸ Choose the right metrics

    View full-size slide

  73. LEARNINGS
    LEARNINGS

    View full-size slide

  74. LEARNINGS
    LEARNINGS

    View full-size slide

  75. LEARNINGS
    LEARNINGS

    View full-size slide

  76. LEARNINGS
    LEARNINGS
    ▸ Choose the right metrics - Disk utilization, User logins

    View full-size slide

  77. LEARNINGS
    LEARNINGS
    ▸ Choose the right metrics - Disk utilization, User logins
    ▸ Mean can be absurd - Choose wisely

    View full-size slide

  78. LEARNINGS
    LEARNINGS
    ▸ Choose the right metrics - Disk utilization, User logins
    ▸ Mean can be absurd - Choose wisely

    View full-size slide

  79. LEARNINGS
    LEARNINGS
    ▸ Choose the right metrics - Disk utilization, User logins
    ▸ Mean can be absurd - Choose wisely
    ▸ Choose the right tool for the job. Histograms, counters,
    meters, timers.

    View full-size slide

  80. ALERT
    ALERTING

    View full-size slide

  81. ALERT
    ALERTING

    View full-size slide

  82. ALERT
    ALERTING

    View full-size slide

  83. ALERT
    ALERTING

    View full-size slide

  84. ALERT
    ALERTING

    View full-size slide

  85. ALERT
    ALERTING

    View full-size slide

  86. ALERT
    ALERTING
    ▸ Of extreme importance

    View full-size slide

  87. ALERT
    ALERTING
    ▸ Of extreme importance
    ▸ Definition of done

    View full-size slide

  88. ALERT
    ALERTING
    ▸ Of extreme importance
    ▸ Definition of done
    ▸ Make it visible

    View full-size slide

  89. SUMMING IT UP
    THE COMPLETE SOLUTION

    View full-size slide

  90. SUMMING IT UP
    THE COMPLETE SOLUTION
    CAPTURE

    View full-size slide

  91. SUMMING IT UP
    THE COMPLETE SOLUTION
    CAPTURE AGGREGATE

    View full-size slide

  92. SUMMING IT UP
    THE COMPLETE SOLUTION
    CAPTURE AGGREGATE VISUALIZE

    View full-size slide

  93. SUMMING IT UP
    THE COMPLETE SOLUTION
    CAPTURE AGGREGATE VISUALIZE ALERT

    View full-size slide

  94. CONTINUOS DELIVERY
    CONTINUOUS DELIVERY & MONITORING

    View full-size slide

  95. SOFTWARE THAT CLAIMS TO MONITOR EVERYTHING FOR YOU
    CURRENT SETUP

    View full-size slide

  96. SOFTWARE THAT CLAIMS TO MONITOR EVERYTHING FOR YOU
    CURRENT SETUP

    View full-size slide

  97. SOFTWARE THAT CLAIMS TO MONITOR EVERYTHING FOR YOU
    ▸ Hooks into individual function calls
    CURRENT SETUP

    View full-size slide

  98. SOFTWARE THAT CLAIMS TO MONITOR EVERYTHING FOR YOU
    ▸ Hooks into individual function calls
    ▸ Could be too expensive
    CURRENT SETUP

    View full-size slide

  99. SOFTWARE THAT CLAIMS TO MONITOR EVERYTHING FOR YOU
    ▸ Hooks into individual function calls
    ▸ Could be too expensive
    ▸ A great monitoring and alerting setup is possible for free
    CURRENT SETUP

    View full-size slide

  100. THANKS!
    @mananbharara
    @mananbharara
    READ MORE
    ▸ Metrics, Metrics, Everywhere - https://www.youtube.com/watch?v=czes-oa0yik
    ▸ Oscillator - https://github.com/otto-de/oscillator
    ▸ Xray - https://github.com/otto-de/tesla-xray
    ▸ OTTO Dev blog - http://dev.otto.de/

    View full-size slide