Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Monitoring in the time of Cloud Native

Monitoring in the time of Cloud Native

1a73d340b0467da58dc2591b3a0edd5a?s=128

Cindy Sridharan

October 04, 2017
Tweet

Transcript

  1. in the time of Velocity 2017 New York, NY

  2. @copyconstruct @copyconstruct @copyconstruct

  3. Cloud Native @copyconstruct

  4. containers kubernetes service meshes microservices immutable infrastructure … ... @copyconstruct

  5. 5 @copyconstruct

  6. @copyconstruct

  7. @copyconstruct

  8. ☹ ☹ ☹ @copyconstruct

  9. @copyconstruct

  10. an embarrassment of riches! @copyconstruct

  11. Decision Making in the time of Cloud Native @copyconstruct

  12. It’s tempting, especially when enamored by a new piece of

    technology that promises the moon, to retrofit our problem space with the solution space of said technology, however minimal or non-existent the intersection @copyconstruct
  13. Goal of Talk @copyconstruct

  14. A field guide for evaluation @copyconstruct

  15. o strengths and weaknesses of each category of tools o

    problems they solve o tradeoffs they make o ease of adoption/integration into an existing infrastructure @copyconstruct
  16. What to “monitor” and how in a cloud native environment?

    @copyconstruct
  17. Monitoring in The time of Cloud native @copyconstruct

  18. Monitoring in The time of Cloud native @copyconstruct

  19. @copyconstruct

  20. monitoring @copyconstruct

  21. @copyconstruct

  22. @copyconstruct

  23. @copyconstruct

  24. @copyconstruct

  25. @copyconstruct

  26. As we adopt increasingly complex architectures, the number of “things

    that can go wrong” exponentially increases @copyconstruct
  27. era of embracing failure @copyconstruct

  28. era of complexity @copyconstruct

  29. how do we design monitoring for such systems? how do

    we design these systems themselves? @copyconstruct
  30. The goal of “monitoring” hasn’t changed, even if the scope

    has shrunk the challenge now lies in identifying and minimizing the bits of “monitoring” that still remain human centric @copyconstruct
  31. infrastructure management is becoming more automated application lifecycle management is

    becoming harder @copyconstruct
  32. Observability is about being able to understand how a system

    is behaving in production @copyconstruct
  33. Monitoring is being on the lookout for failures, which in

    turn requires us to be able to predict these failures proactively @copyconstruct
  34. interlude @copyconstruct

  35. blackbox monitoring @copyconstruct

  36. @copyconstruct

  37. “it’s so nice being in an org that communicates quantitatively

    about systems” @copyconstruct
  38. whitebox monitoring @copyconstruct

  39. @copyconstruct

  40. Data are simply facts or figures — bits of information, but not

    information itself @copyconstruct
  41. Data are simply facts or figures — bits of information, but not

    information itself When data are processed, interpreted, organized, structured or presented so as to make them meaningful or useful, they are called information. Information provides context for data. @copyconstruct
  42. purpose driven @copyconstruct

  43. purpose driven not origin driven @copyconstruct

  44. @copyconstruct

  45. @copyconstruct

  46. @copyconstruct

  47. @copyconstruct

  48. @copyconstruct

  49. @copyconstruct

  50. @copyconstruct

  51. @copyconstruct

  52. @copyconstruct

  53. @copyconstruct

  54. @copyconstruct

  55. @copyconstruct

  56. The Three Pillars of Observability @copyconstruct

  57. @copyconstruct

  58. logs @copyconstruct

  59. @copyconstruct

  60. @copyconstruct

  61. both traces and metrics are an abstraction built on top

    of logs that pre-process and encode information along two orthogonal axes, one being request centric, the other being system centric @copyconstruct
  62. Traces @copyconstruct

  63. @copyconstruct

  64. Instrument specific points in your application, proxy, framework, library, middleware

    and anything else that might lie in the path of execution of a request @copyconstruct
  65. @copyconstruct

  66. @copyconstruct

  67. @copyconstruct

  68. metrics @copyconstruct

  69. “a set of numbers that give information about a particular

    process or activity” @copyconstruct
  70. “a list of numbers relating to a particular activity, which

    is recorded at regular periods of time and then studied. Time series are typically used to study, for example, sales, orders, income, etc.” @copyconstruct
  71. @copyconstruct

  72. @copyconstruct

  73. @copyconstruct

  74. evaluation @copyconstruct

  75. logs @copyconstruct

  76. +1 easy to instrument and generate @copyconstruct

  77. +1 easy to instrument and generate +1 provides rich local

    context @copyconstruct
  78. +1 easy to instrument and generate +1 provides rich local

    context -1 performance of logging libraries @copyconstruct
  79. +1 easy to instrument and generate +1 provides rich local

    context -1 performance of logging libraries -1 no guaranteed delivery @copyconstruct
  80. +1 easy to instrument and generate +1 provides rich local

    context -1 performance of logging libraries -1 no guaranteed delivery -1 application performance @copyconstruct
  81. “A fun thing I had seen while at [redacted] was

    that turning off most logging almost doubled performance on the instances we were running on because logs ate through AWS’ EC2 classic’s packet allocations like mad. It was interesting for us to discover that more than 50% of our performance would be lost to trying to control and monitor performance” @copyconstruct
  82. +1 easy to instrument and generate +1 provides rich local

    context -1 performance of logging libraries -1 no guaranteed delivery -1 application performance -1 no dynamic sampling @copyconstruct
  83. -1 buffering might be required @copyconstruct

  84. -1 buffering might be required -1 quotas/ rate limits @copyconstruct

  85. -1 buffering might be required -1 quotas/ rate limits -1

    “actionable data” @copyconstruct
  86. -1 buffering might be required -1 quotas/ rate limits -1

    “actionable data” -1 ELK @copyconstruct
  87. -1 buffering might be required -1 quotas/ rate limits -1

    “actionable data” -1 ELK -1 $$$$ @copyconstruct
  88. metrics @copyconstruct

  89. +1 metrics transfer and storage has a constant overhead @copyconstruct

  90. @copyconstruct

  91. @copyconstruct

  92. +1 metrics transfer and storage has a constant overhead +1

    cheap @copyconstruct
  93. +1 metrics transfer and storage has a constant overhead +1

    cheap +1 statistical & probabilistic analysis @copyconstruct
  94. +1 metrics transfer and storage has a constant overhead +1

    cheap +1 statistical & probabilistic analysis +1 alerting @copyconstruct
  95. +1 metrics transfer and storage has a constant overhead +1

    cheap +1 statistical & probabilistic analysis +1 alerting -1 system scoped @copyconstruct
  96. @copyconstruct

  97. traces @copyconstruct

  98. +1 captures the lifetime of requests as they flow through

    the various components of a distributed system @copyconstruct
  99. +1 captures the lifetime of requests as they flow through

    the various components of a distributed system -1 hard to instrument @copyconstruct
  100. “We’ve been implementing a request tracing service for over a

    year and it’s not complete yet. The challenge with these type of tools is that, we need to add code around each span to truly understand what’s happening during the lifetime of our requests. The frustrating part is that if the code is not instrumented or header is not carrying the id, that code becomes a risky blind spot for operations” @copyconstruct
  101. +1 captures the lifetime of requests as they flow through

    the various components of a distributed system -1 hard to instrument -1 depends on how causality is tracked @copyconstruct
  102. +1 captures the lifetime of requests as they flow through

    the various components of a distributed system -1 hard to instrument -1 depends on how causality is tracked -1 request scoped @copyconstruct
  103. Best practices @copyconstruct

  104. Logs @copyconstruct

  105. o Quotas @copyconstruct

  106. o Quotas o Dynamic Sampling @copyconstruct

  107. o Quotas o Dynamic Sampling o Logging is a Stream

    Processing Problem @copyconstruct
  108.   Filter to outlier countries from where users viewed this

    article fewer than 100 times in total @copyconstruct
  109. Filter to outlier page loads that performed more than 100

    database queries Or, show me only page loads from Indonesia that took more than 10 seconds to load @copyconstruct
  110. Enriched events business event + timer/counter/histogram @copyconstruct

  111. None
  112. None
  113. A new hope for the future OpenLogging/OpenEvent @copyconstruct

  114. metrics @copyconstruct

  115. “Prometheus is much more than just the server. I see

    Prometheus as a set of standards and projects, with the server being just one part of a much greater whole” @copyconstruct
  116. @copyconstruct

  117. @copyconstruct

  118. None
  119. traces @copyconstruct

  120. @copyconstruct

  121. conclusion @copyconstruct

  122. @copyconstruct

  123. None
  124. @copyconstruct

  125. @copyconstruct

  126. @copyconstruct

  127. @copyconstruct

  128. @copyconstruct

  129. @copyconstruct

  130. None
  131. @copyconstruct

  132. Choose your own Observability Adventure! @copyconstruct

  133. @copyconstruct

  134. Thank you @copyconstruct @copyconstruct