Cloud Monitoring - AWS Pop-up Loft Stockholm Nov 2 2018

Slide 1

Slide 1 text

Cloud Monitoring in the era of Microservices and Distributed Architectures 1 AWS Pop-up Loft Stockholm Gunnar Grosch @gunnargrosch

Slide 2

Slide 2 text

• Cloud Evangelist at Opsio www.opsio.se • Have been creating chaos with computers since the 80’s. • Background in development and operations. • My own three chaos monkeys at home. • Skateboarder at heart. • Serverless aficionado. ABOUT ME AWS Pop-up Loft Stockholm @gunnargrosch

Slide 3

Slide 3 text

• Opsio specializes in the operation and support of public, private and hybrid cloud platforms. • Delivering everything from magical customer support to fully managed cloud services. • Helping customers find the right level of monitoring and observability. ABOUT OPSIO AWS Pop-up Loft Stockholm @gunnargrosch STRATEGY ARCHITECTURE MIGRATION CLOUDOPS DEVOPS 24/7 SUPPORT OPTIMIZATION MANAGED SERVICES

Slide 4

Slide 4 text

AWS Pop-up Loft Stockholm @gunnargrosch Goal of this session

Slide 5

Slide 5 text

• Look at why monitoring is complex. • See the difference between monitoring and observability. • Explore the three pillars of observability. • Touch the whats and whys. • Leave you with more questions than answers. GOAL OF THIS SESSION AWS Pop-up Loft Stockholm @gunnargrosch

Slide 6

Slide 6 text

AWS Pop-up Loft Stockholm @gunnargrosch Where are we at?

Slide 7

Slide 7 text

• . WHERE ARE WE AT? AWS Pop-up Loft Stockholm @gunnargrosch

Slide 8

Slide 8 text

• The infrastructure space is in the midst of a change. • The way we build and operate systems has evolved. • Containers, Kubernetes, microservices, serverless, and service meshes change the way we operate software. • The systems we build have become more distributed and more ephemeral. WHERE ARE WE AT? AWS Pop-up Loft Stockholm @gunnargrosch

Slide 9

Slide 9 text

• The tools are only going to get increasingly better with time. • We have a responsibility to ensure our application is good enough. • No service is going to fix our software. • Most failures will come from the application layer or from the interactions between different applications or services. WHERE ARE WE AT? AWS Pop-up Loft Stockholm @gunnargrosch

Slide 10

Slide 10 text

• We can focus on the performance and business logic of our application. • We can focus on making our application more robust. • Having visibility is more important than ever before to understand, operate, maintain and evolve applications. WHERE ARE WE AT? AWS Pop-up Loft Stockholm @gunnargrosch

Slide 11

Slide 11 text

• Many tools are just as bleeding edge as the infrastructure. • There has been a surge in both open source and SaaS tooling. WHERE ARE WE AT? AWS Pop-up Loft Stockholm @gunnargrosch

Slide 12

Slide 12 text

AWS Pop-up Loft Stockholm @gunnargrosch Decision making

Slide 13

Slide 13 text

Oh, we have a new problem DECISION MAKING AWS Pop-up Loft Stockholm @gunnargrosch

Slide 14

Slide 14 text

AWS Pop-up Loft Stockholm @gunnargrosch

Slide 15

Slide 15 text

• How do we choose the best tool for our needs? • How do we tell the difference between these tools? • Should we stop monitoring? • Should we only focus on observability? • What’s the difference between logs and metrics? • Do we really need all three pillars? DECISION MAKING AWS Pop-up Loft Stockholm @gunnargrosch

Slide 16

Slide 16 text

• Don’t just replace your problem with a tool. • Evaluate the value it can provide for your challenges. • The tool should solve your problems. • We want to evolve instead of starting over. DECISION MAKING AWS Pop-up Loft Stockholm @gunnargrosch

Slide 17

Slide 17 text

• What are the strengths of the tool? • What are the weaknesses of the tool? • What are the problems it solves? • What are the tradeoffs it makes? • How is the ease of adoption/integration into an existing infrastructure? DECISION MAKING AWS Pop-up Loft Stockholm @gunnargrosch

Slide 18

Slide 18 text

AWS Pop-up Loft Stockholm @gunnargrosch What to monitor and how?

Slide 19

Slide 19 text

• Monitoring and observability are not the same. • Everything is changing. Almost. • To follow the change we need visibility. WHAT TO MONITOR AND HOW? AWS Pop-up Loft Stockholm @gunnargrosch Better visibility Better understanding Better systems Better instrumentation

Slide 20

Slide 20 text

Monitoring is being on the lookout for failures WHAT TO MONITOR AND HOW? AWS Pop-up Loft Stockholm @gunnargrosch

Slide 21

Slide 21 text

• We develop, test, deploy and monitor. • We monitor waiting for something. • We monitor because we expect failure. • A failure centric approach to monitoring becomes a problem when the number of failure modes increases. WHAT TO MONITOR AND HOW? AWS Pop-up Loft Stockholm @gunnargrosch Dev Test QA Prod Monitor

Slide 22

Slide 22 text

• Complex architectures gives more possible failures. WHAT TO MONITOR AND HOW? AWS Pop-up Loft Stockholm @gunnargrosch Possible failures Complexity

Slide 23

Slide 23 text

• Embracing failure means we need to design services to behave gracefully when they fail. • Use graceful degradation like retries, timeouts and rate limiting. • Increases overall complexity. • Monitoring is human centric. • Automation and self-healing platforms makes it less human centric. WHAT TO MONITOR AND HOW? AWS Pop-up Loft Stockholm @gunnargrosch

Slide 24

Slide 24 text

• So how do we design monitoring for these systems? • Key is still design of the systems themselves. • Monitoring should be failure and human centric. • Systems will continue to get more complex. • Monitoring of every failure becomes unnecessary. WHAT TO MONITOR AND HOW? AWS Pop-up Loft Stockholm @gunnargrosch

Slide 25

Slide 25 text

Observability is understanding how a system behaves WHAT TO MONITOR AND HOW? AWS Pop-up Loft Stockholm @gunnargrosch

Slide 26

Slide 26 text

• Observability provides insights into the behavior of systems with context. • Find answers to questions not yet formulated. • Whitebox monitoring gives us data. • When data is processed, interpreted, organized, structured it’s called information. • Observability gives us information. WHAT TO MONITOR AND HOW? AWS Pop-up Loft Stockholm @gunnargrosch

Slide 27

Slide 27 text

• Observability is our ability to easily find information from data when we need it. • Information on the fly or pre-processed. WHAT TO MONITOR AND HOW? AWS Pop-up Loft Stockholm @gunnargrosch

Slide 28

Slide 28 text

WHAT TO MONITOR AND HOW? AWS Pop-up Loft Stockholm @gunnargrosch Alerting Overview Debugging Profiling Dependencies Monitoring Observability

Slide 29

Slide 29 text

• Observability isn’t about data collection alone. • Raw data creates overheads. • How data is gathered, processed and stored is a key consideration. • Having data and information doesn’t solve problems. • Intuition and knowledge gives better observability. WHAT TO MONITOR AND HOW? AWS Pop-up Loft Stockholm @gunnargrosch

Slide 30

Slide 30 text

Observability is more about what we do with the data than where it comes from WHAT TO MONITOR AND HOW? AWS Pop-up Loft Stockholm @gunnargrosch

Slide 31

Slide 31 text

• We can use the information to know the overall health of a system. • We can use the information to alert based on symptoms. • We can use the information to debug failures. • We can use the information to better understand our system. • We can use the information to understand dependencies. WHAT TO MONITOR AND HOW? AWS Pop-up Loft Stockholm @gunnargrosch

Slide 32

Slide 32 text

AWS Pop-up Loft Stockholm @gunnargrosch The three pillars of observability

Slide 33

Slide 33 text

• Logs • Traces • Metrics THE THREE PILLARS OF OBSERVABILITY AWS Pop-up Loft Stockholm @gunnargrosch

Slide 34

Slide 34 text

• . THE THREE PILLARS OF OBSERVABILITY AWS Pop-up Loft Stockholm @gunnargrosch

Slide 35

Slide 35 text

AWS Pop-up Loft Stockholm @gunnargrosch The three pillars of observability Logs

Slide 36

Slide 36 text

• A recorded event that happened over time. • Logs excel in providing valuable insight along with ample context. • Incredibly useful when we are searching at a very fine level of granularity. • The data is rich but without further processing it can be overwhelming. THE THREE PILLARS OF OBSERVABILITY AWS Pop-up Loft Stockholm @gunnargrosch

Slide 37

Slide 37 text

• What we’re looking for is information. • The number of data points we can collect is endless. THE THREE PILLARS OF OBSERVABILITY AWS Pop-up Loft Stockholm @gunnargrosch

Slide 38

Slide 38 text

AWS Pop-up Loft Stockholm @gunnargrosch The three pillars of observability Traces

Slide 39

Slide 39 text

• A trace represents a series of events in an end-to- end request. • Provides visibility into the path and the structure of a request. • Complex applications can benefit from tracing. • Specific points in the execution of a request are represented. • Instrumentation is added in code which is propagated through the path. THE THREE PILLARS OF OBSERVABILITY AWS Pop-up Loft Stockholm @gunnargrosch

Slide 40

Slide 40 text

• We understand the lifecycle of a request better. • We are able to debug requests spanning multiple services to find bottlenecks. • Traces helps us understand the which and sometimes even the why. THE THREE PILLARS OF OBSERVABILITY AWS Pop-up Loft Stockholm @gunnargrosch

Slide 41

Slide 41 text

AWS Pop-up Loft Stockholm @gunnargrosch The three pillars of observability Metrics

Slide 42

Slide 42 text

• Metrics are a numeric representation of our data over time. • Use mathematical modeling and prediction to know the behavior of a system in a time series. • Optimized for storage, processing, compression and retrieval. • Enables long retention of data as well as easy querying. • Allows for reduction of data resolution over time. THE THREE PILLARS OF OBSERVABILITY AWS Pop-up Loft Stockholm @gunnargrosch

Slide 43

Slide 43 text

• Modern monitoring systems uses labels as additional key-value pair allowing a high degree of dimensionality in the data model. THE THREE PILLARS OF OBSERVABILITY AWS Pop-up Loft Stockholm @gunnargrosch

Slide 44

Slide 44 text

AWS Pop-up Loft Stockholm @gunnargrosch Best practices

Slide 45

Slide 45 text

• Log everything or log selectively. • Quotas instead of log level. • Establishing service tiers with quotas and priorities. • Adjust log rates dynamically. • Identify a small set of hard failure modes. • Don’t monitor more than needed. • Avoid noisy monitoring. • Trace through service meshes. BEST PRACTICES AWS Pop-up Loft Stockholm @gunnargrosch

Slide 46

Slide 46 text

AWS Pop-up Loft Stockholm @gunnargrosch Questions?

Slide 47

Slide 47 text

AWS Pop-up Loft Stockholm @gunnargrosch Thank you for attending!

Slide 48

Slide 48 text

48 Tynäsgatan 12 65216 Karlstad +46 10 252 55 01 [email protected] www.opsio.se Gunnar Grosch @gunnargrosch