Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Service Mesh - Observability

Service Mesh - Observability

Building Cloud-Native App Series - Part 13 of 15
Microservices Architecture Series
Service Mesh - Observability
- Zipkin
- Prometheus
- Grafana
- Kiali

Araf Karsh Hamid

June 01, 2022

More Decks by Araf Karsh Hamid

Other Decks in Technology


  1. @arafkarsh arafkarsh 8 Years Network & Security 6+ Years Microservices

    Blockchain 8 Years Cloud Computing 8 Years Distributed Computing Architecting & Building Apps a tech presentorial Combination of presentation & tutorial ARAF KARSH HAMID Co-Founder / CTO MetaMagic Global Inc., NJ, USA @arafkarsh arafkarsh 1 Microservice Architecture Series Building Cloud Native Apps Service Mesh / Istio Zipkin / Prometheus / Grafana / Kiali Monitoring / Observability Part 13 of 15
  2. @arafkarsh arafkarsh Slides are color coded based on the topic

    colors. Monitoring Observability 1 Kubernetes Auditing 2 Zipkin Prometheus Grafana / Kiali 3 ML / AI 4 2
  3. @arafkarsh arafkarsh Application Modernization – 3 Transformations 3 Monolithic SOA

    Microservice Physical Server Virtual Machine Cloud Waterfall Agile DevOps Source: IBM: Application Modernization > https://www.youtube.com/watch?v=RJ3UQSxwGFY Architecture Infrastructure Delivery Modernization 1 2 3
  4. @arafkarsh arafkarsh Agile Scrum (4-6 Weeks) Developer Journey Monolithic Domain

    Driven Design Event Sourcing and CQRS Waterfall Optional Design Patterns Continuous Integration (CI) 6/12 Months Enterprise Service Bus Relational Database [SQL] / NoSQL Development QA / QC Ops 4 Microservices Domain Driven Design Event Sourcing and CQRS Scrum / Kanban (1-5 Days) Mandatory Design Patterns Infrastructure Design Patterns CI DevOps Event Streaming / Replicated Logs SQL NoSQL CD Container Orchestrator Service Mesh
  5. @arafkarsh arafkarsh Monitoring & Observability • Challenges in Monitoring •

    Monitoring Vs. Observability • ML / AI – based Analytics 5 1
  6. @arafkarsh arafkarsh Challenges in Monitoring 6 Blind Spot Container /

    Pod Disposability increases Portability and Scalability – However, this creates blind spots in Monitoring. Need to Record Portability of inter-dependent components creates an increased need to maintain and record telemetry data with traceability to ensure Observability. Visualization The scale and complexity introduced by the Containers and Container Orchestration good tools to Visualize and Analyze the data generated. Source: A Beginners guide to Kubernetes Monitoring by Splunk Don’t Leave DevOps in Dark Application performance is Critical for Ops Team as Containers can be scaled up and down in lightning speed.
  7. @arafkarsh arafkarsh Monitoring Vs. Observability 7 Monitoring Observability 1 Says

    whether the System is Working or Not Why its not working 2 Collects Metrics and Logs from a System Actionable Insights gained from the Metrics 3 Failure Centric Overall Behavior of the System 4 Is “the How” of something you do Is ”The Process” of something you have 5 I monitor you You make yourself observable Source: A Beginners guide to Observability by Splunk
  8. @arafkarsh arafkarsh Observability 8 Monitoring Predictable Failures Testing Best effort

    verification of correctness Best effort simulation of failure modes All possible permutations of full and partial failure Source: A Beginners guide to Observability by Splunk
  9. @arafkarsh arafkarsh Benefits of Observability 9 1. Better understanding of

    complex microservices communication and end-user usage patterns 2. Helps in faster troubleshooting and shorter MTTR (Mean Time To Recovery) 3. Better understanding of incidents 4. Better uptime and performance 5. Happier customers and more revenue Source: A Beginners guide to Observability by Splunk
  10. @arafkarsh arafkarsh Pillars of Observability 10 Immutable records of discrete

    events that happen over time Logs/events Numbers describing a particular process or activity measured over intervals of time Metrics Data that shows, for each invocation of each downstream service, which instance was called, which method within that instance was invoked, how the request performed, and what the results were Traces Source: A Beginners guide to Observability by Splunk
  11. @arafkarsh arafkarsh Events / Logs 11 Event Sources • System

    and Server logs (syslog) • Firewall and IDS/IPS logs • Container / Pod Logs • Application / Service / Database logs (log4j, log4net, Apache, MySQL, AWS)
  12. @arafkarsh arafkarsh Metrics 12 Metric Sources • Infrastructure Metrics (Node,

    K8s) • System Metrics (CPU, Memory, Disk) • Service Metrics (Envoy Proxy) • Network Metrics (Packets, Bytes) • Business metrics (revenue, customer sign- ups, bounce rate, cart abandonment) • UI Metrics (Google Analytics, Digital Experience Management)
  13. @arafkarsh arafkarsh Traces 13 • Specific parts of a user’s

    journey are collected into traces, showing • Which services were invoked, • Which containers/hosts/instances they were running on, and • what the results of each call were.
  14. @arafkarsh arafkarsh Kubernetes Auditing 15 Auditing Provides logs on what's

    happening within the cluster. Scope and Levels of details are configurable Forensics review of the Kubernetes logs shows the following o What happened? o When did it happen? o Who initiated it? o On what did it happen? o Where was it observed? o From where was it initiated? o To where was it going? Source: https://kubernetes.io/docs/tasks/debug-application-cluster/audit/
  15. @arafkarsh arafkarsh Kubernetes Audit Stages 16 Request Received The stage

    for events generated as soon as the audit handler receives the request. Response Started Once the response headers are sent, but before the response body is sent. Source: https://kubernetes.io/docs/tasks/debug-application-cluster/audit/ Response Completed The response body has been completed and no more bytes will be sent. Panic Events generated when a panic occurred.
  16. @arafkarsh arafkarsh Kubernetes Audit Policy 18 None Don't log events

    that match this rule. MetaData Log request metadata (requesting user, timestamp, resource, verb, etc.) but not request or response body. Request Log event metadata and request body but not response body. This does not apply for non-resource requests. Request Response Log event metadata, request and response bodies. This does not apply for non-resource requests. Source: https://kubernetes.io/docs/tasks/debug-application-cluster/audit/
  17. @arafkarsh arafkarsh Kubernetes Native Monitoring 20 Application Logs (L7 Logs)

    Container / Pod Logs • Process • System Calls • Network Logs • File System Logs Kubernetes Logs • Network Flow Logs • Audit Logs • DNS Logs Host OS Logs • SSH Logs • OS Audit Logs Cloud Infra Logs App Server /bin Container Runtime Host OS Kubernetes Cloud Hardware Host / K8s Node
  18. @arafkarsh arafkarsh Kubernetes Node 21 eBPF Programs Network Flow Log

    K-Probe Connection Tracker Linux Kernel Prometheus Envoy Proxy Log Collector FluentD Pods Pods Pods Pods Pods Pods Service Pods Pods Pods Pods Pods Pods Service Namespace Pods Pods Pods Pods Pods Pods Service Namespace Observability Tools Source IP Address & Port Destination IP Address & Port Protocol Adds Bytes and Packet Count to the K-Probe Data for a Connection Adds K8s Meta data like Namespace, Service Name etc Collects System & Service Metrics Tracks the State of TCP / UDP Connections. It can be used as NAT and Stateful Firewall. Routes Traffic, Perform Load Balancing, Applies Policies, handles Secure Communication FluentD runs as a sidecar to collect logs from various sources.
  19. @arafkarsh arafkarsh Data Collection 22 K-Probe Source IP Address, Source

    Port, Destination IP Address, Destination Port, Protocol NF Log Adds Bytes and Packets count for the above five attributes for a connection Log Collector Adds Kubernetes Meta Data to the above data like Namespace, Service, Pod etc.. Prometheus Collects metrics, System, Service metrics
  20. @arafkarsh arafkarsh Kubernetes Metrics Server 23 Source: https://kubernetes.io/docs/tasks/debug-application-cluster/resource-metrics-pipeline/ • Metrics

    Server is a cluster-wide aggregator of resource usage data. • CPU is reported as the average usage, in CPU cores, over a period of time. • Memory is reported as the working set, in bytes, at the instant the metric was collected.
  21. @arafkarsh arafkarsh ML/AI Driven Analytics 39 o Enrich: Adding context

    to events to make them informative and actionable o Reduce Duplicate: Automatically concealing duplicate events to focus on relevant ones and reducing alert storms o Reduce False +ve: Reducing event clutter and false positives with multivariate anomaly detection o Filter/Tag/Sort: Easily sifting through vast amounts of events by filtering, tagging and sorting Source: A Beginners guide to Observability by Splunk
  22. @arafkarsh arafkarsh Anomalous Events 40 IP Sweep Detection Pods sending

    many packets to many destinations Port Scan Detection Pods sending packets to One Destination on multiple ports. HTTP Spike Service that get too many HTTP inbound Connections DNS Latency Too High Latency for DNS Requests L7 Latency Pods with Too High Latency for L7 Requests Source: Kubernetes Security and Observability: Brendan Creane & Amit Gupta
  23. @arafkarsh arafkarsh 41 Design Patterns are solutions to general problems

    that software developers faced during software development. Design Patterns
  24. @arafkarsh arafkarsh 42 Thank you DREAM | AUTOMATE | EMPOWER

    Araf Karsh Hamid : India: +91.999.545.8627 http://www.slideshare.net/arafkarsh https://speakerdeck.com/arafkarsh https://www.linkedin.com/in/arafkarsh/ https://www.youtube.com/user/arafkarsh/playlists http://www.arafkarsh.com/ @arafkarsh arafkarsh
  25. @arafkarsh arafkarsh References 44 1. July 15, 2015 – Agile

    is Dead : GoTo 2015 By Dave Thomas 2. Apr 7, 2016 - Agile Project Management with Kanban | Eric Brechner | Talks at Google 3. Sep 27, 2017 - Scrum vs Kanban - Two Agile Teams Go Head-to-Head 4. Feb 17, 2019 - Lean vs Agile vs Design Thinking 5. Dec 17, 2020 - Scrum vs Kanban | Differences & Similarities Between Scrum & Kanban 6. Feb 24, 2021 - Agile Methodology Tutorial for Beginners | Jira Tutorial | Agile Methodology Explained. Agile Methodologies
  26. @arafkarsh arafkarsh References 45 1. Vmware: What is Cloud Architecture?

    2. Redhat: What is Cloud Architecture? 3. Cloud Computing Architecture 4. Cloud Adoption Essentials: 5. Google: Hybrid and Multi Cloud 6. IBM: Hybrid Cloud Architecture Intro 7. IBM: Hybrid Cloud Architecture: Part 1 8. IBM: Hybrid Cloud Architecture: Part 2 9. Cloud Computing Basics: IaaS, PaaS, SaaS 1. IBM: IaaS Explained 2. IBM: PaaS Explained 3. IBM: SaaS Explained 4. IBM: FaaS Explained 5. IBM: What is Hypervisor? Cloud Architecture
  27. @arafkarsh arafkarsh References 46 Microservices 1. Microservices Definition by Martin

    Fowler 2. When to use Microservices By Martin Fowler 3. GoTo: Sep 3, 2020: When to use Microservices By Martin Fowler 4. GoTo: Feb 26, 2020: Monolith Decomposition Pattern 5. Thought Works: Microservices in a Nutshell 6. Microservices Prerequisites 7. What do you mean by Event Driven? 8. Understanding Event Driven Design Patterns for Microservices
  28. @arafkarsh arafkarsh References – Microservices – Videos 47 1. Martin

    Fowler – Micro Services : https://www.youtube.com/watch?v=2yko4TbC8cI&feature=youtu.be&t=15m53s 2. GOTO 2016 – Microservices at NetFlix Scale: Principles, Tradeoffs & Lessons Learned. By R Meshenberg 3. Mastering Chaos – A NetFlix Guide to Microservices. By Josh Evans 4. GOTO 2015 – Challenges Implementing Micro Services By Fred George 5. GOTO 2016 – From Monolith to Microservices at Zalando. By Rodrigue Scaefer 6. GOTO 2015 – Microservices @ Spotify. By Kevin Goldsmith 7. Modelling Microservices @ Spotify : https://www.youtube.com/watch?v=7XDA044tl8k 8. GOTO 2015 – DDD & Microservices: At last, Some Boundaries By Eric Evans 9. GOTO 2016 – What I wish I had known before Scaling Uber to 1000 Services. By Matt Ranney 10. DDD Europe – Tackling Complexity in the Heart of Software By Eric Evans, April 11, 2016 11. AWS re:Invent 2016 – From Monolithic to Microservices: Evolving Architecture Patterns. By Emerson L, Gilt D. Chiles 12. AWS 2017 – An overview of designing Microservices based Applications on AWS. By Peter Dalbhanjan 13. GOTO Jun, 2017 – Effective Microservices in a Data Centric World. By Randy Shoup. 14. GOTO July, 2017 – The Seven (more) Deadly Sins of Microservices. By Daniel Bryant 15. Sept, 2017 – Airbnb, From Monolith to Microservices: How to scale your Architecture. By Melanie Cubula 16. GOTO Sept, 2017 – Rethinking Microservices with Stateful Streams. By Ben Stopford. 17. GOTO 2017 – Microservices without Servers. By Glynn Bird.
  29. @arafkarsh arafkarsh References 48 Domain Driven Design 1. Oct 27,

    2012 What I have learned about DDD Since the book. By Eric Evans 2. Mar 19, 2013 Domain Driven Design By Eric Evans 3. Jun 02, 2015 Applied DDD in Java EE 7 and Open Source World 4. Aug 23, 2016 Domain Driven Design the Good Parts By Jimmy Bogard 5. Sep 22, 2016 GOTO 2015 – DDD & REST Domain Driven API’s for the Web. By Oliver Gierke 6. Jan 24, 2017 Spring Developer – Developing Micro Services with Aggregates. By Chris Richardson 7. May 17. 2017 DEVOXX – The Art of Discovering Bounded Contexts. By Nick Tune 8. Dec 21, 2019 What is DDD - Eric Evans - DDD Europe 2019. By Eric Evans 9. Oct 2, 2020 - Bounded Contexts - Eric Evans - DDD Europe 2020. By. Eric Evans 10. Oct 2, 2020 - DDD By Example - Paul Rayner - DDD Europe 2020. By Paul Rayner
  30. @arafkarsh arafkarsh References 49 Event Sourcing and CQRS 1. IBM:

    Event Driven Architecture – Mar 21, 2021 2. Martin Fowler: Event Driven Architecture – GOTO 2017 3. Greg Young: A Decade of DDD, Event Sourcing & CQRS – April 11, 2016 4. Nov 13, 2014 GOTO 2014 – Event Sourcing. By Greg Young 5. Mar 22, 2016 Building Micro Services with Event Sourcing and CQRS 6. Apr 15, 2016 YOW! Nights – Event Sourcing. By Martin Fowler 7. May 08, 2017 When Micro Services Meet Event Sourcing. By Vinicius Gomes
  31. @arafkarsh arafkarsh References 50 Kafka 1. Understanding Kafka 2. Understanding

    RabbitMQ 3. IBM: Apache Kafka – Sept 18, 2020 4. Confluent: Apache Kafka Fundamentals – April 25, 2020 5. Confluent: How Kafka Works – Aug 25, 2020 6. Confluent: How to integrate Kafka into your environment – Aug 25, 2020 7. Kafka Streams – Sept 4, 2021 8. Kafka: Processing Streaming Data with KSQL – Jul 16, 2018 9. Kafka: Processing Streaming Data with KSQL – Nov 28, 2019
  32. @arafkarsh arafkarsh References 51 Databases: Big Data / Cloud Databases

    1. Google: How to Choose the right database? 2. AWS: Choosing the right Database 3. IBM: NoSQL Vs. SQL 4. A Guide to NoSQL Databases 5. How does NoSQL Databases Work? 6. What is Better? SQL or NoSQL? 7. What is DBaaS? 8. NoSQL Concepts 9. Key Value Databases 10. Document Databases 11. Jun 29, 2012 – Google I/O 2012 - SQL vs NoSQL: Battle of the Backends 12. Feb 19, 2013 - Introduction to NoSQL • Martin Fowler • GOTO 2012 13. Jul 25, 2018 - SQL vs NoSQL or MySQL vs MongoDB 14. Oct 30, 2020 - Column vs Row Oriented Databases Explained 15. Dec 9, 2020 - How do NoSQL databases work? Simply Explained! 1. Graph Databases 2. Column Databases 3. Row Vs. Column Oriented Databases 4. Database Indexing Explained 5. MongoDB Indexing 6. AWS: DynamoDB Global Indexing 7. AWS: DynamoDB Local Indexing 8. Google Cloud Spanner 9. AWS: DynamoDB Design Patterns 10. Cloud Provider Database Comparisons 11. CockroachDB: When to use a Cloud DB?
  33. @arafkarsh arafkarsh References 52 Docker / Kubernetes / Istio 1.

    IBM: Virtual Machines and Containers 2. IBM: What is a Hypervisor? 3. IBM: Docker Vs. Kubernetes 4. IBM: Containerization Explained 5. IBM: Kubernetes Explained 6. IBM: Kubernetes Ingress in 5 Minutes 7. Microsoft: How Service Mesh works in Kubernetes 8. IBM: Istio Service Mesh Explained 9. IBM: Kubernetes and OpenShift 10. IBM: Kubernetes Operators 11. 10 Consideration for Kubernetes Deployments Istio – Metrics 1. Istio – Metrics 2. Monitoring Istio Mesh with Grafana 3. Visualize your Istio Service Mesh 4. Security and Monitoring with Istio 5. Observing Services using Prometheus, Grafana, Kiali 6. Istio Cookbook: Kiali Recipe 7. Kubernetes: Open Telemetry 8. Open Telemetry 9. How Prometheus works 10. IBM: Observability vs. Monitoring
  34. @arafkarsh arafkarsh References 53 1. Feb 6, 2020 – An

    introduction to TDD 2. Aug 14, 2019 – Component Software Testing 3. May 30, 2020 – What is Component Testing? 4. Apr 23, 2013 – Component Test By Martin Fowler 5. Jan 12, 2011 – Contract Testing By Martin Fowler 6. Jan 16, 2018 – Integration Testing By Martin Fowler 7. Testing Strategies in Microservices Architecture 8. Practical Test Pyramid By Ham Vocke Testing – TDD / BDD
  35. @arafkarsh arafkarsh 54 1. Simoorg : LinkedIn’s own failure inducer

    framework. It was designed to be easy to extend and most of the important components are plug‐ gable. 2. Pumba : A chaos testing and network emulation tool for Docker. 3. Chaos Lemur : Self-hostable application to randomly destroy virtual machines in a BOSH- managed environment, as an aid to resilience testing of high-availability systems. 4. Chaos Lambda : Randomly terminate AWS ASG instances during business hours. 5. Blockade : Docker-based utility for testing network failures and partitions in distributed applications. 6. Chaos-http-proxy : Introduces failures into HTTP requests via a proxy server. 7. Monkey-ops : Monkey-Ops is a simple service implemented in Go, which is deployed into an OpenShift V3.X and generates some chaos within it. Monkey-Ops seeks some OpenShift components like Pods or Deployment Configs and randomly terminates them. 8. Chaos Dingo : Chaos Dingo currently supports performing operations on Azure VMs and VMSS deployed to an Azure Resource Manager-based resource group. 9. Tugbot : Testing in Production (TiP) framework for Docker. Testing tools
  36. @arafkarsh arafkarsh References 55 CI / CD 1. What is

    Continuous Integration? 2. What is Continuous Delivery? 3. CI / CD Pipeline 4. What is CI / CD Pipeline? 5. CI / CD Explained 6. CI / CD Pipeline using Java Example Part 1 7. CI / CD Pipeline using Ansible Part 2 8. Declarative Pipeline vs Scripted Pipeline 9. Complete Jenkins Pipeline Tutorial 10. Common Pipeline Mistakes 11. CI / CD for a Docker Application
  37. @arafkarsh arafkarsh References 56 DevOps 1. IBM: What is DevOps?

    2. IBM: Cloud Native DevOps Explained 3. IBM: Application Transformation 4. IBM: Virtualization Explained 5. What is DevOps? Easy Way 6. DevOps?! How to become a DevOps Engineer??? 7. Amazon: https://www.youtube.com/watch?v=mBU3AJ3j1rg 8. NetFlix: https://www.youtube.com/watch?v=UTKIT6STSVM 9. DevOps and SRE: https://www.youtube.com/watch?v=uTEL8Ff1Zvk 10. SLI, SLO, SLA : https://www.youtube.com/watch?v=tEylFyxbDLE 11. DevOps and SRE : Risks and Budgets : https://www.youtube.com/watch?v=y2ILKr8kCJU 12. SRE @ Google: https://www.youtube.com/watch?v=d2wn_E1jxn4
  38. @arafkarsh arafkarsh References 57 1. Lewis, James, and Martin Fowler.

    “Microservices: A Definition of This New Architectural Term”, March 25, 2014. 2. Miller, Matt. “Innovate or Die: The Rise of Microservices”. e Wall Street Journal, October 5, 2015. 3. Newman, Sam. Building Microservices. O’Reilly Media, 2015. 4. Alagarasan, Vijay. “Seven Microservices Anti-patterns”, August 24, 2015. 5. Cockcroft, Adrian. “State of the Art in Microservices”, December 4, 2014. 6. Fowler, Martin. “Microservice Prerequisites”, August 28, 2014. 7. Fowler, Martin. “Microservice Tradeoffs”, July 1, 2015. 8. Humble, Jez. “Four Principles of Low-Risk Software Release”, February 16, 2012. 9. Zuul Edge Server, Ketan Gote, May 22, 2017 10. Ribbon, Hysterix using Spring Feign, Ketan Gote, May 22, 2017 11. Eureka Server with Spring Cloud, Ketan Gote, May 22, 2017 12. Apache Kafka, A Distributed Streaming Platform, Ketan Gote, May 20, 2017 13. Functional Reactive Programming, Araf Karsh Hamid, August 7, 2016 14. Enterprise Software Architectures, Araf Karsh Hamid, July 30, 2016 15. Docker and Linux Containers, Araf Karsh Hamid, April 28, 2015
  39. @arafkarsh arafkarsh References 58 16. MSDN – Microsoft https://msdn.microsoft.com/en-us/library/dn568103.aspx 17.

    Martin Fowler : CQRS – http://martinfowler.com/bliki/CQRS.html 18. Udi Dahan : CQRS – http://www.udidahan.com/2009/12/09/clarified-cqrs/ 19. Greg Young : CQRS - https://www.youtube.com/watch?v=JHGkaShoyNs 20. Bertrand Meyer – CQS - http://en.wikipedia.org/wiki/Bertrand_Meyer 21. CQS : http://en.wikipedia.org/wiki/Command–query_separation 22. CAP Theorem : http://en.wikipedia.org/wiki/CAP_theorem 23. CAP Theorem : http://www.julianbrowne.com/article/viewer/brewers-cap-theorem 24. CAP 12 years how the rules have changed 25. EBay Scalability Best Practices : http://www.infoq.com/articles/ebay-scalability-best-practices 26. Pat Helland (Amazon) : Life beyond distributed transactions 27. Stanford University: Rx https://www.youtube.com/watch?v=y9xudo3C1Cw 28. Princeton University: SAGAS (1987) Hector Garcia Molina / Kenneth Salem 29. Rx Observable : https://dzone.com/articles/using-rx-java-observable