Merci Keptn Obvious! SLOs observables avec Prometheus et Keptn

Merci Keptn Obvious! SLOs observables avec Prometheus et Keptn April
20, 2022

> whoami --henrik @HRexed https://www.linkedin.com/in/hrexed/ henrikrexed

> whoami --oleg @oleg_nenashev oleg-nenashev #StayAtHome :(

Si vous restez avec nous, … • Le principe de
base des SRE: SLI/SLO • L’importance de l’observabilité • Introduction à Keptn • Demo

Pourquoi avons nous besoin de SRE? • Developers were focused
on innovation and agility • Operations on stability • SRE has been created to make sure that we are building reliable services and avoiding conflict between Developers and Operations • Les développeurs sont généralement focalisé sur l’innovation et l’agilité • Vos Ops se focalisent sur la disponibilité et la stabilité • SRE a pour objectifs d'accroître la fiabilité de nos système et d’éviter les conflits entre Devs et les Ops

SLI Good Events Valid Events 100 % SLI Service Level
Indicator Un indicateur permettant de comprendre l’état de votre système ou de vos utilisateurs Example: HTTP Request Latency # of HTTP Request with <= 5 sec response time Total # of Requests 100 %

SLO SLO Service Level Objective 100 % 0 % 95
Example: Request latency will be <= 5 secs for 95% of Requests un objectif associé à votre indicateur 100 % 0 % SLO # of HTTP Request with <= 5 sec response time

SLI/SLO vous aide à définir des objectifs • Product owners
defined at a very early stages the objectives for each services • SLI/SLO helps to : • Availability • performance • more • SLI/SLO helps to detect issues before our end-users • Your objectives needs to be achievable because your error budget will be based on it. • Product owners déterminent les objectives pour chaque étape de la phase de construction • SLI/SLO facilite la validation de : ◦ Disponibilité ◦ Performance ◦ …etc • SLI/SLO vous aide à détecter des anomalies avant vos utilisateurs • Vos objectifs doivent être atteignables car vos “errors budget” sont basés dessus. Production performance disponibilité

Remove toil Journée type d’un SRE Operations 50% Dev 50%

??? 25%

L’observabilité

Les piliers de l’observabilité Logs Evènements métriques Observabilité traces

Paysage de la CNCF https://landscape.cncf.io/

15 La réalité… https://twitter.com/dastbe/statu s/1303858170155081728

Open Observability

Open Observability. Des standards

Prometheus Fournisseur de métriques

L'architecture Prometheus Kube State metrics Node exporter Cadvisor Alertmanager Scrape
Prometheus Serveur PromQl

Prometheus est un standard •CouchDb •Mysql •Oracle •PostgreSQL •MongoDB •…
Base de données •Netgear •Windows •IBM Z •Nvidia •….etc Hardware •MQ •Kafka •MQTT •RabbitMQ •…etc Message •Tivoli •Hadoop •NetApp •ScaleIO Stockage •Jira •Jenkins •Github •Fluentd •Nagios •…etc Autre

Automatiser La clé de vos SRE

Systèmes DevOps modernes • CI/CD • Production • Staging

Opérations à grande échelle • Configurations complexes • la répétition
• Déviations de Configurations La maintenance est couteuse Spaghetti à l'automatisation

Keptn Livraison et opérations basées sur les données pour vos
applications cloud natives https://keptn.sh

Orchestration pour vos applications Control Plane CloudEvents

Keptn • Un project de la CNCF • Control plane,
admin frontend/CLI • Observability, dashboards & alerting • SLO-driven multistage delivery • Operations & remediation

SLO Evaluation & Monitoring 4,000+ apps Metrics / SLI Providers
Notifications Auto-remediation

Keptn: Automatisation SLO-Driven pour DevOps & SREs Vous (Dev/Ops/SRE) Ajouter
votre configuration Choisir un Cas d’utilisation SLO-Quality Gates Progressive Delivery Auto- Remediation Declaration GitOps SLOs Standards shipyard SLI/SLO runbook SRE Automation workload Monitoring Delivery Reliability Remediation Automatise la configuration et offre des solutions self-service pour Basé sur un processus event-driven permettant Connecter vos Outils

Déclenche une séquence d’automatisation Orchestration, monitoring , deployment, test ,
evaluation des SLO & remediation Utiliser vos outils sans effort de configuration ou d’intégration

Keptn est extensible https://artifacthub.io/packages /search?ts_query_web=Keptn

Keptn s'intègre à d'autres outils CLI / REST API

Demo Time!

Keptn et Prometheus SLO Evaluation & Monitoring Prometheus Integration Service
Your App Auto-remediation loop

Quickstart Guide: keptn.sh/docs/quickstart Prerequisites: • Docker+K3D or K3s • 12GB+
RAM

Is it observable • Si vous cherchez du contenu educative
sur l’Observabilité, regarder : Is It Observable

It is observable Merci, Keptn Obvious! keptn.sh SLOs observables avec
Prometheus et Keptn

Beaucoup de travail manuel ~90% of test reruns 9:1 ratio
script maintenance vs creation only 10% projects performance tested Test Result Analysis Monitoring Configuration ~ 80% time spent in manual ... Scripts Creation SLO Report Generation 15-20 tests / year < 5 Apps „We are limited in scaling SRE due to manual expert tasks!‘“ Roman Ferstl Managing Director

Adoptants

Keptn est pour les ingénieurs 43 https://keptn.sh

DevOps & SREs – la fiabilté et l’efficacité DevOps: Automate
Speed of Delivery SRE: Automate Resiliancy of Operations Deployment Frequency How often an organization successfully releases to production Lead Time for Changes The amount of time it takes a commit to get into production Change Failure Rate The percentage of deployments causing a failure in production Time to Restore Service How long it takes an organization to recover from a failure in production

Est-ce observable?

Learning from Google‘s SRE Practices • Service Level Indicators (SLIs)
◦ Definition: Measurable Metrics as the base for evaluation ◦ Example: Error Rate of Login Requests • Service Level Objectives (SLOs) ◦ Definition: Binding targets for Service Level Indicators ◦ Example: Login Error Rate must be less than 2% over a 30 day period • Service Level Agreements (SLAs) ◦ Definition: Business Agreement between consumer and provider typically based on SLO ◦ Example: Logins must be reliable & fast (Error Rate, Response Time, Throughput) 99% within a 30 day window • Google Cloud YouTube Video ◦ SLIs, SLOs, SLAs, oh my! (class SRE implements DevOps): https://www.youtube.com/watch?v=tEylFyxbDLE SLIs drive SLOs which inform SLAs

A emporter • Keptn - Livraison et opérations basées sur
les données pour vos apps cloud natives • Keptn n'est pas un outil CI/CD. Il fait passer l'observabilité à l'étape suivante • Keptn ❤ Prometheus • Keptn fait l’automatisation pour vos apps

Références • Website: keptn.sh • Guide de démarrage: keptn.sh/docs/quickstart •
Tutoriels: tutorials.keptn.sh • Keptn en Francais: https://www.youtube.com/playlist?list=PL6i801Rj t9DbMZMaRxkbXS7AC5nQy5ipz

Plus de tutoriels! • Prometheus • Dynatrace • ArgoCD •
Jenkins • Bientôt: Datadog tutorials.keptn.sh

More about Keptn Techworld with Nana, March 2022 https://www.youtube.com/watch?v=3EEZmSwMXp8

Get it at keptn.sh!!! Keptn 100% OFF* * unlimited offer

S'abonner! isitobservable.io

keptn.sh/community/#slack

BACKUP

Notre exemple • Dynatrace est une entreprise de Software Intelligence
• Nous avons des services à grande échelle. SaaS aussi • Nous avons adopté l'ingénierie de la fiabilité du site (SRE) • L'automatisation partout

Why we started Keptn? Surveillance des applications (APM) L'automatisation Alertes
Nos systèmes

Our problems • Échelle du système • Des milliers de
métriques (1000+) • Complex Service Level Indicators (SLIs) • Complex decision making logic • Complex Integration Testing

Nos systèmes

Architecture | keptn | Cloud-native application life-cycle orchestration Architecture du
Keptn

Contributing to Keptn • We are looking for contributors! •
keptn.sh/community/contributing ◦ K8s, Golang, Javascript, Documentation, etc. ◦ SRE and Operations • We participate in Google Summer of Code • Slack: keptn.sh/community/#slack

Join us online • Zoom => CNCF Community Portal •
community.cncf.io/keptn-community • Powered by Bevy • Videos go to YouTube

… and at Kubecon! May 16-20, 2022 https://docs.google.com/presentation/d/1 SzwJD_1f9ufy_hbHGJD8N6nSKtuj80ysI _Srmx-Or5M/edit?usp=sharing

Merci Keptn Obvious! SLOs observables avec Prometheus et Keptn April
20, 2022

> whoami --henrik @HRexed

> whoami --oleg @oleg_nenashev oleg-nenashev #StayAtHome :(

Le programme • Observability 101 • SLIs / SLOs •
Introduction à Keptn • Demo

SRE. Recette du succès 1. Observabilité 2. Métrique 3. Automatisation

Est-ce observable?

Paysage de la CNCF https://landscape.cncf.io/

71 The reality... https://twitter.com/dastbe/statu s/1303858170155081728

Open Observability

Open Observability. Des standards

Prometheus Fournisseur de métriques

L'architecture Prometheus Kube State metrics Node exporter Cadvisor Alertmanage r
Scra p Prometheus Server PromQl

Prometheus devient un standard • CouchDb • Mysql • Oracle
• PostgreSQL • MongoDB • … Base de données • Netgear • Windows • IBM Z • Nvidia • ….etc Hardware • MQ • Kafka • MQTT • RabbitMQ • …etc Message • Tivoli • Hadoop • NetApp • ScaleIO Stockage • Jira • Jenkins • Github • Fluentd • Nagios • …etc Autre

Nos Objectifs SRE / SLI / SLO

Why do we need SRE? Develop er Operatio n •
Developers were focused on innovation and agility • Operations on stability • SRE has been created to make sure that we are building reliable services and avoiding conflict between Developers and Operations • Les développeurs sont généralement focalisé sur l’innovation et l’agilité • Vos Ops se focalisent sur la disponibilité et la stabilité • SRE a pour objectifs d'accroître la fiabilité de nos système et d’éviter les conflits entre Devs et les Ops

SLI/SLO vous aide à définir des objectifs Operation • Product
owners defined at a very early stages the objectives for each services • SLI/SLO helps to : • Availability • performance • more • SLI/SLO helps to detect issues before our end-users • Your objectives needs to be achievable because your error budget will be based on it. Production spee d availability • Product owners déterminent les objectives pour chaque étape de la phase de construction • SLI/SLO facilitate la validation de : ◦ Disponibilité ◦ Performance ◦ …etc • SLI/SLO vous aide à détecter des anomalies avant vos utilisateurs • Vos objectifs doivent être atteignables car vos “errors budget” sont basés dessus.

??? 25%

Confidential 83 SLI Good Events Valid Events 100 % SLI
Service Level Indicator Un indicateur permettant de comprendre l’état de votre système ou de vos utilisateurs Example: HTTP Request Latency # of HTTP Request with <= 5 sec response time Total # of Requests 100 %

Confidential 84 SLO SLO Service Level Objective 100 % 0
% 95 Example: Request latency will be <= 5 secs for 95% of Requests un objectif associé à votre indicateur 100 % 0 % SLO # of HTTP Request with <= 5 sec response time

We – DevOps & SREs – need to delivery faster
and better!! DevOps: Automate Speed of Delivery SRE: Automate Resiliancy of Operations Deployment Frequency How often an organization successfully releases to production Lead Time for Changes The amount of time it takes a commit to get into production Change Failure Rate The percentage of deployments causing a failure in production Time to Restore Service How long it takes an organization to recover from a failure in production

Automatiser La clé de vos SRE

Beaucoup de travail manuel ~90% of test reruns 9:1 ratio
script maintenance vs creation only 10% projects performance tested Test Result Analysis Monitoring Configuration ~ 80% time spent in manual ... Scripts Creation SLO Report Generation 15-20 tests / year < 5 Apps „We are limited in scaling SRE due to manual expert tasks!‘“ Roman Ferstl Managing Director

Systèmes DevOps modernes • CI/CD • Production • Staging

Opérations à grande échelle Spaghetti à l'automatisation • Configurations complexes
• la répétition • Déviations de Configurations La maintenance est cher

Keptn Livraison et opérations basées sur les données pour vos
applications cloud natives https://keptn.sh

Orchestration pour vos applications Control Plane CloudEvents

Keptn • Un project de la CNCF • Control plane,
admin frontend/CLI • Observability, dashboards & alerting • SLO-driven multistage delivery • Operations & remediation

SLO Evaluation & Monitoring 4,000+ apps Metrics / SLI Providers
Notifications Auto-remediation

Keptn: SLO-Driven Automation for DevOps & SREs You (Dev/Ops/SRE) bring
your configuration pick your use case SLO-Quality Gates Progressive Delivery Auto- Remediation Declaration GitOps SLOs Standards shipyard SLI/SLO runbook SRE Automation workload Monitoring Delivery Reliability Remediation automates configuration and provides self-service for through event-driven process orchestration based on connect your tools

triggers an automation sequence orchestrates monitoring config, deployment, test execution,
SLO evaluation & remediation Leverage your existing tooling without writing the automation & integration

Adoptants

Keptn est pour les ingénieurs 99 https://keptn.sh

Keptn est extensible https://artifacthub.io/packages /search?ts_query_web=Keptn

Keptn s'intègre à d'autres outils CLI / REST API

Demo Time!

Keptn et Prometheus SLO Evaluation & Monitoring Prometheus Integration Service
Your App Auto-remediation loop

Quickstart Guide: keptn.sh/docs/quickstart Prerequisites: • Docker+K3D or K3s • 12GB+
RAM

It is observable Merci, Keptn Obvious! keptn.sh SLOs observables avec
Prometheus et Keptn

La Fin!

Learning from Google‘s SRE Practices • Service Level Indicators (SLIs)
◦ Definition: Measurable Metrics as the base for evaluation ◦ Example: Error Rate of Login Requests • Service Level Objectives (SLOs) ◦ Definition: Binding targets for Service Level Indicators ◦ Example: Login Error Rate must be less than 2% over a 30 day period • Service Level Agreements (SLAs) ◦ Definition: Business Agreement between consumer and provider typically based on SLO ◦ Example: Logins must be reliable & fast (Error Rate, Response Time, Throughput) 99% within a 30 day window • Google Cloud YouTube Video ◦ SLIs, SLOs, SLAs, oh my! (class SRE implements DevOps): https://www.youtube.com/watch?v=tEylFyxbDLE SLIs drive SLOs which inform SLAs

A emporter • Keptn - Livraison et opérations basées sur
les données pour vos apps cloud natives • Keptn n'est pas un outil CI/CD. Il fait passer l'observabilité à l'étape suivante • Keptn ❤ Prometheus • Keptn fait l’automatisation pour vos apps

Références • Website: keptn.sh • Guide de démarrage: keptn.sh/docs/quickstart •
Tutoriels: tutorials.keptn.sh • Keptn en Francais: https://www.youtube.com/playlist?list=PL6i801Rj t9DbMZMaRxkbXS7AC5nQy5ipz

Plus de tutoriels! • Prometheus • Dynatrace • ArgoCD •
Jenkins • Bientôt: Datadog tutorials.keptn.sh

More about Keptn Techworld with Nana, March 2022 https://www.youtube.com/watch?v=3EEZmSwMXp8

Get it at keptn.sh!!! Keptn 100% OFF* * unlimited offer

S'abonner! isitobservable.io

keptn.sh/community/#slack

BACKUP

Notre exemple • Dynatrace est une entreprise de Software Intelligence
• Nous avons des services à grande échelle. SaaS aussi • Nous avons adopté l'ingénierie de la fiabilité du site (SRE) • L'automatisation partout

Nos systèmes

Our problems • Échelle du système • Des milliers de
métriques (1000+) • Complex Service Level Indicators (SLIs) • Complex decision making logic • Complex Integration Testing

Nos systèmes

Architecture | keptn | Cloud-native application life-cycle orchestration Architecture du
Keptn

Contributing to Keptn • We are looking for contributors! •
keptn.sh/community/contributing ◦ K8s, Golang, Javascript, Documentation, etc. ◦ SRE and Operations • We participate in Google Summer of Code • Slack: keptn.sh/community/#slack

Join us online • Zoom => CNCF Community Portal •
community.cncf.io/keptn-community • Powered by Bevy • Videos go to YouTube

… and at Kubecon! May 16-20, 2022 https://docs.google.com/presentation/d/1 SzwJD_1f9ufy_hbHGJD8N6nSKtuj80ysI _Srmx-Or5M/edit?usp=sharing

Merci Keptn Obvious! SLOs observables avec Prom...

Merci Keptn Obvious! SLOs observables avec Prometheus et Keptn

Video

More Decks by Oleg Nenashev

Other Decks in Technology

Featured

Transcript