Slide 1

Slide 1 text

Merci Keptn Obvious! SLOs observables avec Prometheus et Keptn April 20, 2022

Slide 2

Slide 2 text

> whoami --henrik @HRexed https://www.linkedin.com/in/hrexed/ henrikrexed

Slide 3

Slide 3 text

> whoami --oleg @oleg_nenashev oleg-nenashev #StayAtHome :(

Slide 4

Slide 4 text

Si vous restez avec nous, … ● Le principe de base des SRE: SLI/SLO ● L’importance de l’observabilité ● Introduction à Keptn ● Demo

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Pourquoi avons nous besoin de SRE? • Developers were focused on innovation and agility • Operations on stability • SRE has been created to make sure that we are building reliable services and avoiding conflict between Developers and Operations ● Les développeurs sont généralement focalisé sur l’innovation et l’agilité ● Vos Ops se focalisent sur la disponibilité et la stabilité ● SRE a pour objectifs d'accroître la fiabilité de nos système et d’éviter les conflits entre Devs et les Ops

Slide 7

Slide 7 text

SLI Good Events Valid Events 100 % SLI Service Level Indicator Un indicateur permettant de comprendre l’état de votre système ou de vos utilisateurs Example: HTTP Request Latency # of HTTP Request with <= 5 sec response time Total # of Requests 100 %

Slide 8

Slide 8 text

SLO SLO Service Level Objective 100 % 0 % 95 Example: Request latency will be <= 5 secs for 95% of Requests un objectif associé à votre indicateur 100 % 0 % SLO # of HTTP Request with <= 5 sec response time

Slide 9

Slide 9 text

SLI/SLO vous aide à définir des objectifs • Product owners defined at a very early stages the objectives for each services • SLI/SLO helps to : • Availability • performance • more • SLI/SLO helps to detect issues before our end-users • Your objectives needs to be achievable because your error budget will be based on it. ● Product owners déterminent les objectives pour chaque étape de la phase de construction ● SLI/SLO facilite la validation de : ○ Disponibilité ○ Performance ○ …etc ● SLI/SLO vous aide à détecter des anomalies avant vos utilisateurs ● Vos objectifs doivent être atteignables car vos “errors budget” sont basés dessus. Production performance disponibilité

Slide 10

Slide 10 text

Remove toil Journée type d’un SRE Operations 50% Dev 50%

Slide 11

Slide 11 text

Remove toil Journée type d’un SRE Operations 25% Dev 50% ??? 25%

Slide 12

Slide 12 text

L’observabilité

Slide 13

Slide 13 text

Les piliers de l’observabilité Logs Evènements métriques Observabilité traces

Slide 14

Slide 14 text

Paysage de la CNCF https://landscape.cncf.io/

Slide 15

Slide 15 text

15 La réalité… https://twitter.com/dastbe/statu s/1303858170155081728

Slide 16

Slide 16 text

Open Observability

Slide 17

Slide 17 text

Open Observability. Des standards

Slide 18

Slide 18 text

Prometheus Fournisseur de métriques

Slide 19

Slide 19 text

L'architecture Prometheus Kube State metrics Node exporter Cadvisor Alertmanager Scrape Prometheus Serveur PromQl

Slide 20

Slide 20 text

Prometheus est un standard •CouchDb •Mysql •Oracle •PostgreSQL •MongoDB •… Base de données •Netgear •Windows •IBM Z •Nvidia •….etc Hardware •MQ •Kafka •MQTT •RabbitMQ •…etc Message •Tivoli •Hadoop •NetApp •ScaleIO Stockage •Jira •Jenkins •Github •Fluentd •Nagios •…etc Autre

Slide 21

Slide 21 text

Automatiser La clé de vos SRE

Slide 22

Slide 22 text

Systèmes DevOps modernes ● CI/CD ● Production ● Staging

Slide 23

Slide 23 text

Opérations à grande échelle • Configurations complexes • la répétition • Déviations de Configurations La maintenance est couteuse Spaghetti à l'automatisation

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

Keptn Livraison et opérations basées sur les données pour vos applications cloud natives https://keptn.sh

Slide 27

Slide 27 text

Orchestration pour vos applications Control Plane CloudEvents

Slide 28

Slide 28 text

Keptn ● Un project de la CNCF ● Control plane, admin frontend/CLI ● Observability, dashboards & alerting ● SLO-driven multistage delivery ● Operations & remediation

Slide 29

Slide 29 text

SLO Evaluation & Monitoring 4,000+ apps Metrics / SLI Providers Notifications Auto-remediation

Slide 30

Slide 30 text

Keptn: Automatisation SLO-Driven pour DevOps & SREs Vous (Dev/Ops/SRE) Ajouter votre configuration Choisir un Cas d’utilisation SLO-Quality Gates Progressive Delivery Auto- Remediation Declaration GitOps SLOs Standards shipyard SLI/SLO runbook SRE Automation workload Monitoring Delivery Reliability Remediation Automatise la configuration et offre des solutions self-service pour Basé sur un processus event-driven permettant Connecter vos Outils

Slide 31

Slide 31 text

Déclenche une séquence d’automatisation Orchestration, monitoring , deployment, test , evaluation des SLO & remediation Utiliser vos outils sans effort de configuration ou d’intégration

Slide 32

Slide 32 text

Keptn est extensible https://artifacthub.io/packages /search?ts_query_web=Keptn

Slide 33

Slide 33 text

Keptn s'intègre à d'autres outils CLI / REST API

Slide 34

Slide 34 text

Demo Time!

Slide 35

Slide 35 text

Keptn et Prometheus SLO Evaluation & Monitoring Prometheus Integration Service Your App Auto-remediation loop

Slide 36

Slide 36 text

Quickstart Guide: keptn.sh/docs/quickstart Prerequisites: ● Docker+K3D or K3s ● 12GB+ RAM

Slide 37

Slide 37 text

Is it observable ● Si vous cherchez du contenu educative sur l’Observabilité, regarder : Is It Observable

Slide 38

Slide 38 text

It is observable Merci, Keptn Obvious! keptn.sh SLOs observables avec Prometheus et Keptn

Slide 39

Slide 39 text

Fin

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

Beaucoup de travail manuel ~90% of test reruns 9:1 ratio script maintenance vs creation only 10% projects performance tested Test Result Analysis Monitoring Configuration ~ 80% time spent in manual ... Scripts Creation SLO Report Generation 15-20 tests / year < 5 Apps „We are limited in scaling SRE due to manual expert tasks!‘“ Roman Ferstl Managing Director

Slide 42

Slide 42 text

Adoptants

Slide 43

Slide 43 text

Keptn est pour les ingénieurs 43 https://keptn.sh

Slide 44

Slide 44 text

DevOps & SREs – la fiabilté et l’efficacité DevOps: Automate Speed of Delivery SRE: Automate Resiliancy of Operations Deployment Frequency How often an organization successfully releases to production Lead Time for Changes The amount of time it takes a commit to get into production Change Failure Rate The percentage of deployments causing a failure in production Time to Restore Service How long it takes an organization to recover from a failure in production

Slide 45

Slide 45 text

Est-ce observable?

Slide 46

Slide 46 text

Learning from Google‘s SRE Practices ● Service Level Indicators (SLIs) ○ Definition: Measurable Metrics as the base for evaluation ○ Example: Error Rate of Login Requests ● Service Level Objectives (SLOs) ○ Definition: Binding targets for Service Level Indicators ○ Example: Login Error Rate must be less than 2% over a 30 day period ● Service Level Agreements (SLAs) ○ Definition: Business Agreement between consumer and provider typically based on SLO ○ Example: Logins must be reliable & fast (Error Rate, Response Time, Throughput) 99% within a 30 day window ● Google Cloud YouTube Video ○ SLIs, SLOs, SLAs, oh my! (class SRE implements DevOps): https://www.youtube.com/watch?v=tEylFyxbDLE SLIs drive SLOs which inform SLAs

Slide 47

Slide 47 text

A emporter ● Keptn - Livraison et opérations basées sur les données pour vos apps cloud natives ● Keptn n'est pas un outil CI/CD. Il fait passer l'observabilité à l'étape suivante ● Keptn ❤ Prometheus ● Keptn fait l’automatisation pour vos apps

Slide 48

Slide 48 text

Références ● Website: keptn.sh ● Guide de démarrage: keptn.sh/docs/quickstart ● Tutoriels: tutorials.keptn.sh ● Keptn en Francais: https://www.youtube.com/playlist?list=PL6i801Rj t9DbMZMaRxkbXS7AC5nQy5ipz

Slide 49

Slide 49 text

Plus de tutoriels! ● Prometheus ● Dynatrace ● ArgoCD ● Jenkins ● Bientôt: Datadog tutorials.keptn.sh

Slide 50

Slide 50 text

More about Keptn Techworld with Nana, March 2022 https://www.youtube.com/watch?v=3EEZmSwMXp8

Slide 51

Slide 51 text

Get it at keptn.sh!!! Keptn 100% OFF* * unlimited offer

Slide 52

Slide 52 text

S'abonner! isitobservable.io

Slide 53

Slide 53 text

keptn.sh/community/#slack

Slide 54

Slide 54 text

BACKUP

Slide 55

Slide 55 text

Notre exemple ● Dynatrace est une entreprise de Software Intelligence ● Nous avons des services à grande échelle. SaaS aussi ● Nous avons adopté l'ingénierie de la fiabilité du site (SRE) ● L'automatisation partout

Slide 56

Slide 56 text

Why we started Keptn? Surveillance des applications (APM) L'automatisation Alertes Nos systèmes

Slide 57

Slide 57 text

Our problems ● Échelle du système ● Des milliers de métriques (1000+) ● Complex Service Level Indicators (SLIs) ● Complex decision making logic ● Complex Integration Testing

Slide 58

Slide 58 text

Why we started Keptn? Surveillance des applications (APM) L'automatisation Alertes Nos systèmes

Slide 59

Slide 59 text

Architecture | keptn | Cloud-native application life-cycle orchestration Architecture du Keptn

Slide 60

Slide 60 text

Contributing to Keptn ● We are looking for contributors! ● keptn.sh/community/contributing ○ K8s, Golang, Javascript, Documentation, etc. ○ SRE and Operations ● We participate in Google Summer of Code ● Slack: keptn.sh/community/#slack

Slide 61

Slide 61 text

Join us online ● Zoom => CNCF Community Portal ● community.cncf.io/keptn-community ● Powered by Bevy ● Videos go to YouTube

Slide 62

Slide 62 text

… and at Kubecon! May 16-20, 2022 https://docs.google.com/presentation/d/1 SzwJD_1f9ufy_hbHGJD8N6nSKtuj80ysI _Srmx-Or5M/edit?usp=sharing

Slide 63

Slide 63 text

Merci Keptn Obvious! SLOs observables avec Prometheus et Keptn April 20, 2022

Slide 64

Slide 64 text

> whoami --henrik @HRexed

Slide 65

Slide 65 text

> whoami --oleg @oleg_nenashev oleg-nenashev #StayAtHome :(

Slide 66

Slide 66 text

Le programme ● Observability 101 ● SLIs / SLOs ● Introduction à Keptn ● Demo

Slide 67

Slide 67 text

No content

Slide 68

Slide 68 text

SRE. Recette du succès 1. Observabilité 2. Métrique 3. Automatisation

Slide 69

Slide 69 text

Est-ce observable?

Slide 70

Slide 70 text

Paysage de la CNCF https://landscape.cncf.io/

Slide 71

Slide 71 text

71 The reality... https://twitter.com/dastbe/statu s/1303858170155081728

Slide 72

Slide 72 text

Open Observability

Slide 73

Slide 73 text

Open Observability. Des standards

Slide 74

Slide 74 text

Prometheus Fournisseur de métriques

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

L'architecture Prometheus Kube State metrics Node exporter Cadvisor Alertmanage r Scra p Prometheus Server PromQl

Slide 77

Slide 77 text

Prometheus devient un standard • CouchDb • Mysql • Oracle • PostgreSQL • MongoDB • … Base de données • Netgear • Windows • IBM Z • Nvidia • ….etc Hardware • MQ • Kafka • MQTT • RabbitMQ • …etc Message • Tivoli • Hadoop • NetApp • ScaleIO Stockage • Jira • Jenkins • Github • Fluentd • Nagios • …etc Autre

Slide 78

Slide 78 text

Nos Objectifs SRE / SLI / SLO

Slide 79

Slide 79 text

Why do we need SRE? Develop er Operatio n • Developers were focused on innovation and agility • Operations on stability • SRE has been created to make sure that we are building reliable services and avoiding conflict between Developers and Operations ● Les développeurs sont généralement focalisé sur l’innovation et l’agilité ● Vos Ops se focalisent sur la disponibilité et la stabilité ● SRE a pour objectifs d'accroître la fiabilité de nos système et d’éviter les conflits entre Devs et les Ops

Slide 80

Slide 80 text

SLI/SLO vous aide à définir des objectifs Operation • Product owners defined at a very early stages the objectives for each services • SLI/SLO helps to : • Availability • performance • more • SLI/SLO helps to detect issues before our end-users • Your objectives needs to be achievable because your error budget will be based on it. Production spee d availability ● Product owners déterminent les objectives pour chaque étape de la phase de construction ● SLI/SLO facilitate la validation de : ○ Disponibilité ○ Performance ○ …etc ● SLI/SLO vous aide à détecter des anomalies avant vos utilisateurs ● Vos objectifs doivent être atteignables car vos “errors budget” sont basés dessus.

Slide 81

Slide 81 text

Remove toil Journée type d’un SRE Operations 50% Dev 50%

Slide 82

Slide 82 text

Remove toil Journée type d’un SRE Operations 25% Dev 50% ??? 25%

Slide 83

Slide 83 text

Confidential 83 SLI Good Events Valid Events 100 % SLI Service Level Indicator Un indicateur permettant de comprendre l’état de votre système ou de vos utilisateurs Example: HTTP Request Latency # of HTTP Request with <= 5 sec response time Total # of Requests 100 %

Slide 84

Slide 84 text

Confidential 84 SLO SLO Service Level Objective 100 % 0 % 95 Example: Request latency will be <= 5 secs for 95% of Requests un objectif associé à votre indicateur 100 % 0 % SLO # of HTTP Request with <= 5 sec response time

Slide 85

Slide 85 text

We – DevOps & SREs – need to delivery faster and better!! DevOps: Automate Speed of Delivery SRE: Automate Resiliancy of Operations Deployment Frequency How often an organization successfully releases to production Lead Time for Changes The amount of time it takes a commit to get into production Change Failure Rate The percentage of deployments causing a failure in production Time to Restore Service How long it takes an organization to recover from a failure in production

Slide 86

Slide 86 text

Automatiser La clé de vos SRE

Slide 87

Slide 87 text

Beaucoup de travail manuel ~90% of test reruns 9:1 ratio script maintenance vs creation only 10% projects performance tested Test Result Analysis Monitoring Configuration ~ 80% time spent in manual ... Scripts Creation SLO Report Generation 15-20 tests / year < 5 Apps „We are limited in scaling SRE due to manual expert tasks!‘“ Roman Ferstl Managing Director

Slide 88

Slide 88 text

Systèmes DevOps modernes ● CI/CD ● Production ● Staging

Slide 89

Slide 89 text

Opérations à grande échelle Spaghetti à l'automatisation • Configurations complexes • la répétition • Déviations de Configurations La maintenance est cher

Slide 90

Slide 90 text

No content

Slide 91

Slide 91 text

No content

Slide 92

Slide 92 text

Keptn Livraison et opérations basées sur les données pour vos applications cloud natives https://keptn.sh

Slide 93

Slide 93 text

Orchestration pour vos applications Control Plane CloudEvents

Slide 94

Slide 94 text

Keptn ● Un project de la CNCF ● Control plane, admin frontend/CLI ● Observability, dashboards & alerting ● SLO-driven multistage delivery ● Operations & remediation

Slide 95

Slide 95 text

SLO Evaluation & Monitoring 4,000+ apps Metrics / SLI Providers Notifications Auto-remediation

Slide 96

Slide 96 text

Keptn: SLO-Driven Automation for DevOps & SREs You (Dev/Ops/SRE) bring your configuration pick your use case SLO-Quality Gates Progressive Delivery Auto- Remediation Declaration GitOps SLOs Standards shipyard SLI/SLO runbook SRE Automation workload Monitoring Delivery Reliability Remediation automates configuration and provides self-service for through event-driven process orchestration based on connect your tools

Slide 97

Slide 97 text

triggers an automation sequence orchestrates monitoring config, deployment, test execution, SLO evaluation & remediation Leverage your existing tooling without writing the automation & integration

Slide 98

Slide 98 text

Adoptants

Slide 99

Slide 99 text

Keptn est pour les ingénieurs 99 https://keptn.sh

Slide 100

Slide 100 text

Keptn est extensible https://artifacthub.io/packages /search?ts_query_web=Keptn

Slide 101

Slide 101 text

Keptn s'intègre à d'autres outils CLI / REST API

Slide 102

Slide 102 text

Demo Time!

Slide 103

Slide 103 text

Keptn et Prometheus SLO Evaluation & Monitoring Prometheus Integration Service Your App Auto-remediation loop

Slide 104

Slide 104 text

Quickstart Guide: keptn.sh/docs/quickstart Prerequisites: ● Docker+K3D or K3s ● 12GB+ RAM

Slide 105

Slide 105 text

It is observable Merci, Keptn Obvious! keptn.sh SLOs observables avec Prometheus et Keptn

Slide 106

Slide 106 text

La Fin!

Slide 107

Slide 107 text

Learning from Google‘s SRE Practices ● Service Level Indicators (SLIs) ○ Definition: Measurable Metrics as the base for evaluation ○ Example: Error Rate of Login Requests ● Service Level Objectives (SLOs) ○ Definition: Binding targets for Service Level Indicators ○ Example: Login Error Rate must be less than 2% over a 30 day period ● Service Level Agreements (SLAs) ○ Definition: Business Agreement between consumer and provider typically based on SLO ○ Example: Logins must be reliable & fast (Error Rate, Response Time, Throughput) 99% within a 30 day window ● Google Cloud YouTube Video ○ SLIs, SLOs, SLAs, oh my! (class SRE implements DevOps): https://www.youtube.com/watch?v=tEylFyxbDLE SLIs drive SLOs which inform SLAs

Slide 108

Slide 108 text

A emporter ● Keptn - Livraison et opérations basées sur les données pour vos apps cloud natives ● Keptn n'est pas un outil CI/CD. Il fait passer l'observabilité à l'étape suivante ● Keptn ❤ Prometheus ● Keptn fait l’automatisation pour vos apps

Slide 109

Slide 109 text

Références ● Website: keptn.sh ● Guide de démarrage: keptn.sh/docs/quickstart ● Tutoriels: tutorials.keptn.sh ● Keptn en Francais: https://www.youtube.com/playlist?list=PL6i801Rj t9DbMZMaRxkbXS7AC5nQy5ipz

Slide 110

Slide 110 text

Plus de tutoriels! ● Prometheus ● Dynatrace ● ArgoCD ● Jenkins ● Bientôt: Datadog tutorials.keptn.sh

Slide 111

Slide 111 text

More about Keptn Techworld with Nana, March 2022 https://www.youtube.com/watch?v=3EEZmSwMXp8

Slide 112

Slide 112 text

Get it at keptn.sh!!! Keptn 100% OFF* * unlimited offer

Slide 113

Slide 113 text

S'abonner! isitobservable.io

Slide 114

Slide 114 text

keptn.sh/community/#slack

Slide 115

Slide 115 text

BACKUP

Slide 116

Slide 116 text

Notre exemple ● Dynatrace est une entreprise de Software Intelligence ● Nous avons des services à grande échelle. SaaS aussi ● Nous avons adopté l'ingénierie de la fiabilité du site (SRE) ● L'automatisation partout

Slide 117

Slide 117 text

Why we started Keptn? Surveillance des applications (APM) L'automatisation Alertes Nos systèmes

Slide 118

Slide 118 text

Our problems ● Échelle du système ● Des milliers de métriques (1000+) ● Complex Service Level Indicators (SLIs) ● Complex decision making logic ● Complex Integration Testing

Slide 119

Slide 119 text

Why we started Keptn? Surveillance des applications (APM) L'automatisation Alertes Nos systèmes

Slide 120

Slide 120 text

Architecture | keptn | Cloud-native application life-cycle orchestration Architecture du Keptn

Slide 121

Slide 121 text

Contributing to Keptn ● We are looking for contributors! ● keptn.sh/community/contributing ○ K8s, Golang, Javascript, Documentation, etc. ○ SRE and Operations ● We participate in Google Summer of Code ● Slack: keptn.sh/community/#slack

Slide 122

Slide 122 text

Join us online ● Zoom => CNCF Community Portal ● community.cncf.io/keptn-community ● Powered by Bevy ● Videos go to YouTube

Slide 123

Slide 123 text

… and at Kubecon! May 16-20, 2022 https://docs.google.com/presentation/d/1 SzwJD_1f9ufy_hbHGJD8N6nSKtuj80ysI _Srmx-Or5M/edit?usp=sharing