Prometheus Operator, a tale about container monitoring at iFood (volume I)

Slide 1

Slide 1 text

Prometheus Operator A tale about container monitoring at iFood Volume I by: Daniel Requena

Slide 2

Slide 2 text

Index Editor's Note ……………………………………………………….. 1 Foreword ...………………..…………...………………..………….. 2 Chapter I - Bare Metal Dungeons .……..….………… 3 Chapter II - Cloudy Mountains .………….……………. 4 Chapter III - The ignorance desert ..………….………. 5 Chapter IV - Container Harbor ..………….…………..… 6 Chapter V - The next adventure ………………………..7 References ..………………..…………...………………..………….9 About the author …………...………………..…………………..10

Slide 3

Slide 3 text

Editor's Note ● This tale is based in facts. ● Characters and ﬁctional elements have been added for the sake of a better storytelling. ● The main character is a chimera of iFood's SRE team for the past 6 years or so. ● The only focus of this presentation is Monitoring.

Slide 4

Slide 4 text

Foreword This Adventure's slides are already available https://speakerdeck.com/drequena/

Slide 5

Slide 5 text

Chapter I Bare Metal Dungeons

Slide 6

Slide 6 text

Chapter I - Bare Metal Dungeons

Slide 7

Slide 7 text

Chapter I - Bare Metal Dungeons

Slide 8

Slide 8 text

Chapter I - Bare Metal Dungeons Class: Human Race : Warrior Level: 10 XP : 2 years ------------- Int : ##### Str : ######## Dex : ### STA : # Tools: . Linux (int) . Networking (int) . Bash (int) . Pyrhon (bgn) . Zabbix (bgn)

Slide 9

Slide 9 text

Chapter I - Bare Metal Dungeons Hello fellow Linus! I'm ScrumMaster, the Bard! Are you looking for a great Monitoring adventure? I heard that beyond the dark forest, there is a site called Bare Metal Dungeons. A place full of big challenges: Racks Servers, VMs, switches, etc... A big reward in eXPerience is promised to the hero who answer to the call. Are you interested? Scrum Master, the Bard Linus, the Sysadmin

Slide 10

Slide 10 text

Chapter I - Bare Metal Dungeons Hello my good friend! A fare adventure you are proposing to me. I shall accept this challenge rightway! To the Bare Metal Dungeons I SAY! Accept my LinkedIn proﬁle as a form of gratitude. Thanks for the opportunity, I hope to see you soon. Good Bye! Scrum Master, the Bard Linus, the Sysadmin

Slide 11

Slide 11 text

Chapter I - Bare Metal Dungeons

Slide 12

Slide 12 text

Chapter I - Bare Metal Dungeons

Slide 13

Slide 13 text

Chapter I - Bare Metal Dungeons The Bare Metal Dungeons landscape ● Physical servers ● Manual provisioned VMs ● Network devices ● Databases ● Web Servers ● Monolithic app ● Few users

Slide 14

Slide 14 text

Chapter I - Bare Metal Dungeons The Bare Metal Dungeons landscape ● Weapon of choice: Zabbix ○ Basic templates ○ Custom Templates ○ Bash scripts ○ Python integrations ○ E-mail alerts ○ ...

Slide 15

Slide 15 text

Chapter I - Bare Metal Dungeons The Bare Metal Dungeons landscape ● At the end... ○ More dynamic workloads ■ LLD ○ A lot of Custom Items ○ Some Web Scenarios (urg!) ○ Starting to use Chef ○ ...

Slide 16

Slide 16 text

Chapter I - Bare Metal Dungeons Class: Human Race : Warrior Level: 15 XP : 4 years ------------- Int : ###### Str : ########## Dex : #### STA : ##### Tools: .Linux (adv) .Networking (int) .Bash (adv) .Pyrhon (int) .Zabbix (int) .Chef (bgn) .AWS (bgn)

Slide 17

Slide 17 text

Chapter II Cloudy Mountains

Slide 18

Slide 18 text

Chapter II - Cloudy Mountains Hello again my dear friend Linus! Congrats on your success in your last mission! I believe you were looking for vacations, am I right? However, a new quest awaits for you! All applications and systems are now moving to the Cloudy Mountains. Your mission is to support the monitoring tasks after the the brave DEV teams break down the Monolithic Dragon. For that, A LOT of servers will be created. Scrum Master, the Bard Linus, the Sysadmin

Slide 19

Slide 19 text

Chapter II - Cloudy Mountains Greetings my under occupied friend! A new quest you say! I must accept rightway again. So, the Monolithic Dragon shall be broken down, right? And you said something about the creation of a lot of servers too. Tell me, my slacker friend, are you talking about how many servers? 70? Scrum Master, the Bard Linus, the Sysadmin

Slide 20

Slide 20 text

Chapter II - Cloudy Mountains A little more. Scrum Master, the Bard Linus, the Sysadmin

Slide 21

Slide 21 text

Chapter II - Cloudy Mountains 140 servers? Scrum Master, the Bard Linus, the Sysadmin

Slide 22

Slide 22 text

Chapter II - Cloudy Mountains A little more. Scrum Master, the Bard Linus, the Sysadmin

Slide 23

Slide 23 text

Chapter II - Cloudy Mountains Oh wow! More than 140? Are we talking about 400 servers? Scrum Master, the Bard Linus, the Sysadmin

Slide 24

Slide 24 text

Chapter II - Cloudy Mountains There will be at least 1000 servers. Many dynamically created by ASGs and others created on AWS painel with no previous notice. Ow! And you shall monitor all kinds of AWS componentes too. Scrum Master, the Bard Linus, the Sysadmin

Slide 25

Slide 25 text

Chapter II - Cloudy Mountains You know what? I'm starting to reconsider our friendship dude. Scrum Master, the Bard Linus, the Sysadmin

Slide 26

Slide 26 text

Chapter II - Cloudy Mountains

Slide 27

Slide 27 text

Chapter II - Cloudy Mountains

Slide 28

Slide 28 text

Chapter II - Cloudy Mountains Cloudy Mountains landscape ● Migrate to Cloud: "Lift and shift". ○ Monolith at large scales ○ ASGs ○ HTTP routing ○ Big database ● Zabbix was still the only weapon of choice. ○ Dynamically registering hosts ○ Monitoring only infrastructure

Slide 29

Slide 29 text

Chapter II - Cloudy Mountains Cloudy Mountains landscape ● The slow killing of the Monolithic Dragon. ○ More instances (asg) ○ SQS/SNS ○ More databases - ro/rw (asg) ○ Loadbalancers ○ Buckets ○ DynamoDB tables ○ Lambdas ○ Elastic Cache systems

Slide 30

Slide 30 text

Chapter II - Cloudy Mountains Cloudy Mountains landscape ● Zabbix became complex and not the only weapon ○ Adding and removing hosts, LBs, SQS, etc... ○ API throttle ○ Ghost hosts and items ■ False alarms ○ Async process (SQS/SNS/Lambda) ● New weapons ○ CloudWatch ○ Lambdas

Slide 31

Slide 31 text

Chapter II - Cloudy Mountains

Slide 32

Slide 32 text

Chapter II - Cloudy Mountains Class: Human Race : Warrior Level: 19 XP : 6 years ------------- Int : ######## Str : ############ Dex : ###### STA : ####### Tools: .Linux (adv) .Networking (int) .Bash (adv) .Pyrhon (int) .Zabbix (int) .Chef (adv) .AWS (adv) .Terraform (adv)

Slide 33

Slide 33 text

Chapter III The ignorance desert

Slide 34

Slide 34 text

Chapter III - The ignorance desert Linus! My old friend! Long time we don't see… You know! I was wondering... Scrum Master, the Bard Linus, the Sysadmin

Slide 35

Slide 35 text

Oh now WHAT?! Tell me WHAT THE F*CK are you put me into now! Scrum Master, the Bard Linus, the Sysadmin Chapter III - The ignorance desert

Slide 36

Slide 36 text

WOW! so aggressive.Calm down. Have you heard about the Kubernetes feaver? Apparently it is the NEW silver bullet to all problems. All apps are now moving to Kubernetes and no previous monitoring solutions are good to it. Your quest is to monitor Kubernetes itself and also all apps inside it. You must go fast to the container harbor before the ships starts to departure. Scrum Master, the Bard Linus, the Sysadmin Chapter III - The ignorance desert

Slide 37

Slide 37 text

IIRC, containers can be created and destroyed in seconds! No tool that I'm aware of can handle that kind of elastic workload! I must wander through the desert of ignorance in order to ﬁnd a suitable tool for this situation. Wish me luck! And the next time we see each other, just don't talk to me anymore. Scrum Master, the Bard Linus, the Sysadmin Chapter III - The ignorance desert

Slide 38

Slide 38 text

Chapter III - The ignorance desert

Slide 39

Slide 39 text

Chapter III - The ignorance desert

Slide 40

Slide 40 text

Prometheus ○ OpenSource ○ Lightweight ○ Simple Architecture ○ Pull based ○ "Agentless" Alert Manager ○ Routing ○ Grouping ○ Deduplication Chapter III - The ignorance desert ○ TSDB based (fast and small storage usage) ○ HTTP based ○ Powerful query language (PROMQL) ○ YAML ﬁle conﬁg ○ Service Discovery (*) ○ Notifying ○ Integrations

Slide 41

Slide 41 text

Prometheus tsdb APP kube_deployment_spec_replicas{deployment="coredns",endpoint="http",instance="100.108.141.184:8080",jo b="kube-state-metrics",namespace="kube-system",pod="prometheus-operator-kube-state-metrics-78fb6c979- nxrlc",service="prometheus-operator-kube-state-metrics"} 2 Prometheus tsdb APP exporter GET /metrics GET /metrics Get metrics from app somehow. Reply in Prometheus metrics standard Chapter III - The ignorance desert

Slide 42

Slide 42 text

Prometheus tsdb APP Alert Manager Rules: avg(metric) > 10 label: critical Condition Error! Solved! #Alerts GET metrics metric = 11 Store Check Result route: receiver: slack-general group_by: - job routes: - receiver: slack-integration match: severity: critical continue: true Chapter III - The ignorance desert

Slide 43

Slide 43 text

Consul K8S File Prometheus Prometheus SD - job_name: monitoring/myapp/0 scrape_interval: 30s metrics_path: /metrics kubernetes_sd_configs: - role: endpoints namespaces: names: - app relabel_configs: - source_labels: [__meta_kubernetes_service_label_app] separator: ; regex: app replacement: $1 action: keep tsdb AWS DNS App Endpoints: - 100.101.30.1 - 100.101.30.2 - 100.101.30.3 - 100.101.30.4 app-pod1 app-pod2 app-pod3 app-pod4 GET /metrics GET /metrics GET /metrics GET /metrics Chapter III - The ignorance desert

Slide 44

Slide 44 text

Chapter III - The ignorance desert

Slide 45

Slide 45 text

Chapter III - The ignorance desert The prometheus DIY problem ● Prometheus ● AlertManager ● Grafana ● K8s monitoring ○ Node Exporter ○ Kube state metrics ○ Internal components ■ API server ■ Etcd ■ CoreDNS Custom apps Grafana Dashboards Custom Alerts Multiples Prometheus ■ Kubelet ■ Controllers ■ Schedulers ■ KubeProxy ■ CNI Custom Rules AlertManager Cluster Defining all AS CO DE Validation Pipelines Updates / Upgrades

Slide 46

Slide 46 text

Chapter III - The ignorance desert Operator Witch

Slide 47

Slide 47 text

Chapter III - The ignorance desert Operator Witch Linus, the Sysadmin I can feel your despair my young warrior. But fear not! Cause I bring good news to you. I will teach you an old but very powerful spell so you can summon a complete Prometheus stack. With that your kubernetes cluster shall be monitored in a extensible and ﬂexible way. But be aware my dear sysadmin! "What easy comes, easy goes".

Slide 48

Slide 48 text

Prometheus Operator Chapter III - The ignorance desert

Slide 49

Slide 49 text

Prometheus Operator ? An operator is a pattern in which a software or even a platform is configured, provisioned and managed using Kubernetes objects (usually CRDs). That pattern gives flexibility and a single "language" for kubernetes users and administrators to use. Operators are normally composed by custom controllers that handle defined CRDs and take actions against it, converging the managed software state to the desired state just like a standard k8s object. ○ Operators examples: ■ Jenkins, Mongo, Mysql, Cassandra, Spark and many more... Chapter III - The ignorance desert

Slide 50

Slide 50 text

Prometheus Operator* ○ Prometheus Operator ○ Prometheus ○ AlertManager ○ Grafana ○ Node Exporter ○ Kube State Metrics ○ Prebaked Alerts ○ Cluster Dashboards * Helm-chart Chapter III - The ignorance desert ○ Kubernetes monitoring ■ API ■ Controllers ■ Schedulers ■ CoreDNS ■ CNI ■ Kubelet ■ KubeProxy

Slide 51

Slide 51 text

Prometheus Operator ○ CRDs ■ Prometheus ■ AlerManager ■ PrometheusRule * ■ ServiceMonitor * ■ PodMonitor (iirc still in Beta) Each CRD have its on API spec (see doc) Chapter III - The ignorance desert

Slide 52

Slide 52 text

Prometheus Operator PrometheusRule and ServiceMonitor CRDs - Brings ﬂexibility to monitor and alerting on any application (or exporter) that can expose metrics using Prometheus standard, without the SD syntax hell and the operational burden to validate, merge and reload daemons. Chapter III - The ignorance desert

Slide 53

Slide 53 text

Prometheus Operator Chapter III - The ignorance desert

Slide 54

Slide 54 text

Prometheus Operator Chapter III - The ignorance desert

Slide 55

Slide 55 text

Prometheus Operator Chapter III - The ignorance desert

Slide 56

Slide 56 text

Prometheus Operator ○ AlertManager and Prometheus ■ Resources ■ Storage ■ Replicas ■ Retention ■ Namespace and object selectors ■ LogLevel ■ Docker image/tag (*) ■ And many more... Chapter III - The ignorance desert

Slide 57

Slide 57 text

Chapter III - The ignorance desert root@w42:~# helm install --namespace monitoring stable/prometheus-operator Prometheus Operator (Installing)

Slide 58

Slide 58 text

Chapter III - The ignorance desert Operator Witch Linus, the Sysadmin Oh! so wonderful! An eternal debt to you I have milady! I didn't quite understand your last verse, I must admit. But since the stack is all set, who cares? I must go now! thank you very much!

Slide 59

Slide 59 text

Chapter II - Cloudy Mountains Class: Human Race : Warrior Level: 19 XP : 6 years ------------- Int : ########### Str : ############### Dex : ######### STA : ######### COURAGE BOOST (#) Tools: .Linux (adv) .Networking (int) Bash (adv) Pyrhon (int) Zabbix (int) .Chef (adv) .AWS (adv) .Terraform (adv)

Slide 60

Slide 60 text

Chapter IV Container Harbor

Slide 61

Slide 61 text

Chapter IV - Container Harbor

Slide 62

Slide 62 text

Container Harbor landscape ○ Dozens of kubernetes clusters (kops) ○ Prometheus Operator (helm) ■ Prometheus and Alertmanager ● Exposed by Ingress (web interfaces) ● 45 days of retention ● 500m CPU / 4G RAM ■ Slack as AlertManager Receiver ● Custom message templates ■ Grafana and Dashboards with Prometheus as DataSource ○ Jenkins as CI/CD system for apps Chapter IV - Container Harbor

Slide 63

Slide 63 text

Container Harbor - The happy Path ○ Install Prometheus Operator ○ Monitor entire cluster by default ○ Monitor internal Apps easily ○ Create custom Dashboards ○ Create custom alerts ○ "And Bob is your uncle!" Chapter IV - Container Harbor

Slide 64

Slide 64 text

Container Harbor - The happy Path Chapter IV - Container Harbor

Slide 65

Slide 65 text

Container Harbor - The REAL Path ○ Prometheus Targets: ■ ETCd not monitored ‍♂ ■ Consequently not alarming. Default ServiceMonitor from Prometheus Operator doesn't support ETCd with TLS. Solution: Generate signed key and certiﬁcate based on K8s API and create a secret on Prometheus Operator namespace and reference the ﬁles in Chart values. Chapter IV - Container Harbor

Slide 66

Slide 66 text

Container Harbor - The actual Path Chapter IV - Container Harbor Snipped…. from values.yaml kubeEtcd: serviceMonitor : caFile: /etc/prometheus/secrets/etcd-client/CA.crt certFile: /etc/prometheus/secrets/etcd-client/CERT.crt keyFile: /etc/prometheus/secrets/etcd-client/KEY.key prometheus: prometheusSpec : secrets: -

Slide 67

Slide 67 text

Container Harbor - The actual Path ○ Prometheus Targets: ■ KubeProxy no monitored ■ Consequently not alarming. Kops listen KubeProxy metrics ports at 127.0.0.1 by default Solution: cluster.yaml (kops snipp) . ... . kubeProxy: . metricsBindAddress: 0.0.0.0 . ... . Chapter IV - Container Harbor

Slide 68

Slide 68 text

Container Harbor - The actual Path ○ Prometheus and AlertManager ■ Both doesn't have Auth method to the Web-UI ■ Especially important to AlertManager ⚠ Solution: Use a Auth solution over Ingress Layer. - ldap-proxy (helm installed) Grafana supports .toml ﬁle in order to auth on Ldap ❤ Chapter IV - Container Harbor

Slide 69

Slide 69 text

Container Harbor - The actual Path ○ Default Alarms uncalibrated. ○ Too many firings ■ ex: Pods CPU Throttling ○ "wrong" classifications. ■ ex: Pods restarting too many times (Warning level) Solution: Full Alarms check up. Rebuild YAML file using jsonnet (good luck with that ) Substitute original alarms (service-monitors) with our calibrated ones. Chapter IV - Container Harbor

Slide 70

Slide 70 text

Container Harbor. Chapter IV - Container Harbor

Slide 71

Slide 71 text

Container Harbor. ● ServiceMonitor and PrometheusRule files ○ Per app on repositories ○ Later on our Helm Chart ● Custom Grafana Dashboards per app! ❤ ○ configmap: labels -> grafana_dashboard: "1" One "small" problem: Some files were not appearing on our clusters. Syntax errors! Linters from Prometheus Operator project (pipelines) Solved in later Prometheus Operator version (admissionWebhook) Chapter IV - Container Harbor

Slide 72

Slide 72 text

Container Harbor. Prometheus metrics became a thing! ● Lots of apps starting to expose them. ● Why not to expose them on ec2 apps? ○ Where store them? ○ Solution: EC2 SD based on labels: - PrometheusScrape: true Chapter IV - Container Harbor

Slide 73

Slide 73 text

Container Harbor. Secret on Prometheus Operator's namespace. name: prometheus-operator-prometheus-scrape-config data: additional-scrape-configs.yaml: - ec2_sd_configs: . - endpoint: "" . filters: . - name: tag:PrometheusScrape . values: . - true . … . Reference on Prometheus object: spec: . additionalScrapeConfigs: . key: additional-scrape-configs.yaml . name: prometheus-operator-prometheus-scrape-config . Chapter IV - Container Harbor

Slide 74

Slide 74 text

Container Harbor. Custom apps metrics, let's scale by it! Solution: Prometheus Adapter. (helm) kubectl get APIService v1beta1.metrics.k8s.io ... NAME SERVICE AVAILABLE ... v1beta1.metrics.k8s.io kube-system/metrics-server True ... v1beta1.custom.metrics.k8s.io kube-system/prometheus-adapter True v1beta1.external.metrics.k8s.io kube-system/prometheus-adapter True Chapter IV - Container Harbor

Slide 75

Slide 75 text

Container Harbor. apiVersion: autoscaling/v1beta1 kind: HorizontalPodAutoscaler metadata: name: my-app-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: my-app-to-scale minReplicas: 1 maxReplicas: 10 metrics: - type: custom external: metricName: Custom-metric-name targetValue: 1500 Chapter IV - Container Harbor Prometheus tsdb APP GET /metrics Prometheus Adapter Query metics Prom metrics Kubernetes API v1beta1.custom.metrics.k8s.io v1beta1.external.metrics.k8s.io

Slide 76

Slide 76 text

Container Harbor. Scaling by external metrics, an example: ● Scaling by number of messages in SQS ● Tiamat ○ Collect queue stats and stores at prometheus. ○ Labels deﬁnes queue IDs ○ Plans to support ■ Kafka topics ■ ... Chapter IV - Container Harbor

Slide 77

Slide 77 text

Container Harbor. apiVersion: autoscaling/v1beta1 kind: HorizontalPodAutoscaler metadata: name: my-app-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: my-app-to-scale minReplicas: 1 maxReplicas: 10 metrics: - type: custom external: metricName: My_queue_size targetValue: 1500 Chapter IV - Container Harbor Prometheus tsdb GET /metrics Prometheus Adapter Query metics Prom metrics Kubernetes API v1beta1.external.metrics.k8s.io Tiamat My_queue_size: 2300

Slide 78

Slide 78 text

Container Harbor. ● Many metrics (k8s and ec2 apps) ● Lots! lots! of Grafanas (datasource -> k8s prometheus) ● Gigantic queries! Chapter IV - Container Harbor

Slide 79

Slide 79 text

Container Harbor. Solution: Adjust Resources: Requests and Limits (CPU and Memory) Use restrictive ﬂags. - --query.max-samples=30000000 - --query.max-concurrency=4 - --query.timeout=1m (use carefully) Prometheus object: query: maxConcurrency: 4 maxSamples: 30000000 timeout: 1m (use carefully) Chapter IV - Container Harbor

Slide 80

Slide 80 text

Container Harbor. Solution (plus) ● Last unﬁnished query will be prompted on terminal (a clue at least) ● Since 2.16 Prometheus have a query log option global: . scrape_interval: 15s . evaluation_interval: 15s . query_log_file: /prometheus/query.log . Chapter IV - Container Harbor

Slide 81

Slide 81 text

Container Harbor. ● Scraped Metrics can also go wrong. ● Really wrong! ● Like knockout wrong Chapter IV - Container Harbor

Slide 82

Slide 82 text

Container Harbor. Why? The cardinality hell. Each unique set of labels in a metric is considered a new time series. Highly mutable labels becomes a explosion of resource consumption Solution: Query and look for a big spike! rate(prometheus_tsdb_head_series_created_total[_PERIOD_]) Fix metrics. Chapter IV - Container Harbor dns_query_count{deployment="myserver",endpoint="http",instance="100.108.141.184:8080",job="my_dns_app",nam espace="dns",pod="dns-server-78fb6c979-nxrlc",query="assdfewr.onsite.com"} 234.03

Slide 83

Slide 83 text

Container Harbor. Even tuning, sometimes Prometheus explodes anyway ● Big queries for internal apps ● Cluster monitoring compromised Chapter IV - Container Harbor apiVersion: monitoring.coreos.com/v1 kind: Prometheus ruleNamespaceSelector: matchExpressions: - key: tribe operator: NotIn values: - infra serviceMonitorNamespaceSelector: matchExpressions: - key: tribe operator: NotIn values: - infra apiVersion: monitoring.coreos.com/v1 kind: Prometheus ruleNamespaceSelector: matchExpressions: - key: tribe operator: NotIn values: - custom-apps serviceMonitorNamespaceSelector: matchExpressions: - key: tribe operator: NotIn values: - custom-apps

Slide 84

Slide 84 text

Container Harbor. ● "Solution" (more of a mitigation…) Chapter IV - Container Harbor Prometheus Infra K8s API CoreDn s Infrastructure Node groups Prometheus Apps App2 App1 Other nodes groups ... ...

Slide 85

Slide 85 text

Container Harbor. AlertManager ● Responsible to route alerts ● Single point of failure ● By default just 1 replica. ● Capable to create HA cluster. Solution: (super easy) AlertManager object (CRD) Replicas: 3 Chapter IV - Container Harbor

Slide 86

Slide 86 text

Container Harbor. How about Prometheus HA? ● Prometheus does not create clusters Chapter IV - Container Harbor Prometheus tsdb APP-4 APP-3 APP-2 APP-1

Slide 87

Slide 87 text

Container Harbor. How about Prometheus HA? ● More replicas. ○ Duplicate metrics ○ Doesn't solve 100% problem ● Super easy to implement api:monitoring.coreos.com/v1 . kind: Prometheus . metadata: . name: Prometheus . spec: . Replicas: 2 . … . Chapter IV - Container Harbor Prometheus tsdb APP-4 APP-3 APP-2 APP-1 Prometheus tsdb

Slide 88

Slide 88 text

Container Harbor. ● Solution: ○ Remote Write/Read Chapter IV - Container Harbor S3 External Solution Read/Write Deduplication and Long term storage S3 S3 S3 HA Storage cluster Fill Gaps Prometheus APP4 APP3 APP2 APP1 Prometheus ...

Slide 89

Slide 89 text

Chapter V The next adventure

Slide 90

Slide 90 text

Unsolved problems yet... ● How to organize Federated prometheus ○ Number of grafanas ● Flexible alerting with AlertManager ○ Today ﬁxed channels and routes ○ Hard to create new ones ● Remote Read / Remote Write (Cortex / Thanos) ○ HA ○ Metrics Dedup ○ Long term persistence Chapter V - The next adventure

Slide 91

Slide 91 text

To be continued...

Slide 92

Slide 92 text

References ● https://prometheus.io/ ● https://github.com/coreos/prometheus-operator ● https://github.com/helm/charts/tree/master/stable/prometheus-operator ● https://www.youtube.com/watch?v=pRmnh8lgjsU ● https://kubernetes.io/docs/concepts/extend-kubernetes/operator/ ● https://coreos.com/blog/introducing-operators.html ● https://operatorhub.io/ ● https://github.com/tiagoapimenta/nginx-ldap-auth ● https://github.com/hugobcar/tiamat ● https://github.com/helm/charts/tree/master/stable/grafana ● https://github.com/DirectXMan12/k8s-prometheus-adapter ● https://cortexmetrics.io/ ● https://www.youtube.com/watch?v=b_pEevMAC3I

Slide 93

Slide 93 text

About the author [email protected] @Daniel_Requena github.com/drequena @daniel_requena speakerdeck.com/drequena linkedin.com/in/danielrequena/ Daniel Requena