Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CNCF Bordeaux, FR - Scaling Prometheus

Adrien F
October 23, 2018

CNCF Bordeaux, FR - Scaling Prometheus

In this talk, we'll look at what is Prometheus, how it works and what it can provide a company with.
We'll also talk about the various ways to scale Prometheus and one we choose, Thanos. We'll see a deployment of Thanos at Cdiscount, the leading french ecommerce company.

Thanks !

Adrien F

October 23, 2018
Tweet

More Decks by Adrien F

Other Decks in Technology

Transcript

  1. Prometheus at Scale A tale of observability in the new

    cloud native world. With a usage example from the leading european e-commerce company: Cdiscount Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  2. What is "Observability" ? Visualization + Monitoring Aggregable Tracing Request

    Scoped Logging Event Based https://peter.bourgon.org/blog/2017/02/21/metrics-tracing-and- logging.html Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  3. Why do I need monitoring ? Is it working ?

    Knowing instead of guessing Drive technical and business decisions As input for other systems (Big Data/BI, auto remediation, anomaly detetion) Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  4. Main challenges Bad culture (our code is perfect, the client

    says it's running, etc...) Fear of everyone knowing where the flaws are Monitoring tools are not perfect Alerting and Operational practices are hard to get right Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  5. Logging != Monitoring Logging recording to diagnose a system 1

    2 7 . 0 . 0 . 1 - f r a n k [ 1 0 / O c t / 2 0 0 0 : 1 3 : 5 5 : 3 6 - 0 7 0 0 ] " G E T / a p a c h e _ p b . g i f H T T P / 1 . 0 " 2 0 0 2 3 2 6 Monitoring observation, checking and recording h t t p _ r e q u e s t s _ t o t a l { m e t h o d = " p o s t " , c o d e = " 2 0 0 " } 1 0 2 7 Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  6. Once upon a time, there was Nagios And all was

    well, or was it ? ~ # / u s r / l o c a l / n a g i o s / c h e c k _ c l o u d _ n a t i v e . s h K O - N o t e v e r y t h i n g i s g o i n g w e l l With the development of cloud native technologies, VMs popping up and down and containers platforms juggling with thousands of containers all the time ? Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  7. And then, Prometheus Built at Soundcloud around 2012 and inspired

    from Google's Borgmon system, Prometheus joined the CNCF as the second hosted project, just after Kubernetes. Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  8. What's so special about Prometheus ? Complete monitoring stack Heavily

    inspired from Google's Borgmon p u l l vs p u s h paradigm Alerting Huge community And of course, part of the CNCF ! Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  9. Service Discovery Azure, Consul, DNS, EC2, GCE, K8S, Marathon, and

    more... Automatic update of targets as they come and go Powerful relabelling feature, allowing arbitrary labels to be rewritten or inserted before ingesting metrics Can also watch a JSON file containing a list of targets Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  10. Metrics Format # # T Y P E e v

    e n t s _ t o t a l c o u n t e r # # H E L P e v e n t s _ t o t a l L o g g i n g e v e n t s e v e n t s _ t o t a l { l e v e l = " I N F O " } 1 2 Easy to generate (even from Bash scripts) Many libraries available (Go, Java, Python, .Net, JS, etc...) So simple, yet so powerful, with many metrics types In the process of getting standardized Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  11. Powerful Query Language h t t p _ r e

    q u e s t s _ t o t a l h t t p _ r e q u e s t s _ t o t a l { c o d e = " 5 0 0 " } i n c r e a s e ( h t t p _ r e q u e s t s _ t o t a l { c o d e = " 5 0 0 " } [ 5 m ] ) Predictions Quantiles Math And so much more... Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  12. Example of metrics integration React server side rendering component (Javascript)

    In a few lines of code, monitoring added to our hot business paths: n e w p r o m . H i s t o g r a m ( { n a m e : " r e a c t _ p r e r e n d e r _ s e c o n d s " , h e l p : ' r e a c t p r e r e n d e r t i m e ' , l a b e l N a m e s : [ ' m o d u l e ' , ' c o m p o n e n t ' ] , b u c k e t s : [ 1 0 0 / 1 e 3 , 5 0 0 / 1 e 3 , 1 0 0 0 / 1 e 3 ] , } ) ; Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  13. Visualization Complete monitoring of this component Lower MTTR Better understanding

    of the system Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  14. System Integration 100+ integrations DBs, Messaging Systems OS (Windows &

    Linux), Hardware (IPMI) HTTP (Apache, nginx, HAProxy, Varnish) Logging (fluentd, mtail) JMX, SNMP, StatsD Minecraft, SMTP, Jenkins And so many more... Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  15. Alerting g r o u p s : - n

    a m e : e x a m p l e r u l e s : - a l e r t : H i g h E r r o r R a t e e x p r : r e q u e s t _ l a t e n c y _ s e c o n d s { j o b = " w w w " } > 0 . 5 f o r : " 1 0 m " l a b e l s : s e v e r i t y : p a g e a n n o t a t i o n s : s u m m a r y : H i g h r e q u e s t l a t e n c y Easy to setup Easy to collaborate on Hard to master Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  16. I'm convinced, what are the drawbacks ? Lack of High

    Availability by default Scaling over time and targets can be difficult No concepts of tenants or security (any "enterprise" features) Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  17. High Availability Load Balancer RR Targets Sounds reasonable ? These

    replicas are stateful, if one goes down for 10 minutes, when it goes back up, you'll have gaps. Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  18. Scaling over time and over targets Time 6m 1y 2y

    7d 3m Metrics 100K 1M 10M 100M Storage Cost Some solutions exists: Sharding, Federating, Mix of both or Remote writing data to an external system Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  19. Sharding Prometheus DC Bordeaux DC Paris Pros: Works ! Cons:

    No aggregated metrics unless... Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  20. Federation DC Bordeaux DC Paris Good balance between # of

    Prometheus instances and scale Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  21. What about metric retention ? Prometheus SSD Remote Write Prometheus

    Add more disk space Or send the metrics to an external system Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  22. Remote write to an external system Graphite, OpenTSDB InfluxDB (also

    with a commercial offering) M3DB (Uber) And more... r e m o t e _ w r i t e : - u r l : " h t t p : / / l o c a l h o s t : 9 2 0 1 / w r i t e " Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  23. An alternative to remote write based systems KISS architecture, OSS

    Flexible enough, meets our needs in scalability and reliability Some of the following slides are from the author's talk Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  24. Thanos Sidecar Targets Sidecar Prometheus SSD gRPC (Store API) Adds

    a defined gRPC API to acces Prometheus Data Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  25. Thanos Query Targets Sidecar Prometheus SSD gRPC (Store API) HTTP

    API Querier Same HTTP API as Prometheus, seamless for Grafana Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  26. Global View Targets Sidecar Prometheus SSD Querier Targets Sidecar Prometheus

    SSD Merge Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  27. Global View + Availability Targets Sidecar Prometheus SSD Querier Targets

    Sidecar Prometheus SSD Merge Sidecar Prometheus SSD replica: A replica: B Dedup Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  28. Retention Targets write Sidecar Prometheus SSD Block Block Block ObjStorage

    Block Block Block Stores block in Object Storage (AWS S3, Azure Blob Storage, Ceph, Minio, etc...) Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  29. Store Gateway read Store Cache ObjStorage Block Block Block gRPC

    API Querier Potential of unlimited retention Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  30. Let's scale it up now ! ~3.4B Euro GMV ~8.6M

    customers, leading ecommerce company in Europe ~3000 servers in 2 DC, ~1000 containers Kube On Premise, Mesos/Marathon SRE team of ~20 engineers Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  31. Summer of 2017, first POC Plenty of microservices, no adapted

    monitoring Experiment limited to the Mobile team Huge impact on Black Briday and Xmas days, the whole SRE team is convinced of the potential of it. Word is spreading around dev teams. Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  32. Summer of 2018, new observability project How can we crawl

    the complete IT system ? What are our users expecting ? How long do we want to keep metrics ? Many options around, which one to choose ? Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  33. 100% coverage of our IT systems 5 years retention Dashboards

    and alerts for everyone ! # DevOps Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  34. H H μ Cibles C C Crawlers F Federation TSDB

    Stockage G Affichage H H μ C C F TSDB G VIP H H μ C C DC1 DC2 POPX Gestionnaire Cibles Gestionnaire Cibles B B C FAI (Orange, Free, SFR, Bouygues) Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  35. Zoom Mesos/Marathon/K8s or VMs Dedicated Ceph tcp/https thanos[1-2] thanoss[1-2] tcp/tls

    thanosq[1-2] Mesos/Marathon/K8s or VMs SI Crawlers Prom/Thanos tcp/http federate tcp/https Thanos- Store Grafana tcp/tls tcp/https Target Manager configures Thanos- Query VIP VIP tcp/https tcp/https DC1 Crawlers DCX/ POPX Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  36. Objective Black Friday VMs instead of Kubernetes (will change in

    the future) High Availability as a Good to Have option 2 engineers + 1 project manager Instrumentation Day ! Get the whole team in a room Instruments all the systems (Varnish, MSSQL, IIS, Containers Clusters, ...) Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  37. Last words Governance - Why do I need it ?

    Generic alerts/dashboards/reporting Where is my metric c d i s c o u n t _ r e v e n u e _ e u r o generated and calculated from ? To generate traffic maps How will this tool be used ? By whom ? Adapt your product to your users This is not a one-shot project, this is an ongoing work that grows with your company Meetup CNCF Bordeaux #1 - Adrien Fillon, Cdiscount, Octobre 2018
  38. Thanks you Any questions ? Meetup CNCF Bordeaux #1 -

    Adrien Fillon, Cdiscount, Octobre 2018