Monitoring Event Pipelines

Slide 1

Slide 1 text

1 Why you need one, and why you should stop rolling your own. By Sean Porter (@PorterTech) Monitoring Event Pipelines

Slide 2

Slide 2 text

● Sean Porter ● Creator of Sensu ● Co-founder & CTO ● @PorterTech Who am I? 2

Slide 3

Slide 3 text

Overview 3 ● Our shared reality ● My experience in building Sensu ● What is a monitoring pipeline? ● Attributes of an effective pipeline ● Demo time ● The future of pipelines

Slide 4

Slide 4 text

4 OUR SHARED REALITY

Slide 5

Slide 5 text

5 COMPLEXITY TIME

Slide 6

Slide 6 text

6 # OF THINGS TIME Containers Servers VMs Functions

Slide 7

Slide 7 text

Ephemeral compute! ● Host-based -> role-based monitoring ● Polling -> publish-subscribe & push APIs ● Point-and-click -> Infrastructure as Code The paradigm shift 7

Slide 8

Slide 8 text

“We gotta ﬁnd a way to make [this] ﬁt into the hole for (this), using nothing but {that}.” 8

Slide 9

Slide 9 text

9 AMOUNT OF DATA TIME

Slide 10

Slide 10 text

Slide 11

Slide 11 text

11 THE TOOLS YOU KNOW

Slide 12

Slide 12 text

12 INTEGRATE NEED TO BUILD SCALE

Slide 13

Slide 13 text

Slide 14

Slide 14 text

14 NEED TO MAINTAIN

Slide 15

Slide 15 text

15 HOLD THAT THOUGHT

Slide 16

Slide 16 text

16 BUILDING SENSU

Slide 17

Slide 17 text

Sensu origin story 17 I joined Sonian in 2010 as an “Automation Engineer” ● Early adopters ● High rate of change ○ Growing team ○ Evolving software stack

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Sensu origin story 19 I started a project with some goals (July 2011) ● Handle ephemeral compute ● Leverage existing and familiar technologies ● Easy to drive with conﬁg management ● Easy to scale horizontally ● APIs!!!

Slide 20

Slide 20 text

Sensu origin story 20 ● Agent based system with auto-discovery ● Message bus for communication ● Simple key-value data store for state ● Central service check scheduler (pub-sub) ● JSON conﬁguration ● REST APIs

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Sensu origin story 23 ● Designed for the cloud ○ Proved to handle ephemeral compute ○ Operated securely on public net (AWS) ● Focused on composability & extensibility ○ Reusable components / building blocks

Slide 24

Slide 24 text

Sensu origin story 24

Slide 25

Slide 25 text

Sensu origin story 25 ● Named it Sensu ● Sonian sponsored development! ○ Deployed to production after 2 months ○ Replaced a number of tools! ● Open sourced (MIT) on November 1st, 2011

Slide 26

Slide 26 text

Slide 27

Slide 27 text

27 FRAMEWORK THE MONITORING ROUTER BUS

Slide 28

Slide 28 text

28 THE MONITORING PIPELINE

Slide 29

Slide 29 text

29 THE MONITORING PIPELINE OBSERVABILITY

Slide 30

Slide 30 text

30 WHAT IS A MONITORING PIPELINE?

Slide 31

Slide 31 text

Uniﬁed data collection and processing for all types of monitoring events: ● Service checks ● Metrics ● Traces ● Logs ● Inventory What is a monitoring pipeline? 31

Slide 32

Slide 32 text

There are two critical layers: ● Data plane ● Control plane What is a monitoring pipeline? 32

Slide 33

Slide 33 text

● Data input ● Data transportation & routing ● Load balancing & failover ● The layer developers interact with (APIs) The data plane 33

Slide 34

Slide 34 text

● Central management unit ○ Orchestrator ○ Conﬁguration ○ Security (auth) ● APIs, agents, data processors, etc. ● The layer operators interact with The control plane 34

Slide 35

Slide 35 text

35 https://www.youtube.com/watch?v=CM2Y6B1yuDg

Slide 36

Slide 36 text

Slide 37

Slide 37 text

37 https://bravenewgeek.com/the-observability-pipeline/

Slide 38

Slide 38 text

38 WAIT… BUT WHY?

Slide 39

Slide 39 text

39 CHANGE YOUR MIND

Slide 40

Slide 40 text

40 CHANGE DATASTORES

Slide 41

Slide 41 text

41 CHANGE FORMATS

Slide 42

Slide 42 text

42 CHANGE VISUALIZATION

Slide 43

Slide 43 text

43 CHANGE SAMPLING

Slide 44

Slide 44 text

44 CHANGE PLATFORMS

Slide 45

Slide 45 text

45 MAKE CHANGE INEXPENSIVE

Slide 46

Slide 46 text

46 MAKE IT FUTURE PROOF

Slide 47

Slide 47 text

● Fewer agents ○ Less “edge” service to support ○ Resource utilization (i.e. fewer sidecars) ● Cost savings* Wait… But why? 47

Slide 48

Slide 48 text

48 ATTRIBUTES OF AN EFFECTIVE PIPELINE

Slide 49

Slide 49 text

● Uniﬁed data format(s) ● Unique IDs ● Capture context at collection time ● Support additional metadata ● Support efﬁcient debugging Event payload(s) 49

Slide 50

Slide 50 text

Event payload(s) 50

Slide 51

Slide 51 text

{ metadata: { annotations: { … } }, entity: { name: “osmc.de” }, timestamp: now() } Event payload(s) 51

Slide 52

Slide 52 text

{ metadata: { annotations: { … } }, check: { output: “the system is down” }, timestamp: now() } Event payload(s) 52

Slide 53

Slide 53 text

{ metadata: { annotations: { … } }, metrics: { points: [ … ] }, timestamp: now() } Event payload(s) 53

Slide 54

Slide 54 text

● Lightweight (think sidecars) ● Multi-platform support ● Initiates connections to backends ● Bi-directional communication Collection Agent 54

Slide 55

Slide 55 text

● Auto-registration with the backend ● Auto-discovery (context) ○ Platform info ○ System details ○ Roles / responsibilities Collection Agent 55

Slide 56

Slide 56 text

● Keepalive / heartbeat mechanism ● Service check execution support ○ Commonly overlooked / underappreciated ○ Leverage over a decade of investment ● Durable outbound data queue Collection Agent 56

Slide 57

Slide 57 text

● Several data inputs (APIs) ○ Industry standards* ■ Metrics (StatsD, OpenTelemetry, Graphite) ■ Trace (OpenTracing) ■ Structured log (JSON) Collection Agent 57

Slide 58

Slide 58 text

● Standard cryptography (TLS v1.2) ○ mTLS veriﬁcation & authentication ● Standard protocol (HTTP) ● Agent initiated connections ○ Traverse complex networks Data Transport 58

Slide 59

Slide 59 text

● Scale horizontally ● Little coordination with peers ● Concurrency & parallelism ○ “Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once.” — Rob Pike Data Processor 59

Slide 60

Slide 60 text

● Easy to extend and integrate! ○ Simple APIs and clear specs ● Multi-tenancy to enable self-service ○ Namespaces ○ RBAC Data Processor 60

Slide 61

Slide 61 text

● The “secret sauce” ● Granular routing ○ Is it an incident? ○ Is it a resolution? ○ Metric data? ○ Production? Ofﬁce hours? Filtering 61

Slide 62

Slide 62 text

Transformation 62

Slide 63

Slide 63 text

Actions 63 ● Alert notiﬁcations ● Incident management ● Metric & event storage ● Inventory ● Auto-remediation

Slide 64

Slide 64 text

64 Monitoring Pipeline Filter Only incidents Filter Only logs Filter Only production Transform Transform - - Action Action Action Action Action Action Events Service Checks Metrics Trace Log Transform Redact Sensitive Transform Nagios => InfluxDB Filter Only once every 30 min Filter Only metrics Filter Only incidents Transform Annotate Transform -