Slide 1

Slide 1 text

@tyler_treat The Observability Pipeline Tyler Treat / deliver:Agile 2019 / April 29, 2019

Slide 2

Slide 2 text

@tyler_treat The way we build systems has fundamentally changed.

Slide 3

Slide 3 text

@tyler_treat Our systems are more complex than they’ve ever been.

Slide 4

Slide 4 text

@tyler_treat Don’t believe me?

Slide 5

Slide 5 text

@tyler_treat https://www.youtube.com/watch?v=xy3w2hGijhE

Slide 6

Slide 6 text

@tyler_treat Pets vs. Cattle

Slide 7

Slide 7 text

@tyler_treat This is our server. His name is Toby.

Slide 8

Slide 8 text

@tyler_treat We take good care of Toby.

Slide 9

Slide 9 text

@tyler_treat We release to him twice a year.
 (quarterly if we’re feeling dangerous)

Slide 10

Slide 10 text

@tyler_treat Toby is compatible with most
 versions of Internet Explorer.

Slide 11

Slide 11 text

@tyler_treat Toby likes to go on long walks,
 so sometimes we’ll take him 
 offline for a bit.
 (usually just nights and weekends)

Slide 12

Slide 12 text

@tyler_treat No one seems to mind.

Slide 13

Slide 13 text

@tyler_treat Sometimes Toby crashes,
 but we always make sure
 to restart him.

Slide 14

Slide 14 text

@tyler_treat We like Toby.

Slide 15

Slide 15 text

@tyler_treat This is 74db150601cd.

Slide 16

Slide 16 text

@tyler_treat It’s best not to get too
 attached because when he’s
 no longer needed, well…

Slide 17

Slide 17 text

@tyler_treat

Slide 18

Slide 18 text

@tyler_treat Transactional
 DB App Server Reporting
 DB

Slide 19

Slide 19 text

@tyler_treat Transactional
 DB App Server Reporting
 DB

Slide 20

Slide 20 text

@tyler_treat “We need to be highly available.”

Slide 21

Slide 21 text

@tyler_treat Transactional
 DB App Server Reporting
 DB

Slide 22

Slide 22 text

@tyler_treat Node 1 App Server Reporting
 DB Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Server rver

Slide 23

Slide 23 text

@tyler_treat Node 1 App Server Reporting
 DB Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Server rver

Slide 24

Slide 24 text

@tyler_treat “We need to support every device.”

Slide 25

Slide 25 text

@tyler_treat Node 1 App Server Reporting
 DB Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Server rver

Slide 26

Slide 26 text

@tyler_treat Node 1 App Server Reporting
 DB Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Server rver

Slide 27

Slide 27 text

@tyler_treat “We need faster response times.”

Slide 28

Slide 28 text

@tyler_treat Node 1 App Server Reporting
 DB Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Server rver

Slide 29

Slide 29 text

@tyler_treat Node 1 App Server Reporting
 DB Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Server rver Node 1 Node 2 Node 3 Node 4 Node 5 Cache Cluster

Slide 30

Slide 30 text

@tyler_treat “We need real-time analytics, not batch.”

Slide 31

Slide 31 text

@tyler_treat Node 1 App Server Reporting
 DB Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Server rver Node 1 Node 2 Node 3 Node 4 Node 5 Cache Cluster

Slide 32

Slide 32 text

@tyler_treat App Server Node 1 Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Server rver Node 1 Node 2 Node 3 Node 4 Node 5 Cache Cluster Node 1 Node 2 Node 3 Node 4 Node 5 BI Data Cluster BI Server BI Server Data Pipeline

Slide 33

Slide 33 text

@tyler_treat “We need to release multiple times a day.”

Slide 34

Slide 34 text

@tyler_treat App Server Node 1 Node 2 Node 3 Node 4 Node 5 Database Cluster App Server App Server rver Node 1 Node 2 Node 3 Node 4 Node 5 Cache Cluster Node 1 Node 2 Node 3 Node 4 Node 5 BI Data Cluster BI Server BI Server Data Pipeline

Slide 35

Slide 35 text

@tyler_treat Node 1 Node 2 Node 3 Node 4 Node 5 BI Data Cluster BI Server BI Server 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice Data Pipeline

Slide 36

Slide 36 text

@tyler_treat “We need to support multiple geos.”

Slide 37

Slide 37 text

@tyler_treat Node 1 Node 2 Node 3 Node 4 Node 5 BI Data Cluster BI Server BI Server 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice 1 2 3 4 5 Database Cluster 1 2 3 4 5 Cache Cluster Microservice Data Pipeline

Slide 38

Slide 38 text

@tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice Asia Pacific BI Server BI Server Microservice Microservice Microservice Microservice

Slide 39

Slide 39 text

@tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN

Slide 40

Slide 40 text

@tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN Infrastructure Load Balancers Orchestrators DNS Configuration . . .

Slide 41

Slide 41 text

@tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . .

Slide 42

Slide 42 text

@tyler_treat “Oh, and one more thing…”

Slide 43

Slide 43 text

@tyler_treat “…we need to do DevOps.”

Slide 44

Slide 44 text

@tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . .

Slide 45

Slide 45 text

@tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer “DevOps” Infrastructure Load Balancers Orchestrators DNS Configuration . . .

Slide 46

Slide 46 text

@tyler_treat The way we build systems has fundamentally changed.

Slide 47

Slide 47 text

@tyler_treat Because our constraints and expectations have fundamentally changed.

Slide 48

Slide 48 text

@tyler_treat Cloud and containers have led to much more distributed and dynamic systems.

Slide 49

Slide 49 text

@tyler_treat Transactional
 DB App Server Reporting
 DB

Slide 50

Slide 50 text

@tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . . “DevOps”

Slide 51

Slide 51 text

@tyler_treat This shift has exposed deficiencies in our tools and practices…

Slide 52

Slide 52 text

@tyler_treat …and has led to new tools created to help us support our systems.

Slide 53

Slide 53 text

@tyler_treat How do we make sense of it all?

Slide 54

Slide 54 text

@tyler_treat In particular, how do we make this…

Slide 55

Slide 55 text

@tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . . “DevOps”

Slide 56

Slide 56 text

@tyler_treat more like this…

Slide 57

Slide 57 text

@tyler_treat North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice North America BI Server BI Server Microservice Microservice Microservice Microservice CDN CI/CD Repo Repo Repo Repo Builder Builder Builder Builder Builder Builder Artifacts Artifacts Artifacts Deployer Deployer Infrastructure Load Balancers Orchestrators DNS Configuration . . . “DevOps”

Slide 58

Slide 58 text

@tyler_treat “The Observability Pipeline”

Slide 59

Slide 59 text

@tyler_treat A Brave New World

Slide 60

Slide 60 text

@tyler_treat Operations for

Slide 61

Slide 61 text

@tyler_treat APM Debugger Profiler SSH grep

Slide 62

Slide 62 text

@tyler_treat APM Debugger Profiler SSH grep

Slide 63

Slide 63 text

@tyler_treat APM Debugger Profiler SSH grep

Slide 64

Slide 64 text

@tyler_treat APM Debugger Profiler SSH grep

Slide 65

Slide 65 text

@tyler_treat APM Debugger Profiler SSH grep

Slide 66

Slide 66 text

@tyler_treat APM Debugger Profiler SSH System Behavior grep

Slide 67

Slide 67 text

@tyler_treat APM Debugger Profiler SSH System Behavior Actual Customer Impact grep

Slide 68

Slide 68 text

@tyler_treat Operations for

Slide 69

Slide 69 text

@tyler_treat APM Debugger Profiler SSH grep

Slide 70

Slide 70 text

@tyler_treat APM Debugger Profiler SSH grep

Slide 71

Slide 71 text

@tyler_treat APM Debugger Profiler SSH Testing in Production at Scale, Amit Gud grep

Slide 72

Slide 72 text

@tyler_treat APM Debugger Profiler SSH System Behavior Actual Customer Impact ??? grep

Slide 73

Slide 73 text

@tyler_treat grep APM Debugger Profiler SSH System Behavior Actual Customer Impact ???

Slide 74

Slide 74 text

@tyler_treat Also, culture.

Slide 75

Slide 75 text

@tyler_treat Many companies rely on a separate operations team to monitor, triage, and even resolve issues.

Slide 76

Slide 76 text

@tyler_treat This model doesn’t map to the world of microservices and containers.

Slide 77

Slide 77 text

@tyler_treat And it leads to ineffective feedback loops.

Slide 78

Slide 78 text

@tyler_treat In order for developers to take on this responsibility, they need to be enabled.

Slide 79

Slide 79 text

@tyler_treat “DevOps” teams are really “Developer Enablement” teams.

Slide 80

Slide 80 text

@tyler_treat This shift in how we build systems has caused an explosion of new tools and terminology.

Slide 81

Slide 81 text

@tyler_treat “Observability”

Slide 82

Slide 82 text

@tyler_treat Post Hoc vs. Ad Hoc

Slide 83

Slide 83 text

@tyler_treat Data Available Understanding

Slide 84

Slide 84 text

@tyler_treat Data Available Understanding Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit”

Slide 85

Slide 85 text

@tyler_treat Data Available Understanding Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”

Slide 86

Slide 86 text

@tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”

Slide 87

Slide 87 text

@tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage”

Slide 88

Slide 88 text

@tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS

Slide 89

Slide 89 text

@tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” FACTS HYPOTHESES

Slide 90

Slide 90 text

@tyler_treat Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES

Slide 91

Slide 91 text

@tyler_treat Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Unknown Knowns • Things we understand but are not aware of • “We implemented an orchestrator to ensure the system is always running” Known Knowns • Things we are aware of and understand • “The system has a 1GB memory limit” Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” ASSUMPTIONS FACTS HYPOTHESES

Slide 92

Slide 92 text

@tyler_treat Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES Monitoring Observability

Slide 93

Slide 93 text

@tyler_treat Unknown Unknowns • Things we are neither aware of nor understand • “Instances churn because the orchestrator restarts the process when it approaches its memory limit, causing
 sporadic failures and slowdowns” DISCOVERIES Data Available Understanding Known Unknowns • Things we are aware of but don’t understand • “The system exceeded its memory limit and crashed, causing an outage” HYPOTHESES Testing Exploring

Slide 94

Slide 94 text

@tyler_treat “The army is now fully prepared to fight the previous war.”

Slide 95

Slide 95 text

@tyler_treat 
 Observability Data application logs system logs audit logs application metrics distributed traces events

Slide 96

Slide 96 text

@tyler_treat Some
 challenges… 
 Observability Data application logs system logs audit logs application metrics distributed traces events - Locked up inside a single vendor’s solution - Not readily available across the enterprise
 (or in some cases, too readily available) - Many tools and products needed for
 different data and use cases - Tool and data needs vary from team to
 team - Ever-changing landscape of tools, products,
 and services - Sheer volume of data can be overwhelming

Slide 97

Slide 97 text

@tyler_treat System

Slide 98

Slide 98 text

@tyler_treat System Splunk Universal Forwarder

Slide 99

Slide 99 text

@tyler_treat System Splunk Universal Forwarder Datadog Metrics Agent Datadog APM Agent

Slide 100

Slide 100 text

@tyler_treat System Splunk Universal Forwarder Datadog Metrics Agent Datadog APM Agent Universal Analytics Client

Slide 101

Slide 101 text

@tyler_treat System Splunk Universal Forwarder Datadog Metrics Agent Datadog APM Agent Universal Analytics Client Amazon Glacier S3 Client

Slide 102

Slide 102 text

@tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client Amazon Glacier S3 Client … Datadog Metrics Agent

Slide 103

Slide 103 text

System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System

Slide 104

Slide 104 text

@tyler_treat “Oh, actually we want to change how we parse our logs.”

Slide 105

Slide 105 text

System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System

Slide 106

Slide 106 text

@tyler_treat “Re-roll the agents."

Slide 107

Slide 107 text

@tyler_treat “Oh, actually we want to use Sumo Logic for logging.”

Slide 108

Slide 108 text

System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sp Un For Datad A Universal Analytics Client S3 Client … Datado A Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Splunk Universal Forwarder Universal Analytics Client Sp Un For Universal Analytics Client System System System System

Slide 109

Slide 109 text

@tyler_treat “Re-roll the agents."

Slide 110

Slide 110 text

System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sum Co Datad A Universal Analytics Client S3 Client … Datado A System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sum Co Datad A Universal Analytics Client S3 Client … Datado A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System

Slide 111

Slide 111 text

@tyler_treat “Oh, actually we want to use New Relic for APM.”

Slide 112

Slide 112 text

System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sum Co Datad A Universal Analytics Client S3 Client … Datado A System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sumo Logic Collector Datadog APM Agent Universal Analytics Client S3 Client … Datadog Metrics Agent System Sum Co Datad A Universal Analytics Client S3 Client … Datado A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System

Slide 113

Slide 113 text

@tyler_treat “Re-roll the agents."

Slide 114

Slide 114 text

System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System

Slide 115

Slide 115 text

@tyler_treat “Oh, actually we want to evaluate Honeycomb for debugging.”

Slide 116

Slide 116 text

System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System

Slide 117

Slide 117 text

@tyler_treat “Re-roll the agents."

Slide 118

Slide 118 text

System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sumo Logic Collector Universal Analytics Client S3 Client … New Relic APM Agent System Sum Co Universal Analytics Client S3 Client … New R A Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sumo Logic Collector Universal Analytics Client Sum Co Universal Analytics Client System System System System Honeytail Agent Honeytail Agent Honeytail Agent Honey Honeytail Agent Honeytail Agent Honeytail Agent Honey

Slide 119

Slide 119 text

@tyler_treat You get the idea.

Slide 120

Slide 120 text

@tyler_treat How big of a lift is it for your organization to change tools?

Slide 121

Slide 121 text

@tyler_treat How easy is it to experiment with new ones?

Slide 122

Slide 122 text

@tyler_treat Data Sources • VMs • Containers • Load balancers • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … What data to send? Where to send it? How to send it?

Slide 123

Slide 123 text

@tyler_treat A decoupled approach

Slide 124

Slide 124 text

@tyler_treat What data to send? Where to send it? How to send it? Data Sources • VMs • Containers • Load balancers • Service meshes • Audit logs • VPC flow logs • Firewall logs • … Data Sinks • Centralized logging • SIEM • Monitoring • APM • Alerting • Cold storage • BI • … Observability Pipeline

Slide 125

Slide 125 text

@tyler_treat Anatomy of an Observability Pipeline

Slide 126

Slide 126 text

@tyler_treat Structure your damn data. 1. Data Specifications

Slide 127

Slide 127 text

@tyler_treat log.error(“User '{}' login failed”.format(user))

Slide 128

Slide 128 text

@tyler_treat ERROR 2019-04-05 13:26.42 User ‘tylertreat' login failed

Slide 129

Slide 129 text

@tyler_treat log.error(“User login failed”, event=LOGIN_ERROR, user=“tylertreat”, email=“[email protected]”, error=error)

Slide 130

Slide 130 text

@tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “user”: “tylertreat”, “email”: “[email protected]”, “error”: “Invalid username or password”, “message”: “User login failed” }

Slide 131

Slide 131 text

@tyler_treat JSON is fine.

Slide 132

Slide 132 text

@tyler_treat Pass a context object to everything.

Slide 133

Slide 133 text

@tyler_treat def login(ctx, username, email, password): ctx.set(user=username, email=email) ... log.error(“User login failed”, event=LOGIN_ERROR, context=ctx, error=error) ...

Slide 134

Slide 134 text

@tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”: { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “[email protected]”, }, “error”: “Invalid username or password”, “message”: “User login failed” }

Slide 135

Slide 135 text

@tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “ERROR”, “event”: “user_login_error”, “context”: { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”, “email”: “[email protected]”, }, “error”: “Invalid username or password”, “message”: “User login failed” }

Slide 136

Slide 136 text

@tyler_treat What goes on the context?

Slide 137

Slide 137 text

@tyler_treat What can you get for “free” and what do you need to pass along?

Slide 138

Slide 138 text

@tyler_treat Create standard specs for each data type collected (logs, metrics, traces).

Slide 139

Slide 139 text

@tyler_treat Specs can enforce required fields (e.g. user id, license, trace id) and data types.

Slide 140

Slide 140 text

@tyler_treat { “timestamp”: “2019-04-05 13:26.42”, “level”: “INFO”, “event”: “user_login”, “context”: { “id”: “accfbb8315c44a52ad893ca6772e1caf”, “http_method”: “POST”, “http_path”: “/login”, “user”: “tylertreat”,
 “user_id”: “3bb12f6c63274abe87fd1ee4ee37f3d2”,
 “license”: “942e6543f0844be680e72003d5e060fd”, “email”: “[email protected]”, } }

Slide 141

Slide 141 text

@tyler_treat Be mindful not to log sensitive data like passwords.

Slide 142

Slide 142 text

@tyler_treat Specs alone aren’t enough! 2. Specification Libraries

Slide 143

Slide 143 text

@tyler_treat Empowering developers requires providing tools that align the “easy” path with the “right” path.

Slide 144

Slide 144 text

@tyler_treat We need libraries that implement the specs and make it easy for devs to instrument their systems.

Slide 145

Slide 145 text

@tyler_treat • Java: log4j • Go: logrus • Python: structlog • Ruby: ruby-cabin • .NET: serilog • JS: structured-log • etc. There are many existing libraries for structured logging.

Slide 146

Slide 146 text

@tyler_treat For tracing and metrics, there are vendor-neutral APIs like OpenTracing and OpenCensus.

Slide 147

Slide 147 text

@tyler_treat We need a lightweight agent that can collect data from hosts/containers. 3. Data Collector

Slide 148

Slide 148 text

@tyler_treat Collect data, perform transformations/ filters, and write it to the data pipeline.

Slide 149

Slide 149 text

@tyler_treat Typically runs as an agent on the host (DaemonSet in Kubernetes).

Slide 150

Slide 150 text

@tyler_treat Data is written to stdout/stderr or a Unix domain socket.

Slide 151

Slide 151 text

@tyler_treat Just use Fluentd or Logstash (+Beats).

Slide 152

Slide 152 text

@tyler_treat We need a scalable, fault-tolerant data stream to handle the firehose of observability data generated. 4. Data Pipeline

Slide 153

Slide 153 text

@tyler_treat This also provides a buffer that decouples producers from consumers.

Slide 154

Slide 154 text

@tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client Amazon Glacier S3 Client … Datadog Metrics Agent

Slide 155

Slide 155 text

@tyler_treat System Splunk Universal Forwarder Datadog APM Agent Universal Analytics Client Amazon Glacier S3 Client … Datadog Metrics Agent

Slide 156

Slide 156 text

@tyler_treat Lots of options…

Slide 157

Slide 157 text

@tyler_treat

Slide 158

Slide 158 text

@tyler_treat We need a component to consume data from the pipeline, perform filtering, and write it to the appropriate backends. 5. Data Router

Slide 159

Slide 159 text

@tyler_treat May perform transformations and processing of data, but heavy processing should be the responsibility of a backend system (e.g. alerting or aggregations).

Slide 160

Slide 160 text

@tyler_treat This is where the data spec comes into play.

Slide 161

Slide 161 text

@tyler_treat The data type determines how incoming data is routed.

Slide 162

Slide 162 text

@tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

Slide 163

Slide 163 text

@tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

Slide 164

Slide 164 text

@tyler_treat Data Pipeline Amazon Glacier Data Router logs traces metrics

Slide 165

Slide 165 text

@tyler_treat This is primarily a stateless component writing to APIs.

Slide 166

Slide 166 text

@tyler_treat Good fit for “serverless” solutions.

Slide 167

Slide 167 text

@tyler_treat Piecing It All Together

Slide 168

Slide 168 text

@tyler_treat

Slide 169

Slide 169 text

@tyler_treat You don’t need to build it out all in one go.

Slide 170

Slide 170 text

@tyler_treat There are quick wins along the way!

Slide 171

Slide 171 text

@tyler_treat Evolving to an Observability Pipeline • Adopt structured logging • Move log/data collection out of process • Use a centralized logging system • Introduce a streaming data solution • Start adding data consumers

Slide 172

Slide 172 text

@tyler_treat Moving from host-centric to service-centric observability.

Slide 173

Slide 173 text

@tyler_treat This maps to VMs and containers as well as it does to “serverless” models.

Slide 174

Slide 174 text

@tyler_treat Ops Systems Production Product
 Development Product
 Management Security &
 Compliance Support/
 Helpdesk

Slide 175

Slide 175 text

@tyler_treat Dev/Ops/SRE Systems Production Audit Business Analytics Pricing Decisions Data-Driven Product Decisions Threat Detection Monitoring Debugging & Operational Insights ...

Slide 176

Slide 176 text

@tyler_treat Dev/Ops/SRE Systems Production

Slide 177

Slide 177 text

@tyler_treat Dev/Ops/SRE Systems Production

Slide 178

Slide 178 text

@tyler_treat Dev/Ops/SRE Systems Production

Slide 179

Slide 179 text

@tyler_treat Dev/Ops/SRE Systems Production

Slide 180

Slide 180 text

@tyler_treat Dev/Ops/SRE Systems Production

Slide 181

Slide 181 text

@tyler_treat Dev/Ops/SRE Systems Production

Slide 182

Slide 182 text

@tyler_treat Benefits • Pattern can be evolved to with quick wins along the way • Maps to elastic and serverless architectures better • Empowers teams in siloed organizations and unlocks data for other parts of the business • Enables teams to use the tools best suited to their needs • Easier to change tools or evaluate them side-by-side by decoupling • Minimizes impact on developers and the core system

Slide 183

Slide 183 text

@tyler_treat But it’s not a silver bullet.

Slide 184

Slide 184 text

@tyler_treat Downsides • Moving away from agent-based model means we have to handle data routing ourselves • A lot of the Data Router components might need to be custom-made using various vendor SDKs or client libraries (assuming they have APIs) • This also means we might lose some of the value-add features of certain agents • Unclear how well this maps to pull-based models (e.g. Prometheus)

Slide 185

Slide 185 text

@tyler_treat CI/CD Pipeline +
 Observability Pipeline

Slide 186

Slide 186 text

@tyler_treat CI/CD Pre- Production
 (theorizing about known unknowns) Post- Production
 (learning from unknown unknowns) Observability

Slide 187

Slide 187 text

@tyler_treat Thank You realkinetic.com
 bravenewgeek.com