Slide 1

Slide 1 text

realkinetic.com | @real_kinetic What is Happening? Attempting to Understand Our Systems

Slide 2

Slide 2 text

realkinetic.com | @real_kinetic About Me (obligatory sales pitch)

Slide 3

Slide 3 text

realkinetic.com | @real_kinetic Beau Lyddon Managing Partner at Real Kinetic

Slide 4

Slide 4 text

realkinetic.com | @real_kinetic Currently live in beautiful Boulder, CO

Slide 5

Slide 5 text

realkinetic.com | @real_kinetic Started a company: Real Kinetic mentors clients to enable their technical teams to grow and build high- quality software

Slide 6

Slide 6 text

realkinetic.com | @real_kinetic

Slide 7

Slide 7 text

realkinetic.com | @real_kinetic PSA: I use a ton of slides so those offline can follow my narrative using just the slides. So don’t worry about reading every word as I will be verbalizing them out loud.

Slide 8

Slide 8 text

realkinetic.com | @real_kinetic Also, I’m going to go real fast through the beginning. I’m cramming a lot in 30 min.

Slide 9

Slide 9 text

realkinetic.com | @real_kinetic Let’s get going

Slide 10

Slide 10 text

realkinetic.com | @real_kinetic Every company is becoming a technology company

Slide 11

Slide 11 text

realkinetic.com | @real_kinetic Technology (especially software) is becoming a critical piece of every business

Slide 12

Slide 12 text

realkinetic.com | @real_kinetic It’s more and more difficult to do jobs without understanding technology

Slide 13

Slide 13 text

realkinetic.com | @real_kinetic And it’s not really about architecture diagrams (They’re needed but only part of the story)

Slide 14

Slide 14 text

realkinetic.com | @real_kinetic It’s about the “Why” (From Andrew’s Presentation)

Slide 15

Slide 15 text

realkinetic.com | @real_kinetic But it’s not just us. Or even those in R&D.

Slide 16

Slide 16 text

realkinetic.com | @real_kinetic At this point it’s pretty much everyone

Slide 17

Slide 17 text

realkinetic.com | @real_kinetic Our own team, peer teams, support & operations, management, R&D leadership, marketing, sales, executives, board members, customer service, investors, auditors and CUSTOMERS

Slide 18

Slide 18 text

realkinetic.com | @real_kinetic And providing understanding at their perspective is critical

Slide 19

Slide 19 text

realkinetic.com | @real_kinetic All of these people have slightly different perspectives and needs

Slide 20

Slide 20 text

realkinetic.com | @real_kinetic No single “diagram” or even story will work

Slide 21

Slide 21 text

realkinetic.com | @real_kinetic Much of engineering leadership is becoming about explaining our systems to “the rest of the world” (no more God syndrome)

Slide 22

Slide 22 text

realkinetic.com | @real_kinetic And since we’ve generally sucked at this, the government and general population are starting to force our hands.

Slide 23

Slide 23 text

realkinetic.com | @real_kinetic They are finally realizing that "software is eating the world” and that they don’t really understand it.

Slide 24

Slide 24 text

realkinetic.com | @real_kinetic Which really freaks them out

Slide 25

Slide 25 text

realkinetic.com | @real_kinetic Justifiably

Slide 26

Slide 26 text

realkinetic.com | @real_kinetic We have not done a good job helping others understand our “stuff” (MOM: What do you do again? ME: Stuff)

Slide 27

Slide 27 text

realkinetic.com | @real_kinetic Right, Mark?

Slide 28

Slide 28 text

realkinetic.com | @real_kinetic So they’re pushing back

Slide 29

Slide 29 text

realkinetic.com | @real_kinetic

Slide 30

Slide 30 text

realkinetic.com | @real_kinetic FBI (encryption), Facebook (data privacy), GDPR (data privacy), Compliance

Slide 31

Slide 31 text

realkinetic.com | @real_kinetic When we watch the congressional hearings and go “what morons” we should really be saying “we failed”

Slide 32

Slide 32 text

realkinetic.com | @real_kinetic But now the cameras are officially on us

Slide 33

Slide 33 text

realkinetic.com | @real_kinetic

Slide 34

Slide 34 text

realkinetic.com | @real_kinetic

Slide 35

Slide 35 text

realkinetic.com | @real_kinetic

Slide 36

Slide 36 text

realkinetic.com | @real_kinetic But here’s the real kicker

Slide 37

Slide 37 text

realkinetic.com | @real_kinetic MANY OF US have NO CLUE what the hell OUR SYSTEMS ARE DOING

Slide 38

Slide 38 text

realkinetic.com | @real_kinetic So we need to start with ourselves

Slide 39

Slide 39 text

realkinetic.com | @real_kinetic We need to ensure that we can understand our systems and then work our way up

Slide 40

Slide 40 text

realkinetic.com | @real_kinetic And provide the tools that allow all to understand the system from everybody’s perspective

Slide 41

Slide 41 text

realkinetic.com | @real_kinetic This job is actually very difficult (I believe it’s more difficult to explain and fully understand than it is to actually build)

Slide 42

Slide 42 text

realkinetic.com | @real_kinetic Why?

Slide 43

Slide 43 text

realkinetic.com | @real_kinetic Our systems are more complex than they’ve ever been (And only growing increasingly complex)

Slide 44

Slide 44 text

realkinetic.com | @real_kinetic Historical

Slide 45

Slide 45 text

realkinetic.com | @real_kinetic The “simpler” times (It did not feel simpler at the time)

Slide 46

Slide 46 text

realkinetic.com | @real_kinetic We had mainframes, Windows apps, client server, etc

Slide 47

Slide 47 text

realkinetic.com | @real_kinetic These were all very controlled and constrained systems (Or it at least if felt that way)

Slide 48

Slide 48 text

realkinetic.com | @real_kinetic 24/7 Uptime … pfft (We would take systems down for nights and weekends. 5 9s. Ha!)

Slide 49

Slide 49 text

realkinetic.com | @real_kinetic We hardly ever released (Release cycles measured in years, months if you were aggressive)

Slide 50

Slide 50 text

realkinetic.com | @real_kinetic Realtime!? … What does that even mean?

Slide 51

Slide 51 text

realkinetic.com | @real_kinetic You may not believe this but we would run nightly (or weekend) jobs to create reports. (On paper. PAPER!)

Slide 52

Slide 52 text

realkinetic.com | @real_kinetic Devices?

Slide 53

Slide 53 text

realkinetic.com | @real_kinetic We tell you what “device” you will use. (Mainframe terminal, windows, IE, Blackberry)

Slide 54

Slide 54 text

realkinetic.com | @real_kinetic Systems now?

Slide 55

Slide 55 text

realkinetic.com | @real_kinetic Let’s start with a client server architecture built under the old constraints

Slide 56

Slide 56 text

realkinetic.com | @real_kinetic And then evolve it as our constraints evolve

Slide 57

Slide 57 text

realkinetic.com | @real_kinetic

Slide 58

Slide 58 text

realkinetic.com | @real_kinetic Downtime is unacceptable (“x” 9s :/)

Slide 59

Slide 59 text

realkinetic.com | @real_kinetic

Slide 60

Slide 60 text

realkinetic.com | @real_kinetic

Slide 61

Slide 61 text

realkinetic.com | @real_kinetic Devices you say?

Slide 62

Slide 62 text

realkinetic.com | @real_kinetic Oh we’ve got devices. All the damn devices.

Slide 63

Slide 63 text

realkinetic.com | @real_kinetic

Slide 64

Slide 64 text

realkinetic.com | @real_kinetic Realtime?

Slide 65

Slide 65 text

realkinetic.com | @real_kinetic “Uh yeah! I’m not waiting even a second for what I want”

Slide 66

Slide 66 text

realkinetic.com | @real_kinetic No more “stale” reports

Slide 67

Slide 67 text

realkinetic.com | @real_kinetic I want answers (data) now. (Oh, and it better be visual and interactive)

Slide 68

Slide 68 text

realkinetic.com | @real_kinetic

Slide 69

Slide 69 text

realkinetic.com | @real_kinetic As fast as possible releases (From years to months to days to multiple an hour and maybe even faster at scale)

Slide 70

Slide 70 text

realkinetic.com | @real_kinetic And anybody can release. For any reason. (You must release to keep up with demand and to quickly fix issues)

Slide 71

Slide 71 text

realkinetic.com | @real_kinetic

Slide 72

Slide 72 text

realkinetic.com | @real_kinetic We expect access from anywhere at anytime

Slide 73

Slide 73 text

realkinetic.com | @real_kinetic

Slide 74

Slide 74 text

realkinetic.com | @real_kinetic

Slide 75

Slide 75 text

realkinetic.com | @real_kinetic

Slide 76

Slide 76 text

realkinetic.com | @real_kinetic

Slide 77

Slide 77 text

realkinetic.com | @real_kinetic The Modern Technology Cluster #*@!

Slide 78

Slide 78 text

realkinetic.com | @real_kinetic The Modern Technology Cluster #*@! Stack

Slide 79

Slide 79 text

realkinetic.com | @real_kinetic The complexity has risen significantly

Slide 80

Slide 80 text

realkinetic.com | @real_kinetic But don’t worry OSS is here to save you (SPOILER: Only, kinda)

Slide 81

Slide 81 text

realkinetic.com | @real_kinetic

Slide 82

Slide 82 text

realkinetic.com | @real_kinetic Those are the tools that Tyler mentioned that you can use but you need to wrap with your "glue code” (your culture, your processes)

Slide 83

Slide 83 text

realkinetic.com | @real_kinetic Oh … and I’m not done

Slide 84

Slide 84 text

realkinetic.com | @real_kinetic There are significantly more nodes in the system

Slide 85

Slide 85 text

realkinetic.com | @real_kinetic And many connections between these nodes to handle scale

Slide 86

Slide 86 text

realkinetic.com | @real_kinetic These connections create dependency trees

Slide 87

Slide 87 text

realkinetic.com | @real_kinetic And even more the nodes and connections are constantly changing

Slide 88

Slide 88 text

realkinetic.com | @real_kinetic All while we must maintain usage rates

Slide 89

Slide 89 text

realkinetic.com | @real_kinetic Thus we end up with different versions of the same type of node potentially within a single request

Slide 90

Slide 90 text

realkinetic.com | @real_kinetic All of this (and more) leads to our systems producing emergent behaviors that can’t be predicted.

Slide 91

Slide 91 text

realkinetic.com | @real_kinetic In other words our systems are becoming much more similar to “living” systems (Cities, governments, ecological, biological, etc)

Slide 92

Slide 92 text

realkinetic.com | @real_kinetic So this …

Slide 93

Slide 93 text

realkinetic.com | @real_kinetic

Slide 94

Slide 94 text

realkinetic.com | @real_kinetic is kind of … like … alive?

Slide 95

Slide 95 text

realkinetic.com | @real_kinetic

Slide 96

Slide 96 text

realkinetic.com | @real_kinetic We may have created a monster

Slide 97

Slide 97 text

realkinetic.com | @real_kinetic And it might kill us! F*$@!

Slide 98

Slide 98 text

realkinetic.com | @real_kinetic Let’s go back to the old way

Slide 99

Slide 99 text

realkinetic.com | @real_kinetic Except it’s too late.

Slide 100

Slide 100 text

realkinetic.com | @real_kinetic This actually works.

Slide 101

Slide 101 text

realkinetic.com | @real_kinetic Beyond the obvious successful companies (Google, Amazon, Facebook), the research backs up that these systems help all types of companies that embrace them across all industries.

Slide 102

Slide 102 text

realkinetic.com | @real_kinetic Dynamic systems that support rapid development and experimentation directly increase quality and velocity

Slide 103

Slide 103 text

realkinetic.com | @real_kinetic Thus IT becomes a differentiator and is no longer a cost center

Slide 104

Slide 104 text

realkinetic.com | @real_kinetic DevOps is a critical piece of this transformation

Slide 105

Slide 105 text

realkinetic.com | @real_kinetic If you don’t have a dynamic system that supports experimentation and rapid release and embrace DevOps you will be beat by those that do

Slide 106

Slide 106 text

realkinetic.com | @real_kinetic Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations https://a16z.com/2018/03/28/devops-org-change-software-performance/ a16z Podcast: Feedback Loops — Company Culture, Change, and DevOps

Slide 107

Slide 107 text

realkinetic.com | @real_kinetic So if this isn’t your world, it likely will be in the future

Slide 108

Slide 108 text

realkinetic.com | @real_kinetic So what can we do to attempt to understand the chaos?

Slide 109

Slide 109 text

realkinetic.com | @real_kinetic An example from our past experience at Workiva

Slide 110

Slide 110 text

realkinetic.com | @real_kinetic “Calc”

Slide 111

Slide 111 text

realkinetic.com | @real_kinetic A system and method that efficiently, robustly, and flexibly permits large scale distributed asynchronous calculations in a networked environment, where the number of users entering data is large, the number of variables and equations are large and can comprise long and/or wide dependency chains, and data integrity is important

Slide 112

Slide 112 text

realkinetic.com | @real_kinetic Or … a distributed calculation engine

Slide 113

Slide 113 text

realkinetic.com | @real_kinetic Built on stateless runtimes with no SSH or live debugging (Serverless in 2011, yep it was a thing)

Slide 114

Slide 114 text

realkinetic.com | @real_kinetic Not that SSH or Debuggers would have mattered

Slide 115

Slide 115 text

realkinetic.com | @real_kinetic Massive Scale (Millions of nodes)

Slide 116

Slide 116 text

realkinetic.com | @real_kinetic A tease …

Slide 117

Slide 117 text

realkinetic.com | @real_kinetic

Slide 118

Slide 118 text

realkinetic.com | @real_kinetic

Slide 119

Slide 119 text

realkinetic.com | @real_kinetic Structure, and thus behavior, changed when the data changed (Very dynamic)

Slide 120

Slide 120 text

realkinetic.com | @real_kinetic What is the state of the system?
 Is it done? What is done? Is it broken? What is broken? What is fast/slow?

Slide 121

Slide 121 text

realkinetic.com | @real_kinetic A single actor in the system does not know the status of the overall system.

Slide 122

Slide 122 text

realkinetic.com | @real_kinetic There is no obvious way to track the status of the system unless the nodes within the system help us

Slide 123

Slide 123 text

realkinetic.com | @real_kinetic To have any chance of keeping up with the understanding of systems we need the systems to self describe

Slide 124

Slide 124 text

realkinetic.com | @real_kinetic And of course we need automation and self healing

Slide 125

Slide 125 text

realkinetic.com | @real_kinetic And to have self description, automation, and self healing we need data. We need the systems to give us data to provide necessary context.

Slide 126

Slide 126 text

realkinetic.com | @real_kinetic So what are the specifics?

Slide 127

Slide 127 text

realkinetic.com | @real_kinetic We’ll start by working our way up from the code

Slide 128

Slide 128 text

realkinetic.com | @real_kinetic 1. Pass a context object to basically everything

Slide 129

Slide 129 text

realkinetic.com | @real_kinetic type Context = { user_id :: String , account_id :: String , trace_id :: String , request_id :: String , parent_id :: Maybe String }

Slide 130

Slide 130 text

realkinetic.com | @real_kinetic What goes on the context?

Slide 131

Slide 131 text

realkinetic.com | @real_kinetic Think about the data you wish you had when debugging an issue (This is why your devs should support their own systems)

Slide 132

Slide 132 text

realkinetic.com | @real_kinetic What is the data that would change the behavior of the system?

Slide 133

Slide 133 text

realkinetic.com | @real_kinetic The user (and/or company), time, machine stats (CPU, Memory, etc), software version, configuration data, the calling request, any dependent requests

Slide 134

Slide 134 text

realkinetic.com | @real_kinetic What of that can we get for “free” and what do we need to pass along (Free == Machine Provided Memory, CPU, etc)

Slide 135

Slide 135 text

realkinetic.com | @real_kinetic The data we can’t get for “free” should go on the context (Data that is “request” specific User, Company, Calling Request Id)

Slide 136

Slide 136 text

realkinetic.com | @real_kinetic There are side-benefits as well

Slide 137

Slide 137 text

realkinetic.com | @real_kinetic If you’re a SaaS company you should probably pass licensing data as part of the context

Slide 138

Slide 138 text

realkinetic.com | @real_kinetic This will allow you to move processes around based on their license

Slide 139

Slide 139 text

realkinetic.com | @real_kinetic Imagine routing traffic to specific queues based off user, account, license and environment (usage, resources available) (The ability to isolate processes at runtime Amazon is the king of this)

Slide 140

Slide 140 text

realkinetic.com | @real_kinetic Also, think about GDPR and needing to track user actions, data and what they have approved the system to do

Slide 141

Slide 141 text

realkinetic.com | @real_kinetic Please, use some data structure to pass contextual data to all dependent functions

Slide 142

Slide 142 text

realkinetic.com | @real_kinetic This is the easiest thing you can start doing today

Slide 143

Slide 143 text

realkinetic.com | @real_kinetic Oh, and then make sure to log that context on every request

Slide 144

Slide 144 text

realkinetic.com | @real_kinetic And speaking of logging

Slide 145

Slide 145 text

realkinetic.com | @real_kinetic 2. Structure your logs JSON is fine

Slide 146

Slide 146 text

realkinetic.com | @real_kinetic I’m tired of writing regex’s to scrape logs because we’re too lazy to add structure at the time it actually makes the most sense

Slide 147

Slide 147 text

realkinetic.com | @real_kinetic [{ "env": "Dev", “server_name": "AWS1", “app_name": “MyService", “app_loc": “/home/app“, “user_id”: “u1”, “account_id”: “a1”, "logger": "mylogger", "platform": “py", “trace_id”: “t1”, “ parent_id”: “p1”, "messages": [{ "tag": "Incoming metrics data", "data": "{\"clientid\":54732}", "thread": "10", “time": 1485555302470, "level": "DEBUG", "id": "0c28701b-e4de-11e6-8936-8975598968a4" }] }]

Slide 148

Slide 148 text

realkinetic.com | @real_kinetic You can take this as far as you’d like

Slide 149

Slide 149 text

realkinetic.com | @real_kinetic Very structured with a type system, code reviews, etc

Slide 150

Slide 150 text

realkinetic.com | @real_kinetic There are many existing libraries (Too many to list. Just Google “Structured logs” and your language of choice)

Slide 151

Slide 151 text

realkinetic.com | @real_kinetic But at minimum get your logs into a standard format with property tags

Slide 152

Slide 152 text

realkinetic.com | @real_kinetic 3. Create a data pipeline

Slide 153

Slide 153 text

realkinetic.com | @real_kinetic There is a ton of data that you want and need to collect

Slide 154

Slide 154 text

realkinetic.com | @real_kinetic Logs, metrics, analytics, audits, etc

Slide 155

Slide 155 text

realkinetic.com | @real_kinetic We want to make it as simple, yet robust as possible

Slide 156

Slide 156 text

realkinetic.com | @real_kinetic But most importantly we want some system that has all of the data

Slide 157

Slide 157 text

realkinetic.com | @real_kinetic What we often see at the beginning:

Slide 158

Slide 158 text

realkinetic.com | @real_kinetic

Slide 159

Slide 159 text

realkinetic.com | @real_kinetic

Slide 160

Slide 160 text

realkinetic.com | @real_kinetic

Slide 161

Slide 161 text

realkinetic.com | @real_kinetic

Slide 162

Slide 162 text

realkinetic.com | @real_kinetic And now your services are spending more time with non- critical path dependencies than those on critical path

Slide 163

Slide 163 text

realkinetic.com | @real_kinetic Standardize & simplify

Slide 164

Slide 164 text

realkinetic.com | @real_kinetic A single data pipeline (queue) (Or use a pull process. Just get your logs into a central location)

Slide 165

Slide 165 text

realkinetic.com | @real_kinetic

Slide 166

Slide 166 text

realkinetic.com | @real_kinetic Look into “sidecar” style collection

Slide 167

Slide 167 text

realkinetic.com | @real_kinetic

Slide 168

Slide 168 text

realkinetic.com | @real_kinetic This allows you to write to stdout and the sidecar will collect and push to your queue

Slide 169

Slide 169 text

realkinetic.com | @real_kinetic The data pipeline provides a layer of abstraction that allows you to get the data everywhere it needs to be without impacting developers and the “core” system

Slide 170

Slide 170 text

realkinetic.com | @real_kinetic

Slide 171

Slide 171 text

realkinetic.com | @real_kinetic Where should all of the data go?

Slide 172

Slide 172 text

realkinetic.com | @real_kinetic At minimum all data should go into a cheap, long term storage solution (AWS Glacier, etc)

Slide 173

Slide 173 text

realkinetic.com | @real_kinetic You’ll want this data for historical system behavior to help “machine learn” your system into automation

Slide 174

Slide 174 text

realkinetic.com | @real_kinetic Ideally, all data should go into a queryable, large scale data storage solution. (solid time based query capabilities a plus) (Google BigQuery, AWS Redshift)

Slide 175

Slide 175 text

realkinetic.com | @real_kinetic This is why we structure our logs

Slide 176

Slide 176 text

realkinetic.com | @real_kinetic There are more targeted or customized solutions starting to fill the space

Slide 177

Slide 177 text

realkinetic.com | @real_kinetic “Start solving high-cardinality problems in minutes” (honeycomb.io)

Slide 178

Slide 178 text

realkinetic.com | @real_kinetic From their marketing …

Slide 179

Slide 179 text

realkinetic.com | @real_kinetic High-cardinality refers to columns with values that are very uncommon or unique.High-cardinality column values are typically identification numbers, email addresses, or user names. An example of a data table column with high-cardinality would be a USERS table with a column named USER_ID.

Slide 180

Slide 180 text

realkinetic.com | @real_kinetic Query anything. Break down, filter, and pivot on high- cardinality fields like user_id.

Slide 181

Slide 181 text

realkinetic.com | @real_kinetic Once again, this is why we structure our logs

Slide 182

Slide 182 text

realkinetic.com | @real_kinetic See the raw data behind every result.

Slide 183

Slide 183 text

realkinetic.com | @real_kinetic See the exact events leading to an issue, who was affected, and how.

Slide 184

Slide 184 text

realkinetic.com | @real_kinetic Share queries, results, and history. Collaborate.

Slide 185

Slide 185 text

realkinetic.com | @real_kinetic Many other options … (Still a bit too dashboard based but trending in the right direction)

Slide 186

Slide 186 text

realkinetic.com | @real_kinetic

Slide 187

Slide 187 text

realkinetic.com | @real_kinetic The beauty of the data pipeline is you can use 1 or many. And test multiple in parallel if you’d like without interrupting development. (Just don’t forget to have Devs user test the solutions as well)

Slide 188

Slide 188 text

realkinetic.com | @real_kinetic You’re still going to end up with multiple consumers

Slide 189

Slide 189 text

realkinetic.com | @real_kinetic

Slide 190

Slide 190 text

realkinetic.com | @real_kinetic Back to the structured logs thing

Slide 191

Slide 191 text

realkinetic.com | @real_kinetic 4. Structure and standardize all data leaving a system

Slide 192

Slide 192 text

realkinetic.com | @real_kinetic Provide libraries to add structure to not just logs but also metrics, audits, etc

Slide 193

Slide 193 text

realkinetic.com | @real_kinetic We like having one standard across the board

Slide 194

Slide 194 text

realkinetic.com | @real_kinetic But you can also break them apart by “type” … Metrics, audits, tracing, etc

Slide 195

Slide 195 text

realkinetic.com | @real_kinetic As long as you get it standardized across systems

Slide 196

Slide 196 text

realkinetic.com | @real_kinetic But people are quickly realizing that this data is all related and the separation is arbitrary

Slide 197

Slide 197 text

realkinetic.com | @real_kinetic OpenCensus A single distribution of libraries for metrics and distributed tracing with minimal overhead that allows you to export data to multiple backends. https://opencensus.io

Slide 198

Slide 198 text

realkinetic.com | @real_kinetic Vendor-neutral APIs and instrumentation for distributed tracing. https://opentracing.io

Slide 199

Slide 199 text

realkinetic.com | @real_kinetic Most of the “infrastructure data” players are converting support for all styles of system data collection

Slide 200

Slide 200 text

realkinetic.com | @real_kinetic

Slide 201

Slide 201 text

realkinetic.com | @real_kinetic With a data pipeline you’ll be setup to handle whatever tool(s) come next (Leverage abstractions at the integration layers to allow easier adaptation to change)

Slide 202

Slide 202 text

realkinetic.com | @real_kinetic 5. Minimize, isolate and track dependencies

Slide 203

Slide 203 text

realkinetic.com | @real_kinetic Unmanaged dependencies are where throughput goes to die (And what creates and increases complexity faster than anything else)

Slide 204

Slide 204 text

realkinetic.com | @real_kinetic Golang got a few things correct.

Slide 205

Slide 205 text

realkinetic.com | @real_kinetic One of them is promoting code duplication over introducing unnecessary dependencies

Slide 206

Slide 206 text

realkinetic.com | @real_kinetic Personally, I promote the Golang + Haskell approach

Slide 207

Slide 207 text

realkinetic.com | @real_kinetic A dependency can be introduced when it is well formalized and worth the cost (In the Haskell world you’ll see laws for APIs. These are pretty stable APIs.)

Slide 208

Slide 208 text

realkinetic.com | @real_kinetic Quick Note:

Slide 209

Slide 209 text

realkinetic.com | @real_kinetic Avoiding dependencies does not mean “build everything”

Slide 210

Slide 210 text

realkinetic.com | @real_kinetic Javascript Padding Library != AWS Dynamo

Slide 211

Slide 211 text

realkinetic.com | @real_kinetic Using Dynamo + client library is less code and likely no additional dependency vs building from scratch

Slide 212

Slide 212 text

realkinetic.com | @real_kinetic And way better than building your own database (Even though these days people seem to think building a database is easy and necessary)

Slide 213

Slide 213 text

realkinetic.com | @real_kinetic Back to regular schedule programming

Slide 214

Slide 214 text

realkinetic.com | @real_kinetic If you’re going to introduce dependencies then clearly track and pin them

Slide 215

Slide 215 text

realkinetic.com | @real_kinetic Ideally a single file in the project/repo. (Or in an aggregate repo)

Slide 216

Slide 216 text

realkinetic.com | @real_kinetic If possible standardize the spec for these files

Slide 217

Slide 217 text

realkinetic.com | @real_kinetic Then create a process that aggregates the dependencies into an overall mapping to give a picture of the system

Slide 218

Slide 218 text

realkinetic.com | @real_kinetic This goes for services as well as libraries

Slide 219

Slide 219 text

realkinetic.com | @real_kinetic Then you can generate diagrams (Free architecture diagrams!)

Slide 220

Slide 220 text

realkinetic.com | @real_kinetic

Slide 221

Slide 221 text

realkinetic.com | @real_kinetic Which you can then visualize over time

Slide 222

Slide 222 text

realkinetic.com | @real_kinetic

Slide 223

Slide 223 text

realkinetic.com | @real_kinetic

Slide 224

Slide 224 text

realkinetic.com | @real_kinetic Netflix has some great examples and tools (Those #%*@$!# are always leading the charge) Out of necessity

Slide 225

Slide 225 text

realkinetic.com | @real_kinetic Spigo and Simianviz

Slide 226

Slide 226 text

realkinetic.com | @real_kinetic https://github.com/adrianco/spigo

Slide 227

Slide 227 text

realkinetic.com | @real_kinetic

Slide 228

Slide 228 text

realkinetic.com | @real_kinetic That said …

Slide 229

Slide 229 text

realkinetic.com | @real_kinetic Service/network dependencies are still a nightmare

Slide 230

Slide 230 text

realkinetic.com | @real_kinetic 6. Use network sidecars (service mesh, proxies) to better isolate and handle these dependencies

Slide 231

Slide 231 text

realkinetic.com | @real_kinetic Similar concept to the data pipeline except with even more “production” benefits

Slide 232

Slide 232 text

realkinetic.com | @real_kinetic

Slide 233

Slide 233 text

realkinetic.com | @real_kinetic A combination of many of the API Gateway, proxy, router, etc solutions that exist today

Slide 234

Slide 234 text

realkinetic.com | @real_kinetic Having a standard network proxy gives you: Load balancing, service discovery, health checking, circuit breakers, standard observability (+tracing)

Slide 235

Slide 235 text

realkinetic.com | @real_kinetic Using the sidecar allows you to easily standardize without introducing new dependencies at the code and team level

Slide 236

Slide 236 text

realkinetic.com | @real_kinetic And of course the meta-data can be pumped to your same data pipeline

Slide 237

Slide 237 text

realkinetic.com | @real_kinetic

Slide 238

Slide 238 text

realkinetic.com | @real_kinetic Many new tools Especially around Kubernetes

Slide 239

Slide 239 text

realkinetic.com | @real_kinetic Vary from service mesh focused to full bore micro- service framework

Slide 240

Slide 240 text

realkinetic.com | @real_kinetic

Slide 241

Slide 241 text

realkinetic.com | @real_kinetic All of these come with “free” monitoring tools

Slide 242

Slide 242 text

realkinetic.com | @real_kinetic And …

Slide 243

Slide 243 text

realkinetic.com | @real_kinetic 7. Distributed Tracing

Slide 244

Slide 244 text

realkinetic.com | @real_kinetic We need better ways to visualize our systems

Slide 245

Slide 245 text

realkinetic.com | @real_kinetic Charts, dashboards are nice for looking at system behaviors in a generic, data driven perspective

Slide 246

Slide 246 text

realkinetic.com | @real_kinetic But that layer of abstraction (while helping isolate variables) removes a layer of intuition

Slide 247

Slide 247 text

realkinetic.com | @real_kinetic We need the ability to also visualize specific and aggregate behavior

Slide 248

Slide 248 text

realkinetic.com | @real_kinetic Tracing is one example

Slide 249

Slide 249 text

realkinetic.com | @real_kinetic def my_func(*args, **kwargs): logging.info("start") analytics.store(“my_func”, “start”) do_something() do_something_else() do_another_thing() logging.info("end") analytics.store(“my_func”, “stop”)

Slide 250

Slide 250 text

realkinetic.com | @real_kinetic This is really slow and we don’t know why so we start doing naive timing crap

Slide 251

Slide 251 text

realkinetic.com | @real_kinetic def my_func(*args, **kwargs): logging.info(“start {}“.format(time.now())) analytics.store(“my_func”, “start”) do_something() do_something_else() do_another_thing() logging.info(“end {}“.format(time.now())) analytics.store(“my_func”, “stop”)

Slide 252

Slide 252 text

realkinetic.com | @real_kinetic Let’s take advantage of our context and structured logging to enable tracing

Slide 253

Slide 253 text

realkinetic.com | @real_kinetic ctx = { “trace_id”: “t1”, “parent_id”: None, “id”: “newgenid” | more} @trace() def my_func(ctx, *args, **kwargs): do_something(ctx) do_something_else(ctx) do_another_thing(ctx)

Slide 254

Slide 254 text

realkinetic.com | @real_kinetic ctx = { “trace_id”: “t1”, “parent_id”: “newgenid”, “id”: uuid.new | more} @trace() def do_something(ctx, *args, **kwargs): some_other_crap …

Slide 255

Slide 255 text

realkinetic.com | @real_kinetic This will give us the ability to get a call graph

Slide 256

Slide 256 text

realkinetic.com | @real_kinetic

Slide 257

Slide 257 text

realkinetic.com | @real_kinetic And since we’re collecting all of the metadata that we can we know the characteristics of these nodes

Slide 258

Slide 258 text

realkinetic.com | @real_kinetic

Slide 259

Slide 259 text

realkinetic.com | @real_kinetic Oh crap, those aren’t “pure” functions. They’re all doing IO. (Stupid ORMs and their poor abstractions. A good abstraction would make it clear there is IO happening)

Slide 260

Slide 260 text

realkinetic.com | @real_kinetic This visualization does a good job showing dependencies (And is very good at representing larger, distributed, asynchronous processes)

Slide 261

Slide 261 text

realkinetic.com | @real_kinetic

Slide 262

Slide 262 text

realkinetic.com | @real_kinetic But it’s not great for all needs

Slide 263

Slide 263 text

realkinetic.com | @real_kinetic In our example these are synchronous processes

Slide 264

Slide 264 text

realkinetic.com | @real_kinetic

Slide 265

Slide 265 text

realkinetic.com | @real_kinetic This style isn’t intuitive for the actual stack + performance within the process

Slide 266

Slide 266 text

realkinetic.com | @real_kinetic Standard Tracing View

Slide 267

Slide 267 text

realkinetic.com | @real_kinetic

Slide 268

Slide 268 text

realkinetic.com | @real_kinetic

Slide 269

Slide 269 text

realkinetic.com | @real_kinetic Come with the ability to search, discover traces

Slide 270

Slide 270 text

realkinetic.com | @real_kinetic

Slide 271

Slide 271 text

realkinetic.com | @real_kinetic Tracing standards and systems are quite immature but growing (and hopefully stabilizing) quickly

Slide 272

Slide 272 text

realkinetic.com | @real_kinetic 2 Parts

Slide 273

Slide 273 text

realkinetic.com | @real_kinetic The spec

Slide 274

Slide 274 text

realkinetic.com | @real_kinetic OpenCensus

Slide 275

Slide 275 text

realkinetic.com | @real_kinetic Distributed Trace Context Community Group https://www.w3.org/community/trace-context/ https://github.com/w3c/distributed-tracing This specification defines formats to pass trace context information across systems. Our goal is to share this with the community so that various tracing and diagnostics products can operate together.

Slide 276

Slide 276 text

realkinetic.com | @real_kinetic Pick something. Use structured logging + data pipeline to pass off (and transform if necessary) to tracing aggregator

Slide 277

Slide 277 text

realkinetic.com | @real_kinetic The aggregators

Slide 278

Slide 278 text

realkinetic.com | @real_kinetic

Slide 279

Slide 279 text

realkinetic.com | @real_kinetic And as mentioned many of the collectors are including (or in the process of adding) tracing as part of their offerings

Slide 280

Slide 280 text

realkinetic.com | @real_kinetic

Slide 281

Slide 281 text

realkinetic.com | @real_kinetic But any system that lets you query and aggregate relationships will give you the base system necessary

Slide 282

Slide 282 text

realkinetic.com | @real_kinetic Give your users the ability to create the visualizations and “traces” that map to their use case

Slide 283

Slide 283 text

realkinetic.com | @real_kinetic Those Netflix folks again

Slide 284

Slide 284 text

realkinetic.com | @real_kinetic vizceral https://github.com/Netflix/vizceral

Slide 285

Slide 285 text

realkinetic.com | @real_kinetic

Slide 286

Slide 286 text

realkinetic.com | @real_kinetic

Slide 287

Slide 287 text

realkinetic.com | @real_kinetic

Slide 288

Slide 288 text

realkinetic.com | @real_kinetic

Slide 289

Slide 289 text

realkinetic.com | @real_kinetic 8. Provide the ability to “trace” through the system without impact

Slide 290

Slide 290 text

realkinetic.com | @real_kinetic Some folks call this the “Tracer Bullet”

Slide 291

Slide 291 text

realkinetic.com | @real_kinetic It is a way to simulate a request through the system that makes no “destructive” change

Slide 292

Slide 292 text

realkinetic.com | @real_kinetic In other words: Send request that NoOP writes to storage, writes to 3rd Party apps (Be careful to impact 3rd party quotas, licenses.)

Slide 293

Slide 293 text

realkinetic.com | @real_kinetic FYI, this is how companies like Amazon test their AWS APIs

Slide 294

Slide 294 text

realkinetic.com | @real_kinetic Leverage the context

Slide 295

Slide 295 text

realkinetic.com | @real_kinetic type Context = { user_id :: String , account_id :: String , trace_id :: String , request_id :: String , parent_id :: Maybe String , request_type :: (STANDARD, TRACE) }

Slide 296

Slide 296 text

realkinetic.com | @real_kinetic def my_func(ctx, id, data): my_thing = db.get(id) my_thing.data = data if ctx.request_type != REQUEST_TYPE.TRACE: # Write to storage my_thing.put() # More ideally we wrap our storage layer to use the flag

Slide 297

Slide 297 text

realkinetic.com | @real_kinetic This looks like a feature flag

Slide 298

Slide 298 text

realkinetic.com | @real_kinetic Yes!

Slide 299

Slide 299 text

realkinetic.com | @real_kinetic Use feature flags!

Slide 300

Slide 300 text

realkinetic.com | @real_kinetic And you can use them for more than just features

Slide 301

Slide 301 text

realkinetic.com | @real_kinetic Just make sure you log those flags as part of your context so your tools can properly tag the data

Slide 302

Slide 302 text

realkinetic.com | @real_kinetic Tracer bullets is how we generated our graphs

Slide 303

Slide 303 text

realkinetic.com | @real_kinetic

Slide 304

Slide 304 text

realkinetic.com | @real_kinetic And now I’m going to get “rant-y”

Slide 305

Slide 305 text

realkinetic.com | @real_kinetic 9. Provide the ability to experiment and test in production

Slide 306

Slide 306 text

realkinetic.com | @real_kinetic Tracer bullets, feature flags allow us to use our production system for gathering information

Slide 307

Slide 307 text

realkinetic.com | @real_kinetic We should also support “tester” accounts so you can fully mimic all user actions in a production system

Slide 308

Slide 308 text

realkinetic.com | @real_kinetic All of the work you need to do to support this is work that you should do anyway to fully support multi-tenant apps

Slide 309

Slide 309 text

realkinetic.com | @real_kinetic The ability to isolate services, accounts, actions on demand

Slide 310

Slide 310 text

realkinetic.com | @real_kinetic The ability to stop, interrupt, move bad acting services, users, etc

Slide 311

Slide 311 text

realkinetic.com | @real_kinetic Ideally, support chaos tools in production (Also, use chaos tooling! :))

Slide 312

Slide 312 text

realkinetic.com | @real_kinetic Allowing folks to experiment and learn within the production system helps them build an intuition for the system, it’s behavior, and their impact on that behavior

Slide 313

Slide 313 text

realkinetic.com | @real_kinetic 10. Use tools (custom if necessary) to simulate usage

Slide 314

Slide 314 text

realkinetic.com | @real_kinetic Load testing, chaos, general traffic simulation

Slide 315

Slide 315 text

realkinetic.com | @real_kinetic Using network proxies and a data pipeline will allow you to capture actual traffic …

Slide 316

Slide 316 text

realkinetic.com | @real_kinetic Of which you can then replay to simulate certain traffic patterns, etc

Slide 317

Slide 317 text

realkinetic.com | @real_kinetic 11. Kill environments

Slide 318

Slide 318 text

realkinetic.com | @real_kinetic Less environments means … Less environments

Slide 319

Slide 319 text

realkinetic.com | @real_kinetic Less things to maintain and understand means we can put more time in understanding or other systems

Slide 320

Slide 320 text

realkinetic.com | @real_kinetic “Production” (any environment of which customers have access) is the only environment that matters

Slide 321

Slide 321 text

realkinetic.com | @real_kinetic So why do we spend so much time not in production?

Slide 322

Slide 322 text

realkinetic.com | @real_kinetic We know replicas and models are not as good as the real thing

Slide 323

Slide 323 text

realkinetic.com | @real_kinetic Yet we continue to build that way.

Slide 324

Slide 324 text

realkinetic.com | @real_kinetic And worse we allow shortcuts in other environments that won’t work in production (SSH in Dev, No SSH in Prod)

Slide 325

Slide 325 text

realkinetic.com | @real_kinetic Wouldn’t you also want those tools and abilities in production?

Slide 326

Slide 326 text

realkinetic.com | @real_kinetic We don’t invest in building production capable tools for dev because … time?

Slide 327

Slide 327 text

realkinetic.com | @real_kinetic So instead you’re going to wait until you have a production issue?

Slide 328

Slide 328 text

realkinetic.com | @real_kinetic Scenario: Massive Outage Boss: What are we doing to resolve the issue? You: Well, not much. Normally I would do “x” but I can’t because those only work in dev environments. So I’m going to attempt to hack together some duct tape solution that I’ll never use again. And I’m going to run it now in production without going through the code review process.

Slide 329

Slide 329 text

realkinetic.com | @real_kinetic If you’ve done everything mentioned then why would you need other environments? (Quick answer: If you need to change/test core infrastructure that impacts all users at all times)

Slide 330

Slide 330 text

realkinetic.com | @real_kinetic Do your best to force as much development and testing in production as possible Quick answer: If you need to change/test core infrastructure that impacts all users at all times

Slide 331

Slide 331 text

realkinetic.com | @real_kinetic In closing …

Slide 332

Slide 332 text

realkinetic.com | @real_kinetic There’s so much more we can do that I didn’t get to

Slide 333

Slide 333 text

realkinetic.com | @real_kinetic And it all starts with empathy for our peers and users

Slide 334

Slide 334 text

realkinetic.com | @real_kinetic Please come talk to me I would love to discuss further @lyddonb

Slide 335

Slide 335 text

realkinetic.com | @real_kinetic Quick Recap:

Slide 336

Slide 336 text

realkinetic.com | @real_kinetic • Pass a context • Structure your logs • Create a data pipeline • Structure all system data and pass to pipeline • Minimize, track and build visualizations for dependencies • Leverage service meshes • Distributed Tracing • Support NoOp, experimentation, simulation in production • Then kill as many non-production environments as possible

Slide 337

Slide 337 text

realkinetic.com | @real_kinetic And here are all those tools again:

Slide 338

Slide 338 text

realkinetic.com | @real_kinetic

Slide 339

Slide 339 text

realkinetic.com | @real_kinetic Thank You

Slide 340

Slide 340 text

realkinetic.com | @real_kinetic @lyddonb @real_kinetic Real Kinetic mentors clients to enable their technical teams to grow and build high- quality software

Slide 341

Slide 341 text

realkinetic.com | @real_kinetic Resources & References • Cloud Native Landscape • Incidents Are Unplanned Investments • stella.report • How to Keep Your Systems Running Day After Day - Allspaw • Honeycomb • More Environments Will Not Make Things Easier • Silicon Valley’s Tech Gods Are Headed For A Reckoning • On purpose and by necessity: compliance under the GDPR • ACCELERATE: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations • a16z Podcast: Feedback Loops — Company Culture, Change, and DevOps • System and method for performing distributed asynchronous calculations in a networked environment • You Could Have Invented Structured Logging • What is structured logging and why developers need it • How one developer just broke Node, Babel and thousands of projects in 11 lines of JavaScript • W3C Distributed Trace Context Community Group • Load Testing with Locust

Slide 342

Slide 342 text

realkinetic.com | @real_kinetic Products, Libs, Etc • Splunk • Datadog • Nagios • Apache Kafka • Amazon Kinesis • FluentD • Prometheus • Google Stackdriver • VictorOps • Amazon Glacier • Google BigQuery • Amazon Redshift • OpenCensus • OpenTracing • Haskell • Go • AWS DynamoDB • Spigo and Simianviz • Envoy • Kubernetes • Istio • Linkerd • Kong • Jaeger • Zipkin • AWS X-Ray • Stackdriver Trace • Vizceral