What Is Happening: Attempting To Understand Our Systems

What Is Happening: Attempting To Understand Our Systems

Our systems are growing, not only in size but also in complexity. There are more and more relationships between systems, often via fragile network connections. We’re increasingly integrating with systems outside of our control. Not only that, but these systems are more dynamic. While we increase expectations of uptime, we’ve also continued to increase the communication entropy in the system. Many systems now change by the hour. And this only captures a portion of the complexity. A question keeps getting asked that we struggle to answer: what is happening?

What does our system actually look like right now? What did it look like an hour ago? What is it going to look like in another hour? And that is just the structure of the system. There are many more dimensions we’re interested in. Is our system healthy? What does healthy even mean? Does the status, state, or health mean the same thing to you, your boss, operations, or engineering? And most importantly, what does all of this mean to our users?

These questions have led to a tooling explosion. We will walk through some of these tools and how they can help. We’ll also call out the gaps in these tools that appear when they are applied to practical use. We’ll discuss the perspectives and categories of tooling that we need. We’ll finish by focusing on the foundational actions we can take now to best position us to adapt, as our systems and tools change.

2fbb998c15b56e93a92a89ebda6415eb?s=128

Beau Lyddon

April 14, 2018
Tweet

Transcript

  1. realkinetic.com | @real_kinetic What is Happening? Attempting to Understand Our

    Systems
  2. realkinetic.com | @real_kinetic About Me (obligatory sales pitch)

  3. realkinetic.com | @real_kinetic Beau Lyddon Managing Partner at Real Kinetic

  4. realkinetic.com | @real_kinetic Currently live in beautiful Boulder, CO

  5. realkinetic.com | @real_kinetic Started a company: Real Kinetic mentors clients

    to enable their technical teams to grow and build high- quality software
  6. realkinetic.com | @real_kinetic

  7. realkinetic.com | @real_kinetic PSA: I use a ton of slides

    so those offline can follow my narrative using just the slides. So don’t worry about reading every word as I will be verbalizing them out loud.
  8. realkinetic.com | @real_kinetic Also, I’m going to go real fast

    through the beginning. I’m cramming a lot in 30 min.
  9. realkinetic.com | @real_kinetic Let’s get going

  10. realkinetic.com | @real_kinetic Every company is becoming a technology company

  11. realkinetic.com | @real_kinetic Technology (especially software) is becoming a critical

    piece of every business
  12. realkinetic.com | @real_kinetic It’s more and more difficult to do

    jobs without understanding technology
  13. realkinetic.com | @real_kinetic And it’s not really about architecture diagrams

    (They’re needed but only part of the story)
  14. realkinetic.com | @real_kinetic It’s about the “Why” (From Andrew’s Presentation)

  15. realkinetic.com | @real_kinetic But it’s not just us. Or even

    those in R&D.
  16. realkinetic.com | @real_kinetic At this point it’s pretty much everyone

  17. realkinetic.com | @real_kinetic Our own team, peer teams, support &

    operations, management, R&D leadership, marketing, sales, executives, board members, customer service, investors, auditors and CUSTOMERS
  18. realkinetic.com | @real_kinetic And providing understanding at their perspective is

    critical
  19. realkinetic.com | @real_kinetic All of these people have slightly different

    perspectives and needs
  20. realkinetic.com | @real_kinetic No single “diagram” or even story will

    work
  21. realkinetic.com | @real_kinetic Much of engineering leadership is becoming about

    explaining our systems to “the rest of the world” (no more God syndrome)
  22. realkinetic.com | @real_kinetic And since we’ve generally sucked at this,

    the government and general population are starting to force our hands.
  23. realkinetic.com | @real_kinetic They are finally realizing that "software is

    eating the world” and that they don’t really understand it.
  24. realkinetic.com | @real_kinetic Which really freaks them out

  25. realkinetic.com | @real_kinetic Justifiably

  26. realkinetic.com | @real_kinetic We have not done a good job

    helping others understand our “stuff” (MOM: What do you do again? ME: Stuff)
  27. realkinetic.com | @real_kinetic Right, Mark?

  28. realkinetic.com | @real_kinetic So they’re pushing back

  29. realkinetic.com | @real_kinetic

  30. realkinetic.com | @real_kinetic FBI (encryption), Facebook (data privacy), GDPR (data

    privacy), Compliance
  31. realkinetic.com | @real_kinetic When we watch the congressional hearings and

    go “what morons” we should really be saying “we failed”
  32. realkinetic.com | @real_kinetic But now the cameras are officially on

    us
  33. realkinetic.com | @real_kinetic

  34. realkinetic.com | @real_kinetic

  35. realkinetic.com | @real_kinetic

  36. realkinetic.com | @real_kinetic But here’s the real kicker

  37. realkinetic.com | @real_kinetic MANY OF US have NO CLUE what

    the hell OUR SYSTEMS ARE DOING
  38. realkinetic.com | @real_kinetic So we need to start with ourselves

  39. realkinetic.com | @real_kinetic We need to ensure that we can

    understand our systems and then work our way up
  40. realkinetic.com | @real_kinetic And provide the tools that allow all

    to understand the system from everybody’s perspective
  41. realkinetic.com | @real_kinetic This job is actually very difficult (I

    believe it’s more difficult to explain and fully understand than it is to actually build)
  42. realkinetic.com | @real_kinetic Why?

  43. realkinetic.com | @real_kinetic Our systems are more complex than they’ve

    ever been (And only growing increasingly complex)
  44. realkinetic.com | @real_kinetic Historical

  45. realkinetic.com | @real_kinetic The “simpler” times (It did not feel

    simpler at the time)
  46. realkinetic.com | @real_kinetic We had mainframes, Windows apps, client server,

    etc
  47. realkinetic.com | @real_kinetic These were all very controlled and constrained

    systems (Or it at least if felt that way)
  48. realkinetic.com | @real_kinetic 24/7 Uptime … pfft (We would take

    systems down for nights and weekends. 5 9s. Ha!)
  49. realkinetic.com | @real_kinetic We hardly ever released (Release cycles measured

    in years, months if you were aggressive)
  50. realkinetic.com | @real_kinetic Realtime!? … What does that even mean?

  51. realkinetic.com | @real_kinetic You may not believe this but we

    would run nightly (or weekend) jobs to create reports. (On paper. PAPER!)
  52. realkinetic.com | @real_kinetic Devices?

  53. realkinetic.com | @real_kinetic We tell you what “device” you will

    use. (Mainframe terminal, windows, IE, Blackberry)
  54. realkinetic.com | @real_kinetic Systems now?

  55. realkinetic.com | @real_kinetic Let’s start with a client server architecture

    built under the old constraints
  56. realkinetic.com | @real_kinetic And then evolve it as our constraints

    evolve
  57. realkinetic.com | @real_kinetic

  58. realkinetic.com | @real_kinetic Downtime is unacceptable (“x” 9s :/)

  59. realkinetic.com | @real_kinetic

  60. realkinetic.com | @real_kinetic

  61. realkinetic.com | @real_kinetic Devices you say?

  62. realkinetic.com | @real_kinetic Oh we’ve got devices. All the damn

    devices.
  63. realkinetic.com | @real_kinetic

  64. realkinetic.com | @real_kinetic Realtime?

  65. realkinetic.com | @real_kinetic “Uh yeah! I’m not waiting even a

    second for what I want”
  66. realkinetic.com | @real_kinetic No more “stale” reports

  67. realkinetic.com | @real_kinetic I want answers (data) now. (Oh, and

    it better be visual and interactive)
  68. realkinetic.com | @real_kinetic

  69. realkinetic.com | @real_kinetic As fast as possible releases (From years

    to months to days to multiple an hour and maybe even faster at scale)
  70. realkinetic.com | @real_kinetic And anybody can release. For any reason.

    (You must release to keep up with demand and to quickly fix issues)
  71. realkinetic.com | @real_kinetic

  72. realkinetic.com | @real_kinetic We expect access from anywhere at anytime

  73. realkinetic.com | @real_kinetic

  74. realkinetic.com | @real_kinetic

  75. realkinetic.com | @real_kinetic

  76. realkinetic.com | @real_kinetic

  77. realkinetic.com | @real_kinetic The Modern Technology Cluster #*@!

  78. realkinetic.com | @real_kinetic The Modern Technology Cluster #*@! Stack

  79. realkinetic.com | @real_kinetic The complexity has risen significantly

  80. realkinetic.com | @real_kinetic But don’t worry OSS is here to

    save you (SPOILER: Only, kinda)
  81. realkinetic.com | @real_kinetic

  82. realkinetic.com | @real_kinetic Those are the tools that Tyler mentioned

    that you can use but you need to wrap with your "glue code” (your culture, your processes)
  83. realkinetic.com | @real_kinetic Oh … and I’m not done

  84. realkinetic.com | @real_kinetic There are significantly more nodes in the

    system
  85. realkinetic.com | @real_kinetic And many connections between these nodes to

    handle scale
  86. realkinetic.com | @real_kinetic These connections create dependency trees

  87. realkinetic.com | @real_kinetic And even more the nodes and connections

    are constantly changing
  88. realkinetic.com | @real_kinetic All while we must maintain usage rates

  89. realkinetic.com | @real_kinetic Thus we end up with different versions

    of the same type of node potentially within a single request
  90. realkinetic.com | @real_kinetic All of this (and more) leads to

    our systems producing emergent behaviors that can’t be predicted.
  91. realkinetic.com | @real_kinetic In other words our systems are becoming

    much more similar to “living” systems (Cities, governments, ecological, biological, etc)
  92. realkinetic.com | @real_kinetic So this …

  93. realkinetic.com | @real_kinetic

  94. realkinetic.com | @real_kinetic is kind of … like … alive?

  95. realkinetic.com | @real_kinetic

  96. realkinetic.com | @real_kinetic We may have created a monster

  97. realkinetic.com | @real_kinetic And it might kill us! F*$@!

  98. realkinetic.com | @real_kinetic Let’s go back to the old way

  99. realkinetic.com | @real_kinetic Except it’s too late.

  100. realkinetic.com | @real_kinetic This actually works.

  101. realkinetic.com | @real_kinetic Beyond the obvious successful companies (Google, Amazon,

    Facebook), the research backs up that these systems help all types of companies that embrace them across all industries.
  102. realkinetic.com | @real_kinetic Dynamic systems that support rapid development and

    experimentation directly increase quality and velocity
  103. realkinetic.com | @real_kinetic Thus IT becomes a differentiator and is

    no longer a cost center
  104. realkinetic.com | @real_kinetic DevOps is a critical piece of this

    transformation
  105. realkinetic.com | @real_kinetic If you don’t have a dynamic system

    that supports experimentation and rapid release and embrace DevOps you will be beat by those that do
  106. realkinetic.com | @real_kinetic Accelerate: The Science of Lean Software and

    DevOps: Building and Scaling High Performing Technology Organizations https://a16z.com/2018/03/28/devops-org-change-software-performance/ a16z Podcast: Feedback Loops — Company Culture, Change, and DevOps
  107. realkinetic.com | @real_kinetic So if this isn’t your world, it

    likely will be in the future
  108. realkinetic.com | @real_kinetic So what can we do to attempt

    to understand the chaos?
  109. realkinetic.com | @real_kinetic An example from our past experience at

    Workiva
  110. realkinetic.com | @real_kinetic “Calc”

  111. realkinetic.com | @real_kinetic A system and method that efficiently, robustly,

    and flexibly permits large scale distributed asynchronous calculations in a networked environment, where the number of users entering data is large, the number of variables and equations are large and can comprise long and/or wide dependency chains, and data integrity is important
  112. realkinetic.com | @real_kinetic Or … a distributed calculation engine

  113. realkinetic.com | @real_kinetic Built on stateless runtimes with no SSH

    or live debugging (Serverless in 2011, yep it was a thing)
  114. realkinetic.com | @real_kinetic Not that SSH or Debuggers would have

    mattered
  115. realkinetic.com | @real_kinetic Massive Scale (Millions of nodes)

  116. realkinetic.com | @real_kinetic A tease …

  117. realkinetic.com | @real_kinetic

  118. realkinetic.com | @real_kinetic

  119. realkinetic.com | @real_kinetic Structure, and thus behavior, changed when the

    data changed (Very dynamic)
  120. realkinetic.com | @real_kinetic What is the state of the system?


    Is it done? What is done? Is it broken? What is broken? What is fast/slow?
  121. realkinetic.com | @real_kinetic A single actor in the system does

    not know the status of the overall system.
  122. realkinetic.com | @real_kinetic There is no obvious way to track

    the status of the system unless the nodes within the system help us
  123. realkinetic.com | @real_kinetic To have any chance of keeping up

    with the understanding of systems we need the systems to self describe
  124. realkinetic.com | @real_kinetic And of course we need automation and

    self healing
  125. realkinetic.com | @real_kinetic And to have self description, automation, and

    self healing we need data. We need the systems to give us data to provide necessary context.
  126. realkinetic.com | @real_kinetic So what are the specifics?

  127. realkinetic.com | @real_kinetic We’ll start by working our way up

    from the code
  128. realkinetic.com | @real_kinetic 1. Pass a context object to basically

    everything
  129. realkinetic.com | @real_kinetic type Context = { user_id :: String

    , account_id :: String , trace_id :: String , request_id :: String , parent_id :: Maybe String }
  130. realkinetic.com | @real_kinetic What goes on the context?

  131. realkinetic.com | @real_kinetic Think about the data you wish you

    had when debugging an issue (This is why your devs should support their own systems)
  132. realkinetic.com | @real_kinetic What is the data that would change

    the behavior of the system?
  133. realkinetic.com | @real_kinetic The user (and/or company), time, machine stats

    (CPU, Memory, etc), software version, configuration data, the calling request, any dependent requests
  134. realkinetic.com | @real_kinetic What of that can we get for

    “free” and what do we need to pass along (Free == Machine Provided Memory, CPU, etc)
  135. realkinetic.com | @real_kinetic The data we can’t get for “free”

    should go on the context (Data that is “request” specific User, Company, Calling Request Id)
  136. realkinetic.com | @real_kinetic There are side-benefits as well

  137. realkinetic.com | @real_kinetic If you’re a SaaS company you should

    probably pass licensing data as part of the context
  138. realkinetic.com | @real_kinetic This will allow you to move processes

    around based on their license
  139. realkinetic.com | @real_kinetic Imagine routing traffic to specific queues based

    off user, account, license and environment (usage, resources available) (The ability to isolate processes at runtime Amazon is the king of this)
  140. realkinetic.com | @real_kinetic Also, think about GDPR and needing to

    track user actions, data and what they have approved the system to do
  141. realkinetic.com | @real_kinetic Please, use some data structure to pass

    contextual data to all dependent functions
  142. realkinetic.com | @real_kinetic This is the easiest thing you can

    start doing today
  143. realkinetic.com | @real_kinetic Oh, and then make sure to log

    that context on every request
  144. realkinetic.com | @real_kinetic And speaking of logging

  145. realkinetic.com | @real_kinetic 2. Structure your logs JSON is fine

  146. realkinetic.com | @real_kinetic I’m tired of writing regex’s to scrape

    logs because we’re too lazy to add structure at the time it actually makes the most sense
  147. realkinetic.com | @real_kinetic [{ "env": "Dev", “server_name": "AWS1", “app_name": “MyService",

    “app_loc": “/home/app“, “user_id”: “u1”, “account_id”: “a1”, "logger": "mylogger", "platform": “py", “trace_id”: “t1”, “ parent_id”: “p1”, "messages": [{ "tag": "Incoming metrics data", "data": "{\"clientid\":54732}", "thread": "10", “time": 1485555302470, "level": "DEBUG", "id": "0c28701b-e4de-11e6-8936-8975598968a4" }] }]
  148. realkinetic.com | @real_kinetic You can take this as far as

    you’d like
  149. realkinetic.com | @real_kinetic Very structured with a type system, code

    reviews, etc
  150. realkinetic.com | @real_kinetic There are many existing libraries (Too many

    to list. Just Google “Structured logs” and your language of choice)
  151. realkinetic.com | @real_kinetic But at minimum get your logs into

    a standard format with property tags
  152. realkinetic.com | @real_kinetic 3. Create a data pipeline

  153. realkinetic.com | @real_kinetic There is a ton of data that

    you want and need to collect
  154. realkinetic.com | @real_kinetic Logs, metrics, analytics, audits, etc

  155. realkinetic.com | @real_kinetic We want to make it as simple,

    yet robust as possible
  156. realkinetic.com | @real_kinetic But most importantly we want some system

    that has all of the data
  157. realkinetic.com | @real_kinetic What we often see at the beginning:

  158. realkinetic.com | @real_kinetic

  159. realkinetic.com | @real_kinetic

  160. realkinetic.com | @real_kinetic

  161. realkinetic.com | @real_kinetic

  162. realkinetic.com | @real_kinetic And now your services are spending more

    time with non- critical path dependencies than those on critical path
  163. realkinetic.com | @real_kinetic Standardize & simplify

  164. realkinetic.com | @real_kinetic A single data pipeline (queue) (Or use

    a pull process. Just get your logs into a central location)
  165. realkinetic.com | @real_kinetic

  166. realkinetic.com | @real_kinetic Look into “sidecar” style collection

  167. realkinetic.com | @real_kinetic

  168. realkinetic.com | @real_kinetic This allows you to write to stdout

    and the sidecar will collect and push to your queue
  169. realkinetic.com | @real_kinetic The data pipeline provides a layer of

    abstraction that allows you to get the data everywhere it needs to be without impacting developers and the “core” system
  170. realkinetic.com | @real_kinetic

  171. realkinetic.com | @real_kinetic Where should all of the data go?

  172. realkinetic.com | @real_kinetic At minimum all data should go into

    a cheap, long term storage solution (AWS Glacier, etc)
  173. realkinetic.com | @real_kinetic You’ll want this data for historical system

    behavior to help “machine learn” your system into automation
  174. realkinetic.com | @real_kinetic Ideally, all data should go into a

    queryable, large scale data storage solution. (solid time based query capabilities a plus) (Google BigQuery, AWS Redshift)
  175. realkinetic.com | @real_kinetic This is why we structure our logs

  176. realkinetic.com | @real_kinetic There are more targeted or customized solutions

    starting to fill the space
  177. realkinetic.com | @real_kinetic “Start solving high-cardinality problems in minutes” (honeycomb.io)

  178. realkinetic.com | @real_kinetic From their marketing …

  179. realkinetic.com | @real_kinetic High-cardinality refers to columns with values that

    are very uncommon or unique.High-cardinality column values are typically identification numbers, email addresses, or user names. An example of a data table column with high-cardinality would be a USERS table with a column named USER_ID.
  180. realkinetic.com | @real_kinetic Query anything. Break down, filter, and pivot

    on high- cardinality fields like user_id.
  181. realkinetic.com | @real_kinetic Once again, this is why we structure

    our logs
  182. realkinetic.com | @real_kinetic See the raw data behind every result.

  183. realkinetic.com | @real_kinetic See the exact events leading to an

    issue, who was affected, and how.
  184. realkinetic.com | @real_kinetic Share queries, results, and history. Collaborate.

  185. realkinetic.com | @real_kinetic Many other options … (Still a bit

    too dashboard based but trending in the right direction)
  186. realkinetic.com | @real_kinetic

  187. realkinetic.com | @real_kinetic The beauty of the data pipeline is

    you can use 1 or many. And test multiple in parallel if you’d like without interrupting development. (Just don’t forget to have Devs user test the solutions as well)
  188. realkinetic.com | @real_kinetic You’re still going to end up with

    multiple consumers
  189. realkinetic.com | @real_kinetic

  190. realkinetic.com | @real_kinetic Back to the structured logs thing

  191. realkinetic.com | @real_kinetic 4. Structure and standardize all data leaving

    a system
  192. realkinetic.com | @real_kinetic Provide libraries to add structure to not

    just logs but also metrics, audits, etc
  193. realkinetic.com | @real_kinetic We like having one standard across the

    board
  194. realkinetic.com | @real_kinetic But you can also break them apart

    by “type” … Metrics, audits, tracing, etc
  195. realkinetic.com | @real_kinetic As long as you get it standardized

    across systems
  196. realkinetic.com | @real_kinetic But people are quickly realizing that this

    data is all related and the separation is arbitrary
  197. realkinetic.com | @real_kinetic OpenCensus A single distribution of libraries for

    metrics and distributed tracing with minimal overhead that allows you to export data to multiple backends. https://opencensus.io
  198. realkinetic.com | @real_kinetic Vendor-neutral APIs and instrumentation for distributed tracing.

    https://opentracing.io
  199. realkinetic.com | @real_kinetic Most of the “infrastructure data” players are

    converting support for all styles of system data collection
  200. realkinetic.com | @real_kinetic

  201. realkinetic.com | @real_kinetic With a data pipeline you’ll be setup

    to handle whatever tool(s) come next (Leverage abstractions at the integration layers to allow easier adaptation to change)
  202. realkinetic.com | @real_kinetic 5. Minimize, isolate and track dependencies

  203. realkinetic.com | @real_kinetic Unmanaged dependencies are where throughput goes to

    die (And what creates and increases complexity faster than anything else)
  204. realkinetic.com | @real_kinetic Golang got a few things correct.

  205. realkinetic.com | @real_kinetic One of them is promoting code duplication

    over introducing unnecessary dependencies
  206. realkinetic.com | @real_kinetic Personally, I promote the Golang + Haskell

    approach
  207. realkinetic.com | @real_kinetic A dependency can be introduced when it

    is well formalized and worth the cost (In the Haskell world you’ll see laws for APIs. These are pretty stable APIs.)
  208. realkinetic.com | @real_kinetic Quick Note:

  209. realkinetic.com | @real_kinetic Avoiding dependencies does not mean “build everything”

  210. realkinetic.com | @real_kinetic Javascript Padding Library != AWS Dynamo

  211. realkinetic.com | @real_kinetic Using Dynamo + client library is less

    code and likely no additional dependency vs building from scratch
  212. realkinetic.com | @real_kinetic And way better than building your own

    database (Even though these days people seem to think building a database is easy and necessary)
  213. realkinetic.com | @real_kinetic Back to regular schedule programming

  214. realkinetic.com | @real_kinetic If you’re going to introduce dependencies then

    clearly track and pin them
  215. realkinetic.com | @real_kinetic Ideally a single file in the project/repo.

    (Or in an aggregate repo)
  216. realkinetic.com | @real_kinetic If possible standardize the spec for these

    files
  217. realkinetic.com | @real_kinetic Then create a process that aggregates the

    dependencies into an overall mapping to give a picture of the system
  218. realkinetic.com | @real_kinetic This goes for services as well as

    libraries
  219. realkinetic.com | @real_kinetic Then you can generate diagrams (Free architecture

    diagrams!)
  220. realkinetic.com | @real_kinetic

  221. realkinetic.com | @real_kinetic Which you can then visualize over time

  222. realkinetic.com | @real_kinetic

  223. realkinetic.com | @real_kinetic

  224. realkinetic.com | @real_kinetic Netflix has some great examples and tools

    (Those #%*@$!# are always leading the charge) Out of necessity
  225. realkinetic.com | @real_kinetic Spigo and Simianviz

  226. realkinetic.com | @real_kinetic https://github.com/adrianco/spigo

  227. realkinetic.com | @real_kinetic

  228. realkinetic.com | @real_kinetic That said …

  229. realkinetic.com | @real_kinetic Service/network dependencies are still a nightmare

  230. realkinetic.com | @real_kinetic 6. Use network sidecars (service mesh, proxies)

    to better isolate and handle these dependencies
  231. realkinetic.com | @real_kinetic Similar concept to the data pipeline except

    with even more “production” benefits
  232. realkinetic.com | @real_kinetic

  233. realkinetic.com | @real_kinetic A combination of many of the API

    Gateway, proxy, router, etc solutions that exist today
  234. realkinetic.com | @real_kinetic Having a standard network proxy gives you:

    Load balancing, service discovery, health checking, circuit breakers, standard observability (+tracing)
  235. realkinetic.com | @real_kinetic Using the sidecar allows you to easily

    standardize without introducing new dependencies at the code and team level
  236. realkinetic.com | @real_kinetic And of course the meta-data can be

    pumped to your same data pipeline
  237. realkinetic.com | @real_kinetic

  238. realkinetic.com | @real_kinetic Many new tools Especially around Kubernetes

  239. realkinetic.com | @real_kinetic Vary from service mesh focused to full

    bore micro- service framework
  240. realkinetic.com | @real_kinetic

  241. realkinetic.com | @real_kinetic All of these come with “free” monitoring

    tools
  242. realkinetic.com | @real_kinetic And …

  243. realkinetic.com | @real_kinetic 7. Distributed Tracing

  244. realkinetic.com | @real_kinetic We need better ways to visualize our

    systems
  245. realkinetic.com | @real_kinetic Charts, dashboards are nice for looking at

    system behaviors in a generic, data driven perspective
  246. realkinetic.com | @real_kinetic But that layer of abstraction (while helping

    isolate variables) removes a layer of intuition
  247. realkinetic.com | @real_kinetic We need the ability to also visualize

    specific and aggregate behavior
  248. realkinetic.com | @real_kinetic Tracing is one example

  249. realkinetic.com | @real_kinetic def my_func(*args, **kwargs): logging.info("start") analytics.store(“my_func”, “start”) do_something()

    do_something_else() do_another_thing() logging.info("end") analytics.store(“my_func”, “stop”)
  250. realkinetic.com | @real_kinetic This is really slow and we don’t

    know why so we start doing naive timing crap
  251. realkinetic.com | @real_kinetic def my_func(*args, **kwargs): logging.info(“start {}“.format(time.now())) analytics.store(“my_func”, “start”)

    do_something() do_something_else() do_another_thing() logging.info(“end {}“.format(time.now())) analytics.store(“my_func”, “stop”)
  252. realkinetic.com | @real_kinetic Let’s take advantage of our context and

    structured logging to enable tracing
  253. realkinetic.com | @real_kinetic ctx = { “trace_id”: “t1”, “parent_id”: None,

    “id”: “newgenid” | more} @trace() def my_func(ctx, *args, **kwargs): do_something(ctx) do_something_else(ctx) do_another_thing(ctx)
  254. realkinetic.com | @real_kinetic ctx = { “trace_id”: “t1”, “parent_id”: “newgenid”,

    “id”: uuid.new | more} @trace() def do_something(ctx, *args, **kwargs): some_other_crap …
  255. realkinetic.com | @real_kinetic This will give us the ability to

    get a call graph
  256. realkinetic.com | @real_kinetic

  257. realkinetic.com | @real_kinetic And since we’re collecting all of the

    metadata that we can we know the characteristics of these nodes
  258. realkinetic.com | @real_kinetic

  259. realkinetic.com | @real_kinetic Oh crap, those aren’t “pure” functions. They’re

    all doing IO. (Stupid ORMs and their poor abstractions. A good abstraction would make it clear there is IO happening)
  260. realkinetic.com | @real_kinetic This visualization does a good job showing

    dependencies (And is very good at representing larger, distributed, asynchronous processes)
  261. realkinetic.com | @real_kinetic

  262. realkinetic.com | @real_kinetic But it’s not great for all needs

  263. realkinetic.com | @real_kinetic In our example these are synchronous processes

  264. realkinetic.com | @real_kinetic

  265. realkinetic.com | @real_kinetic This style isn’t intuitive for the actual

    stack + performance within the process
  266. realkinetic.com | @real_kinetic Standard Tracing View

  267. realkinetic.com | @real_kinetic

  268. realkinetic.com | @real_kinetic

  269. realkinetic.com | @real_kinetic Come with the ability to search, discover

    traces
  270. realkinetic.com | @real_kinetic

  271. realkinetic.com | @real_kinetic Tracing standards and systems are quite immature

    but growing (and hopefully stabilizing) quickly
  272. realkinetic.com | @real_kinetic 2 Parts

  273. realkinetic.com | @real_kinetic The spec

  274. realkinetic.com | @real_kinetic OpenCensus

  275. realkinetic.com | @real_kinetic Distributed Trace Context Community Group https://www.w3.org/community/trace-context/ https://github.com/w3c/distributed-tracing

    This specification defines formats to pass trace context information across systems. Our goal is to share this with the community so that various tracing and diagnostics products can operate together.
  276. realkinetic.com | @real_kinetic Pick something. Use structured logging + data

    pipeline to pass off (and transform if necessary) to tracing aggregator
  277. realkinetic.com | @real_kinetic The aggregators

  278. realkinetic.com | @real_kinetic

  279. realkinetic.com | @real_kinetic And as mentioned many of the collectors

    are including (or in the process of adding) tracing as part of their offerings
  280. realkinetic.com | @real_kinetic

  281. realkinetic.com | @real_kinetic But any system that lets you query

    and aggregate relationships will give you the base system necessary
  282. realkinetic.com | @real_kinetic Give your users the ability to create

    the visualizations and “traces” that map to their use case
  283. realkinetic.com | @real_kinetic Those Netflix folks again

  284. realkinetic.com | @real_kinetic vizceral https://github.com/Netflix/vizceral

  285. realkinetic.com | @real_kinetic

  286. realkinetic.com | @real_kinetic

  287. realkinetic.com | @real_kinetic

  288. realkinetic.com | @real_kinetic

  289. realkinetic.com | @real_kinetic 8. Provide the ability to “trace” through

    the system without impact
  290. realkinetic.com | @real_kinetic Some folks call this the “Tracer Bullet”

  291. realkinetic.com | @real_kinetic It is a way to simulate a

    request through the system that makes no “destructive” change
  292. realkinetic.com | @real_kinetic In other words: Send request that NoOP

    writes to storage, writes to 3rd Party apps (Be careful to impact 3rd party quotas, licenses.)
  293. realkinetic.com | @real_kinetic FYI, this is how companies like Amazon

    test their AWS APIs
  294. realkinetic.com | @real_kinetic Leverage the context

  295. realkinetic.com | @real_kinetic type Context = { user_id :: String

    , account_id :: String , trace_id :: String , request_id :: String , parent_id :: Maybe String , request_type :: (STANDARD, TRACE) }
  296. realkinetic.com | @real_kinetic def my_func(ctx, id, data): my_thing = db.get(id)

    my_thing.data = data if ctx.request_type != REQUEST_TYPE.TRACE: # Write to storage my_thing.put() # More ideally we wrap our storage layer to use the flag
  297. realkinetic.com | @real_kinetic This looks like a feature flag

  298. realkinetic.com | @real_kinetic Yes!

  299. realkinetic.com | @real_kinetic Use feature flags!

  300. realkinetic.com | @real_kinetic And you can use them for more

    than just features
  301. realkinetic.com | @real_kinetic Just make sure you log those flags

    as part of your context so your tools can properly tag the data
  302. realkinetic.com | @real_kinetic Tracer bullets is how we generated our

    graphs
  303. realkinetic.com | @real_kinetic

  304. realkinetic.com | @real_kinetic And now I’m going to get “rant-y”

  305. realkinetic.com | @real_kinetic 9. Provide the ability to experiment and

    test in production
  306. realkinetic.com | @real_kinetic Tracer bullets, feature flags allow us to

    use our production system for gathering information
  307. realkinetic.com | @real_kinetic We should also support “tester” accounts so

    you can fully mimic all user actions in a production system
  308. realkinetic.com | @real_kinetic All of the work you need to

    do to support this is work that you should do anyway to fully support multi-tenant apps
  309. realkinetic.com | @real_kinetic The ability to isolate services, accounts, actions

    on demand
  310. realkinetic.com | @real_kinetic The ability to stop, interrupt, move bad

    acting services, users, etc
  311. realkinetic.com | @real_kinetic Ideally, support chaos tools in production (Also,

    use chaos tooling! :))
  312. realkinetic.com | @real_kinetic Allowing folks to experiment and learn within

    the production system helps them build an intuition for the system, it’s behavior, and their impact on that behavior
  313. realkinetic.com | @real_kinetic 10. Use tools (custom if necessary) to

    simulate usage
  314. realkinetic.com | @real_kinetic Load testing, chaos, general traffic simulation

  315. realkinetic.com | @real_kinetic Using network proxies and a data pipeline

    will allow you to capture actual traffic …
  316. realkinetic.com | @real_kinetic Of which you can then replay to

    simulate certain traffic patterns, etc
  317. realkinetic.com | @real_kinetic 11. Kill environments

  318. realkinetic.com | @real_kinetic Less environments means … Less environments

  319. realkinetic.com | @real_kinetic Less things to maintain and understand means

    we can put more time in understanding or other systems
  320. realkinetic.com | @real_kinetic “Production” (any environment of which customers have

    access) is the only environment that matters
  321. realkinetic.com | @real_kinetic So why do we spend so much

    time not in production?
  322. realkinetic.com | @real_kinetic We know replicas and models are not

    as good as the real thing
  323. realkinetic.com | @real_kinetic Yet we continue to build that way.

  324. realkinetic.com | @real_kinetic And worse we allow shortcuts in other

    environments that won’t work in production (SSH in Dev, No SSH in Prod)
  325. realkinetic.com | @real_kinetic Wouldn’t you also want those tools and

    abilities in production?
  326. realkinetic.com | @real_kinetic We don’t invest in building production capable

    tools for dev because … time?
  327. realkinetic.com | @real_kinetic So instead you’re going to wait until

    you have a production issue?
  328. realkinetic.com | @real_kinetic Scenario: Massive Outage Boss: What are we

    doing to resolve the issue? You: Well, not much. Normally I would do “x” but I can’t because those only work in dev environments. So I’m going to attempt to hack together some duct tape solution that I’ll never use again. And I’m going to run it now in production without going through the code review process.
  329. realkinetic.com | @real_kinetic If you’ve done everything mentioned then why

    would you need other environments? (Quick answer: If you need to change/test core infrastructure that impacts all users at all times)
  330. realkinetic.com | @real_kinetic Do your best to force as much

    development and testing in production as possible Quick answer: If you need to change/test core infrastructure that impacts all users at all times
  331. realkinetic.com | @real_kinetic In closing …

  332. realkinetic.com | @real_kinetic There’s so much more we can do

    that I didn’t get to
  333. realkinetic.com | @real_kinetic And it all starts with empathy for

    our peers and users
  334. realkinetic.com | @real_kinetic Please come talk to me I would

    love to discuss further @lyddonb
  335. realkinetic.com | @real_kinetic Quick Recap:

  336. realkinetic.com | @real_kinetic • Pass a context • Structure your

    logs • Create a data pipeline • Structure all system data and pass to pipeline • Minimize, track and build visualizations for dependencies • Leverage service meshes • Distributed Tracing • Support NoOp, experimentation, simulation in production • Then kill as many non-production environments as possible
  337. realkinetic.com | @real_kinetic And here are all those tools again:

  338. realkinetic.com | @real_kinetic

  339. realkinetic.com | @real_kinetic Thank You

  340. realkinetic.com | @real_kinetic @lyddonb @real_kinetic Real Kinetic mentors clients to

    enable their technical teams to grow and build high- quality software
  341. realkinetic.com | @real_kinetic Resources & References • Cloud Native Landscape

    • Incidents Are Unplanned Investments • stella.report • How to Keep Your Systems Running Day After Day - Allspaw • Honeycomb • More Environments Will Not Make Things Easier • Silicon Valley’s Tech Gods Are Headed For A Reckoning • On purpose and by necessity: compliance under the GDPR • ACCELERATE: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations • a16z Podcast: Feedback Loops — Company Culture, Change, and DevOps • System and method for performing distributed asynchronous calculations in a networked environment • You Could Have Invented Structured Logging • What is structured logging and why developers need it • How one developer just broke Node, Babel and thousands of projects in 11 lines of JavaScript • W3C Distributed Trace Context Community Group • Load Testing with Locust
  342. realkinetic.com | @real_kinetic Products, Libs, Etc • Splunk • Datadog

    • Nagios • Apache Kafka • Amazon Kinesis • FluentD • Prometheus • Google Stackdriver • VictorOps • Amazon Glacier • Google BigQuery • Amazon Redshift • OpenCensus • OpenTracing • Haskell • Go • AWS DynamoDB • Spigo and Simianviz • Envoy • Kubernetes • Istio • Linkerd • Kong • Jaeger • Zipkin • AWS X-Ray • Stackdriver Trace • Vizceral