Programming in the Large: Architecture and Experimentation

42d9867a0fee0fa6de6534e9df0f1e9b?s=47 Mark Hibberd
December 11, 2014

Programming in the Large: Architecture and Experimentation

Building robust, quality systems is hard. We trade off organizational issues against technical decisions; the ability to deliver quickly against our ability to change; and the ability to build systems easily against the ability to run those systems in production. However, good architectural decisions can free us to choose the right tools and techniques, allowing us to manage these challenges and concentrate on solving real problems rather than our made up ones.

In this talk, we will run through some stereotypical projects, come to terms with legacy systems, and look at the properties of robust architectures. In particular we are interested in how architectures lend themselves to experimentation and change in terms of both function and technology.

We will attempt to ground the discussion with examples from my past projects. Looking at where things have worked well and probably of more interest, where they really have not.

This was presented at the YOW! conference in Australia, Melbourne 04/12/2014, Brisbane 08/12/2014, Sydney 11/12/2014 (http://2014.yowconference.com.au/).

42d9867a0fee0fa6de6534e9df0f1e9b?s=128

Mark Hibberd

December 11, 2014
Tweet

Transcript

  1. @markhibberd programming in the large Architecture and Experimentation

  2. “Simplicity is prerequisite for reliability” Edsger W. Dijkstra -! How

    do we tell truths that might hurt? (1975)
  3. Legacy Systems and Organisations z ģ G Y

  4. How Did We Get Here

  5. The Hand-me Down Code Last Touched

  6. Code Last Touched You Started The Hand-me Down

  7. Code Last Touched You Started Everyone Else Started The Hand-me

    Down
  8. Code Last Touched You Started Everyone Else Started You’re The

    Expert The Hand-me Down
  9. The Rush Job Start Work

  10. Start Work A Working System The Rush Job

  11. Start Work System Delivered A Working System The Rush Job

  12. Start Work System Delivered The Rush Job

  13. The Rewrite Someone Else’s Code

  14. Someone Else’s Code System Delivered The Rewrite

  15. Someone Else’s Code System Delivered Bob Knows Better The Rewrite

  16. Someone Else’s Code System Delivered A New System Bob Knows

    Better The Rewrite
  17. Someone Else’s Code System Delivered A New, Not Quite Working

    System Bob Knows Better The Rewrite
  18. Someone Else’s Code System Delivered An Old, Not Quite Working

    System Bob Knows Better The Rewrite
  19. The Greenfield Enthusiasm

  20. Enthusiasm System Delivered The Greenfield

  21. Enthusiasm Realisation and Despair System Delivered The Greenfield

  22. An Idea Oh, Sorry, We Shipped That 30 Minutes Later

    The Prototype
  23. The Bandwagon

  24. How We Pick Our Technology The Bandwagon

  25. Perhaps we need a microservice to deploy Docker The Bandwagon

  26. So we can run a microservice The Bandwagon

  27. To display some text The Bandwagon

  28. legacy is the default

  29. The Ideal New Ideas

  30. The Ideal New Ideas Stable Ideas

  31. The Ideal New Ideas Stable Ideas We Now Know Better

  32. Taking Responsibility

  33. Too Important to Ignore, Too Important to Change an anecdote

  34. 100 million+ active users 100 million+ transactions a day millions

    of $$$ a couple of “simple” services
  35. server client

  36. /call server client on-demand

  37. /call server client /check on-demand periodically

  38. /call server client /check on-demand periodically

  39. /call server client /check on-demand periodically /check2 /check2z /v3check

  40. /call server client /check on-demand periodically /check2 /check2z /v3check

  41. /call server /check /check2 /check2z /v3check

  42. enter our protagonists…

  43. /call server /check we spent a lot of time “fire

    fighting” /check2 /check2z /v3check
  44. /call server /check we spent a lot of time “fire

    fighting” /check2 /check2z /v3check
  45. /call server /check /check2 /check2z /v3check we spent a lot

    of time improving “quality”
  46. /call server /check we spent a lot of time improving

    “quality”
  47. /call server /check we spent a lot of time improving

    “quality”
  48. Programmer Myth #1 It Is Someone Else’s Fault

  49. we completely failed to adapt the system for change

  50. we remained hostage to a fear of change

  51. Autonomous Systems and Rates of Change ģ Y z G

  52. Systems

  53. Code Search an example

  54. code search

  55. web du jour db ui

  56. web du jour db ui indexer api

  57. web du jour db ui indexer api

  58. web du jour db ui indexer api

  59. db ui indexer api

  60. the thing about real systems is their autonomy

  61. ?

  62. rules not boxes

  63. architecture is the concepts on which we formulate our systems

  64. architecture is the rules for how these systems interact

  65. architecture is the rules for how these systems are implemented

  66. indexer search independent problem domains

  67. indexer search code ctags ctags application/html application/search.v1+json well defined interfaces

  68. indexer search code ctags ctags application/html application/search.v1+json well defined interfaces

  69. indexer independent technical decisions search shell scala

  70. indexer independent technical decisions search shell scala git hook embedded

  71. indexer independent technical decisions search shell scala git hook embedded

    os logging os logging
  72. indexer consistency helps avoid chaos search shell scala git hook

    embedded os logging os logging
  73. Autonomy

  74. #1 individually deployable

  75. indexer search

  76. indexer search v1 v1

  77. indexer search v2 v1

  78. indexer search v3 v1

  79. indexer search v3 v2

  80. #2 independent domain models

  81. indexer search

  82. different notions of “index”

  83. really don’t do this

  84. really don’t do this

  85. #3 standards for interchange formats

  86. indexer search

  87. indexer search

  88. indexer search standard rules for these help avoid chaos

  89. #4 no shared state

  90. None
  91. really don’t do this

  92. really don’t do this

  93. autonomy builds in reliability

  94. indexer search

  95. x search x /\/\/\/\/\

  96. x search x /\/\/\/\/\

  97. autonomy builds in the ability to change

  98. indexer search shell scala git hook embedded os logging os

    logging
  99. indexer search haskell scala git hook embedded os logging os

    logging
  100. How long does it take to get a 1 line

    change to production?
  101. None
  102. None
  103. None
  104. None
  105. None
  106. warning signs an anecdote

  107. multi database - multi data center replication 100 million+ transactions

    a day
  108. None
  109. None
  110. None
  111. None
  112. None
  113. None
  114. None
  115. None
  116. x x

  117. x x /\/\/\/\/\/\/\

  118. the data-model was entirely shared between replication and otp system

  119. it was ALL shared state

  120. it was really only feasible to change if one team

    was working on both “systems”
  121. if one system failed, they often both failed

  122. as we patched failure modes, reliability never improved

  123. x x /\/\/\/\/\/\/\

  124. x /\/\/\/\/\/\/\

  125. autonomy is far more important for reliability than code improvements

  126. Programmer Myth #2 The Bad Code is to Blame

  127. System Evolution z ģ G Y

  128. “... with proper design, the features come cheaply. This approach

    is arduous, but continues to succeed.” Dennis Ritchie
  129. thinking ahead is not about avoiding change

  130. indexer search shell scala git hook embedded os logging os

    logging
  131. indexer search haskell scala git hook embedded os logging os

    logging
  132. thinking ahead is about letting us change at different rates

    for different problems
  133. thinking ahead is about letting us make short term decisions

    that don’t have long term effects
  134. attempting change an anecdote

  135. small company analytics product very quality focused team inherited a

    small piece of code very bad code
  136. the product

  137. the jsp

  138. the rewrite heavy focus on quality

  139. the rewrite but… rebuilt same structure

  140. None
  141. the indivisible blob

  142. websphere the indivisible blob

  143. websphere the indivisible blob The Plan ui core split

  144. websphere the indivisible blob The Plan ui core tech upgrade

  145. websphere the indivisible blob The Plan ui core indexer websphere

    isolate
  146. The Reality ui core indexer websphere

  147. The Reality ui core indexer websphere data model + state

  148. The Reality ui core indexer websphere data model + state

    WEBSHERE
  149. Programmer Myth #3 We Must Do Something Now

  150. Rewrites

  151. Programmer Myth #4 We Should Rewrite

  152. (not) Rewrites

  153. architecture is controlled by developers not architects

  154. #1 version everything

  155. indexer search

  156. indexer v1 search v1

  157. indexer v1 search v1 v1 v1 v1

  158. the internet is broken an aside

  159. MIME-Version: 1.0

  160. what should a client do if it sees something that

    isn’t version 1.0?
  161. #2 the wedge

  162. the status quo

  163. a wedge the status quo

  164. a wedge the status quo

  165. a wedge the status quo

  166. a wedge

  167. a wedge

  168. mega-code-search-tool

  169. mega-code-search-tool R

  170. mega-code-search-tool external indexer support R

  171. mega-code-search-tool R external indexer support

  172. mega-code-search-tool R external indexer support scala

  173. R scala haskell javascript search

  174. #3 embrace partial moves

  175. mega-code-search-tool

  176. mega-code-search-tool {incomplete}

  177. control in progress moves at a single point

  178. track and cap the number of moves in progress

  179. plan for rollback as much as rollforward

  180. #4 validate as you go

  181. mega-code-search-tool

  182. mega-code-search-tool external indexer support R

  183. R make sure you can run this straight away external

    indexer support mega-code-search-tool
  184. R make sure you can run this straight away external

    indexer support mega-code-search-tool
  185. mega-code-search-tool R external indexer support scala

  186. mega-code-search-tool R external indexer support scala

  187. mega-code-search-tool R external indexer support scala

  188. R scala haskell javascript search

  189. Experimentation and Measurement G ģ z Y

  190. Change Without Fear

  191. we need confidence that things don’t break when we ship

    code
  192. confidence stems from knowing code works in production before it

    affects a customer
  193. #1 move production to development

  194. production quality data automation of environments lots of testing

  195. production quality data automation of environments lots of testing Rather

    Old Hat
  196. #2 move development to production

  197. yes, really. i want to ship your worst, un-tried, experimental

    code to production
  198. Programmer Myth #5 We Can’t Ship That

  199. Safety First

  200. @ambiata we deal with ingesting and processing lots of data

    100s TB / per day / per customer scientific experiment and measurement is key experiments affect users directly researchers / non-specialist engineers produce code
  201. ingest store the machine package publish

  202. ingest store the machine package publish

  203. ingest store package publish the machine

  204. #1 split environments

  205. ingest store package publish the machine

  206. ingest store package publish the machine production:live

  207. ingest store package publish the machine production:exp

  208. ingest store package publish the machine production:* package publish

  209. implemented through machine level acls experiment live control

  210. implemented through machine level acls experiment live control write read

  211. implemented through machine level acls experiment live control

  212. implemented through machine level acls experiment live control write read

  213. implemented through machine level acls experiment live control write read

  214. #2 checkpoints

  215. ingest store package publish the machine

  216. ingest store package publish the machine x x

  217. ingest store package publish the machine x x

  218. ingest store package publish the machine x x

  219. ingest store package publish the machine x x

  220. deep implementation, intra- and inter- process crosschecks

  221. #3 tandem deployments

  222. ingest store package publish the machine

  223. ingest store package publish the machine

  224. ingest store package publish the machine x x x x

  225. ingest store package publish the machine x x x x

  226. ingest store package publish the machine x x x x

  227. ingest store package publish the machine x x x x

  228. #4 measure everything

  229. every result computed should have traceability back to the code

    & data
  230. package publish the machine

  231. package publish the machine publish-ab12f2e

  232. package publish the machine publish-ab12f2e

  233. package publish the machine publish-ab12f2e

  234. package publish the machine package-ab12f2e

  235. package publish the machine score-ab12f2e

  236. package publish the machine

  237. package publish the machine size: 192GB checksum: d32fe1a created: 2014-03-02T10:01

    loaded: store-a122fe3
  238. statistics work, measurements over time will find errors

  239. package publish the machine

  240. package publish the machine wall-time: 13411s cpu-time: 429130s records: 19

    million histogram: a: 13million b: 2million c: 4million
  241. package publish the machine wall-time: 13411s cpu-time: 429130s records: 19

    million histogram: a: 13million b: 2million c: 4million aggregate over time
  242. package publish the machine median: … averages: cpu-time: 420030s quantiles:

    … aggregate over time
  243. package publish the machine cross check everything wall-time: 13411s cpu-time:

    429130s records: 19 million histogram: a: 13million b: 2million c: 4million
  244. package publish the machine cross check everything wall-time: 13411s cpu-time:

    429130s records: 19 million histogram: a: 13million b: 2million c: 4million
  245. Programmer Myth #6 But We Can’t Do That In Our

    Situation
  246. these techniques adapt

  247. WebCloud (tm) live live

  248. WebCloud (tm) live proxy live

  249. WebCloud (tm) experiment live proxy experiment live

  250. WebCloud (tm) experiment live proxy experiment live

  251. Packaged Products live live live

  252. Packaged Products live measurement live live

  253. Packaged Products live measurement live live policy

  254. Packaged Products experiment live measurement live live policy

  255. change is the default

  256. architecture is every day

  257. experiment for reliability

  258. measure always

  259. end z ģ G Y