Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Programming in the Large: Architecture and Experimentation

Mark Hibberd
December 11, 2014

Programming in the Large: Architecture and Experimentation

Building robust, quality systems is hard. We trade off organizational issues against technical decisions; the ability to deliver quickly against our ability to change; and the ability to build systems easily against the ability to run those systems in production. However, good architectural decisions can free us to choose the right tools and techniques, allowing us to manage these challenges and concentrate on solving real problems rather than our made up ones.

In this talk, we will run through some stereotypical projects, come to terms with legacy systems, and look at the properties of robust architectures. In particular we are interested in how architectures lend themselves to experimentation and change in terms of both function and technology.

We will attempt to ground the discussion with examples from my past projects. Looking at where things have worked well and probably of more interest, where they really have not.

This was presented at the YOW! conference in Australia, Melbourne 04/12/2014, Brisbane 08/12/2014, Sydney 11/12/2014 (http://2014.yowconference.com.au/).

Mark Hibberd

December 11, 2014
Tweet

More Decks by Mark Hibberd

Other Decks in Programming

Transcript

  1. @markhibberd
    programming in the large
    Architecture and
    Experimentation

    View Slide

  2. “Simplicity is prerequisite for
    reliability”
    Edsger W. Dijkstra -!
    How do we tell truths that might hurt? (1975)

    View Slide

  3. Legacy Systems
    and Organisations
    z
    ģ
    G
    Y

    View Slide

  4. How Did We Get Here

    View Slide

  5. The Hand-me Down
    Code Last
    Touched

    View Slide

  6. Code Last
    Touched
    You
    Started
    The Hand-me Down

    View Slide

  7. Code Last
    Touched
    You
    Started
    Everyone Else
    Started
    The Hand-me Down

    View Slide

  8. Code Last
    Touched
    You
    Started
    Everyone Else
    Started
    You’re The
    Expert
    The Hand-me Down

    View Slide

  9. The Rush Job
    Start
    Work

    View Slide

  10. Start
    Work
    A Working
    System
    The Rush Job

    View Slide

  11. Start
    Work
    System
    Delivered
    A Working
    System
    The Rush Job

    View Slide

  12. Start
    Work
    System
    Delivered
    The Rush Job

    View Slide

  13. The Rewrite
    Someone
    Else’s Code

    View Slide

  14. Someone
    Else’s Code
    System
    Delivered
    The Rewrite

    View Slide

  15. Someone
    Else’s Code
    System
    Delivered
    Bob Knows
    Better
    The Rewrite

    View Slide

  16. Someone
    Else’s Code
    System
    Delivered
    A New
    System
    Bob Knows
    Better
    The Rewrite

    View Slide

  17. Someone
    Else’s Code
    System
    Delivered
    A New, Not Quite
    Working System
    Bob Knows
    Better
    The Rewrite

    View Slide

  18. Someone
    Else’s Code
    System
    Delivered
    An Old, Not Quite
    Working System
    Bob Knows
    Better
    The Rewrite

    View Slide

  19. The Greenfield
    Enthusiasm

    View Slide

  20. Enthusiasm System
    Delivered
    The Greenfield

    View Slide

  21. Enthusiasm Realisation
    and Despair
    System
    Delivered
    The Greenfield

    View Slide

  22. An Idea Oh, Sorry, We
    Shipped That
    30 Minutes
    Later
    The Prototype

    View Slide

  23. The Bandwagon

    View Slide

  24. How We
    Pick Our
    Technology
    The Bandwagon

    View Slide

  25. Perhaps we need a
    microservice to
    deploy Docker
    The Bandwagon

    View Slide

  26. So we can run
    a microservice
    The Bandwagon

    View Slide

  27. To display some
    text
    The Bandwagon

    View Slide

  28. legacy
    is the default

    View Slide

  29. The Ideal
    New Ideas

    View Slide

  30. The Ideal
    New Ideas
    Stable Ideas

    View Slide

  31. The Ideal
    New Ideas
    Stable Ideas
    We Now Know Better

    View Slide

  32. Taking Responsibility

    View Slide

  33. Too Important to Ignore,
    Too Important to Change
    an anecdote

    View Slide

  34. 100 million+ active users
    100 million+ transactions a day
    millions of $$$
    a couple of “simple” services

    View Slide

  35. server client

    View Slide

  36. /call
    server client
    on-demand

    View Slide

  37. /call
    server client
    /check
    on-demand
    periodically

    View Slide

  38. /call
    server client
    /check
    on-demand
    periodically

    View Slide

  39. /call
    server client
    /check
    on-demand
    periodically
    /check2
    /check2z
    /v3check

    View Slide

  40. /call
    server client
    /check
    on-demand
    periodically
    /check2
    /check2z
    /v3check

    View Slide

  41. /call
    server
    /check
    /check2
    /check2z
    /v3check

    View Slide

  42. enter our protagonists…

    View Slide

  43. /call
    server
    /check
    we spent a lot
    of time “fire
    fighting”
    /check2
    /check2z
    /v3check

    View Slide

  44. /call
    server
    /check
    we spent a lot
    of time “fire
    fighting”
    /check2
    /check2z
    /v3check

    View Slide

  45. /call
    server
    /check
    /check2
    /check2z
    /v3check
    we spent a
    lot of time
    improving
    “quality”

    View Slide

  46. /call
    server
    /check
    we spent a
    lot of time
    improving
    “quality”

    View Slide

  47. /call
    server
    /check
    we spent a
    lot of time
    improving
    “quality”

    View Slide

  48. Programmer Myth #1
    It Is Someone Else’s Fault

    View Slide

  49. we completely failed
    to adapt the system
    for change

    View Slide

  50. we remained hostage
    to a fear of change

    View Slide

  51. Autonomous Systems
    and Rates of Change
    ģ
    Y
    z G

    View Slide

  52. Systems

    View Slide

  53. Code Search
    an example

    View Slide

  54. code search

    View Slide

  55. web du jour
    db
    ui

    View Slide

  56. web du jour
    db
    ui
    indexer
    api

    View Slide

  57. web du jour
    db
    ui
    indexer
    api

    View Slide

  58. web du jour
    db
    ui
    indexer
    api

    View Slide

  59. db
    ui
    indexer
    api

    View Slide

  60. the thing about real
    systems is their
    autonomy

    View Slide

  61. ?

    View Slide

  62. rules
    not boxes

    View Slide

  63. architecture is the
    concepts on which we
    formulate our systems

    View Slide

  64. architecture is the rules
    for how these systems
    interact

    View Slide

  65. architecture is the rules
    for how these systems are
    implemented

    View Slide

  66. indexer search
    independent problem domains

    View Slide

  67. indexer search
    code
    ctags
    ctags
    application/html
    application/search.v1+json
    well defined interfaces

    View Slide

  68. indexer search
    code
    ctags
    ctags
    application/html
    application/search.v1+json
    well defined interfaces

    View Slide

  69. indexer
    independent technical decisions
    search
    shell scala

    View Slide

  70. indexer
    independent technical decisions
    search
    shell scala
    git hook embedded

    View Slide

  71. indexer
    independent technical decisions
    search
    shell scala
    git hook embedded
    os logging os logging

    View Slide

  72. indexer
    consistency helps avoid chaos
    search
    shell scala
    git hook embedded
    os logging os logging

    View Slide

  73. Autonomy

    View Slide

  74. #1
    individually deployable

    View Slide

  75. indexer search

    View Slide

  76. indexer search
    v1
    v1

    View Slide

  77. indexer search
    v2 v1

    View Slide

  78. indexer search
    v3 v1

    View Slide

  79. indexer search
    v3 v2

    View Slide

  80. #2
    independent domain
    models

    View Slide

  81. indexer search

    View Slide

  82. different notions of “index”

    View Slide

  83. really don’t do this

    View Slide

  84. really don’t do this

    View Slide

  85. #3
    standards for
    interchange formats

    View Slide

  86. indexer search

    View Slide

  87. indexer search

    View Slide

  88. indexer search
    standard rules for these help avoid chaos

    View Slide

  89. #4
    no shared state

    View Slide

  90. View Slide

  91. really don’t do this

    View Slide

  92. really don’t do this

    View Slide

  93. autonomy builds in
    reliability

    View Slide

  94. indexer search

    View Slide

  95. x
    search
    x
    /\/\/\/\/\

    View Slide

  96. x
    search
    x
    /\/\/\/\/\

    View Slide

  97. autonomy builds in the
    ability to change

    View Slide

  98. indexer search
    shell scala
    git hook embedded
    os logging os logging

    View Slide

  99. indexer search
    haskell scala
    git hook embedded
    os logging os logging

    View Slide

  100. How long does it take to
    get a 1 line change to
    production?

    View Slide

  101. View Slide

  102. View Slide

  103. View Slide

  104. View Slide

  105. View Slide

  106. warning signs
    an anecdote

    View Slide

  107. multi database - multi data center replication
    100 million+ transactions a day

    View Slide

  108. View Slide

  109. View Slide

  110. View Slide

  111. View Slide

  112. View Slide

  113. View Slide

  114. View Slide

  115. View Slide

  116. x x

    View Slide

  117. x x
    /\/\/\/\/\/\/\

    View Slide

  118. the data-model was
    entirely shared
    between replication
    and otp system

    View Slide

  119. it was ALL shared state

    View Slide

  120. it was really only
    feasible to change if
    one team was working
    on both “systems”

    View Slide

  121. if one system failed,
    they often both failed

    View Slide

  122. as we patched failure
    modes, reliability never
    improved

    View Slide

  123. x x
    /\/\/\/\/\/\/\

    View Slide

  124. x
    /\/\/\/\/\/\/\

    View Slide

  125. autonomy is far more
    important for
    reliability than code
    improvements

    View Slide

  126. Programmer Myth #2
    The Bad Code is to Blame

    View Slide

  127. System Evolution
    z
    ģ
    G
    Y

    View Slide

  128. “... with proper design, the
    features come cheaply. This
    approach is arduous, but
    continues to succeed.”
    Dennis Ritchie

    View Slide

  129. thinking ahead is not
    about avoiding change

    View Slide

  130. indexer search
    shell scala
    git hook embedded
    os logging os logging

    View Slide

  131. indexer search
    haskell scala
    git hook embedded
    os logging os logging

    View Slide

  132. thinking ahead is about
    letting us change at
    different rates for
    different problems

    View Slide

  133. thinking ahead is about
    letting us make short
    term decisions that don’t
    have long term effects

    View Slide

  134. attempting change
    an anecdote

    View Slide

  135. small company
    analytics product
    very quality focused team
    inherited a small piece of code
    very bad code

    View Slide

  136. the
    product

    View Slide

  137. the
    jsp

    View Slide

  138. the
    rewrite
    heavy focus on quality

    View Slide

  139. the
    rewrite
    but… rebuilt same structure

    View Slide

  140. View Slide

  141. the
    indivisible
    blob

    View Slide

  142. websphere
    the
    indivisible
    blob

    View Slide

  143. websphere
    the
    indivisible
    blob
    The Plan
    ui
    core
    split

    View Slide

  144. websphere
    the
    indivisible
    blob
    The Plan
    ui
    core
    tech upgrade

    View Slide

  145. websphere
    the
    indivisible
    blob
    The Plan
    ui
    core
    indexer
    websphere
    isolate

    View Slide

  146. The Reality
    ui
    core
    indexer
    websphere

    View Slide

  147. The Reality
    ui
    core
    indexer
    websphere
    data model + state

    View Slide

  148. The Reality
    ui
    core
    indexer
    websphere
    data model + state
    WEBSHERE

    View Slide

  149. Programmer Myth #3
    We Must Do Something Now

    View Slide

  150. Rewrites

    View Slide

  151. Programmer Myth #4
    We Should Rewrite

    View Slide

  152. (not) Rewrites

    View Slide

  153. architecture is
    controlled by developers
    not architects

    View Slide

  154. #1
    version everything

    View Slide

  155. indexer search

    View Slide

  156. indexer
    v1
    search
    v1

    View Slide

  157. indexer
    v1
    search
    v1
    v1 v1
    v1

    View Slide

  158. the internet is broken
    an aside

    View Slide

  159. MIME-Version: 1.0

    View Slide

  160. what should a client
    do if it sees something
    that isn’t version 1.0?

    View Slide

  161. #2
    the wedge

    View Slide

  162. the status quo

    View Slide

  163. a wedge
    the status quo

    View Slide

  164. a wedge
    the status quo

    View Slide

  165. a wedge
    the status quo

    View Slide

  166. a wedge

    View Slide

  167. a wedge

    View Slide

  168. mega-code-search-tool

    View Slide

  169. mega-code-search-tool
    R

    View Slide

  170. mega-code-search-tool
    external indexer support
    R

    View Slide

  171. mega-code-search-tool R
    external indexer support

    View Slide

  172. mega-code-search-tool R
    external indexer support
    scala

    View Slide

  173. R
    scala
    haskell
    javascript
    search

    View Slide

  174. #3
    embrace partial moves

    View Slide

  175. mega-code-search-tool

    View Slide

  176. mega-code-search-tool
    {incomplete}

    View Slide

  177. control in progress
    moves at a single point

    View Slide

  178. track and cap the
    number of moves
    in progress

    View Slide

  179. plan for rollback as much
    as rollforward

    View Slide

  180. #4
    validate as you go

    View Slide

  181. mega-code-search-tool

    View Slide

  182. mega-code-search-tool
    external indexer support
    R

    View Slide

  183. R
    make sure you can run
    this straight away
    external indexer support
    mega-code-search-tool

    View Slide

  184. R
    make sure you can run
    this straight away
    external indexer support
    mega-code-search-tool

    View Slide

  185. mega-code-search-tool R
    external indexer support
    scala

    View Slide

  186. mega-code-search-tool R
    external indexer support
    scala

    View Slide

  187. mega-code-search-tool R
    external indexer support
    scala

    View Slide

  188. R
    scala
    haskell
    javascript
    search

    View Slide

  189. Experimentation
    and Measurement
    G
    ģ
    z
    Y

    View Slide

  190. Change Without Fear

    View Slide

  191. we need confidence that
    things don’t break when
    we ship code

    View Slide

  192. confidence stems from
    knowing code works in
    production before it affects
    a customer

    View Slide

  193. #1
    move production to
    development

    View Slide

  194. production quality data
    automation of environments
    lots of testing

    View Slide

  195. production quality data
    automation of environments
    lots of testing
    Rather Old Hat

    View Slide

  196. #2
    move development to
    production

    View Slide

  197. yes, really.
    i want to ship your worst,
    un-tried, experimental
    code to production

    View Slide

  198. Programmer Myth #5
    We Can’t Ship That

    View Slide

  199. Safety First

    View Slide

  200. @ambiata
    we deal with ingesting and processing lots of data
    100s TB / per day / per customer
    scientific experiment and measurement is key
    experiments affect users directly
    researchers / non-specialist engineers produce code

    View Slide

  201. ingest store
    the machine
    package
    publish

    View Slide

  202. ingest store
    the machine
    package
    publish

    View Slide

  203. ingest store
    package
    publish
    the machine

    View Slide

  204. #1
    split environments

    View Slide

  205. ingest store
    package
    publish
    the machine

    View Slide

  206. ingest store
    package
    publish
    the machine
    production:live

    View Slide

  207. ingest store
    package
    publish
    the machine
    production:exp

    View Slide

  208. ingest store
    package
    publish
    the machine
    production:*
    package
    publish

    View Slide

  209. implemented through machine level acls
    experiment
    live
    control

    View Slide

  210. implemented through machine level acls
    experiment
    live
    control
    write
    read

    View Slide

  211. implemented through machine level acls
    experiment
    live
    control

    View Slide

  212. implemented through machine level acls
    experiment
    live
    control
    write
    read

    View Slide

  213. implemented through machine level acls
    experiment
    live
    control
    write
    read

    View Slide

  214. #2
    checkpoints

    View Slide

  215. ingest store
    package
    publish
    the machine

    View Slide

  216. ingest store
    package
    publish
    the machine
    x x

    View Slide

  217. ingest store
    package
    publish
    the machine
    x x

    View Slide

  218. ingest store
    package
    publish
    the machine
    x x

    View Slide

  219. ingest store
    package
    publish
    the machine
    x x

    View Slide

  220. deep implementation,
    intra- and inter- process
    crosschecks

    View Slide

  221. #3
    tandem deployments

    View Slide

  222. ingest store
    package
    publish
    the machine

    View Slide

  223. ingest store
    package
    publish
    the machine

    View Slide

  224. ingest store
    package
    publish
    the machine
    x x
    x x

    View Slide

  225. ingest store
    package
    publish
    the machine
    x x
    x x

    View Slide

  226. ingest store
    package
    publish
    the machine
    x x
    x x

    View Slide

  227. ingest store
    package
    publish
    the machine
    x x
    x x

    View Slide

  228. #4
    measure everything

    View Slide

  229. every result computed
    should have traceability
    back to the code & data

    View Slide

  230. package
    publish
    the machine

    View Slide

  231. package
    publish
    the machine
    publish-ab12f2e

    View Slide

  232. package
    publish
    the machine
    publish-ab12f2e

    View Slide

  233. package
    publish
    the machine
    publish-ab12f2e

    View Slide

  234. package
    publish
    the machine
    package-ab12f2e

    View Slide

  235. package
    publish
    the machine
    score-ab12f2e

    View Slide

  236. package
    publish
    the machine

    View Slide

  237. package
    publish
    the machine
    size: 192GB
    checksum: d32fe1a
    created: 2014-03-02T10:01
    loaded: store-a122fe3

    View Slide

  238. statistics work,
    measurements over time
    will find errors

    View Slide

  239. package
    publish
    the machine

    View Slide

  240. package
    publish
    the machine
    wall-time: 13411s
    cpu-time: 429130s
    records: 19 million
    histogram:
    a: 13million
    b: 2million
    c: 4million

    View Slide

  241. package
    publish
    the machine
    wall-time: 13411s
    cpu-time: 429130s
    records: 19 million
    histogram:
    a: 13million
    b: 2million
    c: 4million
    aggregate over time

    View Slide

  242. package
    publish
    the machine
    median:

    averages:
    cpu-time: 420030s
    quantiles:

    aggregate over time

    View Slide

  243. package
    publish
    the machine
    cross check
    everything
    wall-time: 13411s
    cpu-time: 429130s
    records: 19 million
    histogram:
    a: 13million
    b: 2million
    c: 4million

    View Slide

  244. package
    publish
    the machine
    cross check
    everything
    wall-time: 13411s
    cpu-time: 429130s
    records: 19 million
    histogram:
    a: 13million
    b: 2million
    c: 4million

    View Slide

  245. Programmer Myth #6
    But We Can’t Do That In Our
    Situation

    View Slide

  246. these techniques adapt

    View Slide

  247. WebCloud (tm)
    live
    live

    View Slide

  248. WebCloud (tm)
    live
    proxy
    live

    View Slide

  249. WebCloud (tm)
    experiment
    live
    proxy
    experiment
    live

    View Slide

  250. WebCloud (tm)
    experiment
    live
    proxy
    experiment
    live

    View Slide

  251. Packaged Products
    live
    live
    live

    View Slide

  252. Packaged Products
    live
    measurement
    live
    live

    View Slide

  253. Packaged Products
    live
    measurement
    live
    live
    policy

    View Slide

  254. Packaged Products
    experiment
    live
    measurement
    live
    live
    policy

    View Slide

  255. change
    is the default

    View Slide

  256. architecture
    is every day

    View Slide

  257. experiment
    for reliability

    View Slide

  258. measure
    always

    View Slide

  259. end
    z
    ģ
    G
    Y

    View Slide