Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Programming in the Large: Architecture and Experimentation

Mark Hibberd
December 11, 2014

Programming in the Large: Architecture and Experimentation

Building robust, quality systems is hard. We trade off organizational issues against technical decisions; the ability to deliver quickly against our ability to change; and the ability to build systems easily against the ability to run those systems in production. However, good architectural decisions can free us to choose the right tools and techniques, allowing us to manage these challenges and concentrate on solving real problems rather than our made up ones.

In this talk, we will run through some stereotypical projects, come to terms with legacy systems, and look at the properties of robust architectures. In particular we are interested in how architectures lend themselves to experimentation and change in terms of both function and technology.

We will attempt to ground the discussion with examples from my past projects. Looking at where things have worked well and probably of more interest, where they really have not.

This was presented at the YOW! conference in Australia, Melbourne 04/12/2014, Brisbane 08/12/2014, Sydney 11/12/2014 (http://2014.yowconference.com.au/).

Mark Hibberd

December 11, 2014
Tweet

More Decks by Mark Hibberd

Other Decks in Programming

Transcript

  1. @markhibberd
    programming in the large
    Architecture and
    Experimentation

    View full-size slide

  2. “Simplicity is prerequisite for
    reliability”
    Edsger W. Dijkstra -!
    How do we tell truths that might hurt? (1975)

    View full-size slide

  3. Legacy Systems
    and Organisations
    z
    ģ
    G
    Y

    View full-size slide

  4. How Did We Get Here

    View full-size slide

  5. The Hand-me Down
    Code Last
    Touched

    View full-size slide

  6. Code Last
    Touched
    You
    Started
    The Hand-me Down

    View full-size slide

  7. Code Last
    Touched
    You
    Started
    Everyone Else
    Started
    The Hand-me Down

    View full-size slide

  8. Code Last
    Touched
    You
    Started
    Everyone Else
    Started
    You’re The
    Expert
    The Hand-me Down

    View full-size slide

  9. The Rush Job
    Start
    Work

    View full-size slide

  10. Start
    Work
    A Working
    System
    The Rush Job

    View full-size slide

  11. Start
    Work
    System
    Delivered
    A Working
    System
    The Rush Job

    View full-size slide

  12. Start
    Work
    System
    Delivered
    The Rush Job

    View full-size slide

  13. The Rewrite
    Someone
    Else’s Code

    View full-size slide

  14. Someone
    Else’s Code
    System
    Delivered
    The Rewrite

    View full-size slide

  15. Someone
    Else’s Code
    System
    Delivered
    Bob Knows
    Better
    The Rewrite

    View full-size slide

  16. Someone
    Else’s Code
    System
    Delivered
    A New
    System
    Bob Knows
    Better
    The Rewrite

    View full-size slide

  17. Someone
    Else’s Code
    System
    Delivered
    A New, Not Quite
    Working System
    Bob Knows
    Better
    The Rewrite

    View full-size slide

  18. Someone
    Else’s Code
    System
    Delivered
    An Old, Not Quite
    Working System
    Bob Knows
    Better
    The Rewrite

    View full-size slide

  19. The Greenfield
    Enthusiasm

    View full-size slide

  20. Enthusiasm System
    Delivered
    The Greenfield

    View full-size slide

  21. Enthusiasm Realisation
    and Despair
    System
    Delivered
    The Greenfield

    View full-size slide

  22. An Idea Oh, Sorry, We
    Shipped That
    30 Minutes
    Later
    The Prototype

    View full-size slide

  23. The Bandwagon

    View full-size slide

  24. How We
    Pick Our
    Technology
    The Bandwagon

    View full-size slide

  25. Perhaps we need a
    microservice to
    deploy Docker
    The Bandwagon

    View full-size slide

  26. So we can run
    a microservice
    The Bandwagon

    View full-size slide

  27. To display some
    text
    The Bandwagon

    View full-size slide

  28. legacy
    is the default

    View full-size slide

  29. The Ideal
    New Ideas

    View full-size slide

  30. The Ideal
    New Ideas
    Stable Ideas

    View full-size slide

  31. The Ideal
    New Ideas
    Stable Ideas
    We Now Know Better

    View full-size slide

  32. Taking Responsibility

    View full-size slide

  33. Too Important to Ignore,
    Too Important to Change
    an anecdote

    View full-size slide

  34. 100 million+ active users
    100 million+ transactions a day
    millions of $$$
    a couple of “simple” services

    View full-size slide

  35. server client

    View full-size slide

  36. /call
    server client
    on-demand

    View full-size slide

  37. /call
    server client
    /check
    on-demand
    periodically

    View full-size slide

  38. /call
    server client
    /check
    on-demand
    periodically

    View full-size slide

  39. /call
    server client
    /check
    on-demand
    periodically
    /check2
    /check2z
    /v3check

    View full-size slide

  40. /call
    server client
    /check
    on-demand
    periodically
    /check2
    /check2z
    /v3check

    View full-size slide

  41. /call
    server
    /check
    /check2
    /check2z
    /v3check

    View full-size slide

  42. enter our protagonists…

    View full-size slide

  43. /call
    server
    /check
    we spent a lot
    of time “fire
    fighting”
    /check2
    /check2z
    /v3check

    View full-size slide

  44. /call
    server
    /check
    we spent a lot
    of time “fire
    fighting”
    /check2
    /check2z
    /v3check

    View full-size slide

  45. /call
    server
    /check
    /check2
    /check2z
    /v3check
    we spent a
    lot of time
    improving
    “quality”

    View full-size slide

  46. /call
    server
    /check
    we spent a
    lot of time
    improving
    “quality”

    View full-size slide

  47. /call
    server
    /check
    we spent a
    lot of time
    improving
    “quality”

    View full-size slide

  48. Programmer Myth #1
    It Is Someone Else’s Fault

    View full-size slide

  49. we completely failed
    to adapt the system
    for change

    View full-size slide

  50. we remained hostage
    to a fear of change

    View full-size slide

  51. Autonomous Systems
    and Rates of Change
    ģ
    Y
    z G

    View full-size slide

  52. Code Search
    an example

    View full-size slide

  53. web du jour
    db
    ui

    View full-size slide

  54. web du jour
    db
    ui
    indexer
    api

    View full-size slide

  55. web du jour
    db
    ui
    indexer
    api

    View full-size slide

  56. web du jour
    db
    ui
    indexer
    api

    View full-size slide

  57. db
    ui
    indexer
    api

    View full-size slide

  58. the thing about real
    systems is their
    autonomy

    View full-size slide

  59. rules
    not boxes

    View full-size slide

  60. architecture is the
    concepts on which we
    formulate our systems

    View full-size slide

  61. architecture is the rules
    for how these systems
    interact

    View full-size slide

  62. architecture is the rules
    for how these systems are
    implemented

    View full-size slide

  63. indexer search
    independent problem domains

    View full-size slide

  64. indexer search
    code
    ctags
    ctags
    application/html
    application/search.v1+json
    well defined interfaces

    View full-size slide

  65. indexer search
    code
    ctags
    ctags
    application/html
    application/search.v1+json
    well defined interfaces

    View full-size slide

  66. indexer
    independent technical decisions
    search
    shell scala

    View full-size slide

  67. indexer
    independent technical decisions
    search
    shell scala
    git hook embedded

    View full-size slide

  68. indexer
    independent technical decisions
    search
    shell scala
    git hook embedded
    os logging os logging

    View full-size slide

  69. indexer
    consistency helps avoid chaos
    search
    shell scala
    git hook embedded
    os logging os logging

    View full-size slide

  70. #1
    individually deployable

    View full-size slide

  71. indexer search

    View full-size slide

  72. indexer search
    v1
    v1

    View full-size slide

  73. indexer search
    v2 v1

    View full-size slide

  74. indexer search
    v3 v1

    View full-size slide

  75. indexer search
    v3 v2

    View full-size slide

  76. #2
    independent domain
    models

    View full-size slide

  77. indexer search

    View full-size slide

  78. different notions of “index”

    View full-size slide

  79. really don’t do this

    View full-size slide

  80. really don’t do this

    View full-size slide

  81. #3
    standards for
    interchange formats

    View full-size slide

  82. indexer search

    View full-size slide

  83. indexer search

    View full-size slide

  84. indexer search
    standard rules for these help avoid chaos

    View full-size slide

  85. #4
    no shared state

    View full-size slide

  86. really don’t do this

    View full-size slide

  87. really don’t do this

    View full-size slide

  88. autonomy builds in
    reliability

    View full-size slide

  89. indexer search

    View full-size slide

  90. x
    search
    x
    /\/\/\/\/\

    View full-size slide

  91. x
    search
    x
    /\/\/\/\/\

    View full-size slide

  92. autonomy builds in the
    ability to change

    View full-size slide

  93. indexer search
    shell scala
    git hook embedded
    os logging os logging

    View full-size slide

  94. indexer search
    haskell scala
    git hook embedded
    os logging os logging

    View full-size slide

  95. How long does it take to
    get a 1 line change to
    production?

    View full-size slide

  96. warning signs
    an anecdote

    View full-size slide

  97. multi database - multi data center replication
    100 million+ transactions a day

    View full-size slide

  98. x x
    /\/\/\/\/\/\/\

    View full-size slide

  99. the data-model was
    entirely shared
    between replication
    and otp system

    View full-size slide

  100. it was ALL shared state

    View full-size slide

  101. it was really only
    feasible to change if
    one team was working
    on both “systems”

    View full-size slide

  102. if one system failed,
    they often both failed

    View full-size slide

  103. as we patched failure
    modes, reliability never
    improved

    View full-size slide

  104. x x
    /\/\/\/\/\/\/\

    View full-size slide

  105. x
    /\/\/\/\/\/\/\

    View full-size slide

  106. autonomy is far more
    important for
    reliability than code
    improvements

    View full-size slide

  107. Programmer Myth #2
    The Bad Code is to Blame

    View full-size slide

  108. System Evolution
    z
    ģ
    G
    Y

    View full-size slide

  109. “... with proper design, the
    features come cheaply. This
    approach is arduous, but
    continues to succeed.”
    Dennis Ritchie

    View full-size slide

  110. thinking ahead is not
    about avoiding change

    View full-size slide

  111. indexer search
    shell scala
    git hook embedded
    os logging os logging

    View full-size slide

  112. indexer search
    haskell scala
    git hook embedded
    os logging os logging

    View full-size slide

  113. thinking ahead is about
    letting us change at
    different rates for
    different problems

    View full-size slide

  114. thinking ahead is about
    letting us make short
    term decisions that don’t
    have long term effects

    View full-size slide

  115. attempting change
    an anecdote

    View full-size slide

  116. small company
    analytics product
    very quality focused team
    inherited a small piece of code
    very bad code

    View full-size slide

  117. the
    rewrite
    heavy focus on quality

    View full-size slide

  118. the
    rewrite
    but… rebuilt same structure

    View full-size slide

  119. the
    indivisible
    blob

    View full-size slide

  120. websphere
    the
    indivisible
    blob

    View full-size slide

  121. websphere
    the
    indivisible
    blob
    The Plan
    ui
    core
    split

    View full-size slide

  122. websphere
    the
    indivisible
    blob
    The Plan
    ui
    core
    tech upgrade

    View full-size slide

  123. websphere
    the
    indivisible
    blob
    The Plan
    ui
    core
    indexer
    websphere
    isolate

    View full-size slide

  124. The Reality
    ui
    core
    indexer
    websphere

    View full-size slide

  125. The Reality
    ui
    core
    indexer
    websphere
    data model + state

    View full-size slide

  126. The Reality
    ui
    core
    indexer
    websphere
    data model + state
    WEBSHERE

    View full-size slide

  127. Programmer Myth #3
    We Must Do Something Now

    View full-size slide

  128. Programmer Myth #4
    We Should Rewrite

    View full-size slide

  129. (not) Rewrites

    View full-size slide

  130. architecture is
    controlled by developers
    not architects

    View full-size slide

  131. #1
    version everything

    View full-size slide

  132. indexer search

    View full-size slide

  133. indexer
    v1
    search
    v1

    View full-size slide

  134. indexer
    v1
    search
    v1
    v1 v1
    v1

    View full-size slide

  135. the internet is broken
    an aside

    View full-size slide

  136. MIME-Version: 1.0

    View full-size slide

  137. what should a client
    do if it sees something
    that isn’t version 1.0?

    View full-size slide

  138. the status quo

    View full-size slide

  139. a wedge
    the status quo

    View full-size slide

  140. a wedge
    the status quo

    View full-size slide

  141. a wedge
    the status quo

    View full-size slide

  142. mega-code-search-tool

    View full-size slide

  143. mega-code-search-tool
    R

    View full-size slide

  144. mega-code-search-tool
    external indexer support
    R

    View full-size slide

  145. mega-code-search-tool R
    external indexer support

    View full-size slide

  146. mega-code-search-tool R
    external indexer support
    scala

    View full-size slide

  147. R
    scala
    haskell
    javascript
    search

    View full-size slide

  148. #3
    embrace partial moves

    View full-size slide

  149. mega-code-search-tool

    View full-size slide

  150. mega-code-search-tool
    {incomplete}

    View full-size slide

  151. control in progress
    moves at a single point

    View full-size slide

  152. track and cap the
    number of moves
    in progress

    View full-size slide

  153. plan for rollback as much
    as rollforward

    View full-size slide

  154. #4
    validate as you go

    View full-size slide

  155. mega-code-search-tool

    View full-size slide

  156. mega-code-search-tool
    external indexer support
    R

    View full-size slide

  157. R
    make sure you can run
    this straight away
    external indexer support
    mega-code-search-tool

    View full-size slide

  158. R
    make sure you can run
    this straight away
    external indexer support
    mega-code-search-tool

    View full-size slide

  159. mega-code-search-tool R
    external indexer support
    scala

    View full-size slide

  160. mega-code-search-tool R
    external indexer support
    scala

    View full-size slide

  161. mega-code-search-tool R
    external indexer support
    scala

    View full-size slide

  162. R
    scala
    haskell
    javascript
    search

    View full-size slide

  163. Experimentation
    and Measurement
    G
    ģ
    z
    Y

    View full-size slide

  164. Change Without Fear

    View full-size slide

  165. we need confidence that
    things don’t break when
    we ship code

    View full-size slide

  166. confidence stems from
    knowing code works in
    production before it affects
    a customer

    View full-size slide

  167. #1
    move production to
    development

    View full-size slide

  168. production quality data
    automation of environments
    lots of testing

    View full-size slide

  169. production quality data
    automation of environments
    lots of testing
    Rather Old Hat

    View full-size slide

  170. #2
    move development to
    production

    View full-size slide

  171. yes, really.
    i want to ship your worst,
    un-tried, experimental
    code to production

    View full-size slide

  172. Programmer Myth #5
    We Can’t Ship That

    View full-size slide

  173. Safety First

    View full-size slide

  174. @ambiata
    we deal with ingesting and processing lots of data
    100s TB / per day / per customer
    scientific experiment and measurement is key
    experiments affect users directly
    researchers / non-specialist engineers produce code

    View full-size slide

  175. ingest store
    the machine
    package
    publish

    View full-size slide

  176. ingest store
    the machine
    package
    publish

    View full-size slide

  177. ingest store
    package
    publish
    the machine

    View full-size slide

  178. #1
    split environments

    View full-size slide

  179. ingest store
    package
    publish
    the machine

    View full-size slide

  180. ingest store
    package
    publish
    the machine
    production:live

    View full-size slide

  181. ingest store
    package
    publish
    the machine
    production:exp

    View full-size slide

  182. ingest store
    package
    publish
    the machine
    production:*
    package
    publish

    View full-size slide

  183. implemented through machine level acls
    experiment
    live
    control

    View full-size slide

  184. implemented through machine level acls
    experiment
    live
    control
    write
    read

    View full-size slide

  185. implemented through machine level acls
    experiment
    live
    control

    View full-size slide

  186. implemented through machine level acls
    experiment
    live
    control
    write
    read

    View full-size slide

  187. implemented through machine level acls
    experiment
    live
    control
    write
    read

    View full-size slide

  188. #2
    checkpoints

    View full-size slide

  189. ingest store
    package
    publish
    the machine

    View full-size slide

  190. ingest store
    package
    publish
    the machine
    x x

    View full-size slide

  191. ingest store
    package
    publish
    the machine
    x x

    View full-size slide

  192. ingest store
    package
    publish
    the machine
    x x

    View full-size slide

  193. ingest store
    package
    publish
    the machine
    x x

    View full-size slide

  194. deep implementation,
    intra- and inter- process
    crosschecks

    View full-size slide

  195. #3
    tandem deployments

    View full-size slide

  196. ingest store
    package
    publish
    the machine

    View full-size slide

  197. ingest store
    package
    publish
    the machine

    View full-size slide

  198. ingest store
    package
    publish
    the machine
    x x
    x x

    View full-size slide

  199. ingest store
    package
    publish
    the machine
    x x
    x x

    View full-size slide

  200. ingest store
    package
    publish
    the machine
    x x
    x x

    View full-size slide

  201. ingest store
    package
    publish
    the machine
    x x
    x x

    View full-size slide

  202. #4
    measure everything

    View full-size slide

  203. every result computed
    should have traceability
    back to the code & data

    View full-size slide

  204. package
    publish
    the machine

    View full-size slide

  205. package
    publish
    the machine
    publish-ab12f2e

    View full-size slide

  206. package
    publish
    the machine
    publish-ab12f2e

    View full-size slide

  207. package
    publish
    the machine
    publish-ab12f2e

    View full-size slide

  208. package
    publish
    the machine
    package-ab12f2e

    View full-size slide

  209. package
    publish
    the machine
    score-ab12f2e

    View full-size slide

  210. package
    publish
    the machine

    View full-size slide

  211. package
    publish
    the machine
    size: 192GB
    checksum: d32fe1a
    created: 2014-03-02T10:01
    loaded: store-a122fe3

    View full-size slide

  212. statistics work,
    measurements over time
    will find errors

    View full-size slide

  213. package
    publish
    the machine

    View full-size slide

  214. package
    publish
    the machine
    wall-time: 13411s
    cpu-time: 429130s
    records: 19 million
    histogram:
    a: 13million
    b: 2million
    c: 4million

    View full-size slide

  215. package
    publish
    the machine
    wall-time: 13411s
    cpu-time: 429130s
    records: 19 million
    histogram:
    a: 13million
    b: 2million
    c: 4million
    aggregate over time

    View full-size slide

  216. package
    publish
    the machine
    median:

    averages:
    cpu-time: 420030s
    quantiles:

    aggregate over time

    View full-size slide

  217. package
    publish
    the machine
    cross check
    everything
    wall-time: 13411s
    cpu-time: 429130s
    records: 19 million
    histogram:
    a: 13million
    b: 2million
    c: 4million

    View full-size slide

  218. package
    publish
    the machine
    cross check
    everything
    wall-time: 13411s
    cpu-time: 429130s
    records: 19 million
    histogram:
    a: 13million
    b: 2million
    c: 4million

    View full-size slide

  219. Programmer Myth #6
    But We Can’t Do That In Our
    Situation

    View full-size slide

  220. these techniques adapt

    View full-size slide

  221. WebCloud (tm)
    live
    live

    View full-size slide

  222. WebCloud (tm)
    live
    proxy
    live

    View full-size slide

  223. WebCloud (tm)
    experiment
    live
    proxy
    experiment
    live

    View full-size slide

  224. WebCloud (tm)
    experiment
    live
    proxy
    experiment
    live

    View full-size slide

  225. Packaged Products
    live
    live
    live

    View full-size slide

  226. Packaged Products
    live
    measurement
    live
    live

    View full-size slide

  227. Packaged Products
    live
    measurement
    live
    live
    policy

    View full-size slide

  228. Packaged Products
    experiment
    live
    measurement
    live
    live
    policy

    View full-size slide

  229. change
    is the default

    View full-size slide

  230. architecture
    is every day

    View full-size slide

  231. experiment
    for reliability

    View full-size slide

  232. measure
    always

    View full-size slide