Programming in the Large: Architecture and Experimentation

@markhibberd programming in the large Architecture and Experimentation

“Simplicity is prerequisite for reliability” Edsger W. Dijkstra -! How
do we tell truths that might hurt? (1975)

Legacy Systems and Organisations z ģ G Y

How Did We Get Here

The Hand-me Down Code Last Touched

Code Last Touched You Started The Hand-me Down

Code Last Touched You Started Everyone Else Started The Hand-me
Down

Code Last Touched You Started Everyone Else Started You’re The
Expert The Hand-me Down

The Rush Job Start Work

Start Work A Working System The Rush Job

Start Work System Delivered A Working System The Rush Job

Start Work System Delivered The Rush Job

The Rewrite Someone Else’s Code

Someone Else’s Code System Delivered The Rewrite

Someone Else’s Code System Delivered Bob Knows Better The Rewrite

Someone Else’s Code System Delivered A New System Bob Knows
Better The Rewrite

Someone Else’s Code System Delivered A New, Not Quite Working
System Bob Knows Better The Rewrite

Someone Else’s Code System Delivered An Old, Not Quite Working
System Bob Knows Better The Rewrite

The Greenfield Enthusiasm

Enthusiasm System Delivered The Greenfield

Enthusiasm Realisation and Despair System Delivered The Greenfield

An Idea Oh, Sorry, We Shipped That 30 Minutes Later
The Prototype

The Bandwagon

How We Pick Our Technology The Bandwagon

Perhaps we need a microservice to deploy Docker The Bandwagon

So we can run a microservice The Bandwagon

To display some text The Bandwagon

legacy is the default

The Ideal New Ideas

The Ideal New Ideas Stable Ideas

The Ideal New Ideas Stable Ideas We Now Know Better

Taking Responsibility

Too Important to Ignore, Too Important to Change an anecdote

100 million+ active users 100 million+ transactions a day millions
of $$$ a couple of “simple” services

server client

/call server client on-demand

/call server client /check on-demand periodically

/call server client /check on-demand periodically /check2 /check2z /v3check

/call server /check /check2 /check2z /v3check

enter our protagonists…

/call server /check we spent a lot of time “fire
fighting” /check2 /check2z /v3check

/call server /check /check2 /check2z /v3check we spent a lot
of time improving “quality”

/call server /check we spent a lot of time improving
“quality”

Programmer Myth #1 It Is Someone Else’s Fault

we completely failed to adapt the system for change

we remained hostage to a fear of change

Autonomous Systems and Rates of Change ģ Y z G

Systems

Code Search an example

code search

web du jour db ui

web du jour db ui indexer api

db ui indexer api

the thing about real systems is their autonomy

rules not boxes

architecture is the concepts on which we formulate our systems

architecture is the rules for how these systems interact

architecture is the rules for how these systems are implemented

indexer search independent problem domains

indexer search code ctags ctags application/html application/search.v1+json well defined interfaces

indexer independent technical decisions search shell scala

indexer independent technical decisions search shell scala git hook embedded

indexer independent technical decisions search shell scala git hook embedded
os logging os logging

indexer consistency helps avoid chaos search shell scala git hook
embedded os logging os logging

Autonomy

#1 individually deployable

indexer search

indexer search v1 v1

#2 independent domain models

indexer search

different notions of “index”

really don’t do this

#3 standards for interchange formats

indexer search

indexer search standard rules for these help avoid chaos

#4 no shared state

really don’t do this

autonomy builds in reliability

indexer search

x search x /\/\/\/\/\

autonomy builds in the ability to change

indexer search shell scala git hook embedded os logging os
logging

indexer search haskell scala git hook embedded os logging os
logging

How long does it take to get a 1 line
change to production?

warning signs an anecdote

multi database - multi data center replication 100 million+ transactions
a day

x x /\/\/\/\/\/\/\

the data-model was entirely shared between replication and otp system

it was ALL shared state

it was really only feasible to change if one team
was working on both “systems”

if one system failed, they often both failed

as we patched failure modes, reliability never improved

x x /\/\/\/\/\/\/\

x /\/\/\/\/\/\/\

autonomy is far more important for reliability than code improvements

Programmer Myth #2 The Bad Code is to Blame

System Evolution z ģ G Y

“... with proper design, the features come cheaply. This approach
is arduous, but continues to succeed.” Dennis Ritchie

thinking ahead is not about avoiding change

indexer search shell scala git hook embedded os logging os
logging

indexer search haskell scala git hook embedded os logging os
logging

thinking ahead is about letting us change at different rates
for different problems

thinking ahead is about letting us make short term decisions
that don’t have long term effects

attempting change an anecdote

small company analytics product very quality focused team inherited a
small piece of code very bad code

the product

the jsp

the rewrite heavy focus on quality

the rewrite but… rebuilt same structure

the indivisible blob

websphere the indivisible blob

websphere the indivisible blob The Plan ui core split

websphere the indivisible blob The Plan ui core tech upgrade

websphere the indivisible blob The Plan ui core indexer websphere
isolate

The Reality ui core indexer websphere

The Reality ui core indexer websphere data model + state

The Reality ui core indexer websphere data model + state
WEBSHERE

Programmer Myth #3 We Must Do Something Now

Rewrites

Programmer Myth #4 We Should Rewrite

(not) Rewrites

architecture is controlled by developers not architects

#1 version everything

indexer search

indexer v1 search v1

indexer v1 search v1 v1 v1 v1

the internet is broken an aside

MIME-Version: 1.0

what should a client do if it sees something that
isn’t version 1.0?

#2 the wedge

the status quo

a wedge the status quo

a wedge

mega-code-search-tool

mega-code-search-tool R

mega-code-search-tool external indexer support R

mega-code-search-tool R external indexer support

mega-code-search-tool R external indexer support scala

R scala haskell javascript search

#3 embrace partial moves

mega-code-search-tool {incomplete}

control in progress moves at a single point

track and cap the number of moves in progress

plan for rollback as much as rollforward

#4 validate as you go

mega-code-search-tool external indexer support R

R make sure you can run this straight away external
indexer support mega-code-search-tool

mega-code-search-tool R external indexer support scala

R scala haskell javascript search

Experimentation and Measurement G ģ z Y

Change Without Fear

we need confidence that things don’t break when we ship
code

confidence stems from knowing code works in production before it
affects a customer

#1 move production to development

production quality data automation of environments lots of testing

production quality data automation of environments lots of testing Rather
Old Hat

#2 move development to production

yes, really. i want to ship your worst, un-tried, experimental
code to production

Programmer Myth #5 We Can’t Ship That

Safety First

@ambiata we deal with ingesting and processing lots of data
100s TB / per day / per customer scientific experiment and measurement is key experiments affect users directly researchers / non-specialist engineers produce code

ingest store the machine package publish

ingest store package publish the machine

#1 split environments

ingest store package publish the machine production:live

ingest store package publish the machine production:exp

ingest store package publish the machine production:* package publish

implemented through machine level acls experiment live control

implemented through machine level acls experiment live control write read

implemented through machine level acls experiment live control

implemented through machine level acls experiment live control write read

#2 checkpoints

ingest store package publish the machine x x

deep implementation, intra- and inter- process crosschecks

#3 tandem deployments

ingest store package publish the machine x x x x

#4 measure everything

every result computed should have traceability back to the code
& data

package publish the machine

package publish the machine publish-ab12f2e

package publish the machine package-ab12f2e

package publish the machine score-ab12f2e

package publish the machine size: 192GB checksum: d32fe1a created: 2014-03-02T10:01
loaded: store-a122fe3

statistics work, measurements over time will find errors

package publish the machine wall-time: 13411s cpu-time: 429130s records: 19
million histogram: a: 13million b: 2million c: 4million

package publish the machine wall-time: 13411s cpu-time: 429130s records: 19
million histogram: a: 13million b: 2million c: 4million aggregate over time

package publish the machine median: … averages: cpu-time: 420030s quantiles:
… aggregate over time

package publish the machine cross check everything wall-time: 13411s cpu-time:
429130s records: 19 million histogram: a: 13million b: 2million c: 4million

Programmer Myth #6 But We Can’t Do That In Our
Situation

these techniques adapt

WebCloud (tm) live live

WebCloud (tm) live proxy live

WebCloud (tm) experiment live proxy experiment live

Packaged Products live live live

Packaged Products live measurement live live

Packaged Products live measurement live live policy

Packaged Products experiment live measurement live live policy

change is the default

architecture is every day

experiment for reliability

measure always

end z ģ G Y

Programming in the Large: Architecture and Expe...

Programming in the Large: Architecture and Experimentation

More Decks by Mark Hibberd

Other Decks in Programming

Featured

Transcript