Distributed Systems Are a UX Problem

@tyler_treat Distributed Systems Are a  UX Problem Tyler Treat /
O’Reilly Software Architecture Conference / October 30, 2018

@tyler_treat Tyler Treat  [email protected]

@tyler_treat I like distributed systems.

@tyler_treat

@tyler_treat Disclaimer:  I know approximately nothing about UX…

@tyler_treat …other than when I’m the user, I know when
my experience is good and when it’s bad.

@tyler_treat

@tyler_treat UX

@tyler_treat UX Systems

@tyler_treat UX Systems Business

@tyler_treat UX Systems Business This  Talk

@tyler_treat The Yin and Yang of UX and Architecture

@tyler_treat Monolith

@tyler_treat Service Service Service Service Service Service Service Serv Service

@tyler_treat Implications

@tyler_treat

@tyler_treat book trip Trip Service Trip Database transaction Good old
days

@tyler_treat book trip Microservices Airline Service Hotel Service Car Service
Trip Service transaction transaction transaction

Trip Service transaction transaction transaction ACID ACID ACID

@tyler_treat UX Implications of Microservices • Data consistency

@tyler_treat Service Service Service Service Service Service Service Serv Service

@tyler_treat UX Implications of Microservices • Data consistency • Race
conditions

@tyler_treat

conditions • Performance

Trip Service transaction transaction transaction

conditions • Performance • Partial failure

@tyler_treat So are microservices bad?

@tyler_treat Microservices are about  people scale.

@tyler_treat Transparency

@tyler_treat A Study of Transparency and Adaptability of Heterogeneous Computer
Networks with TCP/IP and IPv6 Protocols  Das, 2012 “Any change in a computing system, such as a new feature or new component, is transparent if the system after change adheres to previous external interface as much as possible while changing its internal behavior.”

@tyler_treat System

@tyler_treat High Transparency Low Transparency

@tyler_treat NFS High Transparency Low Transparency

@tyler_treat NFS FTP High Transparency Low Transparency

@tyler_treat Types of Transparencies Access transparency Location transparency Migration transparency
Relocation transparency Replication transparency Concurrent transparency Failure transparency Persistence transparency Security transparency

@tyler_treat Transparency is about usability.

@tyler_treat Usability Control

@tyler_treat Simplicity Flexibility, Performance,  Correctness RPC

@tyler_treat Simplicity Flexibility, Performance,  Correctness Erlang Message Passing

@tyler_treat RPC Erlang  Message Passing High Transparency Low Transparency

@tyler_treat Translating UX for developers: APIs

@tyler_treat Transparencies simplify the API of a system.

@tyler_treat UX is about deciding what knobs to expose.

@tyler_treat The Truth is Prohibitively Expensive Balancing Consistency and UX

days

days Transparency

Trip Service transaction transaction transaction Transparency

Trip Service transaction transaction transaction ACID ACID ACID Transparency

@tyler_treat

@tyler_treat Spreadsheet service

@tyler_treat Spreadsheet service Document service

@tyler_treat Spreadsheet service Document service Presentation service

@tyler_treat Spreadsheet service Document service Presentation service IAM service

@tyler_treat Spreadsheet service Document service Presentation service IAM service consistent

@tyler_treat Consistency is about ordering of events in a distributed
system.

@tyler_treat Why is this hard?

@tyler_treat So what can we do?

@tyler_treat Coordinate

@tyler_treat Two-Phase Commit

@tyler_treat book trip 2PC Prepare Airline Service Hotel Service Car
Service Trip Service propose propose propose

Service Trip Service vote vote vote

@tyler_treat book trip 2PC Commit Airline Service Hotel Service Car
Service Trip Service commit/abort commit/abort commit/abort

@tyler_treat book trip 2PC Commit Airline Service Hotel Service Car
Service Trip Service done done done

@tyler_treat Problems with 2PC • Chatty protocol: beholden to network
latency • Limited throughput • Transaction coordinator: single point of failure • Blocking protocol: susceptible to deadlock

Service Trip Service propose propose propose

@tyler_treat Add more phases!

@tyler_treat Three-Phase Commit

@tyler_treat

@tyler_treat atomic clocks NTP GPS TrueTime

@tyler_treat Good news:  we solved physics.

@tyler_treat Bad news:  it costs all the money.

@tyler_treat Not exactly…

@tyler_treat Spanner: Google’s Globally-Distributed Database  Corbett et al.

@tyler_treat TrueTime forces that uncertainty to the surface, and Spanner
provides a transparency over it.

@tyler_treat Spanner doesn’t avoid trade-offs, it just minimizes their probability.

@tyler_treat Spanner is expensive and proprietary.

@tyler_treat But it’s not the end of the story…

@tyler_treat Unless every service is backed by the same database,
you probably still have to deal with consistency problems.

@tyler_treat Challenges to Adopting Stronger Consistency at Scale  Ajoux et
al., 2015 “The biggest barrier to providing stronger consistency guarantees…is that the consistency mechanism must integrate consistency across many stateful services.”

@tyler_treat Coordination is expensive because processes can’t make progress independently.

@tyler_treat

@tyler_treat Peter Bailis, 2015 https://speakerdeck.com/pbailis/silence-is-golden-coordination-avoiding-systems-design

@tyler_treat And what about partial failure?

@tyler_treat

@tyler_treat Memories, Guesses, and Apologies Dealing with Partial Knowledge

@tyler_treat The cost of knowing the “truth” can be prohibitively
expensive.

@tyler_treat And partial failure means the “truth” is also fragile.

@tyler_treat Where does this leave us?

@tyler_treat We could go back to the monolith.

@tyler_treat We could build expensive data centers with fancy hardware…
@tyler_treat

@tyler_treat …or we could rethink our transparencies.

@tyler_treat @tyler_treat

@tyler_treat Gregor Hohpe, 2005 https://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf

@tyler_treat Exception Handling in Asynchronous Systems

@tyler_treat

@tyler_treat Exception Handling in Asynchronous Systems • Write-off

@tyler_treat

@tyler_treat Exception Handling in Asynchronous Systems • Write-off • Retry

@tyler_treat

@tyler_treat Exception Handling in Asynchronous Systems • Write-off • Retry
• Compensating action

@tyler_treat Revisiting Two-Phase Commit

@tyler_treat Sagas

@tyler_treat Sagas  Garcia-Molina & Salem, 1987 “A long-lived transaction is
a saga if it can be written as a sequence of transactions that can be interleaved with other transactions…Either all the transactions in a saga are successfully completed or compensating transactions are run to amend a partial execution.”

@tyler_treat Sagas split long-lived transactions into individual, interleaved sub-transactions: T
= T1 , T2 , . . . , Tn

@tyler_treat And each sub-transaction has a compensating transaction: C1 ,
C2 , . . . , Cn

@tyler_treat T1 , T2 , . . . , Tn
T1 , T2 , . . . , Tj , Cj , . . . , C2 , C1 Sagas guarantee one of two execution sequences:

@tyler_treat book trip Airline Service Hotel Service Car Service Trip
Service transaction transaction transaction

@tyler_treat • Book flight • Book hotel • Book car
• Charge money T = T1 , T2 , . . . , Tn

@tyler_treat • Cancel flight • Cancel hotel • Cancel car
• Refund money C1 , C2 , . . . , Cn

@tyler_treat Compensating transactions must be idempotent.

@tyler_treat Sagas trade off isolation for availability.

@tyler_treat Event-Driven

@tyler_treat book trip Airline Service Hotel Service Car Service Trip
Service transaction transaction transaction

@tyler_treat event Airline Service Hotel Service Car Service Trip Service
event event event

@tyler_treat System Properties Business Rules

@tyler_treat Sean T. Allen “People don’t want distributed transactions, they
just want the guarantees that distributed transactions give them.”

@tyler_treat CAP theorem

@tyler_treat CAP Theorem • Consistency, Availability, Partition Tolerance • When
a partition occurs, do we: • Choose availability and give up consistency?    - or - • Choose consistency and give up availability?

@tyler_treat CAP Theorem • Consistency, Availability, Partition Tolerance • When
a partition occurs, do we: • Choose availability and give up consistency?    - or - • Choose consistency and give up availability? (or YOLO it)

@tyler_treat The CAP theorem is a UX question…

@tyler_treat When a partial failure occurs, how do you want
the application to behave?

@tyler_treat

@tyler_treat We can choose consistency and sacriﬁce availability…

@tyler_treat …or we can choose availability by making local decisions
with the knowledge at hand and designing the UX accordingly.

@tyler_treat Managing partial failure is a matter of dealing with
partial knowledge…

@tyler_treat …and managing risk.

@tyler_treat Check value  < $10,000? Our risk appetite can drive
business rules. Clear locally Double check with  all replicas before  clearing yes no

@tyler_treat Memories, guesses, and apologies

@tyler_treat Computers operate with partial knowledge.

@tyler_treat Either there’s a disconnect with the “real world”…

@tyler_treat …or there’s a disconnect between systems.

@tyler_treat Systems don’t make decisions, they make guesses.

@tyler_treat Systems have memory.

@tyler_treat Memories help systems make better guesses in the future.

@tyler_treat Forgetfulness is a business decision.

@tyler_treat Sometimes the system guesses wrong.

@tyler_treat Systems need the capacity to apologize.

@tyler_treat Customers judge you not by your failures, but by
how you handle your failures.

@tyler_treat Are you building systems that never fail or systems
that fail gracefully?

@tyler_treat

@tyler_treat Businesses need both code and people to manage apologies.

@tyler_treat It becomes less about trying to build the perfect
system and more about how we cope with an imperfect one.

@tyler_treat Wrapping Up Summary and Observations

@tyler_treat

@tyler_treat @tyler_treat

@tyler_treat ACID distributed transactions exactly-once delivery ordered delivery serializable isolation
linearizability System Properties

@tyler_treat ACID distributed transactions exactly-once delivery ordered delivery serializable isolation
linearizability System Properties negative account balance Business Rules / Application Invariants two users sharing same ID room double-booked balance reconciles

@tyler_treat

@tyler_treat We put ourselves at the mercy of our infrastructure
and hope it makes good on its promises.

@tyler_treat Kyle Kingsbury, 2015 http://jepsen.io It often doesn’t.

@tyler_treat When do we actually need consistency?

@tyler_treat

@tyler_treat We can use consistency when the stakes are high
and the cost is worth it.

@tyler_treat And design our transparencies accordingly.

@tyler_treat We could try to build perfect systems.

@tyler_treat Should we build perfect systems or pragmatic systems?

@tyler_treat Systems that can compensate.

@tyler_treat Systems that can recover.

@tyler_treat Systems that can apologize.

@tyler_treat UX Systems Business

@tyler_treat Data Consistency Race Conditions Performance Partial Failure

@tyler_treat Data Consistency Race Conditions Performance Partial Failure Transparency Informs

@tyler_treat Thank You bravenewgeek.com  realkinetic.com

@tyler_treat References • https://gotocon.com/dl/goto-chicago-2015/slides/CaitieMcCaffrey_ApplyingTheSagaPattern.pdf • http://ijcsits.org/papers/vol2no62012/42vol2no6.pdf • http://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf • https://queue.acm.org/detail.cfm?id=2745385
• https://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf • http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_133.pdf • https://bravenewgeek.com/distributed-systems-are-a-ux-problem/ • http://www.cs.princeton.edu/~wlloyd/papers/challenges-hotos15.pdf • https://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf • https://www.youtube.com/watch?v=lsKaNDj4TrE • Starbucks photo - https://www.geekwire.com/2015/starbucks-mobile-ordering-now-blankets-the-u-s-with-coverage-in-san-francisco-new-york-and-more-coming-today/ • Friction image - https://byjus.com/physics/friction-in-automobiles/ • Carbon copy forms - http://www.rainiercopy.com/forms.html • Rosetta Stone photo - https://en.wikipedia.org/wiki/Rosetta_Stone#/media/File:Rosetta_Stone.JPG

Distributed Systems Are a UX Problem

Distributed Systems Are a UX Problem

More Decks by Tyler Treat

Other Decks in Programming

Featured

Transcript