$30 off During Our Annual Pro Sale. View Details »

Distributed Systems Are a UX Problem

Tyler Treat
October 30, 2018

Distributed Systems Are a UX Problem

Distributed systems are not strictly an engineering problem. It’s far too easy to assume a backend development concern, but the reality is there are implications at every point in the stack. Often the trade-offs we make lower in the stack in order to buy responsiveness bubble up to the top—so much, in fact, that it rarely doesn’t impact the application in some way.

Distributed systems affect the user. We need to shift the focus from system properties and guarantees to business rules and application behavior. We need to understand the limitations and trade-offs at each level in the stack and why they exist. We need to assume failure and plan for recovery. We need to start thinking of distributed systems as a UX problem.

Tyler Treat looks at distributed systems through the lens of user experience, observing how architecture, design patterns, and business problems all coalesce into UX. Tyler also shares system design anti-patterns and alternative patterns for building reliable and scalable systems with respect to business outcomes.

Topic include:

- The “truth” can be prohibitively expensive: When does strong consistency make sense, and when does it not? How do we reconcile this with application UX?
- Failure as an inevitability: If we can’t build perfect systems, what is “good enough”?
- Dealing with partial knowledge: Systems usually operate in the real world (e.g., an inventory application for a widget warehouse). How do we design for the “disconnect” between the real world and the system?

Tyler Treat

October 30, 2018
Tweet

More Decks by Tyler Treat

Other Decks in Programming

Transcript

  1. @tyler_treat
    Distributed Systems Are a

    UX Problem
    Tyler Treat / O’Reilly Software Architecture Conference / October 30, 2018

    View Slide

  2. @tyler_treat
    Tyler Treat

    [email protected]

    View Slide

  3. @tyler_treat
    I like distributed systems.

    View Slide

  4. @tyler_treat

    View Slide

  5. @tyler_treat

    View Slide

  6. @tyler_treat
    Disclaimer:

    I know approximately nothing about UX…

    View Slide

  7. @tyler_treat
    …other than when I’m the user, I know when
    my experience is good and when it’s bad.

    View Slide

  8. @tyler_treat

    View Slide

  9. @tyler_treat
    UX

    View Slide

  10. @tyler_treat
    UX Systems

    View Slide

  11. @tyler_treat
    UX Systems

    View Slide

  12. @tyler_treat
    UX Systems
    Business

    View Slide

  13. @tyler_treat
    UX Systems
    Business
    This

    Talk

    View Slide

  14. @tyler_treat
    The Yin and Yang of
    UX and Architecture

    View Slide

  15. @tyler_treat
    Monolith

    View Slide

  16. @tyler_treat
    Monolith

    View Slide

  17. @tyler_treat
    Service
    Service
    Service
    Service
    Service
    Service
    Service
    Serv
    Service

    View Slide

  18. @tyler_treat
    Service
    Service
    Service
    Service
    Service
    Service
    Service
    Serv
    Service

    View Slide

  19. @tyler_treat
    Service
    Service
    Service
    Service
    Service
    Service
    Service
    Serv
    Service

    View Slide

  20. @tyler_treat
    Implications

    View Slide

  21. @tyler_treat

    View Slide

  22. @tyler_treat
    book trip
    Trip
    Service
    Trip
    Database
    transaction
    Good old days

    View Slide

  23. @tyler_treat
    book trip
    Microservices
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    transaction
    transaction
    transaction

    View Slide

  24. @tyler_treat
    book trip
    Microservices
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    transaction
    transaction
    transaction
    ACID
    ACID
    ACID

    View Slide

  25. @tyler_treat
    UX Implications of Microservices
    • Data consistency

    View Slide

  26. @tyler_treat
    Service
    Service
    Service
    Service
    Service
    Service
    Service
    Serv
    Service

    View Slide

  27. @tyler_treat
    Service
    Service
    Service
    Service
    Service
    Service
    Service
    Serv
    Service

    View Slide

  28. @tyler_treat
    UX Implications of Microservices
    • Data consistency
    • Race conditions

    View Slide

  29. @tyler_treat

    View Slide

  30. @tyler_treat
    UX Implications of Microservices
    • Data consistency
    • Race conditions
    • Performance

    View Slide

  31. @tyler_treat
    book trip
    Microservices
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    transaction
    transaction
    transaction

    View Slide

  32. @tyler_treat
    book trip
    Microservices
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    transaction
    transaction
    transaction

    View Slide

  33. @tyler_treat
    UX Implications of Microservices
    • Data consistency
    • Race conditions
    • Performance
    • Partial failure

    View Slide

  34. @tyler_treat
    So are microservices bad?

    View Slide

  35. @tyler_treat
    Microservices are about

    people scale.

    View Slide

  36. @tyler_treat
    Transparency

    View Slide

  37. @tyler_treat
    A Study of Transparency and Adaptability of Heterogeneous
    Computer Networks with TCP/IP and IPv6 Protocols

    Das, 2012
    “Any change in a computing system, such as a new feature or new
    component, is transparent if the system after change adheres to
    previous external interface as much as possible while changing its
    internal behavior.”

    View Slide

  38. @tyler_treat
    System

    View Slide

  39. @tyler_treat
    System

    View Slide

  40. @tyler_treat
    High Transparency
    Low Transparency

    View Slide

  41. @tyler_treat
    NFS
    High Transparency
    Low Transparency

    View Slide

  42. @tyler_treat
    NFS
    FTP
    High Transparency
    Low Transparency

    View Slide

  43. @tyler_treat
    Types of Transparencies
    Access transparency
    Location transparency
    Migration transparency
    Relocation transparency
    Replication transparency
    Concurrent transparency
    Failure transparency
    Persistence transparency
    Security transparency

    View Slide

  44. @tyler_treat
    Transparency is about usability.

    View Slide

  45. @tyler_treat
    Usability Control

    View Slide

  46. @tyler_treat
    Usability Control

    View Slide

  47. @tyler_treat
    Usability Control

    View Slide

  48. @tyler_treat
    Simplicity
    Flexibility, Performance,

    Correctness
    RPC

    View Slide

  49. @tyler_treat
    Simplicity Flexibility, Performance,

    Correctness
    Erlang Message Passing

    View Slide

  50. @tyler_treat
    RPC
    Erlang

    Message Passing
    High Transparency
    Low Transparency

    View Slide

  51. @tyler_treat
    Translating UX for developers:
    APIs

    View Slide

  52. @tyler_treat
    Transparencies simplify the API
    of a system.

    View Slide

  53. @tyler_treat
    UX is about deciding what
    knobs to expose.

    View Slide

  54. @tyler_treat
    The Truth is Prohibitively Expensive
    Balancing Consistency and UX

    View Slide

  55. @tyler_treat
    book trip
    Trip
    Service
    Trip
    Database
    transaction
    Good old days

    View Slide

  56. @tyler_treat
    book trip
    Trip
    Service
    Trip
    Database
    transaction
    Good old days
    Transparency

    View Slide

  57. @tyler_treat
    book trip
    Microservices
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    transaction
    transaction
    transaction
    Transparency

    View Slide

  58. @tyler_treat
    book trip
    Microservices
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    transaction
    transaction
    transaction
    ACID
    ACID
    ACID
    Transparency

    View Slide

  59. @tyler_treat

    View Slide

  60. @tyler_treat

    View Slide

  61. @tyler_treat

    View Slide

  62. @tyler_treat
    Spreadsheet service

    View Slide

  63. @tyler_treat
    Spreadsheet service
    Document service

    View Slide

  64. @tyler_treat
    Spreadsheet service
    Document service
    Presentation service

    View Slide

  65. @tyler_treat
    Spreadsheet service
    Document service
    Presentation service
    IAM service

    View Slide

  66. @tyler_treat
    Spreadsheet service
    Document service
    Presentation service
    IAM service
    consistent

    View Slide

  67. @tyler_treat
    Consistency is about ordering of
    events in a distributed system.

    View Slide

  68. @tyler_treat
    Why is this hard?

    View Slide

  69. View Slide

  70. @tyler_treat
    So what can we do?

    View Slide

  71. @tyler_treat
    Coordinate

    View Slide

  72. @tyler_treat
    Two-Phase Commit

    View Slide

  73. @tyler_treat
    book trip
    2PC Prepare
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    propose
    propose
    propose

    View Slide

  74. @tyler_treat
    book trip
    2PC Prepare
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    vote
    vote
    vote

    View Slide

  75. @tyler_treat
    book trip
    2PC Commit
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    commit/abort
    commit/abort
    commit/abort

    View Slide

  76. @tyler_treat
    book trip
    2PC Commit
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    done
    done
    done

    View Slide

  77. @tyler_treat
    Problems with 2PC
    • Chatty protocol: beholden to network latency
    • Limited throughput
    • Transaction coordinator: single point of failure
    • Blocking protocol: susceptible to deadlock

    View Slide

  78. @tyler_treat
    book trip
    2PC Prepare
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    propose
    propose
    propose

    View Slide

  79. @tyler_treat
    book trip
    2PC Prepare
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    propose
    propose
    propose

    View Slide

  80. @tyler_treat
    book trip
    2PC Prepare
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    propose
    propose
    propose

    View Slide

  81. @tyler_treat
    Add more phases!

    View Slide

  82. @tyler_treat
    Three-Phase Commit

    View Slide

  83. @tyler_treat

    View Slide

  84. @tyler_treat
    atomic clocks
    NTP
    GPS
    TrueTime

    View Slide

  85. @tyler_treat
    Good news:

    we solved physics.

    View Slide

  86. @tyler_treat
    Bad news:

    it costs all the money.

    View Slide

  87. @tyler_treat
    Not exactly…

    View Slide

  88. @tyler_treat
    Spanner: Google’s Globally-Distributed Database

    Corbett et al.

    View Slide

  89. @tyler_treat
    TrueTime forces that uncertainty to the
    surface, and Spanner provides a
    transparency over it.

    View Slide

  90. @tyler_treat
    Spanner doesn’t avoid trade-offs,
    it just minimizes their probability.

    View Slide

  91. @tyler_treat
    Spanner is expensive and
    proprietary.

    View Slide

  92. @tyler_treat
    But it’s not the end of the story…

    View Slide

  93. @tyler_treat
    Unless every service is backed by the
    same database, you probably still have
    to deal with consistency problems.

    View Slide

  94. @tyler_treat
    Challenges to Adopting Stronger Consistency at Scale

    Ajoux et al., 2015
    “The biggest barrier to providing stronger consistency guarantees…is
    that the consistency mechanism must integrate consistency across
    many stateful services.”

    View Slide

  95. @tyler_treat
    Coordination is expensive because
    processes can’t make progress
    independently.

    View Slide

  96. @tyler_treat

    View Slide

  97. @tyler_treat

    View Slide

  98. @tyler_treat
    Peter Bailis, 2015 https://speakerdeck.com/pbailis/silence-is-golden-coordination-avoiding-systems-design

    View Slide

  99. @tyler_treat
    And what about partial failure?

    View Slide

  100. @tyler_treat

    View Slide

  101. @tyler_treat

    View Slide

  102. @tyler_treat

    View Slide

  103. @tyler_treat

    View Slide

  104. @tyler_treat

    View Slide

  105. @tyler_treat
    Memories, Guesses, and Apologies
    Dealing with Partial Knowledge

    View Slide

  106. @tyler_treat
    The cost of knowing the “truth”
    can be prohibitively expensive.

    View Slide

  107. @tyler_treat
    And partial failure means the
    “truth” is also fragile.

    View Slide

  108. @tyler_treat
    Where does this leave us?

    View Slide

  109. @tyler_treat
    We could go
    back to the
    monolith.

    View Slide

  110. @tyler_treat
    We could build
    expensive data centers
    with fancy hardware…
    @tyler_treat

    View Slide

  111. @tyler_treat
    …or we could
    rethink our
    transparencies.

    View Slide

  112. @tyler_treat
    @tyler_treat

    View Slide

  113. View Slide

  114. @tyler_treat
    Gregor Hohpe, 2005 https://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf

    View Slide

  115. @tyler_treat
    Gregor Hohpe, 2005 https://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf

    View Slide

  116. @tyler_treat
    Gregor Hohpe, 2005 https://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf

    View Slide

  117. @tyler_treat
    Gregor Hohpe, 2005 https://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf

    View Slide

  118. @tyler_treat
    Exception Handling in
    Asynchronous Systems

    View Slide

  119. @tyler_treat

    View Slide

  120. @tyler_treat
    Exception Handling in Asynchronous Systems
    • Write-off

    View Slide

  121. @tyler_treat

    View Slide

  122. @tyler_treat
    Exception Handling in Asynchronous Systems
    • Write-off
    • Retry

    View Slide

  123. @tyler_treat

    View Slide

  124. @tyler_treat
    Exception Handling in Asynchronous Systems
    • Write-off
    • Retry
    • Compensating action

    View Slide

  125. @tyler_treat
    Revisiting Two-Phase Commit

    View Slide

  126. @tyler_treat
    Sagas

    View Slide

  127. @tyler_treat
    Sagas

    Garcia-Molina & Salem, 1987
    “A long-lived transaction is a saga if it can be written as a sequence of
    transactions that can be interleaved with other transactions…Either all
    the transactions in a saga are successfully completed or
    compensating transactions are run to amend a partial execution.”

    View Slide

  128. @tyler_treat
    Sagas

    Garcia-Molina & Salem, 1987
    “A long-lived transaction is a saga if it can be written as a sequence of
    transactions that can be interleaved with other transactions…Either all
    the transactions in a saga are successfully completed or
    compensating transactions are run to amend a partial execution.”

    View Slide

  129. @tyler_treat
    Sagas split long-lived transactions into
    individual, interleaved sub-transactions:
    T = T1
    , T2
    , . . . , Tn

    View Slide

  130. @tyler_treat
    And each sub-transaction has a
    compensating transaction:
    C1
    , C2
    , . . . , Cn

    View Slide

  131. @tyler_treat
    T1
    , T2
    , . . . , Tn
    T1
    , T2
    , . . . , Tj
    , Cj
    , . . . , C2
    , C1
    Sagas guarantee one of two
    execution sequences:

    View Slide

  132. @tyler_treat
    book trip
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    transaction
    transaction
    transaction

    View Slide

  133. @tyler_treat
    • Book flight
    • Book hotel
    • Book car
    • Charge money
    T = T1
    , T2
    , . . . , Tn

    View Slide

  134. @tyler_treat
    • Cancel flight
    • Cancel hotel
    • Cancel car
    • Refund money
    C1
    , C2
    , . . . , Cn

    View Slide

  135. @tyler_treat
    Compensating transactions
    must be idempotent.

    View Slide

  136. @tyler_treat
    Sagas trade off isolation for
    availability.

    View Slide

  137. @tyler_treat
    Event-Driven

    View Slide

  138. @tyler_treat
    book trip
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    transaction
    transaction
    transaction

    View Slide

  139. @tyler_treat
    event
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    event
    event
    event

    View Slide

  140. @tyler_treat
    event
    Airline
    Service
    Hotel
    Service
    Car
    Service
    Trip
    Service
    event
    event
    event

    View Slide

  141. @tyler_treat
    System Properties Business Rules

    View Slide

  142. @tyler_treat
    Sean T. Allen
    “People don’t want distributed transactions,
    they just want the guarantees that distributed
    transactions give them.”

    View Slide

  143. @tyler_treat
    CAP theorem

    View Slide

  144. @tyler_treat
    CAP Theorem
    • Consistency, Availability, Partition Tolerance
    • When a partition occurs, do we:
    • Choose availability and give up consistency?


    - or -
    • Choose consistency and give up availability?

    View Slide

  145. @tyler_treat
    CAP Theorem
    • Consistency, Availability, Partition Tolerance
    • When a partition occurs, do we:
    • Choose availability and give up consistency?


    - or -
    • Choose consistency and give up availability?
    (or YOLO it)

    View Slide

  146. @tyler_treat
    The CAP theorem is a UX
    question…

    View Slide

  147. @tyler_treat
    When a partial failure occurs, how do
    you want the application to behave?

    View Slide

  148. @tyler_treat

    View Slide

  149. @tyler_treat

    View Slide

  150. @tyler_treat
    We can choose consistency and
    sacrifice availability…

    View Slide

  151. @tyler_treat
    …or we can choose availability by making
    local decisions with the knowledge at
    hand and designing the UX accordingly.

    View Slide

  152. @tyler_treat
    Managing partial failure is a matter
    of dealing with partial knowledge…

    View Slide

  153. @tyler_treat
    …and managing risk.

    View Slide

  154. @tyler_treat
    Check value

    < $10,000?
    Our risk appetite can
    drive business rules.
    Clear locally
    Double check with

    all replicas before

    clearing
    yes
    no

    View Slide

  155. @tyler_treat
    Memories, guesses, and
    apologies

    View Slide

  156. @tyler_treat
    Computers operate with partial
    knowledge.

    View Slide

  157. @tyler_treat
    Either there’s a
    disconnect with
    the “real world”…

    View Slide

  158. @tyler_treat
    …or there’s a
    disconnect
    between systems.

    View Slide

  159. @tyler_treat
    Systems don’t make decisions,
    they make guesses.

    View Slide

  160. @tyler_treat
    Systems have memory.

    View Slide

  161. @tyler_treat
    Memories help systems make
    better guesses in the future.

    View Slide

  162. @tyler_treat
    Forgetfulness is a business
    decision.

    View Slide

  163. @tyler_treat
    Sometimes the system guesses
    wrong.

    View Slide

  164. @tyler_treat
    Systems need the capacity to
    apologize.

    View Slide

  165. @tyler_treat
    Customers judge you not by your
    failures, but by how you handle your
    failures.

    View Slide

  166. @tyler_treat
    Are you building systems that never
    fail or systems that fail gracefully?

    View Slide

  167. @tyler_treat

    View Slide

  168. @tyler_treat
    Businesses need both code and
    people to manage apologies.

    View Slide

  169. @tyler_treat
    It becomes less about trying to build the
    perfect system and more about how we
    cope with an imperfect one.

    View Slide

  170. @tyler_treat
    Wrapping Up
    Summary and Observations

    View Slide

  171. @tyler_treat

    View Slide

  172. @tyler_treat
    @tyler_treat

    View Slide

  173. @tyler_treat
    ACID
    distributed transactions
    exactly-once delivery
    ordered delivery
    serializable isolation
    linearizability
    System Properties

    View Slide

  174. @tyler_treat
    ACID
    distributed transactions
    exactly-once delivery
    ordered delivery
    serializable isolation
    linearizability
    System Properties
    negative account balance
    Business Rules / Application Invariants
    two users sharing same ID
    room double-booked
    balance reconciles

    View Slide

  175. @tyler_treat

    View Slide

  176. @tyler_treat
    We put ourselves at the mercy of our
    infrastructure and hope it makes good
    on its promises.

    View Slide

  177. @tyler_treat
    Kyle Kingsbury, 2015 http://jepsen.io
    It often
    doesn’t.

    View Slide

  178. @tyler_treat
    When do we actually need
    consistency?

    View Slide

  179. @tyler_treat

    View Slide

  180. @tyler_treat
    We can use consistency when the
    stakes are high and the cost is worth it.

    View Slide

  181. @tyler_treat
    And design our transparencies
    accordingly.

    View Slide

  182. @tyler_treat
    We could try to build perfect
    systems.

    View Slide

  183. @tyler_treat
    Should we build perfect
    systems or pragmatic systems?

    View Slide

  184. @tyler_treat
    Systems that can compensate.

    View Slide

  185. @tyler_treat
    Systems that can recover.

    View Slide

  186. @tyler_treat
    Systems that can apologize.

    View Slide

  187. @tyler_treat
    UX Systems
    Business

    View Slide

  188. @tyler_treat
    Data Consistency
    Race Conditions
    Performance
    Partial Failure

    View Slide

  189. @tyler_treat
    Data Consistency
    Race Conditions
    Performance
    Partial Failure
    Transparency
    Informs

    View Slide

  190. @tyler_treat
    Thank You
    bravenewgeek.com

    realkinetic.com

    View Slide

  191. @tyler_treat
    References
    • https://gotocon.com/dl/goto-chicago-2015/slides/CaitieMcCaffrey_ApplyingTheSagaPattern.pdf
    • http://ijcsits.org/papers/vol2no62012/42vol2no6.pdf
    • http://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf
    • https://queue.acm.org/detail.cfm?id=2745385
    • https://www.enterpriseintegrationpatterns.com/docs/IEEE_Software_Design_2PC.pdf
    • http://www-db.cs.wisc.edu/cidr/cidr2009/Paper_133.pdf
    • https://bravenewgeek.com/distributed-systems-are-a-ux-problem/
    • http://www.cs.princeton.edu/~wlloyd/papers/challenges-hotos15.pdf
    • https://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf
    • https://www.youtube.com/watch?v=lsKaNDj4TrE
    • Starbucks photo - https://www.geekwire.com/2015/starbucks-mobile-ordering-now-blankets-the-u-s-with-coverage-in-san-francisco-new-york-and-more-coming-today/
    • Friction image - https://byjus.com/physics/friction-in-automobiles/
    • Carbon copy forms - http://www.rainiercopy.com/forms.html
    • Rosetta Stone photo - https://en.wikipedia.org/wiki/Rosetta_Stone#/media/File:Rosetta_Stone.JPG

    View Slide