Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Scalable, Highly Concurrent & Fault Tolerant Systems - Lessons Learned

Building Scalable, Highly Concurrent & Fault Tolerant Systems - Lessons Learned

Lessons learned through agony and pain, lots of pain.

Jonas Bonér

July 27, 2012
Tweet

More Decks by Jonas Bonér

Other Decks in Programming

Transcript

  1. Building
    Scalable,
    Highly Concurrent &
    Fault-Tolerant
    Systems:
    Lessons Learned
    Jonas Bonér
    CTO Typesafe
    Twitter: @jboner

    View Slide

  2. I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again

    View Slide

  3. I will never use distributed transactions again
    Lessons
    Learned
    through...
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again

    View Slide

  4. I will never use distributed transactions again
    Lessons
    Learned
    through...
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    Agony

    View Slide

  5. I will never use distributed transactions again
    Lessons
    Learned
    through...
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    I will never use distributed transactions again
    Agony
    and Pain
    lots of
    Pain

    View Slide

  6. Agenda
    • It’s All Trade-offs
    • Go Concurrent
    • Go Reactive
    • Go Fault-Tolerant
    • Go Distributed
    • Go Big

    View Slide

  7. View Slide

  8. It’s all
    Trade-offs

    View Slide

  9. Performance
    vs
    Scalability

    View Slide

  10. Latency
    vs
    Throughput

    View Slide

  11. Availability
    vs
    Consistency

    View Slide

  12. Go Concurrent

    View Slide

  13. Shared mutable state

    View Slide

  14. Shared mutable state
    Together with threads...

    View Slide

  15. Shared mutable state
    ...leads to
    Together with threads...

    View Slide

  16. Shared mutable state
    ...code that is totally INDETERMINISTIC
    ...leads to
    Together with threads...

    View Slide

  17. Shared mutable state
    ...code that is totally INDETERMINISTIC
    ...and the root of all EVIL
    ...leads to
    Together with threads...

    View Slide

  18. Shared mutable state
    ...code that is totally INDETERMINISTIC
    ...and the root of all EVIL
    ...leads to
    Together with threads...
    Please, avoid it at all cost

    View Slide

  19. Shared mutable state
    ...code that is totally INDETERMINISTIC
    ...and the root of all EVIL
    ...leads to
    Together with threads...
    Please, avoid it at all cost
    Use IMMUTABLE
    state!!!

    View Slide

  20. The problem with locks
    • Locks do not compose
    • Locks break encapsulation
    • Taking too few locks
    • Taking too many locks
    • Taking the wrong locks
    • Taking locks in the wrong order
    • Error recovery is hard

    View Slide

  21. You deserve better tools
    • Dataflow Concurrency
    • Actors
    • Software Transactional Memory (STM)
    • Agents

    View Slide

  22. Dataflow Concurrency
    • Deterministic
    • Declarative
    • Data-driven
    • Threads are suspended until data is available
    • Lazy & On-demand
    • No difference between:
    • Concurrent code
    • Sequential code
    • Examples: Akka & GPars

    View Slide

  23. Actors
    •Share NOTHING
    •Isolated lightweight event-based processes
    •Each actor has a mailbox (message queue)
    •Communicates through asynchronous and
    non-blocking message passing
    •Location transparent (distributable)
    •Examples: Akka & Erlang

    View Slide

  24. • See the memory as a transactional dataset
    • Similar to a DB: begin, commit, rollback (ACI)
    • Transactions are retried upon collision
    • Rolls back the memory on abort
    • Transactions can nest and compose
    • Use STM instead of abusing your database
    with temporary storage of “scratch” data
    • Examples: Haskell, Clojure & Scala
    STM

    View Slide

  25. • Reactive memory cells (STM Ref)
    • Send a update function to the Agent, which
    1. adds it to an (ordered) queue, to be
    2. applied to the Agent asynchronously
    • Reads are “free”, just dereferences the Ref
    • Cooperates with STM
    • Examples: Clojure & Akka
    Agents

    View Slide

  26. If we could start all over...

    View Slide

  27. If we could start all over...
    1. Start with a Deterministic, Declarative & Immutable core

    View Slide

  28. If we could start all over...
    1. Start with a Deterministic, Declarative & Immutable core
    • Logic & Functional Programming

    View Slide

  29. If we could start all over...
    1. Start with a Deterministic, Declarative & Immutable core
    • Logic & Functional Programming
    • Dataflow

    View Slide

  30. If we could start all over...
    1. Start with a Deterministic, Declarative & Immutable core
    • Logic & Functional Programming
    • Dataflow
    2. Add Indeterminism selectively - only where needed

    View Slide

  31. If we could start all over...
    1. Start with a Deterministic, Declarative & Immutable core
    • Logic & Functional Programming
    • Dataflow
    2. Add Indeterminism selectively - only where needed
    • Actor/Agent-based Programming

    View Slide

  32. If we could start all over...
    1. Start with a Deterministic, Declarative & Immutable core
    • Logic & Functional Programming
    • Dataflow
    2. Add Indeterminism selectively - only where needed
    • Actor/Agent-based Programming
    3. Add Mutability selectively - only where needed

    View Slide

  33. If we could start all over...
    1. Start with a Deterministic, Declarative & Immutable core
    • Logic & Functional Programming
    • Dataflow
    2. Add Indeterminism selectively - only where needed
    • Actor/Agent-based Programming
    3. Add Mutability selectively - only where needed
    • Protected by Transactions (STM)

    View Slide

  34. If we could start all over...
    1. Start with a Deterministic, Declarative & Immutable core
    • Logic & Functional Programming
    • Dataflow
    2. Add Indeterminism selectively - only where needed
    • Actor/Agent-based Programming
    3. Add Mutability selectively - only where needed
    • Protected by Transactions (STM)
    4. Finally - only if really needed

    View Slide

  35. If we could start all over...
    1. Start with a Deterministic, Declarative & Immutable core
    • Logic & Functional Programming
    • Dataflow
    2. Add Indeterminism selectively - only where needed
    • Actor/Agent-based Programming
    3. Add Mutability selectively - only where needed
    • Protected by Transactions (STM)
    4. Finally - only if really needed
    • Add Monitors (Locks) and explicit Threads

    View Slide

  36. Go Reactive

    View Slide

  37. Never block
    • ...unless you really have to
    • Blocking kills scalability (and performance)
    • Never sit on resources you don’t use
    • Use non-blocking IO
    • Be reactive
    • How?

    View Slide

  38. Go Async
    Design for reactive event-driven systems
    1. Use asynchronous message passing
    2. Use Iteratee-based IO
    3. Use push not pull (or poll)
    • Examples:
    • Akka or Erlang actors
    • Play’s reactive Iteratee IO
    • Node.js or JavaScript Promises
    • Server-Sent Events or WebSockets
    • Scala’s Futures library

    View Slide

  39. Go Fault-Tolerant

    View Slide

  40. Failure Recovery in Java/C/C# etc.

    View Slide

  41. • You are given a SINGLE thread of control
    Failure Recovery in Java/C/C# etc.

    View Slide

  42. • You are given a SINGLE thread of control
    • If this thread blows up you are screwed
    Failure Recovery in Java/C/C# etc.

    View Slide

  43. • You are given a SINGLE thread of control
    • If this thread blows up you are screwed
    • So you need to do all explicit error handling
    WITHIN this single thread
    Failure Recovery in Java/C/C# etc.

    View Slide

  44. • You are given a SINGLE thread of control
    • If this thread blows up you are screwed
    • So you need to do all explicit error handling
    WITHIN this single thread
    • To make things worse - errors do not
    propagate between threads so there is NO
    WAY OF EVEN FINDING OUT that
    something have failed
    Failure Recovery in Java/C/C# etc.

    View Slide

  45. • You are given a SINGLE thread of control
    • If this thread blows up you are screwed
    • So you need to do all explicit error handling
    WITHIN this single thread
    • To make things worse - errors do not
    propagate between threads so there is NO
    WAY OF EVEN FINDING OUT that
    something have failed
    • This leads to DEFENSIVE programming with:
    Failure Recovery in Java/C/C# etc.

    View Slide

  46. • You are given a SINGLE thread of control
    • If this thread blows up you are screwed
    • So you need to do all explicit error handling
    WITHIN this single thread
    • To make things worse - errors do not
    propagate between threads so there is NO
    WAY OF EVEN FINDING OUT that
    something have failed
    • This leads to DEFENSIVE programming with:
    • Error handling TANGLED with business logic
    Failure Recovery in Java/C/C# etc.

    View Slide

  47. • You are given a SINGLE thread of control
    • If this thread blows up you are screwed
    • So you need to do all explicit error handling
    WITHIN this single thread
    • To make things worse - errors do not
    propagate between threads so there is NO
    WAY OF EVEN FINDING OUT that
    something have failed
    • This leads to DEFENSIVE programming with:
    • Error handling TANGLED with business logic
    • SCATTERED all over the code base
    Failure Recovery in Java/C/C# etc.

    View Slide

  48. • You are given a SINGLE thread of control
    • If this thread blows up you are screwed
    • So you need to do all explicit error handling
    WITHIN this single thread
    • To make things worse - errors do not
    propagate between threads so there is NO
    WAY OF EVEN FINDING OUT that
    something have failed
    • This leads to DEFENSIVE programming with:
    • Error handling TANGLED with business logic
    • SCATTERED all over the code base
    Failure Recovery in Java/C/C# etc.
    We can do
    better!!!

    View Slide

  49. Just
    Let It Crash

    View Slide

  50. View Slide

  51. The right way
    1. Isolated lightweight processes
    2. Supervised processes
    • Each running process has a supervising process
    • Errors are sent to the supervisor (asynchronously)
    • Supervisor manages the failure
    • Same semantics local as remote
    • For example the Actor Model solves it nicely

    View Slide

  52. Go Distributed

    View Slide

  53. Performance
    vs
    Scalability

    View Slide

  54. How do I know if I have a
    performance problem?

    View Slide

  55. How do I know if I have a
    performance problem?
    If your system is
    slow for a single user

    View Slide

  56. How do I know if I have a
    scalability problem?

    View Slide

  57. How do I know if I have a
    scalability problem?
    If your system is
    fast for a single user
    but slow under heavy load

    View Slide

  58. (Three) Misconceptions about
    Reliable Distributed Computing
    - Werner Vogels
    1. Transparency is the ultimate goal
    2. Automatic object replication is desirable
    3. All replicas are equal and deterministic
    Classic paper: A Note On Distributed Computing - Waldo et. al.

    View Slide

  59. Transparent Distributed Computing
    • Emulating Consistency and Shared
    Memory in a distributed environment
    • Distributed Objects
    • “Sucks like an inverted hurricane” - Martin Fowler
    • Distributed Transactions
    • ...don’t get me started...
    Fallacy 1

    View Slide

  60. Fallacy 2
    RPC
    • Emulating synchronous blocking method
    dispatch - across the network
    • Ignores:
    • Latency
    • Partial failures
    • General scalability concerns, caching etc.
    • “Convenience over Correctness” - Steve Vinoski

    View Slide

  61. Instead

    View Slide

  62. Embrace the Network
    Instead
    and
    be
    done
    with
    it
    Use
    Asynchronous
    Message
    Passing

    View Slide

  63. Delivery Semantics
    • No guarantees
    • At most once
    • At least once
    • Once and only once
    Guaranteed Delivery

    View Slide

  64. It’s all lies.

    View Slide

  65. It’s all lies.

    View Slide

  66. The network is inherently unreliable
    and there is no such thing as 100%
    guaranteed delivery
    It’s all lies.

    View Slide

  67. Guaranteed Delivery

    View Slide

  68. Guaranteed Delivery
    The question is what to guarantee

    View Slide

  69. Guaranteed Delivery
    The question is what to guarantee
    1. The message is - sent out on the network?

    View Slide

  70. Guaranteed Delivery
    The question is what to guarantee
    1. The message is - sent out on the network?
    2. The message is - received by the receiver host’s NIC?

    View Slide

  71. Guaranteed Delivery
    The question is what to guarantee
    1. The message is - sent out on the network?
    2. The message is - received by the receiver host’s NIC?
    3. The message is - put on the receiver’s queue?

    View Slide

  72. Guaranteed Delivery
    The question is what to guarantee
    1. The message is - sent out on the network?
    2. The message is - received by the receiver host’s NIC?
    3. The message is - put on the receiver’s queue?
    4. The message is - applied to the receiver?

    View Slide

  73. Guaranteed Delivery
    The question is what to guarantee
    1. The message is - sent out on the network?
    2. The message is - received by the receiver host’s NIC?
    3. The message is - put on the receiver’s queue?
    4. The message is - applied to the receiver?
    5. The message is - starting to be processed by the receiver?

    View Slide

  74. Guaranteed Delivery
    The question is what to guarantee
    1. The message is - sent out on the network?
    2. The message is - received by the receiver host’s NIC?
    3. The message is - put on the receiver’s queue?
    4. The message is - applied to the receiver?
    5. The message is - starting to be processed by the receiver?
    6. The message is - has completed processing by the receiver?

    View Slide

  75. Ok, then what to do?
    1. Start with 0 guarantees (0 additional cost)
    2. Add the guarantees you need - one by one

    View Slide

  76. Ok, then what to do?
    1. Start with 0 guarantees (0 additional cost)
    2. Add the guarantees you need - one by one
    Different USE-CASES
    Different GUARANTEES
    Different COSTS

    View Slide

  77. Ok, then what to do?
    1. Start with 0 guarantees (0 additional cost)
    2. Add the guarantees you need - one by one
    Different USE-CASES
    Different GUARANTEES
    Different COSTS
    For each additional guarantee you add you will either:
    • decrease performance, throughput or scalability
    • increase latency

    View Slide

  78. Just

    View Slide

  79. Just
    Use ACKing

    View Slide

  80. Just
    Use ACKing
    and be done with it

    View Slide

  81. Latency
    vs
    Throughput

    View Slide

  82. You should strive for
    maximal throughput
    with
    acceptable latency

    View Slide

  83. Go Big

    View Slide

  84. Go Big
    Data

    View Slide

  85. Big Data
    Imperative OO programming doesn't cut it
    • Object-Mathematics Impedance Mismatch
    • We need functional processing, transformations etc.
    • Examples: Spark, Crunch/Scrunch, Cascading, Cascalog,
    Scalding, Scala Parallel Collections
    • Hadoop have been called the:
    • “Assembly language of MapReduce programming”
    • “EJB of our time”

    View Slide

  86. Batch processing doesn't cut it
    • Ala Hadoop
    • We need real-time data processing
    • Examples: Spark, Storm, S4 etc.
    • Watch“Why Big Data Needs To Be Functional”
    by Dean Wampler
    Big Data

    View Slide

  87. Go Big
    DB

    View Slide

  88. When is
    a RDBMS
    not
    good enough?

    View Slide

  89. Scaling reads
    to a RDBMS
    is hard

    View Slide

  90. Scaling writes
    to a RDBMS
    is impossible

    View Slide

  91. Do we
    really need
    a RDBMS?

    View Slide

  92. Do we
    really need
    a RDBMS?
    Sometimes...

    View Slide

  93. Do we
    really need
    a RDBMS?

    View Slide

  94. Do we
    really need
    a RDBMS?
    But many times we don’t

    View Slide

  95. Atomic
    Consistent
    Isolated
    Durable

    View Slide

  96. Availability
    vs
    Consistency

    View Slide

  97. Brewer’s
    CAP
    theorem

    View Slide

  98. You can only pick
    2
    Consistency
    Availability
    Partition tolerance
    At a given point in time

    View Slide

  99. Centralized system
    • In a centralized system (RDBMS etc.)
    we don’t have network partitions,
    e.g. P in CAP
    • So you get both:
    Consistency
    Availability

    View Slide

  100. Distributed system
    • In a distributed (scalable) system
    we will have network partitions,
    e.g. P in CAP
    • So you get to only pick one:
    Consistency
    Availability

    View Slide

  101. Basically Available
    Soft state
    Eventually consistent

    View Slide

  102. Think about your data
    • When do you need ACID?
    • When is Eventual Consistency a better fit?
    • Different kinds of data has different needs
    • You need full consistency less than you think
    Then think again

    View Slide

  103. How fast is fast enough?
    • Never guess: Measure, measure and measure
    • Start by defining a baseline
    • Where are we now?
    • Define what is “good enough” - i.e. SLAs
    • Where do we want to go?
    • When are we done?
    • Beware of micro-benchmarks

    View Slide

  104. • Never guess: Measure, measure and measure
    • Start by defining a baseline
    • Where are we now?
    • Define what is “good enough” - i.e. SLAs
    • Where do we want to go?
    • When are we done?
    • Beware of micro-benchmarks
    ...or, when can we go for a beer?

    View Slide

  105. To sum things up...
    1. Maximizing a specific metric impacts others
    • Every strategic decision involves a trade-off
    • There's no "silver bullet"
    2. Applying yesterday's best practices to the
    problems faced today will lead to:
    • Waste of resources
    • Performance and scalability bottlenecks
    • Unreliable systems

    View Slide

  106. SO

    View Slide

  107. GO

    View Slide

  108. ...now home and build yourself
    Scalable,
    Highly Concurrent &
    Fault-Tolerant
    Systems

    View Slide

  109. Thank You
    Email: [email protected]
    Web: typesafe.com
    Twitter: @jboner

    View Slide