Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Designing and Evaluating a Distributed Computing Language Runtime

Designing and Evaluating a Distributed Computing Language Runtime

MesosCon EU 2016

Christopher Meiklejohn

September 01, 2016
Tweet

More Decks by Christopher Meiklejohn

Other Decks in Research

Transcript

  1. Synchronization • To enforce an order
 Makes programming easier •

    Eliminate accidental nondeterminism
 Prevent race conditions 6
  2. Synchronization • To enforce an order
 Makes programming easier •

    Eliminate accidental nondeterminism
 Prevent race conditions • Techniques
 Locks, mutexes, semaphores, monitors, etc. 6
  3. Difficult Cases • “Internet of Things”, 
 Low power, limited

    memory and connectivity • Mobile Gaming
 Offline operation with replicated, shared state 7
  4. Weak Synchronization • Can we achieve anything without synchronization?
 Not

    really. • Strong Eventual Consistency (SEC)
 “Replicas that deliver the same updates have equivalent state” 8
  5. Weak Synchronization • Can we achieve anything without synchronization?
 Not

    really. • Strong Eventual Consistency (SEC)
 “Replicas that deliver the same updates have equivalent state” • Primary requirement
 Eventual replica-to-replica communication 8
  6. Weak Synchronization • Can we achieve anything without synchronization?
 Not

    really. • Strong Eventual Consistency (SEC)
 “Replicas that deliver the same updates have equivalent state” • Primary requirement
 Eventual replica-to-replica communication • Order insensitive! (Commutativity) 8
  7. Weak Synchronization • Can we achieve anything without synchronization?
 Not

    really. • Strong Eventual Consistency (SEC)
 “Replicas that deliver the same updates have equivalent state” • Primary requirement
 Eventual replica-to-replica communication • Order insensitive! (Commutativity) • Duplicate insensitive! (Idempotent) 8
  8. Programming SEC 1. Eliminate accidental nondeterminism
 (ex. deterministic, modeling non-monotonic

    operations monotonically) 2. Retain the properties of functional programming
 (ex. confluence, referential transparency over composition) 14
  9. Programming SEC 1. Eliminate accidental nondeterminism
 (ex. deterministic, modeling non-monotonic

    operations monotonically) 2. Retain the properties of functional programming
 (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime
 (ex. replication, membership, dissemination) 14
  10. Programming SEC 1. Eliminate accidental nondeterminism
 (ex. deterministic, modeling non-monotonic

    operations monotonically) 2. Retain the properties of functional programming
 (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime
 (ex. replication, membership, dissemination) 15
  11. Conflict-Free 
 Replicated Data Types • Many types exist with

    different properties
 Sets, counters, registers, flags, maps, graphs 17
  12. Conflict-Free 
 Replicated Data Types • Many types exist with

    different properties
 Sets, counters, registers, flags, maps, graphs • Strong Eventual Consistency
 Instances satisfy SEC property per- object 17
  13. RA RB RC {1} (1, {a}, {}) add(1) {1} (1,

    {c}, {}) add(1) {} (1, {c}, {c}) remove(1)
  14. RA RB RC {1} (1, {a}, {}) add(1) {1} (1,

    {c}, {}) add(1) {} (1, {c}, {c}) remove(1) {1} {1} {1} (1, {a, c}, {c}) (1, {a, c}, {c}) (1, {a, c}, {c})
  15. Programming SEC 1. Eliminate accidental nondeterminism
 (ex. deterministic, modeling non-monotonic

    operations monotonically) 2. Retain the properties of functional programming
 (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime
 (ex. replication, membership, dissemination) 23
  16. Lattice Processing (Lasp) • Distributed dataflow
 Declarative, functional programming model

    • Convergent data structures
 Primary data abstraction is the CRDT 25
  17. Lattice Processing (Lasp) • Distributed dataflow
 Declarative, functional programming model

    • Convergent data structures
 Primary data abstraction is the CRDT • Enables composition
 Provides functional composition of CRDTs that preserves the SEC property 25
  18. 26 %% Create initial set. S1 = declare(set), %% Add

    elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
  19. 27 %% Create initial set. S1 = declare(set), %% Add

    elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
  20. 28 %% Create initial set. S1 = declare(set), %% Add

    elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
  21. 29 %% Create initial set. S1 = declare(set), %% Add

    elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
  22. 30 %% Create initial set. S1 = declare(set), %% Add

    elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
  23. Lattice Processing (Lasp) • Functional and set-theoretic operations on sets


    Product, intersection, union, filter, map, fold 31
  24. Lattice Processing (Lasp) • Functional and set-theoretic operations on sets


    Product, intersection, union, filter, map, fold • Metadata computation
 Performs transformation on the internal metadata of CRDTs allowing creation of “composed” CRDTs 31
  25. Programming SEC 1. Eliminate accidental nondeterminism
 (ex. deterministic, modeling non-monotonic

    operations monotonically) 2. Retain the properties of functional programming
 (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime
 (ex. replication, membership, dissemination) 32
  26. Selective Hearing • Epidemic broadcast based runtime system
 Provide a

    runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution 34
  27. Selective Hearing • Epidemic broadcast based runtime system
 Provide a

    runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution • Well-matched to Lattice Processing (Lasp) 34
  28. Selective Hearing • Epidemic broadcast based runtime system
 Provide a

    runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution • Well-matched to Lattice Processing (Lasp) • Epidemic broadcast mechanisms provide weak ordering but are resilient and efficient 34
  29. Selective Hearing • Epidemic broadcast based runtime system
 Provide a

    runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution • Well-matched to Lattice Processing (Lasp) • Epidemic broadcast mechanisms provide weak ordering but are resilient and efficient • Lasp’s programming model is tolerant to message re- ordering, disconnections, and node failures 34
  30. Selective Hearing • Epidemic broadcast based runtime system
 Provide a

    runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution • Well-matched to Lattice Processing (Lasp) • Epidemic broadcast mechanisms provide weak ordering but are resilient and efficient • Lasp’s programming model is tolerant to message re- ordering, disconnections, and node failures • “Selective Receive”
 Nodes selectively receive and process messages based on interest. 34
  31. Layered Approach • Membership
 Configurable membership protocol which can operate

    in a client-server or peer-to-peer mode • Broadcast (via Gossip, Tree, etc.)
 Efficient dissemination of both program state and application state via gossip, broadcast tree, or hybrid mode 35
  32. Layered Approach • Membership
 Configurable membership protocol which can operate

    in a client-server or peer-to-peer mode • Broadcast (via Gossip, Tree, etc.)
 Efficient dissemination of both program state and application state via gossip, broadcast tree, or hybrid mode • Auto-discovery
 Integration with Mesos, auto-discovery of Lasp nodes for ease of configurability 35
  33. Programming SEC 1. Eliminate accidental nondeterminism
 (ex. deterministic, modeling non-monotonic

    operations monotonically) 2. Retain the properties of functional programming
 (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime
 (ex. replication, membership, dissemination) 42
  34. Advertisement Counter • Mobile game platform selling advertisement space
 Advertisements

    are paid according to a minimum number of impressions • Clients will go offline
 Clients have limited connectivity and the system still needs to make progress while clients are offline 44
  35. Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot

    Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 45
  36. Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot

    Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Riot Ads Rovio Ads Product Read 50,000 Remove Increment Union 46 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  37. Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot

    Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Rovio Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 1 Client 47 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  38. Ads ovio Ad ounter 1 ovio Ad ounter 2 Riot

    Ad ounter 1 Riot Ad ounter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Rovio Ad Counter 1 Ro C Rovio Ad Counter 1 Ro C Rovio Ad Counter 1 Ro C Rovio Ad Counter 1 Ro C Client Side, Sing 48 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  39. Ads Contracts Ads Contracts Ads With Contracts Riot Ads Rovio

    Ads Filter Product move Read Union Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 49 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  40. Ads Contracts Ads Contracts Ads With Contracts Filter Product Read

    Union Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 50 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  41. Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot

    Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 51 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  42. Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot

    Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Riot Ads Rovio Ads Fil Product Read 50,000 Remove Increment Union 52 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  43. Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot

    Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 53 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  44. Advertisement Counter • Completely monotonic
 Disabling advertisements and contracts are

    all modeled through monotonic state growth • Arbitrary distribution
 Use of convergent data structures allows computational graph to be arbitrarily distributed 54
  45. Advertisement Counter • Completely monotonic
 Disabling advertisements and contracts are

    all modeled through monotonic state growth • Arbitrary distribution
 Use of convergent data structures allows computational graph to be arbitrarily distributed • Divergence
 Divergence is a factor of synchronization period 54
  46. Background Distributed Erlang • Transparent distribution
 Built-in, provided by Erlang/BEAM,

    cross-node message passing. • Known scalability limitations
 Analyzed in academic in various publications. 57
  47. Background Distributed Erlang • Transparent distribution
 Built-in, provided by Erlang/BEAM,

    cross-node message passing. • Known scalability limitations
 Analyzed in academic in various publications. • Single connection
 Head of line blocking. 57
  48. Background Distributed Erlang • Transparent distribution
 Built-in, provided by Erlang/BEAM,

    cross-node message passing. • Known scalability limitations
 Analyzed in academic in various publications. • Single connection
 Head of line blocking. • Full membership
 All-to-all failure detection with heartbeats and timeouts. 57
  49. Background Erlang Port Mapper Daemon • Operates on a known

    port
 Similar to Solaris sunrpc style portmap: known port for mapping to dynamic port-based services. 58
  50. Background Erlang Port Mapper Daemon • Operates on a known

    port
 Similar to Solaris sunrpc style portmap: known port for mapping to dynamic port-based services. • Bridged networking
 Problematic for cluster in bridged networking with dynamic port allocation. 58
  51. Experiment Design • Single application
 Advertisement counter example from Rovio

    Entertainment. • Runtime configuration
 Application controlled through runtime environment variables. 59
  52. Experiment Design • Single application
 Advertisement counter example from Rovio

    Entertainment. • Runtime configuration
 Application controlled through runtime environment variables. • Membership
 Full membership with Distributed Erlang via EPMD. 59
  53. Experiment Design • Single application
 Advertisement counter example from Rovio

    Entertainment. • Runtime configuration
 Application controlled through runtime environment variables. • Membership
 Full membership with Distributed Erlang via EPMD. • Dissemination
 State-based object dissemination through anti-entropy protocol (fanout-based, PARC-style.) 59
  54. Experiment Orchestration • Docker
 Used for deployment of both EPMD

    and Lasp application. • Single EPMD instance per slave
 Controlled through the use of host networking and HOSTNAME: UNIQUE constraints in Mesos. 60
  55. Experiment Orchestration • Docker
 Used for deployment of both EPMD

    and Lasp application. • Single EPMD instance per slave
 Controlled through the use of host networking and HOSTNAME: UNIQUE constraints in Mesos. • Lasp
 Local execution using host networking: connects to local EPMD. 60
  56. Experiment Orchestration • Docker
 Used for deployment of both EPMD

    and Lasp application. • Single EPMD instance per slave
 Controlled through the use of host networking and HOSTNAME: UNIQUE constraints in Mesos. • Lasp
 Local execution using host networking: connects to local EPMD. • Service Discovery
 Service discovery facilitated through clustering EPMD instances through Sprinter. 60
  57. Experiment Orchestration • Playa-Mesos
 Used for local experiment design and

    evaluation with Mesos. • Threads (Processes)
 Low node count in evaluation required node simulation through local threads to increase concurrency. 61
  58. Experiment Orchestration • Playa-Mesos
 Used for local experiment design and

    evaluation with Mesos. • Threads (Processes)
 Low node count in evaluation required node simulation through local threads to increase concurrency. • Transparent Migration
 Desired transparent migration of experiments to AWS based Mesos deployment: problematic with IP/port assignment and clustering. 61
  59. Experiment Orchestration • Playa-Mesos
 Used for local experiment design and

    evaluation with Mesos. • Threads (Processes)
 Low node count in evaluation required node simulation through local threads to increase concurrency. • Transparent Migration
 Desired transparent migration of experiments to AWS based Mesos deployment: problematic with IP/port assignment and clustering. • Adapted via Environment
 Adapted the deployment based on detection of whether it was cloud vs. local deployment. 61
  60. Ideal Experiment • Local Deployment
 High thread concurrency when operating

    with lower node count. • Cloud Deployment
 Low thread concurrency when operating with a higher node count. 62
  61. Initial Evaluation • Moved to DC/OS exclusively
 Environments too different:

    too much work needed to be adapted for things to work correctly. 64
  62. Initial Evaluation • Moved to DC/OS exclusively
 Environments too different:

    too much work needed to be adapted for things to work correctly. • Single orchestration task
 Dispatched events, controlled when to start and stop the evaluation and performed log aggregation. 64
  63. Initial Evaluation • Moved to DC/OS exclusively
 Environments too different:

    too much work needed to be adapted for things to work correctly. • Single orchestration task
 Dispatched events, controlled when to start and stop the evaluation and performed log aggregation. • Bottleneck
 Events immediately dispatched: would require blocking for processing acknowledgment. 64
  64. Initial Evaluation • Moved to DC/OS exclusively
 Environments too different:

    too much work needed to be adapted for things to work correctly. • Single orchestration task
 Dispatched events, controlled when to start and stop the evaluation and performed log aggregation. • Bottleneck
 Events immediately dispatched: would require blocking for processing acknowledgment. • Unrealistic
 Events do not queue up all at once for processing by the client. 64
  65. Lasp Difficulties • Too expensive
 2.0 CPU and 2048 MiB

    of memory. • Weeks spent adding instrumentation
 Process level, VM level, Erlang Observer instrumentation to identify heavy CPU and memory processes. 65
  66. Lasp Difficulties • Too expensive
 2.0 CPU and 2048 MiB

    of memory. • Weeks spent adding instrumentation
 Process level, VM level, Erlang Observer instrumentation to identify heavy CPU and memory processes. • Dissemination too expensive
 1000 threads to a single dissemination process (one Mesos task) leads to backed up message queues and memory leaks. 65
  67. Lasp Difficulties • Too expensive
 2.0 CPU and 2048 MiB

    of memory. • Weeks spent adding instrumentation
 Process level, VM level, Erlang Observer instrumentation to identify heavy CPU and memory processes. • Dissemination too expensive
 1000 threads to a single dissemination process (one Mesos task) leads to backed up message queues and memory leaks. • Unrealistic
 Two different dissemination mechanisms: thread to thread and node to node: one is synthetic. 65
  68. EPMD Difficulties • Nodes become unregistered
 Nodes randomly unregistered with

    EPMD during execution. • Lost connection
 EPMD loses connections with nodes for some arbitrary reason. 66
  69. EPMD Difficulties • Nodes become unregistered
 Nodes randomly unregistered with

    EPMD during execution. • Lost connection
 EPMD loses connections with nodes for some arbitrary reason. • EPMD task restarted by Mesos
 Restarted for an unknown reason, which leads Lasp instances to restart in their own container. 66
  70. Overhead Difficulties • Too much state
 Client would ship around

    5 GiB of state within 90 seconds. • Delta dissemination
 Delta dissemination only provides around a 30% decrease in state transmission. 67
  71. Overhead Difficulties • Too much state
 Client would ship around

    5 GiB of state within 90 seconds. • Delta dissemination
 Delta dissemination only provides around a 30% decrease in state transmission. • Unbounded queues
 Message buffers would lead to VMs crashing because of large memory consumption. 67
  72. Mesos Difficulties • CPU “units”
 Unclear what CPU units actually

    are: what they quantify and how to properly provision tasks using them. 68
  73. Mesos Difficulties • CPU “units”
 Unclear what CPU units actually

    are: what they quantify and how to properly provision tasks using them. • Random task failures
 Impossible to debug when instances are “Killed”; mostly the OOM-Killer as we learned. 68
  74. Mesos Difficulties • CPU “units”
 Unclear what CPU units actually

    are: what they quantify and how to properly provision tasks using them. • Random task failures
 Impossible to debug when instances are “Killed”; mostly the OOM-Killer as we learned. • Log rolling
 UI doesn’t handle log rolling when debugging; CLI needs to be restarted in some cases for log rolling. 68
  75. Mesos Difficulties • CPU “units”
 Unclear what CPU units actually

    are: what they quantify and how to properly provision tasks using them. • Random task failures
 Impossible to debug when instances are “Killed”; mostly the OOM-Killer as we learned. • Log rolling
 UI doesn’t handle log rolling when debugging; CLI needs to be restarted in some cases for log rolling. • Docker containerizer
 Seemed very immature at the time; difficult to debug or have visibility into processes running in Docker. 68
  76. Ditch Distributed Erlang • Pluggable membership service
 Build pluggable membership

    service with abstract interface initially on EPMD and later migrate after tested. 70
  77. Ditch Distributed Erlang • Pluggable membership service
 Build pluggable membership

    service with abstract interface initially on EPMD and later migrate after tested. • Adapt Lasp and Broadcast layer
 Integrate pluggable membership service throughout the stack and librate existing libraries from distributed Erlang. 70
  78. Ditch Distributed Erlang • Pluggable membership service
 Build pluggable membership

    service with abstract interface initially on EPMD and later migrate after tested. • Adapt Lasp and Broadcast layer
 Integrate pluggable membership service throughout the stack and librate existing libraries from distributed Erlang. • Build service discovery mechanism
 Mechanize node discovery outside of EPMD based on new membership service. 70
  79. Partisan (Membership Layer) • Pluggable protocol membership layer
 Allow runtime

    configuration of protocols used for cluster membership. 71
  80. Partisan (Membership Layer) • Pluggable protocol membership layer
 Allow runtime

    configuration of protocols used for cluster membership. • Several protocol implementations: 71
  81. Partisan (Membership Layer) • Pluggable protocol membership layer
 Allow runtime

    configuration of protocols used for cluster membership. • Several protocol implementations: • Full membership via EPMD. 71
  82. Partisan (Membership Layer) • Pluggable protocol membership layer
 Allow runtime

    configuration of protocols used for cluster membership. • Several protocol implementations: • Full membership via EPMD. • Full membership via TCP. 71
  83. Partisan (Membership Layer) • Pluggable protocol membership layer
 Allow runtime

    configuration of protocols used for cluster membership. • Several protocol implementations: • Full membership via EPMD. • Full membership via TCP. • Client-server membership via TCP. 71
  84. Partisan (Membership Layer) • Pluggable protocol membership layer
 Allow runtime

    configuration of protocols used for cluster membership. • Several protocol implementations: • Full membership via EPMD. • Full membership via TCP. • Client-server membership via TCP. • Peer-to-peer membership via TCP (with HyParView) 71
  85. Partisan (Membership Layer) • Pluggable protocol membership layer
 Allow runtime

    configuration of protocols used for cluster membership. • Several protocol implementations: • Full membership via EPMD. • Full membership via TCP. • Client-server membership via TCP. • Peer-to-peer membership via TCP (with HyParView) • Visualization
 Provide a force-directed graph-based visualization engine for cluster debugging in real-time. 71
  86. Partisan (Full via EPMD or TCP) • Full membership
 Nodes

    have full visibility into the entire graph. 72
  87. Partisan (Full via EPMD or TCP) • Full membership
 Nodes

    have full visibility into the entire graph. • Failure detection
 Performed by peer-to-peer heartbeat messages with a timeout. 72
  88. Partisan (Full via EPMD or TCP) • Full membership
 Nodes

    have full visibility into the entire graph. • Failure detection
 Performed by peer-to-peer heartbeat messages with a timeout. • Limited scalability
 Heartbeat interval increases when node count increases leading to false or delayed detection. 72
  89. Partisan (Full via EPMD or TCP) • Full membership
 Nodes

    have full visibility into the entire graph. • Failure detection
 Performed by peer-to-peer heartbeat messages with a timeout. • Limited scalability
 Heartbeat interval increases when node count increases leading to false or delayed detection. • Testing
 Used to create the initial test suite for Partisan. 72
  90. Partisan (Client-Server Model) • Client-server membership
 Server has all peers

    in the system as peers; client has only the server as a peer. 73
  91. Partisan (Client-Server Model) • Client-server membership
 Server has all peers

    in the system as peers; client has only the server as a peer. • Failure detection
 Nodes heartbeat with timeout all peers they are aware of. 73
  92. Partisan (Client-Server Model) • Client-server membership
 Server has all peers

    in the system as peers; client has only the server as a peer. • Failure detection
 Nodes heartbeat with timeout all peers they are aware of. • Limited scalability
 Single point of failure: server; with limited scalability on visibility. 73
  93. Partisan (Client-Server Model) • Client-server membership
 Server has all peers

    in the system as peers; client has only the server as a peer. • Failure detection
 Nodes heartbeat with timeout all peers they are aware of. • Limited scalability
 Single point of failure: server; with limited scalability on visibility. • Testing
 Used for baseline evaluations as “reference” architecture. 73
  94. Partisan (HyParView, default) • Partial view protocol
 Two views: active

    (fixed) and passive (log n); passive used for failure replacement with active view. 74
  95. Partisan (HyParView, default) • Partial view protocol
 Two views: active

    (fixed) and passive (log n); passive used for failure replacement with active view. • Failure detection
 Performed by monitoring active TCP connections to peers with keep-alive enabled. 74
  96. Partisan (HyParView, default) • Partial view protocol
 Two views: active

    (fixed) and passive (log n); passive used for failure replacement with active view. • Failure detection
 Performed by monitoring active TCP connections to peers with keep-alive enabled. • Very scalable (10k+ nodes during academic evaluation)
 However, probabilistic; potentially leads to isolated nodes during churn. 74
  97. Plumtree (Broadcast Layer) • Optimization
 Computes node-rooted spanning trees through

    the overlay network to reduce message redundancy. • Hybrid approach
 Ideal: tree-based dissemination: flood other nodes with metadata information used to repair the tree during network partitions. 75
  98. Plumtree (Broadcast Layer) • Optimization
 Computes node-rooted spanning trees through

    the overlay network to reduce message redundancy. • Hybrid approach
 Ideal: tree-based dissemination: flood other nodes with metadata information used to repair the tree during network partitions. • Configurable at runtime
 Possible to enable or disable at runtime without application changes: optimization only. 75
  99. Lasp Integration • Membership Layer
 Configurable at runtime through the

    environment for any of the previously discussed models. 76
  100. Lasp Integration • Membership Layer
 Configurable at runtime through the

    environment for any of the previously discussed models. • Dissemination Layer
 Also configurable: client/server, tree or not, causal- based or not for efficiency. 76
  101. Lasp Integration • Membership Layer
 Configurable at runtime through the

    environment for any of the previously discussed models. • Dissemination Layer
 Also configurable: client/server, tree or not, causal- based or not for efficiency. • No Application Modifications
 Only affect runtime performance; do not require rewriting the application as it’s safest in the weakest mode. 76
  102. Sprinter (Service Discovery) • Responsible for clustering tasks
 Uses Partisan

    to cluster all nodes and ensure connected overlay network: reads information from Marathon. 77
  103. Sprinter (Service Discovery) • Responsible for clustering tasks
 Uses Partisan

    to cluster all nodes and ensure connected overlay network: reads information from Marathon. • Node local
 Operates at each node and is responsible for taking actions to ensure connected graph: required for probabilistic protocols. 77
  104. Sprinter (Service Discovery) • Responsible for clustering tasks
 Uses Partisan

    to cluster all nodes and ensure connected overlay network: reads information from Marathon. • Node local
 Operates at each node and is responsible for taking actions to ensure connected graph: required for probabilistic protocols. • Membership mode specific
 Knows, based on the membership mode, how to properly cluster nodes and enforces proper join behaviour. 77
  105. Debugging Sprinter • S3 archival
 Nodes periodically snapshot their membership

    view for analysis. • Elected node (or group) analyses 
 Periodically analyses the information in S3 for the following: 78
  106. Debugging Sprinter • S3 archival
 Nodes periodically snapshot their membership

    view for analysis. • Elected node (or group) analyses 
 Periodically analyses the information in S3 for the following: • Isolated node detection
 Identifies isolated nodes and takes corrective measures to repair the overlay. 78
  107. Debugging Sprinter • S3 archival
 Nodes periodically snapshot their membership

    view for analysis. • Elected node (or group) analyses 
 Periodically analyses the information in S3 for the following: • Isolated node detection
 Identifies isolated nodes and takes corrective measures to repair the overlay. • Verifies symmetric relationship
 Ensures that if a node knows about another node, the relationship is symmetric: prevents I know you, but you don’t know me. 78
  108. Debugging Sprinter • S3 archival
 Nodes periodically snapshot their membership

    view for analysis. • Elected node (or group) analyses 
 Periodically analyses the information in S3 for the following: • Isolated node detection
 Identifies isolated nodes and takes corrective measures to repair the overlay. • Verifies symmetric relationship
 Ensures that if a node knows about another node, the relationship is symmetric: prevents I know you, but you don’t know me. • Periodic alerting
 Alerts regarding disconnected graphs so external measures can be taken, if necessary. 78
  109. Evaluation Strategy • Deployment and runtime configuration
 Ability to deploy

    a cluster of node and configure simulations at runtime. 80
  110. Evaluation Strategy • Deployment and runtime configuration
 Ability to deploy

    a cluster of node and configure simulations at runtime. • Each simulation: 80
  111. Evaluation Strategy • Deployment and runtime configuration
 Ability to deploy

    a cluster of node and configure simulations at runtime. • Each simulation: • Different application scenario
 Uniquely execute a different application scenario at runtime based on runtime configuration. 80
  112. Evaluation Strategy • Deployment and runtime configuration
 Ability to deploy

    a cluster of node and configure simulations at runtime. • Each simulation: • Different application scenario
 Uniquely execute a different application scenario at runtime based on runtime configuration. • Result aggregation
 Aggregate results at end of execution and archive these results. 80
  113. Evaluation Strategy • Deployment and runtime configuration
 Ability to deploy

    a cluster of node and configure simulations at runtime. • Each simulation: • Different application scenario
 Uniquely execute a different application scenario at runtime based on runtime configuration. • Result aggregation
 Aggregate results at end of execution and archive these results. • Plot generation
 Automatically generate plots for the execution and aggregate the results of multiple executions. 80
  114. Evaluation Strategy • Deployment and runtime configuration
 Ability to deploy

    a cluster of node and configure simulations at runtime. • Each simulation: • Different application scenario
 Uniquely execute a different application scenario at runtime based on runtime configuration. • Result aggregation
 Aggregate results at end of execution and archive these results. • Plot generation
 Automatically generate plots for the execution and aggregate the results of multiple executions. • Minimal coordination 
 Work must be performed with minimal coordination, as a single orchestrator is a scalability bottleneck for large applications. 80
  115. Node Clustering • Sprinter
 Should automatically cluster the nodes to

    the desired topology based on information derived from Marathon (ports, IPs, etc.) 81
  116. Node Clustering • Sprinter
 Should automatically cluster the nodes to

    the desired topology based on information derived from Marathon (ports, IPs, etc.) • Ensure connectivity
 Nodes may die and be restarted during execution, so we should ensure that the graph stays connected during execution and new nodes are added. 81
  117. Application Execution • Marathon
 Tags tasks accordingly based on the

    desired behavior of each task. • Environment
 Environment should be used to derive the behavior of each of the tasks during the execution. 82
  118. Application Execution • Marathon
 Tags tasks accordingly based on the

    desired behavior of each task. • Environment
 Environment should be used to derive the behavior of each of the tasks during the execution. • Event Generation
 Nodes should generate their own events: there should be no requirement of a central event executor: each node contains it’s own synthetic workload. 82
  119. Application Execution • Marathon
 Tags tasks accordingly based on the

    desired behavior of each task. • Environment
 Environment should be used to derive the behavior of each of the tasks during the execution. • Event Generation
 Nodes should generate their own events: there should be no requirement of a central event executor: each node contains it’s own synthetic workload. • Instrumentation
 Nodes instrument and log their own events for later log aggregation of results. 82
  120. Completion Detection • “Convergence Structure”
 Uninstrumented CRDT of grow-only sets

    containing counters that each node manipulates. • Simulates a workflow
 Nodes use this operation to simulate a workflow for the experiment. 83
  121. Completion Detection • “Convergence Structure”
 Uninstrumented CRDT of grow-only sets

    containing counters that each node manipulates. • Simulates a workflow
 Nodes use this operation to simulate a workflow for the experiment. • Event Generation
 Event generation toggles a boolean for the node to show completion. 83
  122. Completion Detection • “Convergence Structure”
 Uninstrumented CRDT of grow-only sets

    containing counters that each node manipulates. • Simulates a workflow
 Nodes use this operation to simulate a workflow for the experiment. • Event Generation
 Event generation toggles a boolean for the node to show completion. • Log Aggregation
 Completion triggers log aggregation. 83
  123. Completion Detection • “Convergence Structure”
 Uninstrumented CRDT of grow-only sets

    containing counters that each node manipulates. • Simulates a workflow
 Nodes use this operation to simulate a workflow for the experiment. • Event Generation
 Event generation toggles a boolean for the node to show completion. • Log Aggregation
 Completion triggers log aggregation. • Shutdown
 Upon log aggregation completion, nodes shutdown. 83
  124. Completion Detection • “Convergence Structure”
 Uninstrumented CRDT of grow-only sets

    containing counters that each node manipulates. • Simulates a workflow
 Nodes use this operation to simulate a workflow for the experiment. • Event Generation
 Event generation toggles a boolean for the node to show completion. • Log Aggregation
 Completion triggers log aggregation. • Shutdown
 Upon log aggregation completion, nodes shutdown. • External monitoring
 When events complete execution, nodes automatically begin the next experiment. 83
  125. Log Generation • Logs aggregated and generated
 Central location performs

    log aggregation and generation of plots from log data 84
  126. Log Generation • Logs aggregated and generated
 Central location performs

    log aggregation and generation of plots from log data • Reproducible for “verified artifacts”
 Deterministic regeneration of log data into gnuplot artifacts and permanent archival. 84
  127. Log Generation • Logs aggregated and generated
 Central location performs

    log aggregation and generation of plots from log data • Reproducible for “verified artifacts”
 Deterministic regeneration of log data into gnuplot artifacts and permanent archival. • Repeated push/pull operation
 Given it’s GIT based, repeated push/pull/rebase is required to push logs: expensive, but works for now. 84
  128. Learning Lasp • Single node orchestration: bad
 Not possible once

    you exceed a few nodes: message queues, memory, delays. 86
  129. Learning Lasp • Single node orchestration: bad
 Not possible once

    you exceed a few nodes: message queues, memory, delays. • Partial Views
 Required: rely on transitive dissemination of information and partial network knowledge. 86
  130. Learning Lasp • Single node orchestration: bad
 Not possible once

    you exceed a few nodes: message queues, memory, delays. • Partial Views
 Required: rely on transitive dissemination of information and partial network knowledge. • Results
 Reduced Lasp memory footprint to 75MB; larger in practice for debugging. 86
  131. Learning Partisan • Fast churn isolates nodes
 Need a repair

    mechanism: random promotion of isolated nodes; mainly issues of symmetry. 87
  132. Learning Partisan • Fast churn isolates nodes
 Need a repair

    mechanism: random promotion of isolated nodes; mainly issues of symmetry. • FIFO across connections
 Not per connection, but protocol assumes across all connections leading to false disconnects. 87
  133. Learning Partisan • Fast churn isolates nodes
 Need a repair

    mechanism: random promotion of isolated nodes; mainly issues of symmetry. • FIFO across connections
 Not per connection, but protocol assumes across all connections leading to false disconnects. • Unrealistic system model
 You need per message acknowledgements for safety. 87
  134. Learning Partisan • Fast churn isolates nodes
 Need a repair

    mechanism: random promotion of isolated nodes; mainly issues of symmetry. • FIFO across connections
 Not per connection, but protocol assumes across all connections leading to false disconnects. • Unrealistic system model
 You need per message acknowledgements for safety. • Pluggable protocol helps debugging
 Being able to switch to full membership or client-server assists in debugging protocol vs. application problems. 87
  135. Latest Results • Reproducibility at 500 nodes for full applications


    Build and evaluate nodes reproducibly at 500 node cluster sizes (possible at 1000 tasks over 140 nodes) 88
  136. Latest Results • Reproducibility at 500 nodes for full applications


    Build and evaluate nodes reproducibly at 500 node cluster sizes (possible at 1000 tasks over 140 nodes) • Limited financially and by Amazon
 Harder to run larger evaluations because we’re limited financially (as a university) and because of Amazon limits. 88
  137. Latest Results • Reproducibility at 500 nodes for full applications


    Build and evaluate nodes reproducibly at 500 node cluster sizes (possible at 1000 tasks over 140 nodes) • Limited financially and by Amazon
 Harder to run larger evaluations because we’re limited financially (as a university) and because of Amazon limits. • Mean state reduction per client
 Around 100x improvement from our PaPoC 2016 initial evaluation results. 88
  138. Plat à emporter • Visualizations are important!
 Graph performance, visualize

    your cluster: all of these things lead to easier debugging. 89
  139. Plat à emporter • Visualizations are important!
 Graph performance, visualize

    your cluster: all of these things lead to easier debugging. • Control changes
 No Lasp PR accepted without divergence, state transmission, and overhead graphs. 89
  140. Plat à emporter • Visualizations are important!
 Graph performance, visualize

    your cluster: all of these things lead to easier debugging. • Control changes
 No Lasp PR accepted without divergence, state transmission, and overhead graphs. • Automation
 Developers use graphs when they are easy to make: lower the difficulty for generation and understand how changes alter system behaviour. 89
  141. Plat à emporter • Visualizations are important!
 Graph performance, visualize

    your cluster: all of these things lead to easier debugging. • Control changes
 No Lasp PR accepted without divergence, state transmission, and overhead graphs. • Automation
 Developers use graphs when they are easy to make: lower the difficulty for generation and understand how changes alter system behaviour. • Make work easily testable
 When you test locally and deploy globally, you need to make things easy to test, deploy and evaluate (for good science, I say!) 89