Designing and Evaluating a Distributed Computing Language Runtime

Designing and Evaluating a Distributed Computing Language Runtime

Erlang User Conference 2016

3e09fee7b359be847ed5fa48f524a3d3?s=128

Christopher Meiklejohn

September 09, 2016
Tweet

Transcript

  1. Designing and Evaluating a Distributed Computing Language Runtime Christopher Meiklejohn

    (@cmeik) Université catholique de Louvain, Belgium 1
  2. RA RB

  3. RA RB 1 set(1)

  4. RA RB 1 set(1) 3 2 set(2) set(3)

  5. RA RB 1 set(1) 3 2 set(2) set(3) ? ?

  6. Synchronization • To enforce an order
 Makes programming easier 6

  7. Synchronization • To enforce an order
 Makes programming easier •

    Eliminate accidental nondeterminism
 Prevent race conditions 6
  8. Synchronization • To enforce an order
 Makes programming easier •

    Eliminate accidental nondeterminism
 Prevent race conditions • Techniques
 Locks, mutexes, semaphores, monitors, etc. 6
  9. Difficult Cases • “Internet of Things”, 
 Low power, limited

    memory and connectivity 7
  10. Difficult Cases • “Internet of Things”, 
 Low power, limited

    memory and connectivity • Mobile Gaming
 Offline operation with replicated, shared state 7
  11. Weak Synchronization • Can we achieve anything without synchronization?
 Not

    really. 8
  12. Weak Synchronization • Can we achieve anything without synchronization?
 Not

    really. • Strong Eventual Consistency (SEC)
 “Replicas that deliver the same updates have equivalent state” 8
  13. Weak Synchronization • Can we achieve anything without synchronization?
 Not

    really. • Strong Eventual Consistency (SEC)
 “Replicas that deliver the same updates have equivalent state” • Primary requirement
 Eventual replica-to-replica communication 8
  14. Weak Synchronization • Can we achieve anything without synchronization?
 Not

    really. • Strong Eventual Consistency (SEC)
 “Replicas that deliver the same updates have equivalent state” • Primary requirement
 Eventual replica-to-replica communication • Order insensitive! (Commutativity) 8
  15. Weak Synchronization • Can we achieve anything without synchronization?
 Not

    really. • Strong Eventual Consistency (SEC)
 “Replicas that deliver the same updates have equivalent state” • Primary requirement
 Eventual replica-to-replica communication • Order insensitive! (Commutativity) • Duplicate insensitive! (Idempotent) 8
  16. RA RB

  17. RA RB 1 set(1)

  18. RA RB 1 set(1) 3 2 set(2) set(3)

  19. RA RB 1 3 2 3 3 set(1) set(2) set(3)

    max(2,3) max(2,3)
  20. How can we succeed with Strong Eventual Consistency? 13

  21. Programming SEC 1. Eliminate accidental nondeterminism
 (ex. deterministic, modeling non-monotonic

    operations monotonically) 14
  22. Programming SEC 1. Eliminate accidental nondeterminism
 (ex. deterministic, modeling non-monotonic

    operations monotonically) 2. Retain the properties of functional programming
 (ex. confluence, referential transparency over composition) 14
  23. Programming SEC 1. Eliminate accidental nondeterminism
 (ex. deterministic, modeling non-monotonic

    operations monotonically) 2. Retain the properties of functional programming
 (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime
 (ex. replication, membership, dissemination) 14
  24. Programming SEC 1. Eliminate accidental nondeterminism
 (ex. deterministic, modeling non-monotonic

    operations monotonically) 2. Retain the properties of functional programming
 (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime
 (ex. replication, membership, dissemination) 15
  25. Convergent Objects
 Conflict-Free 
 Replicated Data Types 16 SSS 2011

  26. Conflict-Free 
 Replicated Data Types • Many types exist with

    different properties
 Sets, counters, registers, flags, maps, graphs 17
  27. Conflict-Free 
 Replicated Data Types • Many types exist with

    different properties
 Sets, counters, registers, flags, maps, graphs • Strong Eventual Consistency
 Instances satisfy SEC property per- object 17
  28. RA RB RC

  29. RA RB RC {1} (1, {a}, {}) add(1)

  30. RA RB RC {1} (1, {a}, {}) add(1) {1} (1,

    {c}, {}) add(1)
  31. RA RB RC {1} (1, {a}, {}) add(1) {1} (1,

    {c}, {}) add(1) {} (1, {c}, {c}) remove(1)
  32. RA RB RC {1} (1, {a}, {}) add(1) {1} (1,

    {c}, {}) add(1) {} (1, {c}, {c}) remove(1) {1} {1} {1} (1, {a, c}, {c}) (1, {a, c}, {c}) (1, {a, c}, {c})
  33. Programming SEC 1. Eliminate accidental nondeterminism
 (ex. deterministic, modeling non-monotonic

    operations monotonically) 2. Retain the properties of functional programming
 (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime
 (ex. replication, membership, dissemination) 23
  34. Convergent Programs Lattice Processing 24 PPDP 2015

  35. Lattice Processing (Lasp) • Distributed dataflow
 Declarative, functional programming model

    25
  36. Lattice Processing (Lasp) • Distributed dataflow
 Declarative, functional programming model

    • Convergent data structures
 Primary data abstraction is the CRDT 25
  37. Lattice Processing (Lasp) • Distributed dataflow
 Declarative, functional programming model

    • Convergent data structures
 Primary data abstraction is the CRDT • Enables composition
 Provides functional composition of CRDTs that preserves the SEC property 25
  38. 26 %% Create initial set. S1 = declare(set), %% Add

    elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
  39. 27 %% Create initial set. S1 = declare(set), %% Add

    elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
  40. 28 %% Create initial set. S1 = declare(set), %% Add

    elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
  41. 29 %% Create initial set. S1 = declare(set), %% Add

    elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
  42. 30 %% Create initial set. S1 = declare(set), %% Add

    elements to initial set and update. update(S1, {add, [1,2,3]}), %% Create second set. S2 = declare(set), %% Apply map operation between S1 and S2. map(S1, fun(X) -> X * 2 end, S2).
  43. Programming SEC 1. Eliminate accidental nondeterminism
 (ex. deterministic, modeling non-monotonic

    operations monotonically) 2. Retain the properties of functional programming
 (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime
 (ex. replication, membership, dissemination) 31
  44. Distributed Runtime Selective Hearing 32 W-PSDS 2015

  45. Selective Hearing • Epidemic broadcast based runtime system
 Provide a

    runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution 33
  46. Selective Hearing • Epidemic broadcast based runtime system
 Provide a

    runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution • Well-matched to Lattice Processing (Lasp) 33
  47. Selective Hearing • Epidemic broadcast based runtime system
 Provide a

    runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution • Well-matched to Lattice Processing (Lasp) • Epidemic broadcast mechanisms provide weak ordering but are resilient and efficient 33
  48. Selective Hearing • Epidemic broadcast based runtime system
 Provide a

    runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution • Well-matched to Lattice Processing (Lasp) • Epidemic broadcast mechanisms provide weak ordering but are resilient and efficient • Lasp’s programming model is tolerant to message re- ordering, disconnections, and node failures 33
  49. Selective Hearing • Epidemic broadcast based runtime system
 Provide a

    runtime system that can scale to large numbers of nodes, that is resilient to failures and provides efficient execution • Well-matched to Lattice Processing (Lasp) • Epidemic broadcast mechanisms provide weak ordering but are resilient and efficient • Lasp’s programming model is tolerant to message re- ordering, disconnections, and node failures • “Selective Receive”
 Nodes selectively receive and process messages based on interest. 33
  50. Layered Approach 34

  51. Layered Approach • Membership
 Configurable membership protocol which can operate

    in a client-server or peer-to-peer mode 34
  52. Layered Approach • Membership
 Configurable membership protocol which can operate

    in a client-server or peer-to-peer mode • Broadcast (via Gossip, Tree, etc.)
 Efficient dissemination of both program state and application state via gossip, broadcast tree, or hybrid mode 34
  53. Layered Approach • Membership
 Configurable membership protocol which can operate

    in a client-server or peer-to-peer mode • Broadcast (via Gossip, Tree, etc.)
 Efficient dissemination of both program state and application state via gossip, broadcast tree, or hybrid mode • Auto-discovery
 Integration with Mesos, auto-discovery of Lasp nodes for ease of configurability 34
  54. Membership Overlay

  55. Membership Overlay Broadcast Overlay

  56. Membership Overlay Broadcast Overlay

  57. Membership Overlay Broadcast Overlay Mobile Phone

  58. Membership Overlay Broadcast Overlay Mobile Phone Distributed Hash Table

  59. Membership Overlay Broadcast Overlay Mobile Phone Distributed Hash Table Lasp

    Execution
  60. Programming SEC 1. Eliminate accidental nondeterminism
 (ex. deterministic, modeling non-monotonic

    operations monotonically) 2. Retain the properties of functional programming
 (ex. confluence, referential transparency over composition) 3. Distributed, and fault-tolerant runtime
 (ex. replication, membership, dissemination) 41
  61. What can we build? Advertisement Counter 42

  62. Advertisement Counter • Mobile game platform selling advertisement space
 Advertisements

    are paid according to a minimum number of impressions 43
  63. Advertisement Counter • Mobile game platform selling advertisement space
 Advertisements

    are paid according to a minimum number of impressions • Clients will go offline
 Clients have limited connectivity and the system still needs to make progress while clients are offline 43
  64. Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot

    Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 44
  65. Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot

    Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Riot Ads Rovio Ads Product Read 50,000 Remove Increment Union 45 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  66. Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot

    Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Rovio Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 1 Client 46 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  67. Ads ovio Ad ounter 1 ovio Ad ounter 2 Riot

    Ad ounter 1 Riot Ad ounter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Rovio Ad Counter 1 Ro C Rovio Ad Counter 1 Ro C Rovio Ad Counter 1 Ro C Rovio Ad Counter 1 Ro C Client Side, Sing 47 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  68. Ads Contracts Ads Contracts Ads With Contracts Riot Ads Rovio

    Ads Filter Product move Read Union Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 48 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  69. Ads Contracts Ads Contracts Ads With Contracts Filter Product Read

    Union Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 49 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  70. Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot

    Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 50 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  71. Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot

    Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Riot Ads Rovio Ads Fil Product Read 50,000 Remove Increment Union 51 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  72. Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot

    Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client 52 Ads Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Riot Ad Counter 2 Contracts Ads Contracts Ads With Contracts Riot Ads Rovio Ads Filter Product Read 50,000 Remove Increment Read Union Lasp Operation User-Maintained CRDT Lasp-Maintained CRDT Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Rovio Ad Counter 1 Rovio Ad Counter 2 Riot Ad Counter 1 Client Side, Single Copy at Client
  73. Evaluation Initial Evaluation 53

  74. Background Distributed Erlang • Transparent distribution
 Built-in, provided by Erlang/BEAM,

    cross-node message passing. 54
  75. Background Distributed Erlang • Transparent distribution
 Built-in, provided by Erlang/BEAM,

    cross-node message passing. • Known scalability limitations
 Analyzed in academic in various publications. 54
  76. Background Distributed Erlang • Transparent distribution
 Built-in, provided by Erlang/BEAM,

    cross-node message passing. • Known scalability limitations
 Analyzed in academic in various publications. • Single connection
 Head of line blocking. 54
  77. Background Distributed Erlang • Transparent distribution
 Built-in, provided by Erlang/BEAM,

    cross-node message passing. • Known scalability limitations
 Analyzed in academic in various publications. • Single connection
 Head of line blocking. • Full membership
 All-to-all failure detection with heartbeats and timeouts. 54
  78. Background Erlang Port Mapper Daemon • Operates on a known

    port
 Similar to Solaris sunrpc style portmap: known port for mapping to dynamic port-based services. 55
  79. Background Erlang Port Mapper Daemon • Operates on a known

    port
 Similar to Solaris sunrpc style portmap: known port for mapping to dynamic port-based services. • Bridged networking
 Problematic for cluster in bridged networking with dynamic port allocation. 55
  80. Experiment Design • Single application
 Advertisement counter example from Rovio

    Entertainment. 56
  81. Experiment Design • Single application
 Advertisement counter example from Rovio

    Entertainment. • Runtime configuration
 Application controlled through runtime environment variables. 56
  82. Experiment Design • Single application
 Advertisement counter example from Rovio

    Entertainment. • Runtime configuration
 Application controlled through runtime environment variables. • Membership
 Full membership with Distributed Erlang via EPMD. 56
  83. Experiment Design • Single application
 Advertisement counter example from Rovio

    Entertainment. • Runtime configuration
 Application controlled through runtime environment variables. • Membership
 Full membership with Distributed Erlang via EPMD. • Dissemination
 State-based object dissemination through anti-entropy protocol (fanout-based, PARC-style.) 56
  84. Experiment Orchestration • Docker and Mesos with Marathon
 Used for

    deployment of both EPMD and Lasp application. 57
  85. Experiment Orchestration • Docker and Mesos with Marathon
 Used for

    deployment of both EPMD and Lasp application. • Single EPMD instance per slave
 Controlled through the use of host networking and HOSTNAME: UNIQUE constraints in Mesos. 57
  86. Experiment Orchestration • Docker and Mesos with Marathon
 Used for

    deployment of both EPMD and Lasp application. • Single EPMD instance per slave
 Controlled through the use of host networking and HOSTNAME: UNIQUE constraints in Mesos. • Lasp
 Local execution using host networking: connects to local EPMD. 57
  87. Experiment Orchestration • Docker and Mesos with Marathon
 Used for

    deployment of both EPMD and Lasp application. • Single EPMD instance per slave
 Controlled through the use of host networking and HOSTNAME: UNIQUE constraints in Mesos. • Lasp
 Local execution using host networking: connects to local EPMD. • Service Discovery
 Service discovery facilitated through clustering EPMD instances through Sprinter. 57
  88. Ideal Experiment • Local Deployment
 High thread concurrency when operating

    with lower node count. 58
  89. Ideal Experiment • Local Deployment
 High thread concurrency when operating

    with lower node count. • Cloud Deployment
 Low thread concurrency when operating with a higher node count. 58
  90. Results Initial Evaluation 59

  91. Initial Evaluation • Moved to DC/OS exclusively
 Environments too different:

    too much work needed to be adapted for things to work correctly. 60
  92. Initial Evaluation • Moved to DC/OS exclusively
 Environments too different:

    too much work needed to be adapted for things to work correctly. • Single orchestration task
 Dispatched events, controlled when to start and stop the evaluation and performed log aggregation. 60
  93. Initial Evaluation • Moved to DC/OS exclusively
 Environments too different:

    too much work needed to be adapted for things to work correctly. • Single orchestration task
 Dispatched events, controlled when to start and stop the evaluation and performed log aggregation. • Bottleneck
 Events immediately dispatched: would require blocking for processing acknowledgment. 60
  94. Initial Evaluation • Moved to DC/OS exclusively
 Environments too different:

    too much work needed to be adapted for things to work correctly. • Single orchestration task
 Dispatched events, controlled when to start and stop the evaluation and performed log aggregation. • Bottleneck
 Events immediately dispatched: would require blocking for processing acknowledgment. • Unrealistic
 Events do not queue up all at once for processing by the client. 60
  95. Lasp Difficulties • Too expensive
 2.0 CPU and 2048 MiB

    of memory. 61
  96. Lasp Difficulties • Too expensive
 2.0 CPU and 2048 MiB

    of memory. • Weeks spent adding instrumentation
 Process level, VM level, Erlang Observer instrumentation to identify heavy CPU and memory processes. 61
  97. Lasp Difficulties • Too expensive
 2.0 CPU and 2048 MiB

    of memory. • Weeks spent adding instrumentation
 Process level, VM level, Erlang Observer instrumentation to identify heavy CPU and memory processes. • Dissemination too expensive
 1000 threads to a single dissemination process (one Mesos task) leads to backed up message queues and memory leaks. 61
  98. Lasp Difficulties • Too expensive
 2.0 CPU and 2048 MiB

    of memory. • Weeks spent adding instrumentation
 Process level, VM level, Erlang Observer instrumentation to identify heavy CPU and memory processes. • Dissemination too expensive
 1000 threads to a single dissemination process (one Mesos task) leads to backed up message queues and memory leaks. • Unrealistic
 Two different dissemination mechanisms: thread to thread and node to node: one is synthetic. 61
  99. EPMD Difficulties • Nodes become unregistered
 Nodes randomly unregistered with

    EPMD during execution. 62
  100. EPMD Difficulties • Nodes become unregistered
 Nodes randomly unregistered with

    EPMD during execution. • Lost connection
 EPMD loses connections with nodes for some arbitrary reason. 62
  101. EPMD Difficulties • Nodes become unregistered
 Nodes randomly unregistered with

    EPMD during execution. • Lost connection
 EPMD loses connections with nodes for some arbitrary reason. • EPMD task restarted by Mesos
 Restarted for an unknown reason, which leads Lasp instances to restart in their own container. 62
  102. Overhead Difficulties • Too much state
 Client would ship around

    5 GiB of state within 90 seconds. 63
  103. Overhead Difficulties • Too much state
 Client would ship around

    5 GiB of state within 90 seconds. • Delta dissemination
 Delta dissemination only provides around a 30% decrease in state transmission. 63
  104. Overhead Difficulties • Too much state
 Client would ship around

    5 GiB of state within 90 seconds. • Delta dissemination
 Delta dissemination only provides around a 30% decrease in state transmission. • Unbounded queues
 Message buffers would lead to VMs crashing because of large memory consumption. 63
  105. Evaluation Rearchitecture 64

  106. Ditch Distributed Erlang • Pluggable membership service
 Build pluggable membership

    service with abstract interface initially on EPMD and later migrate after tested. 65
  107. Ditch Distributed Erlang • Pluggable membership service
 Build pluggable membership

    service with abstract interface initially on EPMD and later migrate after tested. • Adapt Lasp and Broadcast layer
 Integrate pluggable membership service throughout the stack and librate existing libraries from distributed Erlang. 65
  108. Ditch Distributed Erlang • Pluggable membership service
 Build pluggable membership

    service with abstract interface initially on EPMD and later migrate after tested. • Adapt Lasp and Broadcast layer
 Integrate pluggable membership service throughout the stack and librate existing libraries from distributed Erlang. • Build service discovery mechanism
 Mechanize node discovery outside of EPMD based on new membership service. 65
  109. Partisan (Membership Layer) • Pluggable protocol membership layer
 Allow runtime

    configuration of protocols used for cluster membership. 66
  110. Partisan (Membership Layer) • Pluggable protocol membership layer
 Allow runtime

    configuration of protocols used for cluster membership. • Several protocol implementations: 66
  111. Partisan (Membership Layer) • Pluggable protocol membership layer
 Allow runtime

    configuration of protocols used for cluster membership. • Several protocol implementations: • Full membership via EPMD. 66
  112. Partisan (Membership Layer) • Pluggable protocol membership layer
 Allow runtime

    configuration of protocols used for cluster membership. • Several protocol implementations: • Full membership via EPMD. • Full membership via TCP. 66
  113. Partisan (Membership Layer) • Pluggable protocol membership layer
 Allow runtime

    configuration of protocols used for cluster membership. • Several protocol implementations: • Full membership via EPMD. • Full membership via TCP. • Client-server membership via TCP. 66
  114. Partisan (Membership Layer) • Pluggable protocol membership layer
 Allow runtime

    configuration of protocols used for cluster membership. • Several protocol implementations: • Full membership via EPMD. • Full membership via TCP. • Client-server membership via TCP. • Peer-to-peer membership via TCP (with HyParView) 66
  115. Partisan (Membership Layer) • Pluggable protocol membership layer
 Allow runtime

    configuration of protocols used for cluster membership. • Several protocol implementations: • Full membership via EPMD. • Full membership via TCP. • Client-server membership via TCP. • Peer-to-peer membership via TCP (with HyParView) • Visualization
 Provide a force-directed graph-based visualization engine for cluster debugging in real-time. 66
  116. Partisan (Full via EPMD or TCP) • Full membership
 Nodes

    have full visibility into the entire graph. 67
  117. Partisan (Full via EPMD or TCP) • Full membership
 Nodes

    have full visibility into the entire graph. • Failure detection
 Performed by peer-to-peer heartbeat messages with a timeout. 67
  118. Partisan (Full via EPMD or TCP) • Full membership
 Nodes

    have full visibility into the entire graph. • Failure detection
 Performed by peer-to-peer heartbeat messages with a timeout. • Limited scalability
 Heartbeat interval increases when node count increases leading to false or delayed detection. 67
  119. Partisan (Full via EPMD or TCP) • Full membership
 Nodes

    have full visibility into the entire graph. • Failure detection
 Performed by peer-to-peer heartbeat messages with a timeout. • Limited scalability
 Heartbeat interval increases when node count increases leading to false or delayed detection. • Testing
 Used to create the initial test suite for Partisan. 67
  120. Partisan (Client-Server Model) • Client-server membership
 Server has all peers

    in the system as peers; client has only the server as a peer. 68
  121. Partisan (Client-Server Model) • Client-server membership
 Server has all peers

    in the system as peers; client has only the server as a peer. • Failure detection
 Nodes heartbeat with timeout all peers they are aware of. 68
  122. Partisan (Client-Server Model) • Client-server membership
 Server has all peers

    in the system as peers; client has only the server as a peer. • Failure detection
 Nodes heartbeat with timeout all peers they are aware of. • Limited scalability
 Single point of failure: server; with limited scalability on visibility. 68
  123. Partisan (Client-Server Model) • Client-server membership
 Server has all peers

    in the system as peers; client has only the server as a peer. • Failure detection
 Nodes heartbeat with timeout all peers they are aware of. • Limited scalability
 Single point of failure: server; with limited scalability on visibility. • Testing
 Used for baseline evaluations as “reference” architecture. 68
  124. Partisan (HyParView, default) • Partial view protocol
 Two views: active

    (fixed) and passive (log n); passive used for failure replacement with active view. 69
  125. Partisan (HyParView, default) • Partial view protocol
 Two views: active

    (fixed) and passive (log n); passive used for failure replacement with active view. • Failure detection
 Performed by monitoring active TCP connections to peers with keep-alive enabled. 69
  126. Partisan (HyParView, default) • Partial view protocol
 Two views: active

    (fixed) and passive (log n); passive used for failure replacement with active view. • Failure detection
 Performed by monitoring active TCP connections to peers with keep-alive enabled. • Very scalable (10k+ nodes during academic evaluation)
 However, probabilistic; potentially leads to isolated nodes during churn. 69
  127. Sprinter (Service Discovery) • Responsible for clustering tasks
 Uses Partisan

    to cluster all nodes and ensure connected overlay network: reads information from Marathon. 70
  128. Sprinter (Service Discovery) • Responsible for clustering tasks
 Uses Partisan

    to cluster all nodes and ensure connected overlay network: reads information from Marathon. • Node local
 Operates at each node and is responsible for taking actions to ensure connected graph: required for probabilistic protocols. 70
  129. Sprinter (Service Discovery) • Responsible for clustering tasks
 Uses Partisan

    to cluster all nodes and ensure connected overlay network: reads information from Marathon. • Node local
 Operates at each node and is responsible for taking actions to ensure connected graph: required for probabilistic protocols. • Membership mode specific
 Knows, based on the membership mode, how to properly cluster nodes and enforces proper join behaviour. 70
  130. Debugging Sprinter • S3 archival
 Nodes periodically snapshot their membership

    view for analysis. 71
  131. Debugging Sprinter • S3 archival
 Nodes periodically snapshot their membership

    view for analysis. • Elected node (or group) analyses 
 Periodically analyses the information in S3 for the following: 71
  132. Debugging Sprinter • S3 archival
 Nodes periodically snapshot their membership

    view for analysis. • Elected node (or group) analyses 
 Periodically analyses the information in S3 for the following: • Isolated node detection
 Identifies isolated nodes and takes corrective measures to repair the overlay. 71
  133. Debugging Sprinter • S3 archival
 Nodes periodically snapshot their membership

    view for analysis. • Elected node (or group) analyses 
 Periodically analyses the information in S3 for the following: • Isolated node detection
 Identifies isolated nodes and takes corrective measures to repair the overlay. • Verifies symmetric relationship
 Ensures that if a node knows about another node, the relationship is symmetric: prevents I know you, but you don’t know me. 71
  134. Debugging Sprinter • S3 archival
 Nodes periodically snapshot their membership

    view for analysis. • Elected node (or group) analyses 
 Periodically analyses the information in S3 for the following: • Isolated node detection
 Identifies isolated nodes and takes corrective measures to repair the overlay. • Verifies symmetric relationship
 Ensures that if a node knows about another node, the relationship is symmetric: prevents I know you, but you don’t know me. • Periodic alerting
 Alerts regarding disconnected graphs so external measures can be taken, if necessary. 71
  135. Evaluation Next Evaluation 72

  136. Evaluation Strategy • Deployment and runtime configuration
 Ability to deploy

    a cluster of node and configure simulations at runtime. 73
  137. Evaluation Strategy • Deployment and runtime configuration
 Ability to deploy

    a cluster of node and configure simulations at runtime. • Each simulation: 73
  138. Evaluation Strategy • Deployment and runtime configuration
 Ability to deploy

    a cluster of node and configure simulations at runtime. • Each simulation: • Different application scenario
 Uniquely execute a different application scenario at runtime based on runtime configuration. 73
  139. Evaluation Strategy • Deployment and runtime configuration
 Ability to deploy

    a cluster of node and configure simulations at runtime. • Each simulation: • Different application scenario
 Uniquely execute a different application scenario at runtime based on runtime configuration. • Result aggregation
 Aggregate results at end of execution and archive these results. 73
  140. Evaluation Strategy • Deployment and runtime configuration
 Ability to deploy

    a cluster of node and configure simulations at runtime. • Each simulation: • Different application scenario
 Uniquely execute a different application scenario at runtime based on runtime configuration. • Result aggregation
 Aggregate results at end of execution and archive these results. • Plot generation
 Automatically generate plots for the execution and aggregate the results of multiple executions. 73
  141. Evaluation Strategy • Deployment and runtime configuration
 Ability to deploy

    a cluster of node and configure simulations at runtime. • Each simulation: • Different application scenario
 Uniquely execute a different application scenario at runtime based on runtime configuration. • Result aggregation
 Aggregate results at end of execution and archive these results. • Plot generation
 Automatically generate plots for the execution and aggregate the results of multiple executions. • Minimal coordination 
 Work must be performed with minimal coordination, as a single orchestrator is a scalability bottleneck for large applications. 73
  142. Completion Detection • “Convergence Structure”
 Uninstrumented CRDT of grow-only sets

    containing counters that each node manipulates. 74
  143. Completion Detection • “Convergence Structure”
 Uninstrumented CRDT of grow-only sets

    containing counters that each node manipulates. • Simulates a workflow
 Nodes use this operation to simulate a lock-stop workflow for the experiment. 74
  144. Completion Detection • “Convergence Structure”
 Uninstrumented CRDT of grow-only sets

    containing counters that each node manipulates. • Simulates a workflow
 Nodes use this operation to simulate a lock-stop workflow for the experiment. • Event Generation
 Event generation toggles a boolean for the node to show completion. 74
  145. Completion Detection • “Convergence Structure”
 Uninstrumented CRDT of grow-only sets

    containing counters that each node manipulates. • Simulates a workflow
 Nodes use this operation to simulate a lock-stop workflow for the experiment. • Event Generation
 Event generation toggles a boolean for the node to show completion. • Log Aggregation
 Completion triggers log aggregation. 74
  146. Completion Detection • “Convergence Structure”
 Uninstrumented CRDT of grow-only sets

    containing counters that each node manipulates. • Simulates a workflow
 Nodes use this operation to simulate a lock-stop workflow for the experiment. • Event Generation
 Event generation toggles a boolean for the node to show completion. • Log Aggregation
 Completion triggers log aggregation. • Shutdown
 Upon log aggregation completion, nodes shutdown. 74
  147. Completion Detection • “Convergence Structure”
 Uninstrumented CRDT of grow-only sets

    containing counters that each node manipulates. • Simulates a workflow
 Nodes use this operation to simulate a lock-stop workflow for the experiment. • Event Generation
 Event generation toggles a boolean for the node to show completion. • Log Aggregation
 Completion triggers log aggregation. • Shutdown
 Upon log aggregation completion, nodes shutdown. • External monitoring
 When events complete execution, nodes automatically begin the next experiment. 74
  148. Results Next Evaluation 75

  149. Results Lasp • Single node orchestration: bad
 Not possible once

    you exceed a few nodes: message queues, memory, delays. 76
  150. Results Lasp • Single node orchestration: bad
 Not possible once

    you exceed a few nodes: message queues, memory, delays. • Partial Views
 Required: rely on transitive dissemination of information and partial network knowledge. 76
  151. Results Lasp • Single node orchestration: bad
 Not possible once

    you exceed a few nodes: message queues, memory, delays. • Partial Views
 Required: rely on transitive dissemination of information and partial network knowledge. • Results
 Reduced Lasp memory footprint to 75MB; larger in practice for debugging. 76
  152. Results Partisan • Fast churn isolates nodes
 Need a repair

    mechanism: random promotion of isolated nodes; mainly issues of symmetry. 77
  153. Results Partisan • Fast churn isolates nodes
 Need a repair

    mechanism: random promotion of isolated nodes; mainly issues of symmetry. • FIFO across connections
 Not per connection, but protocol assumes across all connections leading to false disconnects. 77
  154. Results Partisan • Fast churn isolates nodes
 Need a repair

    mechanism: random promotion of isolated nodes; mainly issues of symmetry. • FIFO across connections
 Not per connection, but protocol assumes across all connections leading to false disconnects. • Unrealistic system model
 You need per message acknowledgements for safety. 77
  155. Results Partisan • Fast churn isolates nodes
 Need a repair

    mechanism: random promotion of isolated nodes; mainly issues of symmetry. • FIFO across connections
 Not per connection, but protocol assumes across all connections leading to false disconnects. • Unrealistic system model
 You need per message acknowledgements for safety. • Pluggable protocol helps debugging
 Being able to switch to full membership or client-server assists in debugging protocol vs. application problems. 77
  156. Latest Results • Reproducibility at 300 nodes for full applications


    Connectivity, but transient partitions and isolated nodes at 500 - 1000 nodes (across 140 instances.) 78
  157. Latest Results • Reproducibility at 300 nodes for full applications


    Connectivity, but transient partitions and isolated nodes at 500 - 1000 nodes (across 140 instances.) • Limited financially and by Amazon
 Harder to run larger evaluations because we’re limited financially (as a university) and because of Amazon limits. 78
  158. Latest Results • Reproducibility at 300 nodes for full applications


    Connectivity, but transient partitions and isolated nodes at 500 - 1000 nodes (across 140 instances.) • Limited financially and by Amazon
 Harder to run larger evaluations because we’re limited financially (as a university) and because of Amazon limits. • Mean state reduction per client
 Around 100x improvement from our PaPoC 2016 initial evaluation results. 78
  159. Plat à emporter • Visualizations are important!
 Graph performance, visualize

    your cluster: all of these things lead to easier debugging. 79
  160. Plat à emporter • Visualizations are important!
 Graph performance, visualize

    your cluster: all of these things lead to easier debugging. • Control changes
 No Lasp PR accepted without divergence, state transmission, and overhead graphs. 79
  161. Plat à emporter • Visualizations are important!
 Graph performance, visualize

    your cluster: all of these things lead to easier debugging. • Control changes
 No Lasp PR accepted without divergence, state transmission, and overhead graphs. • Automation
 Developers use graphs when they are easy to make: lower the difficulty for generation and understand how changes alter system behaviour. 79
  162. Plat à emporter • Visualizations are important!
 Graph performance, visualize

    your cluster: all of these things lead to easier debugging. • Control changes
 No Lasp PR accepted without divergence, state transmission, and overhead graphs. • Automation
 Developers use graphs when they are easy to make: lower the difficulty for generation and understand how changes alter system behaviour. • Make work easily testable
 When you test locally and deploy globally, you need to make things easy to test, deploy and evaluate (for good science, I say!) 79
  163. 80 Christopher Meiklejohn @cmeik http://www.lasp-lang.org http://github.com/lasp-lang Thanks!