Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Supercharging Marketo's Campaign Engine

Supercharging Marketo's Campaign Engine

Alternate title: Supercharging Marketo's Multi-Tenant Platform By Going Reactive

A case study by Marketo's Campaign team on how they built their next generation Marketing Campaign Processing Engine following Reactive System design principles. The system leverages various Akka modules and features - including but not limited to Akka Cluster, Cluster Sharding and Persistence - to build a near real-time, multi-tenant and stateful distributed system that is resilient and scalable.

The case study will cover the following topics: A unique multi-cluster architecture that can be deployed as a combination of dispatcher clusters and executor clusters for horizontal scalability on both cluster and tenant level. Dynamically controlling throughput, parallelism and fairness per tenant with exact once execution semantics.

Without being highly CPU intensive, the cluster can process more than 500 Million campaigns per day with as little as 11 VMs. Comparing to its homegrown legacy engine that ran on physical cores, this new Campaign Engine achieves 20 times throughput per tenant.

Avatar for Apurva Pawar

Apurva Pawar

October 23, 2018
Tweet

Other Decks in Technology

Transcript

  1. © 2018 Marketo, Inc. • Event-driven framework • Trigger set

    of actions from event listeners • More than 50 different actions • Customizable routing logic via event listeners • Filters to narrow targets • Branching conditions to define complex logic Campaign Engine Overview
  2. © 2018 Marketo, Inc. Marketo Key Terminology • Lead: Instance

    of a “Person” object • Action: One of 50+ operations that may act on a Lead • (Trigger) Campaign: A set of 1-N customer defined Actions that occur on event listeners • Task: A set of customer defined Actions to run on a single Lead (as defined in the Campaign)
  3. © 2018 Marketo, Inc. Marketo - Scale >5K Active Trigger

    Campaigns (1 customer) >100M Unique Lead Objects (1 customer) >20B Trigger Campaign Jobs Executed Every Month
  4. © 2018 Marketo, Inc. • Per group of ~200 tenants:

    ◦ 6 scheduler processes ◦ 3 dispatcher processes ◦ 36 executor processes • Simple round-robin dispatching • Time quantum for fairness • Additional partitions available for large scale customers Legacy System Details
  5. © 2018 Marketo, Inc. • Not leveraging multi-threading efficiently •

    Resources are only shared across a subset of customers • Underutilized DB resources • Hard to scale up independently Problem with Legacy System
  6. © 2018 Marketo, Inc. A mission to look for technology

    to help us scale our campaign execution platform: • Scale horizontally • Handle fault-tolerance w/ self-healing • Create a responsive system • Provide fairness & elasticity to tenants • Handle back pressure • Ensure exact once execution • Preserve order for tasks to same target The Mission
  7. © 2018 Marketo, Inc. • Akka cluster (horizontal scalability) •

    Cluster shard auto-rebalance/re-create (fault tolerance) • Actor’s supervisor strategy (fault tolerance) • Actor Persistence (self healing) • At-Least-Once delivery (fault tolerance) Landed on Akka Technology
  8. © 2018 Marketo, Inc. Enqueuer Dequeuer Executor 1 Executor n

    Tenant ... Actor Domain-Driven Design (DDD)
  9. © 2018 Marketo, Inc. Enqueuer Dequeuer Executor 1 Executor n

    Tenant ... Apply Supervisor Strategy Fault Tolerance by Actor
  10. © 2018 Marketo, Inc. En De 1 n T1 ...

    En De 1 n T2 En De 1 n Tx Give same execution bandwidth to tenant ... ... ... Fairness
  11. © 2018 Marketo, Inc. Enqueuer Dequeuer 1 n Tenant ...

    Push mode Pull mode Pull mode Back Pressure Handling
  12. © 2018 Marketo, Inc. Enqueuer Dequeuer 1 n Tenant ...

    Task #2 (George) Task #1 (Jane) Task #3 (Jane) 2 Partial Order Preservation
  13. © 2018 Marketo, Inc. Enqueuer Dequeuer Executor 1 Executor n

    Tenant ... Journal & Snapshot Persistent Actor Self-Healing w/ Actor Persistence
  14. © 2018 Marketo, Inc. Source EE #1 T1 T2 T3

    EE #2 EE #n ... Traffic Controller Execution Engine Runtime Topology
  15. © 2018 Marketo, Inc. • Anticipate repeated messages • Handle

    idempotency when needed Design Guidelines
  16. © 2018 Marketo, Inc. En De 1 n Tenant #1

    ... ... Shard #1 Node #1 Node #2 Node #n ... Cluster for Traffic Controller
  17. © 2018 Marketo, Inc. En De 1 n Engine #1

    ... ... Node #1 Node #2 Node #n ... Cluster for Execution Engine Shard #1
  18. © 2018 Marketo, Inc. EE #1 EE #n ... Source

    Tenant Acquire/Release Acquire/Release Horizontal Scalability for a Tenant Traffic Controller
  19. © 2018 Marketo, Inc. En De 1 n Tenant #1

    ... En De 1 n Engine #1 ... At Least Once Delivery Idempotency support Exact Once Execution Traffic Controller Execution Engine
  20. © 2018 Marketo, Inc. Source En De 1 n T1

    ... En De 1 n Engine #1 ... Low frequency resend Runaway Task Recovery
  21. © 2018 Marketo, Inc. Node 1 Node 2 Node 3

    Node 4 Node 5 Node Shard JVM / Shard Region Default Shard Allocation Strategy • Treats all shards equal
  22. © 2018 Marketo, Inc. Default Shard Allocation Strategy with Marketo

    Sharding algorithm Node 1 Node 2 Node 3 Node 4 Node 5 Node Shard for DB1 JVM / Shard Region • Sharding logic introduces detail that cannot be read by default strategy
  23. © 2018 Marketo, Inc. Default Shard Allocation Strategy with Marketo

    Sharding algorithm Node 1 Node 2 Node 3 Node 4 Node 5 Node Shard for DB1 JVM / Shard Region Inefficient • Starving for DB connections under high load
  24. © 2018 Marketo, Inc. Node 1 Node 2 Node 3

    Node 4 Node 5 Node Shard JVM / Shard Region New Shard Default Shard Allocation Strategy • New shard being added
  25. © 2018 Marketo, Inc. Default Shard Allocation Strategy with Marketo

    Sharding algorithm Node 1 Node 2 Node 3 Node 4 Node 5 Node Shard for DB1 JVM / Shard Region New Shard • Default allocation does not consider potential inefficiency
  26. © 2018 Marketo, Inc. Default Shard Allocation Strategy with Marketo

    Sharding algorithm Node 1 Node 2 Node 3 Node 4 Node 5 Node Shard for DB1 JVM / Shard Region inefficient • Starving for DB connections again. Need to spread out shards.
  27. © 2018 Marketo, Inc. Node 1 Node 2 Node 3

    Node 4 Node 5 Node Shard for DB1 JVM / Shard Region Marketo Shard Allocation Strategy with Marketo Sharding algorithm • Ideal allocation for use case
  28. © 2018 Marketo, Inc. Node 1 Node 2 Node 3

    Node 4 Node 5 Node Shard for DB1 JVM / Shard Region New Shard Marketo Shard Allocation Strategy with Marketo Sharding algorithm • New “green” shard will avoid #1, #4 and #5
  29. © 2018 Marketo, Inc. Node 1 Node 2 Node 3

    Node 4 Node 5 Node Shard for DB1 JVM / Shard Region New Shard Marketo Shard Allocation Strategy with Marketo Sharding algorithm • New “red” shard can go #1 or #2
  30. © 2018 Marketo, Inc. Node 1 Node 2 Node 3

    Node 4 Node 5 Node Shard for DB1 JVM / Shard Region Marketo Shard Allocation Strategy with Marketo Sharding algorithm • Unbalanced but efficient
  31. © 2018 Marketo, Inc. Node 2 Node 3 Node 4

    Node 5 Node 6 Node Shard for DB1 JVM / Shard Region Node 1 Marketo Shard Allocation Strategy with Marketo Sharding algorithm • Ideal allocation with 6 nodes (spread out as much as possible)
  32. © 2018 Marketo, Inc. Node 1 Node 2 Node 3

    Node 4 Node Shard for DB1 JVM / Shard Region Marketo Shard Allocation Strategy with Marketo Sharding algorithm • Ideal allocation with 4 nodes (equally inefficient)
  33. © 2018 Marketo, Inc. • Default strategy might ignore additional

    detail/complexity introduced by a sharding algorithm • Evaluate replacing it with a custom implementation • Review: ◦ Clusters are sharded by primary DB resource ◦ Marketo’s akka sharding algorithm is designed to guard rail this DB resource ▪ ‘N’ shards per DB instance; ‘M’ DB instances ◦ DB resource pools are allocated per JVM / node (again, to guard rail) Custom Shard Allocation Strategy to the rescue
  34. © 2018 Marketo, Inc. Deployment and Uptime Concerns: • 15-20

    mins for a cluster restart • Slow state recovery on a rebalance or a restart • Overall journal size was ~10GB/data center for akka persistence
  35. © 2018 Marketo, Inc. En De 1 n Tenant ...

    En De 1 n Engine ... Too Many Persistent Actors Traffic Controller Execution Engine
  36. © 2018 Marketo, Inc. En De 1 n Tenant ...

    En De 1 n Engine ... Revised Persistence Approach Traffic Controller Execution Engine
  37. © 2018 Marketo, Inc. • Takes 5-10 secs to restart

    ◦ Engine does not have to recover any state on a restart/rebalance • No need to “remember entities” for Engine ◦ Controller resend/retry results in lazy state recovery • Execution engine is now stateless ◦ No persistent actors • Allows for 0% downtime ◦ For execution engine side patches or releases ◦ With rolling upgrades Improvements on Deployment and Uptime
  38. © 2018 Marketo, Inc. • Uses a persistent queue for

    buffer • Flow control protocol involves 2 modes: ◦ Push Mode ◦ Pull Mode • Implemented in two phases ◦ Phase 1: Modes were simple but inefficient ◦ Phase 2 is the improvement Campaign Backpressure
  39. © 2018 Marketo, Inc. Backpressure implementation (Phase 1 / Push

    Mode) Enqueuer Dequeuer 1 n Engine ... Push task to queue Send msg to Deq Deq pulls from queue till capacity Deq send back msg to change to pull mode
  40. © 2018 Marketo, Inc. Backpressure implementation (Phase 1 / Pull

    Mode) Enqueuer Dequeuer 1 n Engine ... Deq pulls from queue till queue is empty Push task to queue Send back msg to change to push mode
  41. © 2018 Marketo, Inc. • Enqueuer always pushes to queue

    ◦ DB write is involved ◦ Dequeuer always has to read from queue • Saving to queue and sending message are two different steps ◦ Steps can be combined ◦ Can directly send if dequeuer has capacity ◦ Task can be sent to the dequeuer within the message Why is this inefficient?
  42. © 2018 Marketo, Inc. Backpressure implementation (Phase 2 / Push

    Mode) Enqueuer Dequeuer 1 n Engine ... Send msg containing tasks Deq accepts till capacity Deq send back intention - “Ready for pull mode” Deq consume queue till queue empty Push task to queue Enq confirms; returns PullModeAck Deq receives Ack
  43. © 2018 Marketo, Inc. • Drastically reduced writes to queue

    (99.84% reduction) Improving the backpressure “inflection point” ~65K writes / min ~100 writes / min
  44. © 2018 Marketo, Inc. • Trust (but verify) the larger

    system • Avoid synchronization • Be immutable (or at least look that way) • Hold resources for as short a time as possible • Hold resources as long as needed • Loan pattern adds safety and helps visualize Making Parallel Execution Simple
  45. © 2018 Marketo, Inc. Thread 1 Thread 2 Thread n

    . . . Holding resource Waiting for resource Working without resource Legacy System
  46. © 2018 Marketo, Inc. . . . Thread 1 Thread

    2 Thread n Holding resource Waiting for resource Working without resource First Try
  47. © 2018 Marketo, Inc. . . . Thread 1 Thread

    2 Thread n Holding resource Waiting for resource Working without resource Second Try
  48. © 2018 Marketo, Inc. . . . Getting resource Thread

    n Holding resource Waiting for resource Working without resource Second Try Thread 1 Thread 2
  49. © 2018 Marketo, Inc. . . . Holding resource Waiting

    for resource Working without resource Thread 1 Thread 2 Thread n Final Strategy
  50. © 2018 Marketo, Inc. • Legacy Operations 1. Freeze record

    2. Check the state 3. Perform update Updating a Record
  51. © 2018 Marketo, Inc. • Pessimistic 1. Freeze record 2.

    Check the state 3. Perform update Updating a Record
  52. © 2018 Marketo, Inc. • Pessimistic 1. Freeze record 2.

    Check the state 3. Perform update • Optimistic 1. Check the state 2. Perform update assuming state is unchanged 3. If this assumption fails, try again Updating a Record
  53. © 2018 Marketo, Inc. BEGIN; SELECT favorite_color FROM person WHERE

    id = 10 FOR UPDATE; # favorite_color: "Blue" UPDATE person SET favorite_color = "Purple" WHERE id = 10; COMMIT; Optimistic Updates • Pessimistic
  54. © 2018 Marketo, Inc. • Optimistic BEGIN; SELECT favorite_color FROM

    person WHERE id = 10 FOR UPDATE; # favorite_color: "Blue" UPDATE person SET favorite_color = "Purple" WHERE id = 10; COMMIT; SELECT favorite_color FROM person WHERE id = 10; # favorite_color: "Blue" UPDATE person SET favorite_color = "Purple" WHERE id = 10 AND favorite_color = "Blue"; Optimistic Updates • Pessimistic
  55. © 2018 Marketo, Inc. val conn = connectionPool.getConnection() val stmt

    = conn.createStatement() val rs = stmt.executeQuery("select name from person limit 10") val namesBuf = ListBuffer.empty[String] while (rs.next()) { names += rs.getString(1) } val namesCsv: Seq[String] = namesBuf.mkString(",") stmt.close() conn.close() Procedural Style
  56. © 2018 Marketo, Inc. val namesCsv: String = ConnectionPool.connection{ conn

    => conn.statement{ stmt => stmt.query("select name from person limit 10"){ rs => rs.getString(1) }.mkString(",") } } Loan Style The Connection is closed at the end of this closure
  57. © 2018 Marketo, Inc. val namesCsv: String = ConnectionPool.connection{ conn

    => conn.statement{ stmt => stmt.query("select name from person limit 10"){ rs => rs.getString(1) }.mkString(",") } } Loan Style
  58. © 2018 Marketo, Inc. val namesCsv: String = ConnectionPool.connection{ conn

    => conn.statement{ stmt => stmt.query("select name from person limit 10"){ rs => rs.getString(1) } } }.mkString(",") Loan Style
  59. © 2018 Marketo, Inc. • Legacy System runs on ◦

    3 physical machines, each with ▪ 40 Cores ▪ 32GB of RAM • Reactive System runs on ◦ 8 virtual machines, each with ▪ 4 Cores ▪ 8GB of RAM • Using ~¼ CPU and ~⅔ RAM we see a 20x increase in throughput per tenant Performance Statistics
  60. © 2018 Marketo, Inc. • What We Like ◦ Overall

    fault-tolerance mechanism ◦ Self-healing capability ◦ Lightweight runtime memory footprint ◦ Keep “parallelism” at design time ◦ Actor Domain-Driven Design • Future Improvements ◦ Adaptive execution bandwidth control ◦ Akka Streams in Execution Engine Our Takeaways