Supercharging Marketo's Campaign Engine

Apurva Pawar Daniel Pugliese Dennis Bronnikov Pei-Chiang Ma Supercharging Marketo’s
Campaign Engine OCTOBER 2018

Introduction 01

© 2018 Marketo, Inc. • Event-driven framework • Trigger set
of actions from event listeners • More than 50 different actions • Customizable routing logic via event listeners • Filters to narrow targets • Branching conditions to define complex logic Campaign Engine Overview

© 2018 Marketo, Inc. Marketo Key Terminology • Lead: Instance
of a “Person” object • Action: One of 50+ operations that may act on a Lead • (Trigger) Campaign: A set of 1-N customer defined Actions that occur on event listeners • Task: A set of customer defined Actions to run on a single Lead (as defined in the Campaign)

© 2018 Marketo, Inc. Marketo - Scale >5K Active Trigger
Campaigns (1 customer) >100M Unique Lead Objects (1 customer) >20B Trigger Campaign Jobs Executed Every Month

© 2018 Marketo, Inc. • Per group of ~200 tenants:
◦ 6 scheduler processes ◦ 3 dispatcher processes ◦ 36 executor processes • Simple round-robin dispatching • Time quantum for fairness • Additional partitions available for large scale customers Legacy System Details

© 2018 Marketo, Inc. • Not leveraging multi-threading efficiently •
Resources are only shared across a subset of customers • Underutilized DB resources • Hard to scale up independently Problem with Legacy System

The Journey Starts... 01

© 2018 Marketo, Inc. A mission to look for technology
to help us scale our campaign execution platform: • Scale horizontally • Handle fault-tolerance w/ self-healing • Create a responsive system • Provide fairness & elasticity to tenants • Handle back pressure • Ensure exact once execution • Preserve order for tasks to same target The Mission

© 2018 Marketo, Inc. • Akka cluster (horizontal scalability) •
Cluster shard auto-rebalance/re-create (fault tolerance) • Actor’s supervisor strategy (fault tolerance) • Actor Persistence (self healing) • At-Least-Once delivery (fault tolerance) Landed on Akka Technology

Actor Design 01

© 2018 Marketo, Inc. Enqueuer Dequeuer Executor 1 Executor n
Tenant ... Actor Domain-Driven Design (DDD)

Tenant ... Apply Supervisor Strategy Fault Tolerance by Actor

© 2018 Marketo, Inc. En De 1 n T1 ...
En De 1 n T2 En De 1 n Tx Give same execution bandwidth to tenant ... ... ... Fairness

Tenant +/- Elasticity

© 2018 Marketo, Inc. Enqueuer Dequeuer 1 n Tenant ...
Push mode Pull mode Pull mode Back Pressure Handling

© 2018 Marketo, Inc. Enqueuer Dequeuer 1 n Tenant ...
Task #2 (George) Task #1 (Jane) Task #3 (Jane) 2 Partial Order Preservation

Tenant ... Journal & Snapshot Persistent Actor Self-Healing w/ Actor Persistence

Cluster Design 01

© 2018 Marketo, Inc. Source Multitenant, Fairness, Elasticity, Partial Order
Preservation, ... Task Execution Data Flow

© 2018 Marketo, Inc. Source EE #1 T1 T2 T3
EE #2 EE #n ... Traffic Controller Execution Engine Runtime Topology

© 2018 Marketo, Inc. • Anticipate repeated messages • Handle
idempotency when needed Design Guidelines

© 2018 Marketo, Inc. En De 1 n Tenant #1
... ... Shard #1 Node #1 Node #2 Node #n ... Cluster for Traffic Controller

© 2018 Marketo, Inc. En De 1 n Engine #1
... ... Node #1 Node #2 Node #n ... Cluster for Execution Engine Shard #1

© 2018 Marketo, Inc. EE #1 EE #n ... Source
Tenant Acquire/Release Acquire/Release Horizontal Scalability for a Tenant Traffic Controller

© 2018 Marketo, Inc. En De 1 n Tenant #1
... En De 1 n Engine #1 ... At Least Once Delivery Idempotency support Exact Once Execution Traffic Controller Execution Engine

© 2018 Marketo, Inc. Source En De 1 n T1
... En De 1 n Engine #1 ... Low frequency resend Runaway Task Recovery

Experiences with Akka

Shard Allocation Strategy

© 2018 Marketo, Inc. Node 1 Node 2 Node 3
Node 4 Node 5 Node Shard JVM / Shard Region Default Shard Allocation Strategy • Treats all shards equal

© 2018 Marketo, Inc. Default Shard Allocation Strategy with Marketo
Sharding algorithm Node 1 Node 2 Node 3 Node 4 Node 5 Node Shard for DB1 JVM / Shard Region • Sharding logic introduces detail that cannot be read by default strategy

Sharding algorithm Node 1 Node 2 Node 3 Node 4 Node 5 Node Shard for DB1 JVM / Shard Region Inefficient • Starving for DB connections under high load

Node 4 Node 5 Node Shard JVM / Shard Region New Shard Default Shard Allocation Strategy • New shard being added

Sharding algorithm Node 1 Node 2 Node 3 Node 4 Node 5 Node Shard for DB1 JVM / Shard Region New Shard • Default allocation does not consider potential inefficiency

Sharding algorithm Node 1 Node 2 Node 3 Node 4 Node 5 Node Shard for DB1 JVM / Shard Region inefficient • Starving for DB connections again. Need to spread out shards.

Node 4 Node 5 Node Shard for DB1 JVM / Shard Region Marketo Shard Allocation Strategy with Marketo Sharding algorithm • Ideal allocation for use case

Node 4 Node 5 Node Shard for DB1 JVM / Shard Region New Shard Marketo Shard Allocation Strategy with Marketo Sharding algorithm • New “green” shard will avoid #1, #4 and #5

Node 4 Node 5 Node Shard for DB1 JVM / Shard Region New Shard Marketo Shard Allocation Strategy with Marketo Sharding algorithm • New “red” shard can go #1 or #2

Node 4 Node 5 Node Shard for DB1 JVM / Shard Region Marketo Shard Allocation Strategy with Marketo Sharding algorithm • Unbalanced but efficient

Node 5 Node 6 Node Shard for DB1 JVM / Shard Region Node 1 Marketo Shard Allocation Strategy with Marketo Sharding algorithm • Ideal allocation with 6 nodes (spread out as much as possible)

Node 4 Node Shard for DB1 JVM / Shard Region Marketo Shard Allocation Strategy with Marketo Sharding algorithm • Ideal allocation with 4 nodes (equally inefficient)

© 2018 Marketo, Inc. • Default strategy might ignore additional
detail/complexity introduced by a sharding algorithm • Evaluate replacing it with a custom implementation • Review: ◦ Clusters are sharded by primary DB resource ◦ Marketo’s akka sharding algorithm is designed to guard rail this DB resource ▪ ‘N’ shards per DB instance; ‘M’ DB instances ◦ DB resource pools are allocated per JVM / node (again, to guard rail) Custom Shard Allocation Strategy to the rescue

Cluster Uptime & Deployment

© 2018 Marketo, Inc. Deployment and Uptime Concerns: • 15-20
mins for a cluster restart • Slow state recovery on a rebalance or a restart • Overall journal size was ~10GB/data center for akka persistence

© 2018 Marketo, Inc. En De 1 n Tenant ...
En De 1 n Engine ... Too Many Persistent Actors Traffic Controller Execution Engine

© 2018 Marketo, Inc. En De 1 n Tenant ...
En De 1 n Engine ... Revised Persistence Approach Traffic Controller Execution Engine

© 2018 Marketo, Inc. Journal Size (Redis screenshots) Before: After:
99.6% Reduction

© 2018 Marketo, Inc. • Takes 5-10 secs to restart
◦ Engine does not have to recover any state on a restart/rebalance • No need to “remember entities” for Engine ◦ Controller resend/retry results in lazy state recovery • Execution engine is now stateless ◦ No persistent actors • Allows for 0% downtime ◦ For execution engine side patches or releases ◦ With rolling upgrades Improvements on Deployment and Uptime

Back Pressure Handling

© 2018 Marketo, Inc. • Uses a persistent queue for
buffer • Flow control protocol involves 2 modes: ◦ Push Mode ◦ Pull Mode • Implemented in two phases ◦ Phase 1: Modes were simple but inefficient ◦ Phase 2 is the improvement Campaign Backpressure

© 2018 Marketo, Inc. Backpressure implementation (Phase 1 / Push
Mode) Enqueuer Dequeuer 1 n Engine ... Push task to queue Send msg to Deq Deq pulls from queue till capacity Deq send back msg to change to pull mode

© 2018 Marketo, Inc. Backpressure implementation (Phase 1 / Pull
Mode) Enqueuer Dequeuer 1 n Engine ... Deq pulls from queue till queue is empty Push task to queue Send back msg to change to push mode

© 2018 Marketo, Inc. • Enqueuer always pushes to queue
◦ DB write is involved ◦ Dequeuer always has to read from queue • Saving to queue and sending message are two different steps ◦ Steps can be combined ◦ Can directly send if dequeuer has capacity ◦ Task can be sent to the dequeuer within the message Why is this inefficient?

© 2018 Marketo, Inc. Backpressure implementation (Phase 2 / Push
Mode) Enqueuer Dequeuer 1 n Engine ... Send msg containing tasks Deq accepts till capacity Deq send back intention - “Ready for pull mode” Deq consume queue till queue empty Push task to queue Enq confirms; returns PullModeAck Deq receives Ack

© 2018 Marketo, Inc. • Drastically reduced writes to queue
(99.84% reduction) Improving the backpressure “inflection point” ~65K writes / min ~100 writes / min

Supporting Framework and Libraries

© 2018 Marketo, Inc. • Trust (but verify) the larger
system • Avoid synchronization • Be immutable (or at least look that way) • Hold resources for as short a time as possible • Hold resources as long as needed • Loan pattern adds safety and helps visualize Making Parallel Execution Simple

Resource Pooling

© 2018 Marketo, Inc. Thread 1 Thread 2 Thread n
. . . Holding resource Waiting for resource Working without resource Legacy System

© 2018 Marketo, Inc. . . . Thread 1 Thread
2 Thread n Holding resource Waiting for resource Working without resource First Try

© 2018 Marketo, Inc. . . . Thread 1 Thread
2 Thread n Holding resource Waiting for resource Working without resource Second Try

© 2018 Marketo, Inc. . . . Getting resource Thread
n Holding resource Waiting for resource Working without resource Second Try Thread 1 Thread 2

© 2018 Marketo, Inc. . . . Holding resource Waiting
for resource Working without resource Thread 1 Thread 2 Thread n Final Strategy

Optimistic Updates

© 2018 Marketo, Inc. • Pessimistic 1. Freeze record 2.
Check the state 3. Perform update • Optimistic 1. Check the state 2. Perform update assuming state is unchanged 3. If this assumption fails, try again Updating a Record

© 2018 Marketo, Inc. BEGIN; SELECT favorite_color FROM person WHERE
id = 10 FOR UPDATE; # favorite_color: "Blue" UPDATE person SET favorite_color = "Purple" WHERE id = 10; COMMIT; Optimistic Updates • Pessimistic

© 2018 Marketo, Inc. • Optimistic BEGIN; SELECT favorite_color FROM
person WHERE id = 10 FOR UPDATE; # favorite_color: "Blue" UPDATE person SET favorite_color = "Purple" WHERE id = 10; COMMIT; SELECT favorite_color FROM person WHERE id = 10; # favorite_color: "Blue" UPDATE person SET favorite_color = "Purple" WHERE id = 10 AND favorite_color = "Blue"; Optimistic Updates • Pessimistic

Loan Pattern

© 2018 Marketo, Inc. val conn = connectionPool.getConnection() val stmt
= conn.createStatement() val rs = stmt.executeQuery("select name from person limit 10") val namesBuf = ListBuffer.empty[String] while (rs.next()) { names += rs.getString(1) } val namesCsv: Seq[String] = namesBuf.mkString(",") stmt.close() conn.close() Procedural Style

© 2018 Marketo, Inc. val namesCsv: String = ConnectionPool.connection{ conn
=> conn.statement{ stmt => stmt.query("select name from person limit 10"){ rs => rs.getString(1) }.mkString(",") } } Loan Style The Connection is closed at the end of this closure

=> conn.statement{ stmt => stmt.query("select name from person limit 10"){ rs => rs.getString(1) }.mkString(",") } } Loan Style

=> conn.statement{ stmt => stmt.query("select name from person limit 10"){ rs => rs.getString(1) } } }.mkString(",") Loan Style

Reflection and the Road Ahead

© 2018 Marketo, Inc. • Legacy System runs on ◦
3 physical machines, each with ▪ 40 Cores ▪ 32GB of RAM • Reactive System runs on ◦ 8 virtual machines, each with ▪ 4 Cores ▪ 8GB of RAM • Using ~¼ CPU and ~⅔ RAM we see a 20x increase in throughput per tenant Performance Statistics

© 2018 Marketo, Inc. • What We Like ◦ Overall
fault-tolerance mechanism ◦ Self-healing capability ◦ Lightweight runtime memory footprint ◦ Keep “parallelism” at design time ◦ Actor Domain-Driven Design • Future Improvements ◦ Adaptive execution bandwidth control ◦ Akka Streams in Execution Engine Our Takeaways

We are hiring! Visit us at https://marketo.jobs

Supercharging Marketo's Campaign Engine

Supercharging Marketo's Campaign Engine

Other Decks in Technology

Featured

Transcript