Slide 1

Slide 1 text

A Fair Task Execution Framework mourjo@helpshift.com

Slide 2

Slide 2 text

150 Million+ conversations on Helpshift in the last year

Slide 3

Slide 3 text

Running jobs on customer data ● Maintenance is unavoidable ○ Feature to release: Segment conversations into buckets ○ Every conversation must be part of one bucket ○ Need to add a “default bucket” to all conversation before release

Slide 4

Slide 4 text

Running jobs on customer data ● Maintenance is unavoidable ○ Feature to release: Segment conversations into buckets ○ Every conversation must be part of one bucket ○ Need to add a “default bucket” to all conversation before release ● Uneven data distribution

Slide 5

Slide 5 text

Goals Fair 1. Runtime independent of data distribution 2. Only dependent on amount of data Easy To understand and use

Slide 6

Slide 6 text

Strategy 1: Spawn a thread for each customer Run each customer on a dedicated thread in a threadpool

Slide 7

Slide 7 text

Strategy 1: Spawn a thread for each customer Run each customer on a dedicated thread in a threadpool

Slide 8

Slide 8 text

Strategy 1: Report Card Goal Result Fair No Easy Yes Easy to use but runtime affected by long running jobs

Slide 9

Slide 9 text

Strategy 2: Chunk-and-work Decouple the size of customers from the unit of concurrency. Each task is a “small-enough” segment of a customer

Slide 10

Slide 10 text

Strategy 2: Chunk-and-work Decouple the size of customers from the unit of concurrency. Each task is a “small-enough” segment of a customer [{:cust-id “xyz” :month “Jan”} {:cust-id “xyz” :month “Feb”} ... {:cust-id “abc” :month “Jan”} {:cust-id “abc” :month “Feb”} ...]

Slide 11

Slide 11 text

Strategy 2: Chunk-and-work

Slide 12

Slide 12 text

Strategy 2: Chunk-and-work

Slide 13

Slide 13 text

Strategy 2: Chunk-and-work Some timeranges contain more data than others Some timeranges contain too little or no data

Slide 14

Slide 14 text

Strategy 2: Report Card Goal Result Fair Mostly Easy Yes

Slide 15

Slide 15 text

Strategy 2: Report Card Goal Result Fair Mostly Easy Yes Uneven data distribution across customers => Handled Uneven division of tasks => Unhandled

Slide 16

Slide 16 text

Strategy 2: Report Card Goal Result Fair Mostly Easy Yes Robust No Uneven data distribution across customers => Handled Uneven division of tasks => Unhandled

Slide 17

Slide 17 text

Strategy 2: How to achieve robustness?

Slide 18

Slide 18 text

Strategy 2: How to achieve robustness?

Slide 19

Slide 19 text

Decouple data collection from execution How to achieve robustness?

Slide 20

Slide 20 text

Strategy 3: Iterator and Worker

Slide 21

Slide 21 text

Strategy 3: Iterator and Worker [{:cust-id “xyz” :month “Jan”} {:cust-id “xyz” :month “Feb”} ... {:cust-id “abc” :month “Jan”} {:cust-id “abc” :month “Feb”} ...]

Slide 22

Slide 22 text

Strategy 3: Iterator and Worker Dynamically chunk into uniformly sized tasks [{:cust-id “xyz” :month “Jan”} {:cust-id “xyz” :month “Feb”} ... {:cust-id “abc” :month “Jan”} {:cust-id “abc” :month “Feb”} ...] Give me the next 10K records where id > last-id in this timerange

Slide 23

Slide 23 text

Strategy 3: Iterator and Worker On-the-fly chunked tasks Dynamically chunk into uniformly sized tasks [{:cust-id “xyz” :month “Jan”} {:cust-id “xyz” :month “Feb”} ... {:cust-id “abc” :month “Jan”} {:cust-id “abc” :month “Feb”} ...] [data-id1, data-id2, ...] Give me the next 10K records where id > last-id in this timerange

Slide 24

Slide 24 text

Strategy 3: Iterator and Worker On-the-fly chunked tasks Dynamically chunk into uniformly sized tasks [{:cust-id “xyz” :month “Jan”} {:cust-id “xyz” :month “Feb”} ... {:cust-id “abc” :month “Jan”} {:cust-id “abc” :month “Feb”} ...] [data-id1, data-id2, ...] Give me the next 10K records where id > last-id in this timerange Perform the task

Slide 25

Slide 25 text

Strategy 3: Iterator and Worker On-the-fly chunked tasks Dynamically chunk into uniformly sized tasks [{:cust-id “xyz” :month “Jan”} {:cust-id “xyz” :month “Feb”} ... {:cust-id “abc” :month “Jan”} {:cust-id “abc” :month “Feb”} ...] [data-id1, data-id2, ...] Give me the next 10K records where id > last-id in this timerange Perform the task

Slide 26

Slide 26 text

Strategy 3: Iterator and Worker Launcher Iterator function Worker function

Slide 27

Slide 27 text

Strategy 3: Iterator and Worker Launcher Iterator function Worker function Rendez-vous point

Slide 28

Slide 28 text

Strategy 3: Iterator and Worker (fair-exec-job (let [ch (async/chan 1000)] {:iterator (partial iterate-customer ch) :input-tasks customer-xs-or-ch :chan ch :iterator-thread-count 10 :worker run-tasks :worker-thread-count 20})) Launcher

Slide 29

Slide 29 text

Strategy 3: Iterator and Worker (defn iterate-customer [output-chan {:keys [customer query profiles-per-task] :as params}] (let [data (query-next-batch query) num (count data) last-id (:id (last data))] (doseq [chunk (partition-all 10000 data)] (async/>!! output-chan {:customer customer :chunk chunk})) (when (= num-entities query-limit) (recur (assoc params :query {:id {:$gt last-id}}))))) (fair-exec-job (let [ch (async/chan 1000)] {:iterator (partial iterate-customer ch) :input-tasks customer-xs-or-ch :chan ch :iterator-thread-count 10 :worker run-tasks :worker-thread-count 20})) Iterator Launcher

Slide 30

Slide 30 text

Strategy 3: Iterator and Worker (defn iterate-customer [output-chan {:keys [customer query profiles-per-task] :as params}] (let [data (query-next-batch query) num (count data) last-id (:id (last data))] (doseq [chunk (partition-all 10000 data)] (async/>!! output-chan {:customer customer :chunk chunk})) (when (= num-entities query-limit) (recur (assoc params :query {:id {:$gt last-id}}))))) (defn run-tasks [{:keys [customer chunk] :as task}] (run-maintenance-on customer chunk)) (fair-exec-job (let [ch (async/chan 1000)] {:iterator (partial iterate-customer ch) :input-tasks customer-xs-or-ch :chan ch :iterator-thread-count 10 :worker run-tasks :worker-thread-count 20})) Iterator Worker Launcher

Slide 31

Slide 31 text

Strategy 3: Report Card Goal Result Fair Yes Easy Yes Robust Yes

Slide 32

Slide 32 text

Conclusion ● Concurrency => Parallelism

Slide 33

Slide 33 text

Conclusion ● Concurrency => Parallelism ● A good concurrent design => Tunable throughput

Slide 34

Slide 34 text

Conclusion ● Concurrency => Parallelism ● A good concurrent design => Tunable throughput ● Clojure => Easy to harness the power of concurrency

Slide 35

Slide 35 text

Come say hi! We are hiring @helpshift https://jobs.lever.co/helpshift

Slide 36

Slide 36 text

Come say hi! We are hiring @helpshift https://jobs.lever.co/helpshift Thank you! Questions?