Multi-armed Bandit Optimisation in Clojure

Slide 1

Slide 1 text

Multi-armed Bandit Optimisation in Clojure @pingles Paul Ingles, Principal Engineer, uSwitch.com

Slide 2

Slide 2 text

Product optimisation cycles are long, complex and inefﬁcient.

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Source: http://danariely.com/the-books/excerpted-from-chapter-1-–-the-truth-about-relativity-2/

Slide 5

Slide 5 text

Source: http://danariely.com/the-books/excerpted-from-chapter-1-–-the-truth-about-relativity-2/

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Experiment 1 Time Experiment 2 Experiment 3 …

Slide 8

Slide 8 text

Experiment 1 Time Experiment 2 Experiment 3 … Participants Needed Effect

Slide 9

Slide 9 text

Bandit strategies can help

Slide 10

Slide 10 text

A product  for procrastinators   by a procrastinator

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

http://notﬂix.herokuapp.com/

Slide 13

Slide 13 text

Source: Action Movie Kid, https://www.youtube.com/watch?v=SU-ZM3xoYag

Slide 14

Slide 14 text

Source: Action Movie Kid, https://www.youtube.com/watch?v=SU-ZM3xoYag

Slide 15

Slide 15 text

http://notﬂix.herokuapp.com/

Slide 16

Slide 16 text

The Multi-armed Bandit Problem

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Exploitation Exploration

Slide 19

Slide 19 text

Exploitation Exploration Source: printablecolouringpages.co.uk

Slide 20

Slide 20 text

Exploitation Exploration Source: printablecolouringpages.co.uk

Slide 21

Slide 21 text

Exploitation Exploration Source: printablecolouringpages.co.uk

Slide 22

Slide 22 text

Bandit Model

Slide 23

Slide 23 text

Bandit Model Arms: {1, 2, …, K}

Slide 24

Slide 24 text

Bandit Model Trials: 1, 2, … T Arms: {1, 2, …, K}

Slide 25

Slide 25 text

Bandit Model Rewards: {0, 1} Trials: 1, 2, … T Arms: {1, 2, …, K}

Slide 26

Slide 26 text

K-arms K-buttons SUBMIT SUPPORT HAITI Source: A/B Testing, Siroker and Koomen, 2013 K-pages Source: A/B Testing, Siroker and Koomen, 2013 “Pierce and Wilfork show off their Boston accents” “Pierce, Wilfork show off their bad Boston accents” Source: http:// knightlab.northwestern.edu/ 2013/08/15/designing-from-data-how- news-organizations-use-ab-testing-to- increase-user-engagement/ K-headlines

Slide 27

Slide 27 text

Bandit Strategy

Slide 28

Slide 28 text

Bandit Strategy (defn select-arm [arms] …) Arm selection

Slide 29

Slide 29 text

Bandit Strategy (defn select-arm [arms] …) Arm selection (defn reward [arm x] …) (defn pulled [arm] …) Updating arm state with feedback }

Slide 30

Slide 30 text

Bandit Strategy (defrecord Arm [name pulls value]) (defn select-arm [arms] …) Arm selection (defn reward [arm x] …) (defn pulled [arm] …) Updating arm state with feedback }

Slide 31

Slide 31 text

Clojure Bandit Algorithms

Slide 32

Slide 32 text

Clojure Bandit Algorithms • Epsilon-Greedy

Slide 33

Slide 33 text

Clojure Bandit Algorithms • Epsilon-Greedy • Thompson Sampling

Slide 34

Slide 34 text

ε-greedy

Slide 35

Slide 35 text

ε-greedy ε = epsilon = rate of exploration

Slide 36

Slide 36 text

ε-greedy Source: “Bandit Algorithms for Website Optimization”, John Myles White, O’Reilly, 2012 Explore 10% Exploit 90% Arm k Arm 1 Arm 2 Arm K … epsilon = 0.1 (rate of exploration)

Slide 37

Slide 37 text

(defrecord Arm [name pulls value])

Slide 38

Slide 38 text

(defrecord Arm [name pulls value]) (defn select-arm [epsilon arms] (if (> (rand) epsilon) (exploit :value arms) (rand-nth arms)))

Slide 39

Slide 39 text

(defrecord Arm [name pulls value]) (defn select-arm [epsilon arms] (if (> (rand) epsilon) (exploit :value arms) (rand-nth arms)))   (defn exploit [k arms] (first (sort-by k > arms)))

Slide 40

Slide 40 text

(defrecord Arm [name pulls value]) (defn select-arm [epsilon arms] (if (> (rand) epsilon) (exploit :value arms) (rand-nth arms))) user=> (def arms [(Arm. :arm1 2 0) (Arm. :arm2 2 1)]) user=> (select-arm 1.0 arms) #user.Arm{:name :arm1, :pulls 2, :value 0} user=> (select-arm 0.0 arms) #user.Arm{:name :arm2, :pulls 2, :value 1}   (defn exploit [k arms] (first (sort-by k > arms)))

Slide 41

Slide 41 text

user=> (def arms [(Arm. :arm1 0 0) (Arm. :arm2 0 0)]) user=> (def selected-arm (select-arm 0.1 arms)) #user.Arm{:name :arm1, :pulls 0, :value 0} user=> (-> selected-arm (pulled) ;; time passes (reward 1)) #user.Arm{:name :arm1, :pulls 1, :value 1}

Slide 42

Slide 42 text

Slide 43

Slide 43 text

ε-greedy Behaviour

Slide 44

Slide 44 text

ε-greedy Behaviour (bernoulli-bandit {:arm1 0.1 :arm2 0.1 :arm3 0.1 :arm4 0.1 :arm5 0.9}

Slide 45

Slide 45 text

ε-greedy Behaviour 0.00 0.25 0.50 0.75 1.00 0 100 200 300 400 500 Time Pr(selected = optimal−arm) algo epsilon.greedy 0.2 epsilon.greedy−0.1 Epsilon = 0.1 Epsilon = 0.2

Slide 46

Slide 46 text

Thompson Sampling aka Bayesian Bandit

Slide 47

Slide 47 text

Arm k’s hidden true probability of reward θk [0,1] Arm Model

Slide 48

Slide 48 text

Arm Model 0.50 0.75 1.00 1.25 1.50 0.00 0.25 0.50 0.75 1.00 θk pulls = 0 rewards = 0

Slide 49

Slide 49 text

0.0 0.5 1.0 1.5 2.0 0.00 0.25 0.50 0.75 1.00 Arm Model pulls = 1 rewards = 0 θk

Slide 50

Slide 50 text

0 1 2 0.00 0.25 0.50 0.75 1.00 Arm Model pulls = 10 rewards = 4 θk

Slide 51

Slide 51 text

0 5 10 15 20 25 0.00 0.25 0.50 0.75 1.00 Arm Model pulls = 1000 rewards = 400 θk

Slide 52

Slide 52 text

Thompson Sampling 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 θk pulls = 100  rewards = 40 pulls = 10 rewards = 6

Slide 53

Slide 53 text

Thompson Sampling 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 θk pulls = 100  rewards = 40 pulls = 10 rewards = 6 0.41

Slide 54

Slide 54 text

Thompson Sampling 0 2 4 6 8 0.00 0.25 0.50 0.75 1.00 θk pulls = 100  rewards = 40 pulls = 10 rewards = 6 0.41 0.57

Slide 55

Slide 55 text

Thompson Sampling (use 'incanter.distributions) (defn estimate-value [{:keys [pulls value] :as arm}] (let [prior 1.0 alpha (+ prior value) beta (+ prior (- pulls value))] (assoc arm :theta (draw (beta-distribution alpha beta))))) (defn select-arm [arms] (exploit :theta (map estimate-value arms)))

Slide 56

Slide 56 text

Slide 57

Slide 57 text

Thompson Sampling Behaviour (bernoulli-bandit {:arm1 0.1 :arm2 0.1 :arm3 0.1 :arm4 0.1 :arm5 0.9}

Slide 58

Slide 58 text

Strategy Accuracy 0.00 0.25 0.50 0.75 1.00 0 100 200 300 400 500 Time Pr(selected = optimal−arm) algo epsilon.greedy 0.2 epsilon.greedy−0.1 thompson.sampling Epsilon = 0.1 Epsilon = 0.2 Thompson

Slide 59

Slide 59 text

Thompson Sampling • Smoothly balances explore/exploit tradeoff. • Optimal convergence: logarithmic regret • We can use it to Rank

Slide 60

Slide 60 text

Ranking with Thompson Sampling (defn rank [k arms] (sort-by k > arms)) user=> (def arms [(Arm. :blue 10 6) (Arm. :red 100 40)]) user=> (rank :theta (map estimate-value arms)) (#user.Arm{:name :red, :pulls 100, :value 40, :theta 0.3979…} #user.Arm{:name :blue, :pulls 10, :value 6, :theta 0.2789…})

Slide 61

Slide 61 text

Slide 62

Slide 62 text

[bandit/bandit-core "0.2.1-SNAPSHOT"] https://github.com/pingles/bandit

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

Ranking 1. 2. 3.

Slide 65

Slide 65 text

Ranking 1. 2. 3. Video Rank Bandit

Slide 66

Slide 66 text

Ranking 1. 2. 3. Video Rank Bandit Rank Bandit Arms

Slide 67

Slide 67 text

Ranking 1. 2. 3. Video Rank Bandit Rank Bandit Arms Thumbnail Bandit Thumbnail Bandit Thumbnail Bandit

Slide 68

Slide 68 text

Ranking 1. 2. 3. Portal Video  Thumbnail Bandit Arms Video Rank Bandit Rank Bandit Arms Thumbnail Bandit Thumbnail Bandit Thumbnail Bandit

Slide 69

Slide 69 text

Implementing Notﬂix: State (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

Slide 70

Slide 70 text

Implementing Notﬂix: State (def videos {"chilli" {:url “http://youtube.com/?v=1"} "portal" {:url “http://youtube.com/?v=2"}}) (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

Slide 71

Slide 71 text

Implementing Notﬂix: State (def videos {"chilli" {:url “http://youtube.com/?v=1"} "portal" {:url “http://youtube.com/?v=2"}}) (defn initialise-bandits [] {:video-rank (bandit.arms/bandit "chilli" "portal" "coconut") "chilli" (bandit.arms/bandit "chilli1.png" "chilli2.png" …) "coconut" (bandit.arms/bandit "coconut1.png" "coconut2.png" …) "portal" (bandit.arms/bandit "portal1.png" "portal2.png" …)}) (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

Slide 72

Slide 72 text

Implementing Notﬂix: State (def videos {"chilli" {:url “http://youtube.com/?v=1"} "portal" {:url “http://youtube.com/?v=2"}}) (defn initialise-bandits [] {:video-rank (bandit.arms/bandit "chilli" "portal" "coconut") "chilli" (bandit.arms/bandit "chilli1.png" "chilli2.png" …) "coconut" (bandit.arms/bandit "coconut1.png" "coconut2.png" …) "portal" (bandit.arms/bandit "portal1.png" "portal2.png" …)}) (defn -main [] (dosync (alter bandits merge (initialise-bandits)))) (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

Slide 73

Slide 73 text

Implementing Notﬂix: Drawing Arms (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

Slide 74

Slide 74 text

(defn rank-arms [bandit-id] (->> (get @bandits bandit-id) (vals) ; #Arm{:name “video1” …} ; #Arm{:name “video2” …} (map bandit.algo.bayes/estimate-value) (bandit.arms/rank :theta))) Implementing Notﬂix: Drawing Arms (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

Slide 75

Slide 75 text

(defn rank-arms [bandit-id] (->> (get @bandits bandit-id) (vals) ; #Arm{:name “video1” …} ; #Arm{:name “video2” …} (map bandit.algo.bayes/estimate-value) (bandit.arms/rank :theta))) (defn select-arm [bandit-id] (first (rank-arms bandit-id))) Implementing Notﬂix: Drawing Arms (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

Slide 76

Slide 76 text

Implementing Notﬂix: Arm Feedback (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

Slide 77

Slide 77 text

(defn pulled-arm! [bandit-id arm-label] (alter bandits (update-in [bandit-id arm-label] bandit.arms/pulled))) Implementing Notﬂix: Arm Feedback (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

Slide 78

Slide 78 text

(defn pulled-arm! [bandit-id arm-label] (alter bandits (update-in [bandit-id arm-label] bandit.arms/pulled))) (defn pulled-all-arms! [bandit-id] (alter bandits update-in [bandit-id] (fmap bandit.arms/pulled))) Implementing Notﬂix: Arm Feedback (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

Slide 79

Slide 79 text

(defn pulled-arm! [bandit-id arm-label] (alter bandits (update-in [bandit-id arm-label] bandit.arms/pulled))) (defn pulled-all-arms! [bandit-id] (alter bandits update-in [bandit-id] (fmap bandit.arms/pulled))) (defn reward-arm! [bandit-id arm-label] (alter bandits update-in [bandit-id arm-label] #(bandit.algo.bayes/reward % 1))) Implementing Notﬂix: Arm Feedback (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

Slide 80

Slide 80 text

Slide 81

Slide 81 text

(defroutes notflix-routes (GET "/" [] (dosync (let [ranked-labels (map :name (rank-arms :video-rank))] (pulled-all-arms! :video-rank) (videos-page-html ranked-labels)))) (POST “/reward/:video-name/:thumb-name“ [video-name thumb-name] (dosync (reward-arm! video-name thumb-name) (reward-arm! :video-rank video-name) (redirect-after-post "/")))) Implementing Notﬂix: Web App (def videos {"chilli" {:url “http://youtube.com/?v=1"} "portal" {:url “http://youtube.com/?v=2"}}) (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …} (defn videos-page-html [video-names] (for [video-name video-names] (html ... (dosync (let [video-url (:url (get videos video-name)) thumb-name (:name (select-arm video-name))] (pulled-arm! video-name thumb-name) [:a {:href video-url} [:img {:src thumb-name}]])))))

Slide 82

Slide 82 text

Algorithm Performance: Video Ranking Bandit

Slide 83

Slide 83 text

0 2 4 6 0.00 0.25 0.50 0.75 1.00 Pr(click) pdf θk Algorithm Performance: Video Ranking Bandit

Slide 84

Slide 84 text

0 2 4 6 0.00 0.25 0.50 0.75 1.00 Pr(click) pdf θk Algorithm Performance: Video Ranking Bandit “hero of the coconut pain”

Slide 85

Slide 85 text

0 2 4 6 0.00 0.25 0.50 0.75 1.00 Pr(click) pdf θk Algorithm Performance: Video Ranking Bandit “1000 danes eat 1000 chillis” “hero of the coconut pain”

Slide 86

Slide 86 text

0 2 4 6 0.00 0.25 0.50 0.75 1.00 Pr(click) pdf θk Algorithm Performance: Video Ranking Bandit “3 year-old with a portal gun” “1000 danes eat 1000 chillis” “hero of the coconut pain”

Slide 87

Slide 87 text

Algorithm Performance: Portal Thumbnail Bandit

Slide 88

Slide 88 text

Algorithm Performance: Portal Thumbnail Bandit θk 0 1 2 3 4 0.00 0.25 0.50 0.75 1.00 Pr(click) pdf

Slide 89

Slide 89 text

Algorithm Performance: Portal Thumbnail Bandit θk 0 1 2 3 4 0.00 0.25 0.50 0.75 1.00 Pr(click) pdf

Slide 90

Slide 90 text

Algorithm Performance: Portal Thumbnail Bandit θk 0 1 2 3 4 0.00 0.25 0.50 0.75 1.00 Pr(click) pdf

Slide 91

Slide 91 text

Algorithm Performance: Portal Thumbnail Bandit θk 0 1 2 3 4 0.00 0.25 0.50 0.75 1.00 Pr(click) pdf

Slide 92

Slide 92 text

http://notﬂix.herokuapp.com/

Slide 93

Slide 93 text

= 1.Ariely, D, 2010, “Predictably Irrational”, Harper Perennial 2.Kahneman, D, 2011, “Thinking Fast and Slow”, Farrar, Straus and Giroux 3.Myles White, J, 2012, “Bandit Algorithms for Website Optimization”, O’Reilly. 4.Scott, S, 2010, “A modern Bayesian look at the multi-armed bandit” 5.http://tdunning.blogspot.co.uk/2012/02/bayesian-bandits.html 6.http://www.chrisstucchio.com/blog/2013/bayesian_bandit.html 7.http://www.chrisstucchio.com/blog/2013/bayesian_analysis_conversion_rates.html 8.Siroker and Koomen, 2013, “A/B Testing”, Wiley @pingles https://github.com/pingles/bandit