Multi-armed Bandit Optimisation in Clojure

Multi-armed Bandit Optimisation in Clojure @pingles Paul Ingles, Principal Engineer,
uSwitch.com

Product optimisation cycles are long, complex and inefﬁcient.

Source: http://danariely.com/the-books/excerpted-from-chapter-1-–-the-truth-about-relativity-2/

Experiment 1 Time Experiment 2 Experiment 3 …

Experiment 1 Time Experiment 2 Experiment 3 … Participants Needed
Effect

Bandit strategies can help

A product  for procrastinators   by a procrastinator

http://notﬂix.herokuapp.com/

Source: Action Movie Kid, https://www.youtube.com/watch?v=SU-ZM3xoYag

The Multi-armed Bandit Problem

Exploitation Exploration

Exploitation Exploration Source: printablecolouringpages.co.uk

Bandit Model

Bandit Model Arms: {1, 2, …, K}

Bandit Model Trials: 1, 2, … T Arms: {1, 2,
…, K}

Bandit Model Rewards: {0, 1} Trials: 1, 2, … T
Arms: {1, 2, …, K}

K-arms K-buttons SUBMIT SUPPORT HAITI Source: A/B Testing, Siroker and
Koomen, 2013 K-pages Source: A/B Testing, Siroker and Koomen, 2013 “Pierce and Wilfork show off their Boston accents” “Pierce, Wilfork show off their bad Boston accents” Source: http:// knightlab.northwestern.edu/ 2013/08/15/designing-from-data-how- news-organizations-use-ab-testing-to- increase-user-engagement/ K-headlines

Bandit Strategy

Bandit Strategy (defn select-arm [arms] …) Arm selection

Bandit Strategy (defn select-arm [arms] …) Arm selection (defn reward
[arm x] …) (defn pulled [arm] …) Updating arm state with feedback }

Bandit Strategy (defrecord Arm [name pulls value]) (defn select-arm [arms]
…) Arm selection (defn reward [arm x] …) (defn pulled [arm] …) Updating arm state with feedback }

Clojure Bandit Algorithms

Clojure Bandit Algorithms • Epsilon-Greedy

Clojure Bandit Algorithms • Epsilon-Greedy • Thompson Sampling

ε-greedy

ε-greedy ε = epsilon = rate of exploration

ε-greedy Source: “Bandit Algorithms for Website Optimization”, John Myles White,
O’Reilly, 2012 Explore 10% Exploit 90% Arm k Arm 1 Arm 2 Arm K … epsilon = 0.1 (rate of exploration)

(defrecord Arm [name pulls value])

(defrecord Arm [name pulls value]) (defn select-arm [epsilon arms] (if
(> (rand) epsilon) (exploit :value arms) (rand-nth arms)))

(> (rand) epsilon) (exploit :value arms) (rand-nth arms)))   (defn exploit [k arms] (first (sort-by k > arms)))

(> (rand) epsilon) (exploit :value arms) (rand-nth arms))) user=> (def arms [(Arm. :arm1 2 0) (Arm. :arm2 2 1)]) user=> (select-arm 1.0 arms) #user.Arm{:name :arm1, :pulls 2, :value 0} user=> (select-arm 0.0 arms) #user.Arm{:name :arm2, :pulls 2, :value 1}   (defn exploit [k arms] (first (sort-by k > arms)))

user=> (def arms [(Arm. :arm1 0 0) (Arm. :arm2 0
0)]) user=> (def selected-arm (select-arm 0.1 arms)) #user.Arm{:name :arm1, :pulls 0, :value 0} user=> (-> selected-arm (pulled) ;; time passes (reward 1)) #user.Arm{:name :arm1, :pulls 1, :value 1}

user=> (def arms [(Arm. :arm1 0 0) (Arm. :arm2 0
0)]) user=> (def selected-arm (select-arm 0.1 arms)) #user.Arm{:name :arm1, :pulls 0, :value 0} user=> (-> selected-arm (pulled) ;; time passes (reward 1)) #user.Arm{:name :arm1, :pulls 1, :value 1} Increment pulls counter Increment value counter

ε-greedy Behaviour

ε-greedy Behaviour (bernoulli-bandit {:arm1 0.1 :arm2 0.1 :arm3 0.1 :arm4
0.1 :arm5 0.9}

ε-greedy Behaviour 0.00 0.25 0.50 0.75 1.00 0 100 200
300 400 500 Time Pr(selected = optimal−arm) algo epsilon.greedy 0.2 epsilon.greedy−0.1 Epsilon = 0.1 Epsilon = 0.2

Thompson Sampling aka Bayesian Bandit

Arm k’s hidden true probability of reward θk [0,1] Arm
Model

Arm Model 0.50 0.75 1.00 1.25 1.50 0.00 0.25 0.50
0.75 1.00 θk pulls = 0 rewards = 0

0.0 0.5 1.0 1.5 2.0 0.00 0.25 0.50 0.75 1.00
Arm Model pulls = 1 rewards = 0 θk

0 1 2 0.00 0.25 0.50 0.75 1.00 Arm Model
pulls = 10 rewards = 4 θk

0 5 10 15 20 25 0.00 0.25 0.50 0.75
1.00 Arm Model pulls = 1000 rewards = 400 θk

Thompson Sampling 0 2 4 6 8 0.00 0.25 0.50
0.75 1.00 θk pulls = 100  rewards = 40 pulls = 10 rewards = 6

0.75 1.00 θk pulls = 100  rewards = 40 pulls = 10 rewards = 6 0.41

0.75 1.00 θk pulls = 100  rewards = 40 pulls = 10 rewards = 6 0.41 0.57

Thompson Sampling (use 'incanter.distributions) (defn estimate-value [{:keys [pulls value] :as
arm}] (let [prior 1.0 alpha (+ prior value) beta (+ prior (- pulls value))] (assoc arm :theta (draw (beta-distribution alpha beta))))) (defn select-arm [arms] (exploit :theta (map estimate-value arms)))

Thompson Sampling (use 'incanter.distributions) (defn estimate-value [{:keys [pulls value] :as
arm}] (let [prior 1.0 alpha (+ prior value) beta (+ prior (- pulls value))] (assoc arm :theta (draw (beta-distribution alpha beta))))) (defn select-arm [arms] (exploit :theta (map estimate-value arms))) user=> (select-arm [(Arm. :blue 10 6) (Arm. :red 100 40)]) #user.Arm{:name :blue, :pulls 10, :value 6, :theta 0.4978…}

Thompson Sampling Behaviour (bernoulli-bandit {:arm1 0.1 :arm2 0.1 :arm3 0.1
:arm4 0.1 :arm5 0.9}

Strategy Accuracy 0.00 0.25 0.50 0.75 1.00 0 100 200
300 400 500 Time Pr(selected = optimal−arm) algo epsilon.greedy 0.2 epsilon.greedy−0.1 thompson.sampling Epsilon = 0.1 Epsilon = 0.2 Thompson

Thompson Sampling • Smoothly balances explore/exploit tradeoff. • Optimal convergence:
logarithmic regret • We can use it to Rank

Ranking with Thompson Sampling (defn rank [k arms] (sort-by k
> arms)) user=> (def arms [(Arm. :blue 10 6) (Arm. :red 100 40)]) user=> (rank :theta (map estimate-value arms)) (#user.Arm{:name :red, :pulls 100, :value 40, :theta 0.3979…} #user.Arm{:name :blue, :pulls 10, :value 6, :theta 0.2789…})

Ranking with Thompson Sampling (defn rank [k arms] (sort-by k
> arms)) user=> (def arms [(Arm. :blue 10 6) (Arm. :red 100 40)]) user=> (rank :theta (map estimate-value arms)) (#user.Arm{:name :red, :pulls 100, :value 40, :theta 0.3979…} #user.Arm{:name :blue, :pulls 10, :value 6, :theta 0.2789…}) SWEEEEEEETTT

[bandit/bandit-core "0.2.1-SNAPSHOT"] https://github.com/pingles/bandit

Ranking 1. 2. 3.

Ranking 1. 2. 3. Video Rank Bandit

Ranking 1. 2. 3. Video Rank Bandit Rank Bandit Arms

Ranking 1. 2. 3. Video Rank Bandit Rank Bandit Arms
Thumbnail Bandit Thumbnail Bandit Thumbnail Bandit

Ranking 1. 2. 3. Portal Video  Thumbnail Bandit Arms Video
Rank Bandit Rank Bandit Arms Thumbnail Bandit Thumbnail Bandit Thumbnail Bandit

Implementing Notﬂix: State (def bandits (ref nil)) ; {:video-rank {“chilli”
#Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

Implementing Notﬂix: State (def videos {"chilli" {:url “http://youtube.com/?v=1"} "portal" {:url
“http://youtube.com/?v=2"}}) (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

“http://youtube.com/?v=2"}}) (defn initialise-bandits [] {:video-rank (bandit.arms/bandit "chilli" "portal" "coconut") "chilli" (bandit.arms/bandit "chilli1.png" "chilli2.png" …) "coconut" (bandit.arms/bandit "coconut1.png" "coconut2.png" …) "portal" (bandit.arms/bandit "portal1.png" "portal2.png" …)}) (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

“http://youtube.com/?v=2"}}) (defn initialise-bandits [] {:video-rank (bandit.arms/bandit "chilli" "portal" "coconut") "chilli" (bandit.arms/bandit "chilli1.png" "chilli2.png" …) "coconut" (bandit.arms/bandit "coconut1.png" "coconut2.png" …) "portal" (bandit.arms/bandit "portal1.png" "portal2.png" …)}) (defn -main [] (dosync (alter bandits merge (initialise-bandits)))) (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

Implementing Notﬂix: Drawing Arms (def bandits (ref nil)) ; {:video-rank
{“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

(defn rank-arms [bandit-id] (->> (get @bandits bandit-id) (vals) ; #Arm{:name
“video1” …} ; #Arm{:name “video2” …} (map bandit.algo.bayes/estimate-value) (bandit.arms/rank :theta))) Implementing Notﬂix: Drawing Arms (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

(defn rank-arms [bandit-id] (->> (get @bandits bandit-id) (vals) ; #Arm{:name
“video1” …} ; #Arm{:name “video2” …} (map bandit.algo.bayes/estimate-value) (bandit.arms/rank :theta))) (defn select-arm [bandit-id] (first (rank-arms bandit-id))) Implementing Notﬂix: Drawing Arms (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

Implementing Notﬂix: Arm Feedback (def bandits (ref nil)) ; {:video-rank
{“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

(defn pulled-arm! [bandit-id arm-label] (alter bandits (update-in [bandit-id arm-label] bandit.arms/pulled)))
Implementing Notﬂix: Arm Feedback (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

(defn pulled-all-arms! [bandit-id] (alter bandits update-in [bandit-id] (fmap bandit.arms/pulled))) Implementing Notﬂix: Arm Feedback (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

(defn pulled-all-arms! [bandit-id] (alter bandits update-in [bandit-id] (fmap bandit.arms/pulled))) (defn reward-arm! [bandit-id arm-label] (alter bandits update-in [bandit-id arm-label] #(bandit.algo.bayes/reward % 1))) Implementing Notﬂix: Arm Feedback (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

(defroutes notflix-routes (GET "/" [] (dosync (let [ranked-labels (map :name
(rank-arms :video-rank))] (pulled-all-arms! :video-rank) (videos-page-html ranked-labels)))) (POST “/reward/:video-name/:thumb-name“ [video-name thumb-name] (dosync (reward-arm! video-name thumb-name) (reward-arm! :video-rank video-name) (redirect-after-post "/")))) Implementing Notﬂix: Web App (def videos {"chilli" {:url “http://youtube.com/?v=1"} "portal" {:url “http://youtube.com/?v=2"}}) (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …}

(defroutes notflix-routes (GET "/" [] (dosync (let [ranked-labels (map :name
(rank-arms :video-rank))] (pulled-all-arms! :video-rank) (videos-page-html ranked-labels)))) (POST “/reward/:video-name/:thumb-name“ [video-name thumb-name] (dosync (reward-arm! video-name thumb-name) (reward-arm! :video-rank video-name) (redirect-after-post "/")))) Implementing Notﬂix: Web App (def videos {"chilli" {:url “http://youtube.com/?v=1"} "portal" {:url “http://youtube.com/?v=2"}}) (def bandits (ref nil)) ; {:video-rank {“chilli” #Arm{:pulls … ; “portal” #Arm{:pulls … ; “chilli” …} (defn videos-page-html [video-names] (for [video-name video-names] (html ... (dosync (let [video-url (:url (get videos video-name)) thumb-name (:name (select-arm video-name))] (pulled-arm! video-name thumb-name) [:a {:href video-url} [:img {:src thumb-name}]])))))

Algorithm Performance: Video Ranking Bandit

0 2 4 6 0.00 0.25 0.50 0.75 1.00 Pr(click)
pdf θk Algorithm Performance: Video Ranking Bandit

0 2 4 6 0.00 0.25 0.50 0.75 1.00 Pr(click)
pdf θk Algorithm Performance: Video Ranking Bandit “hero of the coconut pain”

0 2 4 6 0.00 0.25 0.50 0.75 1.00 Pr(click)
pdf θk Algorithm Performance: Video Ranking Bandit “1000 danes eat 1000 chillis” “hero of the coconut pain”

0 2 4 6 0.00 0.25 0.50 0.75 1.00 Pr(click)
pdf θk Algorithm Performance: Video Ranking Bandit “3 year-old with a portal gun” “1000 danes eat 1000 chillis” “hero of the coconut pain”

Algorithm Performance: Portal Thumbnail Bandit

Algorithm Performance: Portal Thumbnail Bandit θk 0 1 2 3
4 0.00 0.25 0.50 0.75 1.00 Pr(click) pdf

= 1.Ariely, D, 2010, “Predictably Irrational”, Harper Perennial 2.Kahneman, D,
2011, “Thinking Fast and Slow”, Farrar, Straus and Giroux 3.Myles White, J, 2012, “Bandit Algorithms for Website Optimization”, O’Reilly. 4.Scott, S, 2010, “A modern Bayesian look at the multi-armed bandit” 5.http://tdunning.blogspot.co.uk/2012/02/bayesian-bandits.html 6.http://www.chrisstucchio.com/blog/2013/bayesian_bandit.html 7.http://www.chrisstucchio.com/blog/2013/bayesian_analysis_conversion_rates.html 8.Siroker and Koomen, 2013, “A/B Testing”, Wiley @pingles https://github.com/pingles/bandit

Multi-armed Bandit Optimisation in Clojure

Multi-armed Bandit Optimisation in Clojure

More Decks by Paul Ingles

Other Decks in Technology

Featured

Transcript