Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Multi-armed Bandit Optimisation in Clojure

Multi-armed Bandit Optimisation in Clojure

Multi-armed Bandit strategies provide efficient and intuitive models that are often more easily applied to product optimisation than traditional methods.

Through building a sample application, Notflix, we use the Thompson Sampling algorithm to rank videos and select thumbnails to optimise click-throughs.

Paul Ingles

June 26, 2014
Tweet

More Decks by Paul Ingles

Other Decks in Technology

Transcript

  1. Multi-armed Bandit
    Optimisation in Clojure
    @pingles
    Paul Ingles,
    Principal Engineer, uSwitch.com

    View Slide

  2. Product optimisation
    cycles are long,
    complex and
    inefficient.

    View Slide

  3. View Slide

  4. Source: http://danariely.com/the-books/excerpted-from-chapter-1-–-the-truth-about-relativity-2/

    View Slide

  5. Source: http://danariely.com/the-books/excerpted-from-chapter-1-–-the-truth-about-relativity-2/

    View Slide

  6. View Slide

  7. Experiment
    1
    Time
    Experiment
    2
    Experiment
    3

    View Slide

  8. Experiment
    1
    Time
    Experiment
    2
    Experiment
    3

    Participants Needed
    Effect

    View Slide

  9. Bandit strategies
    can help

    View Slide

  10. A product

    for procrastinators 

    by a procrastinator

    View Slide

  11. View Slide

  12. http://notflix.herokuapp.com/

    View Slide

  13. Source: Action Movie Kid, https://www.youtube.com/watch?v=SU-ZM3xoYag

    View Slide

  14. Source: Action Movie Kid, https://www.youtube.com/watch?v=SU-ZM3xoYag

    View Slide

  15. http://notflix.herokuapp.com/

    View Slide

  16. The Multi-armed
    Bandit Problem

    View Slide

  17. View Slide

  18. Exploitation
    Exploration

    View Slide

  19. Exploitation
    Exploration
    Source: printablecolouringpages.co.uk

    View Slide

  20. Exploitation
    Exploration
    Source: printablecolouringpages.co.uk

    View Slide

  21. Exploitation
    Exploration
    Source: printablecolouringpages.co.uk

    View Slide

  22. Bandit Model

    View Slide

  23. Bandit Model
    Arms: {1, 2, …, K}

    View Slide

  24. Bandit Model
    Trials: 1, 2, … T
    Arms: {1, 2, …, K}

    View Slide

  25. Bandit Model
    Rewards: {0, 1}
    Trials: 1, 2, … T
    Arms: {1, 2, …, K}

    View Slide

  26. K-arms
    K-buttons
    SUBMIT
    SUPPORT HAITI
    Source: A/B Testing, Siroker
    and Koomen, 2013
    K-pages
    Source: A/B Testing, Siroker
    and Koomen, 2013
    “Pierce and Wilfork
    show off their Boston
    accents”
    “Pierce, Wilfork show
    off their bad Boston
    accents”
    Source: http://
    knightlab.northwestern.edu/
    2013/08/15/designing-from-data-how-
    news-organizations-use-ab-testing-to-
    increase-user-engagement/
    K-headlines

    View Slide

  27. Bandit Strategy

    View Slide

  28. Bandit Strategy
    (defn select-arm [arms]
    …)
    Arm selection

    View Slide

  29. Bandit Strategy
    (defn select-arm [arms]
    …)
    Arm selection
    (defn reward [arm x]
    …)
    (defn pulled [arm]
    …)
    Updating arm
    state with
    feedback
    }

    View Slide

  30. Bandit Strategy
    (defrecord Arm [name pulls value])
    (defn select-arm [arms]
    …)
    Arm selection
    (defn reward [arm x]
    …)
    (defn pulled [arm]
    …)
    Updating arm
    state with
    feedback
    }

    View Slide

  31. Clojure Bandit Algorithms

    View Slide

  32. Clojure Bandit Algorithms
    • Epsilon-Greedy

    View Slide

  33. Clojure Bandit Algorithms
    • Epsilon-Greedy
    • Thompson Sampling

    View Slide

  34. ε-greedy

    View Slide

  35. ε-greedy
    ε = epsilon = rate of exploration

    View Slide

  36. ε-greedy
    Source: “Bandit Algorithms for Website Optimization”, John Myles White, O’Reilly, 2012
    Explore
    10%
    Exploit
    90%
    Arm k
    Arm 1
    Arm 2
    Arm K

    epsilon = 0.1
    (rate of exploration)

    View Slide

  37. (defrecord Arm [name pulls value])

    View Slide

  38. (defrecord Arm [name pulls value])
    (defn select-arm [epsilon arms]
    (if (> (rand) epsilon)
    (exploit :value arms)
    (rand-nth arms)))

    View Slide

  39. (defrecord Arm [name pulls value])
    (defn select-arm [epsilon arms]
    (if (> (rand) epsilon)
    (exploit :value arms)
    (rand-nth arms)))

    (defn exploit [k arms]
    (first (sort-by k > arms)))

    View Slide

  40. (defrecord Arm [name pulls value])
    (defn select-arm [epsilon arms]
    (if (> (rand) epsilon)
    (exploit :value arms)
    (rand-nth arms)))
    user=> (def arms [(Arm. :arm1 2 0)
    (Arm. :arm2 2 1)])
    user=> (select-arm 1.0 arms)
    #user.Arm{:name :arm1, :pulls 2, :value 0}
    user=> (select-arm 0.0 arms)
    #user.Arm{:name :arm2, :pulls 2, :value 1}

    (defn exploit [k arms]
    (first (sort-by k > arms)))

    View Slide

  41. user=> (def arms [(Arm. :arm1 0 0)
    (Arm. :arm2 0 0)])
    user=> (def selected-arm (select-arm 0.1 arms))
    #user.Arm{:name :arm1, :pulls 0, :value 0}
    user=> (-> selected-arm
    (pulled)
    ;; time passes
    (reward 1))
    #user.Arm{:name :arm1, :pulls 1, :value 1}

    View Slide

  42. user=> (def arms [(Arm. :arm1 0 0)
    (Arm. :arm2 0 0)])
    user=> (def selected-arm (select-arm 0.1 arms))
    #user.Arm{:name :arm1, :pulls 0, :value 0}
    user=> (-> selected-arm
    (pulled)
    ;; time passes
    (reward 1))
    #user.Arm{:name :arm1, :pulls 1, :value 1}
    Increment pulls
    counter
    Increment value
    counter

    View Slide

  43. ε-greedy Behaviour

    View Slide

  44. ε-greedy Behaviour
    (bernoulli-bandit {:arm1 0.1
    :arm2 0.1
    :arm3 0.1
    :arm4 0.1
    :arm5 0.9}

    View Slide

  45. ε-greedy Behaviour
    0.00
    0.25
    0.50
    0.75
    1.00
    0 100 200 300 400 500
    Time
    Pr(selected = optimal−arm)
    algo
    epsilon.greedy 0.2
    epsilon.greedy−0.1
    Epsilon = 0.1
    Epsilon = 0.2

    View Slide

  46. Thompson
    Sampling
    aka Bayesian Bandit

    View Slide

  47. Arm k’s hidden true probability
    of reward
    θk
    [0,1]
    Arm Model

    View Slide

  48. Arm Model
    0.50
    0.75
    1.00
    1.25
    1.50
    0.00 0.25 0.50 0.75 1.00
    θk
    pulls = 0
    rewards = 0

    View Slide

  49. 0.0
    0.5
    1.0
    1.5
    2.0
    0.00 0.25 0.50 0.75 1.00
    Arm Model
    pulls = 1
    rewards = 0
    θk

    View Slide

  50. 0
    1
    2
    0.00 0.25 0.50 0.75 1.00
    Arm Model
    pulls = 10
    rewards = 4
    θk

    View Slide

  51. 0
    5
    10
    15
    20
    25
    0.00 0.25 0.50 0.75 1.00
    Arm Model
    pulls = 1000
    rewards = 400
    θk

    View Slide

  52. Thompson Sampling
    0
    2
    4
    6
    8
    0.00 0.25 0.50 0.75 1.00
    θk
    pulls = 100

    rewards = 40
    pulls = 10
    rewards = 6

    View Slide

  53. Thompson Sampling
    0
    2
    4
    6
    8
    0.00 0.25 0.50 0.75 1.00
    θk
    pulls = 100

    rewards = 40
    pulls = 10
    rewards = 6
    0.41

    View Slide

  54. Thompson Sampling
    0
    2
    4
    6
    8
    0.00 0.25 0.50 0.75 1.00
    θk
    pulls = 100

    rewards = 40
    pulls = 10
    rewards = 6
    0.41 0.57

    View Slide

  55. Thompson Sampling
    (use 'incanter.distributions)
    (defn estimate-value
    [{:keys [pulls value] :as arm}]
    (let [prior 1.0
    alpha (+ prior value)
    beta (+ prior (- pulls value))]
    (assoc arm
    :theta (draw (beta-distribution alpha beta)))))
    (defn select-arm [arms]
    (exploit :theta (map estimate-value arms)))

    View Slide

  56. Thompson Sampling
    (use 'incanter.distributions)
    (defn estimate-value
    [{:keys [pulls value] :as arm}]
    (let [prior 1.0
    alpha (+ prior value)
    beta (+ prior (- pulls value))]
    (assoc arm
    :theta (draw (beta-distribution alpha beta)))))
    (defn select-arm [arms]
    (exploit :theta (map estimate-value arms)))
    user=> (select-arm [(Arm. :blue 10 6) (Arm. :red 100 40)])
    #user.Arm{:name :blue, :pulls 10, :value 6, :theta 0.4978…}

    View Slide

  57. Thompson Sampling Behaviour
    (bernoulli-bandit {:arm1 0.1
    :arm2 0.1
    :arm3 0.1
    :arm4 0.1
    :arm5 0.9}

    View Slide

  58. Strategy Accuracy
    0.00
    0.25
    0.50
    0.75
    1.00
    0 100 200 300 400 500
    Time
    Pr(selected = optimal−arm)
    algo
    epsilon.greedy 0.2
    epsilon.greedy−0.1
    thompson.sampling
    Epsilon = 0.1
    Epsilon = 0.2
    Thompson

    View Slide

  59. Thompson Sampling
    • Smoothly balances explore/exploit
    tradeoff.
    • Optimal convergence: logarithmic
    regret
    • We can use it to Rank

    View Slide

  60. Ranking with
    Thompson Sampling
    (defn rank [k arms]
    (sort-by k > arms))
    user=> (def arms [(Arm. :blue 10 6)
    (Arm. :red 100 40)])
    user=> (rank :theta (map estimate-value arms))
    (#user.Arm{:name :red, :pulls 100, :value 40, :theta 0.3979…}
    #user.Arm{:name :blue, :pulls 10, :value 6, :theta 0.2789…})

    View Slide

  61. Ranking with
    Thompson Sampling
    (defn rank [k arms]
    (sort-by k > arms))
    user=> (def arms [(Arm. :blue 10 6)
    (Arm. :red 100 40)])
    user=> (rank :theta (map estimate-value arms))
    (#user.Arm{:name :red, :pulls 100, :value 40, :theta 0.3979…}
    #user.Arm{:name :blue, :pulls 10, :value 6, :theta 0.2789…})
    SWEEEEEEETTT

    View Slide

  62. [bandit/bandit-core "0.2.1-SNAPSHOT"]
    https://github.com/pingles/bandit

    View Slide

  63. View Slide

  64. Ranking
    1.
    2.
    3.

    View Slide

  65. Ranking
    1.
    2.
    3.
    Video Rank
    Bandit

    View Slide

  66. Ranking
    1.
    2.
    3.
    Video Rank
    Bandit
    Rank Bandit
    Arms

    View Slide

  67. Ranking
    1.
    2.
    3.
    Video Rank
    Bandit
    Rank Bandit
    Arms
    Thumbnail
    Bandit
    Thumbnail
    Bandit
    Thumbnail
    Bandit

    View Slide

  68. Ranking
    1.
    2.
    3.
    Portal Video

    Thumbnail Bandit Arms
    Video Rank
    Bandit
    Rank Bandit
    Arms
    Thumbnail
    Bandit
    Thumbnail
    Bandit
    Thumbnail
    Bandit

    View Slide

  69. Implementing Notflix: State
    (def bandits (ref nil))
    ; {:video-rank {“chilli” #Arm{:pulls …
    ; “portal” #Arm{:pulls …
    ; “chilli” …}

    View Slide

  70. Implementing Notflix: State
    (def videos {"chilli" {:url “http://youtube.com/?v=1"}
    "portal" {:url “http://youtube.com/?v=2"}})
    (def bandits (ref nil))
    ; {:video-rank {“chilli” #Arm{:pulls …
    ; “portal” #Arm{:pulls …
    ; “chilli” …}

    View Slide

  71. Implementing Notflix: State
    (def videos {"chilli" {:url “http://youtube.com/?v=1"}
    "portal" {:url “http://youtube.com/?v=2"}})
    (defn initialise-bandits []
    {:video-rank (bandit.arms/bandit "chilli" "portal" "coconut")
    "chilli" (bandit.arms/bandit "chilli1.png" "chilli2.png" …)
    "coconut" (bandit.arms/bandit "coconut1.png" "coconut2.png" …)
    "portal" (bandit.arms/bandit "portal1.png" "portal2.png" …)})
    (def bandits (ref nil))
    ; {:video-rank {“chilli” #Arm{:pulls …
    ; “portal” #Arm{:pulls …
    ; “chilli” …}

    View Slide

  72. Implementing Notflix: State
    (def videos {"chilli" {:url “http://youtube.com/?v=1"}
    "portal" {:url “http://youtube.com/?v=2"}})
    (defn initialise-bandits []
    {:video-rank (bandit.arms/bandit "chilli" "portal" "coconut")
    "chilli" (bandit.arms/bandit "chilli1.png" "chilli2.png" …)
    "coconut" (bandit.arms/bandit "coconut1.png" "coconut2.png" …)
    "portal" (bandit.arms/bandit "portal1.png" "portal2.png" …)})
    (defn -main []
    (dosync
    (alter bandits merge (initialise-bandits))))
    (def bandits (ref nil))
    ; {:video-rank {“chilli” #Arm{:pulls …
    ; “portal” #Arm{:pulls …
    ; “chilli” …}

    View Slide

  73. Implementing Notflix: Drawing Arms
    (def bandits (ref nil))
    ; {:video-rank {“chilli” #Arm{:pulls …
    ; “portal” #Arm{:pulls …
    ; “chilli” …}

    View Slide

  74. (defn rank-arms [bandit-id]
    (->> (get @bandits bandit-id)
    (vals) ; #Arm{:name “video1” …}
    ; #Arm{:name “video2” …}
    (map bandit.algo.bayes/estimate-value)
    (bandit.arms/rank :theta)))
    Implementing Notflix: Drawing Arms
    (def bandits (ref nil))
    ; {:video-rank {“chilli” #Arm{:pulls …
    ; “portal” #Arm{:pulls …
    ; “chilli” …}

    View Slide

  75. (defn rank-arms [bandit-id]
    (->> (get @bandits bandit-id)
    (vals) ; #Arm{:name “video1” …}
    ; #Arm{:name “video2” …}
    (map bandit.algo.bayes/estimate-value)
    (bandit.arms/rank :theta)))
    (defn select-arm [bandit-id]
    (first (rank-arms bandit-id)))
    Implementing Notflix: Drawing Arms
    (def bandits (ref nil))
    ; {:video-rank {“chilli” #Arm{:pulls …
    ; “portal” #Arm{:pulls …
    ; “chilli” …}

    View Slide

  76. Implementing Notflix: Arm Feedback
    (def bandits (ref nil))
    ; {:video-rank {“chilli” #Arm{:pulls …
    ; “portal” #Arm{:pulls …
    ; “chilli” …}

    View Slide

  77. (defn pulled-arm! [bandit-id arm-label]
    (alter bandits
    (update-in [bandit-id arm-label] bandit.arms/pulled)))
    Implementing Notflix: Arm Feedback
    (def bandits (ref nil))
    ; {:video-rank {“chilli” #Arm{:pulls …
    ; “portal” #Arm{:pulls …
    ; “chilli” …}

    View Slide

  78. (defn pulled-arm! [bandit-id arm-label]
    (alter bandits
    (update-in [bandit-id arm-label] bandit.arms/pulled)))
    (defn pulled-all-arms! [bandit-id]
    (alter bandits
    update-in [bandit-id] (fmap bandit.arms/pulled)))
    Implementing Notflix: Arm Feedback
    (def bandits (ref nil))
    ; {:video-rank {“chilli” #Arm{:pulls …
    ; “portal” #Arm{:pulls …
    ; “chilli” …}

    View Slide

  79. (defn pulled-arm! [bandit-id arm-label]
    (alter bandits
    (update-in [bandit-id arm-label] bandit.arms/pulled)))
    (defn pulled-all-arms! [bandit-id]
    (alter bandits
    update-in [bandit-id] (fmap bandit.arms/pulled)))
    (defn reward-arm! [bandit-id arm-label]
    (alter bandits
    update-in [bandit-id arm-label] #(bandit.algo.bayes/reward % 1)))
    Implementing Notflix: Arm Feedback
    (def bandits (ref nil))
    ; {:video-rank {“chilli” #Arm{:pulls …
    ; “portal” #Arm{:pulls …
    ; “chilli” …}

    View Slide

  80. (defroutes notflix-routes
    (GET "/" []
    (dosync
    (let [ranked-labels (map :name (rank-arms :video-rank))]
    (pulled-all-arms! :video-rank)
    (videos-page-html ranked-labels))))
    (POST “/reward/:video-name/:thumb-name“ [video-name thumb-name]
    (dosync
    (reward-arm! video-name thumb-name)
    (reward-arm! :video-rank video-name)
    (redirect-after-post "/"))))
    Implementing Notflix: Web App
    (def videos {"chilli" {:url “http://youtube.com/?v=1"}
    "portal" {:url “http://youtube.com/?v=2"}})
    (def bandits (ref nil))
    ; {:video-rank {“chilli” #Arm{:pulls …
    ; “portal” #Arm{:pulls …
    ; “chilli” …}

    View Slide

  81. (defroutes notflix-routes
    (GET "/" []
    (dosync
    (let [ranked-labels (map :name (rank-arms :video-rank))]
    (pulled-all-arms! :video-rank)
    (videos-page-html ranked-labels))))
    (POST “/reward/:video-name/:thumb-name“ [video-name thumb-name]
    (dosync
    (reward-arm! video-name thumb-name)
    (reward-arm! :video-rank video-name)
    (redirect-after-post "/"))))
    Implementing Notflix: Web App
    (def videos {"chilli" {:url “http://youtube.com/?v=1"}
    "portal" {:url “http://youtube.com/?v=2"}})
    (def bandits (ref nil))
    ; {:video-rank {“chilli” #Arm{:pulls …
    ; “portal” #Arm{:pulls …
    ; “chilli” …}
    (defn videos-page-html [video-names]
    (for [video-name video-names]
    (html ...
    (dosync
    (let [video-url (:url (get videos video-name))
    thumb-name (:name (select-arm video-name))]
    (pulled-arm! video-name thumb-name)
    [:a {:href video-url} [:img {:src thumb-name}]])))))

    View Slide

  82. Algorithm Performance: Video Ranking Bandit

    View Slide

  83. 0
    2
    4
    6
    0.00 0.25 0.50 0.75 1.00
    Pr(click)
    pdf
    θk
    Algorithm Performance: Video Ranking Bandit

    View Slide

  84. 0
    2
    4
    6
    0.00 0.25 0.50 0.75 1.00
    Pr(click)
    pdf
    θk
    Algorithm Performance: Video Ranking Bandit
    “hero of the coconut pain”

    View Slide

  85. 0
    2
    4
    6
    0.00 0.25 0.50 0.75 1.00
    Pr(click)
    pdf
    θk
    Algorithm Performance: Video Ranking Bandit
    “1000 danes eat 1000 chillis”
    “hero of the coconut pain”

    View Slide

  86. 0
    2
    4
    6
    0.00 0.25 0.50 0.75 1.00
    Pr(click)
    pdf
    θk
    Algorithm Performance: Video Ranking Bandit
    “3 year-old with a portal gun”
    “1000 danes eat 1000 chillis”
    “hero of the coconut pain”

    View Slide

  87. Algorithm Performance: Portal Thumbnail Bandit

    View Slide

  88. Algorithm Performance: Portal Thumbnail Bandit
    θk
    0
    1
    2
    3
    4
    0.00 0.25 0.50 0.75 1.00
    Pr(click)
    pdf

    View Slide

  89. Algorithm Performance: Portal Thumbnail Bandit
    θk
    0
    1
    2
    3
    4
    0.00 0.25 0.50 0.75 1.00
    Pr(click)
    pdf

    View Slide

  90. Algorithm Performance: Portal Thumbnail Bandit
    θk
    0
    1
    2
    3
    4
    0.00 0.25 0.50 0.75 1.00
    Pr(click)
    pdf

    View Slide

  91. Algorithm Performance: Portal Thumbnail Bandit
    θk
    0
    1
    2
    3
    4
    0.00 0.25 0.50 0.75 1.00
    Pr(click)
    pdf

    View Slide

  92. http://notflix.herokuapp.com/

    View Slide

  93. =
    1.Ariely, D, 2010, “Predictably Irrational”, Harper Perennial
    2.Kahneman, D, 2011, “Thinking Fast and Slow”, Farrar, Straus and Giroux
    3.Myles White, J, 2012, “Bandit Algorithms for Website Optimization”, O’Reilly.
    4.Scott, S, 2010, “A modern Bayesian look at the multi-armed bandit”
    5.http://tdunning.blogspot.co.uk/2012/02/bayesian-bandits.html
    6.http://www.chrisstucchio.com/blog/2013/bayesian_bandit.html
    7.http://www.chrisstucchio.com/blog/2013/bayesian_analysis_conversion_rates.html
    8.Siroker and Koomen, 2013, “A/B Testing”, Wiley
    @pingles
    https://github.com/pingles/bandit

    View Slide