Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Oct2018 Meetup: Reinforcement Learning (RL): A ...

Oct2018 Meetup: Reinforcement Learning (RL): A Gentle Introduction with a Real World Application by Christian Hidber

Reinforcement Learning (RL) learns complex processes autonomously. No big data sets with the "right" answers are needed: the algorithms learn by experimenting. In 2017, RL algorithms beat the reigning Go World Champion. In this session, I demonstrate "how" and "why" RL works. As a practical example, we apply RL to a syphonic roof drainage system where the choice of the "right" dimensions of trillions of possibilities is extremely difficult.

By the way, syphonic roof drainage systems prevent large buildings, such as airports or stadiums, from collapsing in heavy rain.
The talk shows how RL complements our existing machine learning solution and allowed us to reduce the fail rate by an astonishing 70 percent.

Speaker: Christian Hidber
Christian is a consultant at bSquare with a focus on .NET Development, Machine Learning and Azure and an international conference speaker. He has a PhD in computer algebra from ETH Zurich and did a postdoc at UC Berkeley where he researched online data mining algorithms. Currently he applies machine learning to industrial hydraulics simulations in the context of a product with 7,000 installations in 42 countries all around the world.

You can find him at:
https://www.linkedin.com/in/christian-hidber/

Azure Zurich User Group

October 30, 2018
Tweet

More Decks by Azure Zurich User Group

Other Decks in Technology

Transcript

  1. FOLIE 4 The game: setup M3, OCTOBER 2018 REINFORCEMENT LEARNING

    actions game engine game state "Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC learner Step reward Goal: maximize sum of rewards
  2. FOLIE 5 The game: positive feedback M3, OCTOBER 2018 REINFORCEMENT

    LEARNING game engine game state "Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC learner Step reward actions
  3. FOLIE 6 The game: negative feedback M3, OCTOBER 2018 REINFORCEMENT

    LEARNING game engine game state "Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC learner Step reward actions
  4. FOLIE 7 Policy (rules learned, how to play the game)

    the learned stuff => policy M3, OCTOBER 2018 REINFORCEMENT LEARNING game engine game state learner Step reward actions
  5. FOLIE 8 Policy (rules learned, how to play the game)

    policy improvement => learning M3, OCTOBER 2018 REINFORCEMENT LEARNING game engine game state "Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC learner Step reward actions
  6. FOLIE 9 Policy (rules learned, how to play the game)

    policy improvement => learning M3, OCTOBER 2018 REINFORCEMENT LEARNING game engine game state "Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC learner Step reward actions
  7. FOLIE 10 Policy (rules learned, how to play the game)

    Reinforcement learning M3, OCTOBER 2018 REINFORCEMENT LEARNING game engine game state "Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC Step reward RL algorithm Key idea: continuously improve policy to increase total reward actions
  8. FOLIE 11 Episode 1 : play with 1st policy (random)

    M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Reward -1 Next State
  9. FOLIE 12 Episode 1 : play with 1st policy (random)

    M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Reward 100 Next State
  10. FOLIE 13 Episode 1 : play with 1st policy (random)

    M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Reward -50 Next State Episode Over Episode Over
  11. FOLIE 14 Episode 1 : improve 1st policy for state

    in step 3 M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Episode Over -50
  12. FOLIE 15 Episode 1 : improve 1st policy for state

    in step 2 M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Episode Over Future Reward (sum of all rewards from current state until ‘game over’) 50(=+100 -50)
  13. FOLIE 16 Episode 1 : improve 1st policy for state

    in step 1 M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Episode Over Future Reward (sum of all rewards from current state until ‘game over’) 49(=-1 +100 -50)
  14. FOLIE 17 Episode 2 : play with 2nd policy M3,

    OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Reward -1 Next State Already learned: go left is ok
  15. FOLIE 18 Episode 2 : play with 2nd policy M3,

    OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Reward 100 Next State Already learned: go left is ok
  16. FOLIE 19 Episode 2 : play with 2nd policy M3,

    OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Reward 100 Next State Already learned: don’t go up
  17. FOLIE 20 Episode 2 : play with 2nd policy M3,

    OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Reward -1 Next State
  18. FOLIE 21 Episode 2 : play with 2nd policy M3,

    OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Reward -50 Next State Episode Over Episode Over
  19. FOLIE 22 Episode 2 : improve 2nd policy for state

    in step 5 M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Episode Over Future Reward (sum of all rewards from current state until ‘game over’) -50(=-50)
  20. FOLIE 23 Episode 2 : improve 2nd policy for state

    in step 4 M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Episode Over Future Reward (sum of all rewards from current state until ‘game over’) -51(=-1-50)
  21. FOLIE 24 Episode 2 : improve 2nd policy for state

    in step 3 M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Episode Over Future Reward (sum of all rewards from current state until ‘game over’) 49(=+100-1-50)
  22. FOLIE 25 Episode 2 : improve 2nd policy for state

    in step 2 M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Episode Over Future Reward (sum of all rewards from current state until ‘game over’) 149(=+100+100-1-50)
  23. FOLIE 26 Episode 2 : improve 2nd policy for state

    in step 2 M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Episode Over Future Reward (sum of all rewards from current state until ‘game over’) 149(=+100+100-1-50) = some running average of old and new value
  24. FOLIE 27 Episode 2 : improve 2nd policy for state

    in step 1 M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy (rules learned, how to play the game) 1 2 3 4 5 6 7 Step # State Action from Policy Episode Over Future Reward (sum of all rewards from current state until ‘game over’) 148(=-1+100+100-1-50)
  25. FOLIE 28 So far ….. M3, OCTOBER 2018 REINFORCEMENT LEARNING

    a policy is a map from states to action probabilities
  26. FOLIE 29 …updated by the reinforcement learning algorithm M3, OCTOBER

    2018 REINFORCEMENT LEARNING a policy is a map from states to action probabilities
  27. FOLIE 30 …updated by the reinforcement learning algorithm M3, OCTOBER

    2018 REINFORCEMENT LEARNING a policy is a map from states to action probabilities
  28. FOLIE 31 After many, many episodes, for each state… M3,

    OCTOBER 2018 REINFORCEMENT LEARNING
  29. FOLIE 32 Algorithm sketch M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy

    (rules learned, how to play the game) Initialize table with random action probabilities for each state Repeat play episode with policy given by table Record (state1 ,action1 ,reward1 ),(state2 ,action2 ,reward2 ),…. for episode For each step i compute FutureRewardi = rewardi + rewardi+1 +… update table[statei ] s.t. • actioni becomes for statei more likely if FutureRewardi is “high” • actioni becomes for statei less likely if FutureRewardi is “low”
  30. FOLIE 33 Algorithm sketch M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy

    (rules learned, how to play the game) Initialize table with random action probabilities for each state Repeat play episode with policy given by table Record (state1 ,action1 ,reward1 ),(state2 ,action2 ,reward2 ),…. for episode For each step i compute FutureRewardi = rewardi + rewardi+1 +… update table[statei ] s.t. • actioni becomes for statei more likely if FutureRewardi is “high” • actioni becomes for statei less likely if FutureRewardi is “low”
  31. FOLIE 34 Algorithm sketch M3, OCTOBER 2018 REINFORCEMENT LEARNING Policy

    (rules learned, how to play the game) Initialize table with random action probabilities for each state Repeat play episode with policy given by table Record (state1 ,action1 ,reward1 ),(state2 ,action2 ,reward2 ),…. for episode For each step i compute FutureRewardi = rewardi + rewardi+1 +… update table[statei ] s.t. • actioni becomes for statei more likely if FutureRewardi is “high” • actioni becomes for statei less likely if FutureRewardi is “low”
  32. FOLIE 36 M3, OCTOBER 2018 REINFORCEMENT LEARNING The bad news:

    nice idea, but… «Image" licensed according to CC BY-SA
  33. FOLIE 37 M3, OCTOBER 2018 REINFORCEMENT LEARNING The bad news:

    nice idea, but… too many states… too many actions • Too much memory needed • Too much time «Image" licensed according to CC BY-SA
  34. FOLIE 38 The solution M3, OCTOBER 2018 REINFORCEMENT LEARNING Idea:

    Replace lookup table with a neural network that approximates the action probabilities contained in the table Instead of Table[state] = action probabilities Do NeuralNet( state ) ~ action probablities Policy (rules learned, how to play the game) Change to “update weights of NeuralNet” Change to “play episode with policy given by NeuralNet”
  35. FOLIE 39 Neural nets to the rescue M3, OCTOBER 2018

    REINFORCEMENT LEARNING Idea: Replace lookup table with a neural network that approximates the action probabilities contained in the table Instead of Table[state] = action probabilities Do NeuralNet( state ) ~ action probablities Encode state as vector s ~ outup outdown outleft outright Apply neural network with “the right” weights ~ Policy (rules learned, how to play the game) Use softmax
  36. FOLIE 40 Policy Gradient Algorithm sketch M3, OCTOBER 2018 REINFORCEMENT

    LEARNING Policy (rules learned, how to play the game) Initialize neuralNet with random weights W Repeat play episode(s) with policy given by weights W Record (state1 ,action1 ,reward1 ),(state2 ,action2 ,reward2 ),…. for episode(s) For each step i compute FutureRewardi = rewardi + rewardi+1 +… Update weights W W = W + ???? Encode state as vector s ~ outup outdown outleft outright Weights W ~ Use softmax
  37. FOLIE 41 Policy Gradient Algorithm sketch M3, OCTOBER 2018 REINFORCEMENT

    LEARNING Policy (rules learned, how to play the game) Initialize neuralNet with random weights W Repeat play episode(s) with policy given by weights W Record (state1 ,action1 ,reward1 ),(state2 ,action2 ,reward2 ),…. for episode(s) For each step i compute FutureRewardi = rewardi + rewardi+1 +… Update weights W W = W + ???? Encode state as vector s ~ outup outdown outleft outright Weights W ~ Use softmax
  38. FOLIE 42 Policy Gradient Algorithm sketch M3, OCTOBER 2018 REINFORCEMENT

    LEARNING Policy (rules learned, how to play the game) Initialize neuralNet with random weights W Repeat play episode(s) with policy given by weights W Record (state1 ,action1 ,reward1 ),(state2 ,action2 ,reward2 ),…. for episode(s) For each step i compute FutureRewardi = rewardi + rewardi+1 +… Update weights W W = W + ???? Encode state as vector s ~ outup outdown outleft outright Weights W ~ Use softmax Increases outi
  39. FOLIE 43 Policy Gradient Algorithm sketch M3, OCTOBER 2018 REINFORCEMENT

    LEARNING Policy (rules learned, how to play the game) Initialize neuralNet with random weights W Repeat play episode(s) with policy given by weights W Record (state1 ,action1 ,reward1 ),(state2 ,action2 ,reward2 ),…. for episode(s) For each step i compute FutureRewardi = rewardi + rewardi+1 +… Update weights W Encode state as vector s ~ outup outdown outleft outright Weights W ~ Use softmax Increases outi W = W + GradientW ( neuralNetW ( statei , actioni ) ) FutureRewardi * Learning rate alpha *
  40. FOLIE 44 What for ? The real world M3, OCTOBER

    2018 REINFORCEMENT LEARNING no feasible, deterministic algorithm
  41. FOLIE 45 What for ? Traditional Heuristics Classic Machine Learning

    Automatic solution found in 93.4% M3, OCTOBER 2018 REINFORCEMENT LEARNING Reinforcement Learning
  42. FOLIE 46 manage the water level on the roof control

    & steer the water flow find the right dimensions save & reliable The challenges
  43. FOLIE 52 Designing the Action-Space M3, OCTOBER 2018 REINFORCEMENT LEARNING

    Snake game Roof drainage systems • What actions would a human expert like to have ? • Are theses actions sufficient ? • Would more / other actions be helpful ? • Can we drop any actions ?
  44. FOLIE 53 Designing the State-Space M3, OCTOBER 2018 REINFORCEMENT LEARNING

    • What does a human expert look at ? • Can you switch the experts between 2 steps ? • Full state vs partial state • Designing Features Snake game Roof drainage systems
  45. FOLIE 54 Designing the Reward Function M3, OCTOBER 2018 REINFORCEMENT

    LEARNING • How would you rate the result of an expert ? • As simple as possible • Positive feedback during the game • Beware of “surprising policies” • Game over if TotalReward too low Fruit 100 Death -50 Success 1000 Step -1 Snake game Roof drainage systems Change Error Count +/- 1 per Error Success 100 Step -0.01
  46. FOLIE 55 Policy (rules learned, how to play the game)

    Turning the problem into a game M3, OCTOBER 2018 REINFORCEMENT LEARNING actions game engine game state "Dieses Foto" von Unbekannter Autor ist lizenziert gemäß CC BY-NC step reward RL algorithm
  47. FOLIE 57 Reinforcement Learning Hydraulics Calculation Pipeline Traditional Heuristics Classic

    Machine Learning Automatic solution found in 93.4% Finds a solution in 70.7% of the remaining 6.6% Automatic solution found in 98.1% REINFORCEMENT LEARNING M3, OCTOBER 2018
  48. FOLIE 58 Summary M3, OCTOBER 2018 REINFORCEMENT LEARNING • Turning

    the problem into a game • Continuous policy improvement • No training dataset • Complements supervised learning
  49. FOLIE 59 Thank you! M3, OCTOBER 2018 REINFORCEMENT LEARNING [email protected]

    W +41 44 260 54 00 M +41 76 558 41 48 https://www.linkedin.com/in/christian-hidber/ About Geberit The globally operating Geberit Group is a European leader in the field of sanitary products. Geberit operates with a strong local presence in most European countries, providing unique added value when it comes to sanitary technology and bathroom ceramics. The production network encompasses 30 production facilities, of which 6 are located overseas. The Group is headquartered in Rapperswil-Jona, Switzerland. With around 12,000 employees in around 50 countries, Geberit generated net sales of CHF 2.9 billion in 2017. The Geberit shares are listed on the SIX Swiss Exchange and have been included in the SMI (Swiss Market Index) since 2012.
  50. FOLIE 60 Resources • Sutton & Barto: Reinforcement Learning, an

    introduction, 2nd edition, 2018: https://drive.google.com/file/d/1opPSz5AZ_kVa1uWOdOiveNiBFiEOHjkG/view • http://rll.berkeley.edu/deeprlcourse/f17docs/lecture_4_policy_gradient.pdf • http://karpathy.github.io/2016/05/31/rl/ • https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf • https://arxiv.org/pdf/1707.06347.pdf • https://openai.com/ M3, OCTOBER 2018 REINFORCEMENT LEARNING