Slide 1

Slide 1 text

Practical Experiences with SBB for Reinforcement Learning Jéssica Pauli de C. Bonson, NIMS Lab https://goo.gl/Whtp9n

Slide 2

Slide 2 text

Index 1. Quick SBB Overview 2. Case Studies 3. Practical Application of SBB to the Cases

Slide 3

Slide 3 text

SBB: General Idea ● teams of programs evolve and compete over time ● each team represents a player in the game ● at each generation they play many matches against various opponents ● the best ones reproduce and keep evolving, the others are discarded

Slide 4

Slide 4 text

Genetic Algorithm

Slide 5

Slide 5 text

SBB framework

Slide 6

Slide 6 text

● each team is composed of 2 or more programs ● programs are composed of a set of instructions, registers and an action ● before a team executes an action: ○ all of its programs run over the inputs for the current match state ○ the action from the program with the highest output is selected as the team’s action How the teams work?

Slide 7

Slide 7 text

Case Studies ● TicTacToe ● Poker

Slide 8

Slide 8 text

Example: TicTacToe ● inputs: 9 ● actions: 9 ● opponents: Random and Smart ● points: tictactoe matches’ states + opponents

Slide 9

Slide 9 text

Example: Poker ● inputs: 14 (hand strength, position, pot odds, opponent model...) ● actions: 3 ● opponents: ○ dummy: AlwaysRaise, AlwaysCall, AlwaysFold, Random ○ static: LooseAggressive, LoosePassive, TightAggressive, TightPassive ○ dynamic: Bayesian ● points: poker matches’ states + opponents

Slide 10

Slide 10 text

Code ● SBB for reinforcement learning: ○ https://github.com/jpbonson/SBBReinforcementLe arner ● SBB for classification: ○ https://github.com/jpbonson/SBBClassifier ● Warning! The code is under development!

Slide 11

Slide 11 text

Practical Application of SBB to the Cases Studies

Slide 12

Slide 12 text

Fitness Function ● For RL, usually it is: ○ Win: 1.0 ○ Draw: 0.5 ○ Lose: 0.0 ● But depending on the game, other functions are able to give more information ○ eg.: chips won/lose (poker)

Slide 13

Slide 13 text

Inputs ● Very important: They define what the teams can see about the environment. ● Can be used to control for a weak or strong AI. ● Tweak: SBB deals better with inputs normalized between 0.0-10.0 instead of 0.0-1.0.

Slide 14

Slide 14 text

Opponents ● Too easy opponents: ○ Teams learn to beat them quickly and stop evolving ● Too strong opponents: ○ Teams aren’t able to learn to beat them, and SBB turns into a random walk ● It is important to balance ● Hall of Fame can be used to avoid evolutionary forgetting, but don’t overuse.

Slide 15

Slide 15 text

Points ● How many? ○ The more, the better ○ But it is time consuming ● It is important to invest time on optimizing the matches ● Ensure all teams see the exact same points ● For some games the points can’t be just like the real world task

Slide 16

Slide 16 text

Diversity ● Important to avoid overfitting, so teams can continue to evolve new behaviors ● Two methods: ○ Point Profiles ○ Pareto Dominance + Diversity metrics

Slide 17

Slide 17 text

Diversity: Point Profiles ● Profile: the results of a team run against a set of sample inputs, representing states of the game ● The new teams are mutated until their profile is different from their parent and/or all the other teams ● The set of sample inputs can be made manually or during the training

Slide 18

Slide 18 text

Diversity: Pareto + Diversity ● Pareto Dominance: ○ A method to choose teams so that the chosen ones are the best regarding both the fitness function and the diversity metric

Slide 19

Slide 19 text

Source: http://pt.slideshare.net/abedsayyad/sayyad-slides-raise13

Slide 20

Slide 20 text

Diversity: Diversity Metrics ● There are a lot of diversity metrics available, both general metrics and domain-specific metrics ● Two types: ○ Genotype ○ Phenotype ■ distance measures (euclidean, hamming...) ■ entropy ■ NCD

Slide 21

Slide 21 text

Second Layer ● Uses the teams trained in the first layer as actions for the teams in the second layer ● Goal: More complex behavior using specialized actions ○ Eg.: instead of call/raise/fold for poker, it could be passive/aggressive behavior

Slide 22

Slide 22 text

Mutation Rates ● Seven types of mutation rate ○ Remove program from team, swap instructions in programs, etc… ● Both passive and aggressive mutations work ● It is an exploitation/exploration trade-off

Slide 23

Slide 23 text

Registers ● Varies according to the game ● For TicTacToe, using 5 instead of 2 improved the score by around 20%. ● It is important to reset the registers between matches, but during a match they can be used as memory

Slide 24

Slide 24 text

Team Size + Program Size ● Trade-off between complexity and runtime ● Varies according to the game ○ Team size: 2-9 for TTT, 2-16 for poker, +-30 for soccer ○ Program size: 2-20 for TTT, 5-40 for poker ● Should be big enough to deal with the complexity of the game and have space for introns

Slide 25

Slide 25 text

Operators ● Simple set: +, -, /, * ● Complex set: ln, exp, cos, sin ● Ifs set: >=, < ● Trade-off between complexity and runtime ● Solution for overflow: Rollback so the target register isn’t modified by the instruction

Slide 26

Slide 26 text

Be able to Reuse Teams ● It is important so you are able to: ○ run them against various test cases after training ○ integrate a trained team as an AI in a system ○ use them as actions in a second layer ● Solution: Save teams as .json files

Slide 27

Slide 27 text

Sample .json of trained poker team

Slide 28

Slide 28 text

Sample of a program with and without introns

Slide 29

Slide 29 text

Metrics ● Most of the time runs will take a long time to finish, so it is important to think about what metrics are necessary and code them beforehand ● Automated metrics save a lot of time ● Be careful with bugs

Slide 30

Slide 30 text

Tests + Inheritance ● pSBB for Reinforcement Learning has around 300k lines of code and around 40 classes ● Automated tests and inheritance are essential.

Slide 31

Slide 31 text

Questions?