Slide 1

Slide 1 text

Practical Experiences with SBB for Reinforcement Learning Jéssica Pauli de C. Bonson, NIMS Lab

Slide 2

Slide 2 text

Index 1. Quick SBB Overview 2. Case Studies 3. Practical Application of SBB to the Cases

Slide 3

Slide 3 text

SBB: General Idea ● teams of programs evolve and compete over time ● each team represents a player in the game ● at each generation they play many matches against various opponents ● the best ones reproduce and keep evolving, the others are discarded

Slide 4

Slide 4 text

Genetic Algorithm

Slide 5

Slide 5 text

SBB framework

Slide 6

Slide 6 text

● each team is composed of 2 or more programs ● programs are composed of a set of instructions, registers and an action ● before a team executes an action: ○ all of its programs run over the inputs for the current match state ○ the action from the program with the highest output is selected as the team’s action How the teams work?

Slide 7

Slide 7 text

Case Studies ● TicTacToe ● Poker

Slide 8

Slide 8 text

Example: TicTacToe ● inputs: 9 ● actions: 9 ● opponents: Random and Smart ● points: tictactoe matches’ states + opponents

Slide 9

Slide 9 text

Example: Poker ● inputs: 14 (hand strength, position, pot odds, opponent model...) ● actions: 3 ● opponents: ○ dummy: AlwaysRaise, AlwaysCall, AlwaysFold, Random ○ static: LooseAggressive, LoosePassive, TightAggressive, TightPassive ○ dynamic: Bayesian ● points: poker matches’ states + opponents

Slide 10

Slide 10 text

Code ● SBB for reinforcement learning: ○ arner ● SBB for classification: ○ ● Warning! The code is under development!

Slide 11

Slide 11 text

Practical Application of SBB to the Cases Studies

Slide 12

Slide 12 text

Fitness Function ● For RL, usually it is: ○ Win: 1.0 ○ Draw: 0.5 ○ Lose: 0.0 ● But depending on the game, other functions are able to give more information ○ eg.: chips won/lose (poker)

Slide 13

Slide 13 text

Inputs ● Very important: They define what the teams can see about the environment. ● Can be used to control for a weak or strong AI. ● Tweak: SBB deals better with inputs normalized between 0.0-10.0 instead of 0.0-1.0.

Slide 14

Slide 14 text

Opponents ● Too easy opponents: ○ Teams learn to beat them quickly and stop evolving ● Too strong opponents: ○ Teams aren’t able to learn to beat them, and SBB turns into a random walk ● It is important to balance ● Hall of Fame can be used to avoid evolutionary forgetting, but don’t overuse.

Slide 15

Slide 15 text

Points ● How many? ○ The more, the better ○ But it is time consuming ● It is important to invest time on optimizing the matches ● Ensure all teams see the exact same points ● For some games the points can’t be just like the real world task

Slide 16

Slide 16 text

Diversity ● Important to avoid overfitting, so teams can continue to evolve new behaviors ● Two methods: ○ Point Profiles ○ Pareto Dominance + Diversity metrics

Slide 17

Slide 17 text

Diversity: Point Profiles ● Profile: the results of a team run against a set of sample inputs, representing states of the game ● The new teams are mutated until their profile is different from their parent and/or all the other teams ● The set of sample inputs can be made manually or during the training

Slide 18

Slide 18 text

Diversity: Pareto + Diversity ● Pareto Dominance: ○ A method to choose teams so that the chosen ones are the best regarding both the fitness function and the diversity metric

Slide 19

Slide 19 text


Slide 20

Slide 20 text

Diversity: Diversity Metrics ● There are a lot of diversity metrics available, both general metrics and domain-specific metrics ● Two types: ○ Genotype ○ Phenotype ■ distance measures (euclidean, hamming...) ■ entropy ■ NCD

Slide 21

Slide 21 text

Second Layer ● Uses the teams trained in the first layer as actions for the teams in the second layer ● Goal: More complex behavior using specialized actions ○ Eg.: instead of call/raise/fold for poker, it could be passive/aggressive behavior

Slide 22

Slide 22 text

Mutation Rates ● Seven types of mutation rate ○ Remove program from team, swap instructions in programs, etc… ● Both passive and aggressive mutations work ● It is an exploitation/exploration trade-off

Slide 23

Slide 23 text

Registers ● Varies according to the game ● For TicTacToe, using 5 instead of 2 improved the score by around 20%. ● It is important to reset the registers between matches, but during a match they can be used as memory

Slide 24

Slide 24 text

Team Size + Program Size ● Trade-off between complexity and runtime ● Varies according to the game ○ Team size: 2-9 for TTT, 2-16 for poker, +-30 for soccer ○ Program size: 2-20 for TTT, 5-40 for poker ● Should be big enough to deal with the complexity of the game and have space for introns

Slide 25

Slide 25 text

Operators ● Simple set: +, -, /, * ● Complex set: ln, exp, cos, sin ● Ifs set: >=, < ● Trade-off between complexity and runtime ● Solution for overflow: Rollback so the target register isn’t modified by the instruction

Slide 26

Slide 26 text

Be able to Reuse Teams ● It is important so you are able to: ○ run them against various test cases after training ○ integrate a trained team as an AI in a system ○ use them as actions in a second layer ● Solution: Save teams as .json files

Slide 27

Slide 27 text

Sample .json of trained poker team

Slide 28

Slide 28 text

Sample of a program with and without introns

Slide 29

Slide 29 text

Metrics ● Most of the time runs will take a long time to finish, so it is important to think about what metrics are necessary and code them beforehand ● Automated metrics save a lot of time ● Be careful with bugs

Slide 30

Slide 30 text

Tests + Inheritance ● pSBB for Reinforcement Learning has around 300k lines of code and around 40 classes ● Automated tests and inheritance are essential.

Slide 31

Slide 31 text
