[Dalhousie University, 2016] Symbiotic Bid-based (SBB) GP

Practical Experiences with SBB for Reinforcement Learning Jéssica Pauli de
C. Bonson, NIMS Lab https://goo.gl/Whtp9n

Index 1. Quick SBB Overview 2. Case Studies 3. Practical
Application of SBB to the Cases

SBB: General Idea • teams of programs evolve and compete
over time • each team represents a player in the game • at each generation they play many matches against various opponents • the best ones reproduce and keep evolving, the others are discarded

Genetic Algorithm

SBB framework

• each team is composed of 2 or more programs
• programs are composed of a set of instructions, registers and an action • before a team executes an action: ◦ all of its programs run over the inputs for the current match state ◦ the action from the program with the highest output is selected as the team’s action How the teams work?

Case Studies • TicTacToe • Poker

Example: TicTacToe • inputs: 9 • actions: 9 • opponents:
Random and Smart • points: tictactoe matches’ states + opponents

Example: Poker • inputs: 14 (hand strength, position, pot odds,
opponent model...) • actions: 3 • opponents: ◦ dummy: AlwaysRaise, AlwaysCall, AlwaysFold, Random ◦ static: LooseAggressive, LoosePassive, TightAggressive, TightPassive ◦ dynamic: Bayesian • points: poker matches’ states + opponents

Code • SBB for reinforcement learning: ◦ https://github.com/jpbonson/SBBReinforcementLe arner •
SBB for classification: ◦ https://github.com/jpbonson/SBBClassifier • Warning! The code is under development!

Practical Application of SBB to the Cases Studies

Fitness Function • For RL, usually it is: ◦ Win:
1.0 ◦ Draw: 0.5 ◦ Lose: 0.0 • But depending on the game, other functions are able to give more information ◦ eg.: chips won/lose (poker)

Inputs • Very important: They define what the teams can
see about the environment. • Can be used to control for a weak or strong AI. • Tweak: SBB deals better with inputs normalized between 0.0-10.0 instead of 0.0-1.0.

Opponents • Too easy opponents: ◦ Teams learn to beat
them quickly and stop evolving • Too strong opponents: ◦ Teams aren’t able to learn to beat them, and SBB turns into a random walk • It is important to balance • Hall of Fame can be used to avoid evolutionary forgetting, but don’t overuse.

Points • How many? ◦ The more, the better ◦
But it is time consuming • It is important to invest time on optimizing the matches • Ensure all teams see the exact same points • For some games the points can’t be just like the real world task

Diversity • Important to avoid overfitting, so teams can continue
to evolve new behaviors • Two methods: ◦ Point Profiles ◦ Pareto Dominance + Diversity metrics

Diversity: Point Profiles • Profile: the results of a team
run against a set of sample inputs, representing states of the game • The new teams are mutated until their profile is different from their parent and/or all the other teams • The set of sample inputs can be made manually or during the training

Diversity: Pareto + Diversity • Pareto Dominance: ◦ A method
to choose teams so that the chosen ones are the best regarding both the fitness function and the diversity metric

Source: http://pt.slideshare.net/abedsayyad/sayyad-slides-raise13

Diversity: Diversity Metrics • There are a lot of diversity
metrics available, both general metrics and domain-specific metrics • Two types: ◦ Genotype ◦ Phenotype ▪ distance measures (euclidean, hamming...) ▪ entropy ▪ NCD

Second Layer • Uses the teams trained in the first
layer as actions for the teams in the second layer • Goal: More complex behavior using specialized actions ◦ Eg.: instead of call/raise/fold for poker, it could be passive/aggressive behavior

Mutation Rates • Seven types of mutation rate ◦ Remove
program from team, swap instructions in programs, etc… • Both passive and aggressive mutations work • It is an exploitation/exploration trade-off

Registers • Varies according to the game • For TicTacToe,
using 5 instead of 2 improved the score by around 20%. • It is important to reset the registers between matches, but during a match they can be used as memory

Team Size + Program Size • Trade-off between complexity and
runtime • Varies according to the game ◦ Team size: 2-9 for TTT, 2-16 for poker, +-30 for soccer ◦ Program size: 2-20 for TTT, 5-40 for poker • Should be big enough to deal with the complexity of the game and have space for introns

Operators • Simple set: +, -, /, * • Complex
set: ln, exp, cos, sin • Ifs set: >=, < • Trade-off between complexity and runtime • Solution for overflow: Rollback so the target register isn’t modified by the instruction

Be able to Reuse Teams • It is important so
you are able to: ◦ run them against various test cases after training ◦ integrate a trained team as an AI in a system ◦ use them as actions in a second layer • Solution: Save teams as .json files

Sample .json of trained poker team

Sample of a program with and without introns

Metrics • Most of the time runs will take a
long time to finish, so it is important to think about what metrics are necessary and code them beforehand • Automated metrics save a lot of time • Be careful with bugs

Tests + Inheritance • pSBB for Reinforcement Learning has around
300k lines of code and around 40 classes • Automated tests and inheritance are essential.

Questions?

[Dalhousie University, 2016] Symbiotic Bid-base...

[Dalhousie University, 2016] Symbiotic Bid-based (SBB) GP

Jessica Pauli de C Bonson

More Decks by Jessica Pauli de C Bonson

Featured

Transcript

Practical Experiences with SBB for Reinforcement Learning Jéssica Pauli de

Index 1. Quick SBB Overview 2. Case Studies 3. Practical

SBB: General Idea • teams of programs evolve and compete

Genetic Algorithm

SBB framework

• each team is composed of 2 or more programs

Case Studies • TicTacToe • Poker

Example: TicTacToe • inputs: 9 • actions: 9 • opponents:

Example: Poker • inputs: 14 (hand strength, position, pot odds,

Code • SBB for reinforcement learning: ◦ https://github.com/jpbonson/SBBReinforcementLe arner •

Practical Application of SBB to the Cases Studies

Fitness Function • For RL, usually it is: ◦ Win:

Inputs • Very important: They define what the teams can

Opponents • Too easy opponents: ◦ Teams learn to beat

Points • How many? ◦ The more, the better ◦

Diversity • Important to avoid overfitting, so teams can continue

Diversity: Point Profiles • Profile: the results of a team

Diversity: Pareto + Diversity • Pareto Dominance: ◦ A method

Source: http://pt.slideshare.net/abedsayyad/sayyad-slides-raise13

Diversity: Diversity Metrics • There are a lot of diversity

Second Layer • Uses the teams trained in the first

Mutation Rates • Seven types of mutation rate ◦ Remove

Registers • Varies according to the game • For TicTacToe,

Team Size + Program Size • Trade-off between complexity and

Operators • Simple set: +, -, /, * • Complex

Be able to Reuse Teams • It is important so

Sample .json of trained poker team

Sample of a program with and without introns

Metrics • Most of the time runs will take a

Tests + Inheritance • pSBB for Reinforcement Learning has around

Questions?