Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DeepStack: Expert-Level Artificial Intelligence in Heads-Up No-Limit Poker

DeepStack: Expert-Level Artificial Intelligence in Heads-Up No-Limit Poker

PWL SF Talk

Tom Santero

March 30, 2017
Tweet

More Decks by Tom Santero

Other Decks in Technology

Transcript

  1. Texas Hold’em "truly a 'sawdust joint, with…oiled sawdust covering the

    floors." originated in Texas one of many variations
 of the game of poker most widely played card
 game in the world
  2. Texas Hold’em originated in Texas one of many variations
 of

    the game of poker most widely played card
 game in the world
  3. 1 2 3 4 5 6 7 8 9 10

    gameplay moves clockwise
  4. “The honor’s in the dollar, kid.”
 —Seth Davis, Boiler Room

    (2000) Keeping Score in Hold’em ________________ Cost of Big Blind Chip Stack = # BBs ________________ 100 BB = mBB
  5. tsantero’s hand cowboy’s hand Preflop: 
 I raise to $20


    cowboy raises to $45 I call all other players fold Odds of flopping:
 Trips 29 to 1 Straight 75 to 1 Flush 118 to 1 6 6 5 5 ♠ ♠ ♠ ♠ ♠ ♠ ♠ ♠ ♠ ♠ ♠
  6. tsantero’s hand 6 6 5 5 ♠ ♠ ♠ ♠

    ♠ ♠ ♠ ♠ ♠ ♠ ♠ 3 3 2 2 4 4 ♣ ♣ ♣ ❤ ❤ ♦ ♦ ♦ ♦ pot size: $90 I bet $65 cowboy calls
  7. tsantero’s hand 6 6 5 5 ♠ ♠ ♠ ♠

    ♠ ♠ ♠ ♠ ♠ ♠ ♠ 3 3 2 2 4 4 ♣ ♣ ♣ ❤ ❤ ♦ ♦ ♦ ♦ pot size: $210 I shove all in ($420) cowboy calls J J
  8. tsantero’s hand 6 6 5 5 ♠ ♠ ♠ ♠

    ♠ ♠ ♠ ♠ ♠ ♠ ♠ 3 3 2 2 4 4 ♣ ♣ ♣ ❤ ❤ ♦ ♦ ♦ ♦ J J cowboy’s hand 3 3 ♠ ♠ ♠ 3 3 ♦ ♦ ♦ 4/38 ~ 10%
  9. tsantero’s hand 6 6 5 5 ♠ ♠ ♠ ♠

    ♠ ♠ ♠ ♠ ♠ ♠ ♠ 3 3 2 2 4 4 ♣ ♣ ♣ ❤ ❤ ♦ ♦ ♦ ♦ J J cowboy’s hand 3 3 ♠ ♠ ♠ 3 3 ♦ ♦ ♦ J J
  10. 
 Imperfect information games require more complex reasoning than similarly

    sized perfect information games. The correct decision at a particular moment depends upon the probability distribution over private information that the opponent holds, which is revealed through their past actions.

  11. Heads-Up No Limit Texas Holdem: total number of decision points:

    10160 (10170 in Go) past approaches include pre-solving the entire game or abstraction via Counterfactual Regret Minimization (CFR) DeepStack: attempts to solve for a Nash Equilibrium
  12. Nash Equilibrium Let (S, f) represent a game with n

    players Si = strategy for player i such that: S = S1 x S2 x S3 … x Sn
 is the set of strategy profiles f(x) = (f1(x), f2(x), …fn(x)) is the payoff function
 evaluated at x ∈ S
  13. Nash Equilibrium contd. Let xi equal S for player i

    and x-i equal S for all other players IFF player i ∈ {1, … n} chooses strategy xi then payout function = fi(x) ; it follows that x* ∈ S will produce a Nash Equilibrium such that ∀i, xi ∈ Si : fi(xi*, x-i*) ≥ fi(xi*, x-i*)
  14. Nash Equilibrium contd. ∀i, xi ∈ Si : fi(xi*, x-i*)

    ≥ fi(xi*, x-i*) the strategy profile will produce a Nash Equilibrium so long as no unilateral deviation from the strategy by any single player results in an improved payoff function
  15. CONFESS CONFESS LIE LIE -6, -6 0, -10 -10, 0

    -2, -2 Prisoner 2 Prisoner 1
  16. Oskar Morgenstern and John von Neumann a mixed strategy Nash

    Equilibrium will exist in any zero-sum game involving a finite set of actions
  17. a mixed strategy Nash Equilibrium will exist in any zero-sum

    game involving a finite set of actions K K K K decision to bet/call, raise, or fold will vary depending on the situation 5 5 ♠ ♠ ♠ ♠ ♠ K K J J
  18. a mixed strategy Nash Equilibrium will exist in any zero-sum

    game involving a finite set of actions K K K K decision to bet/call, raise, or fold will vary depending on the situation 5 5 ♦ ♦ ♦ ♦ A A J J ♣ ♦
  19. Exploitable vs Non-exploitable play a player’s strategy: probability distribution over

    range of public actions
 determine a player’s range: function of the card’s they’ll play for what price expected utility: estimation of the payout function
  20. Exploitable vs Non-exploitable play a Nash Equilibrium strategy: seeks to

    maximize expected utility against a
 best-response opponent strategy exploitability: difference in expected utility of a best-response strategy and the expected utility under a NE
  21. DeepStack: limited exploitability via
 heuristic search local strategy: computation over

    the current public state depth-limited lookahead: learned value function restricted set of lookahead actions: difference in expected utility of a best-response strategy and the expected utility under a NE
  22. Continual Resolving initial state: strategy is determined by our range

    of play next actions: discard previous strategy and determine action using CFR algorithm. input: our stored range and a vector of opponent counterfactual values update our range and opponent counter factual values
  23. Rules for Resolving our action: replace opponent counterfactual values with

    those computed in the re-solved strategy for our chosen action update our own range using the computed strategy and Bayes’ rule chance action: replace the opponent counterfactual values with those computed for this chance action from the last resolve update our own range by zeroing out impossible hands given the new public state
  24. Limited Depth Lookahead “intuition” neural net: learned counterfactual function that

    approximates the resulting values if the public state were to be solved with the current iteration’s ranges input: public state = players’ ranges, pot size, public cards output: vector of counterfactual values for holding any hand for both players
  25. Sparse Lookahead Trees reduction of actions: generate a space lookahead

    tree limited to only 4 actions: fold, call, 2 or 3 bet, all-in optimization for decision speed: allows DeepStack to play at conventional human speed decision points reduced form 10160 to 107 can be solved in < 5 seconds using a GeForce GTX 1080