Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LLMs vs Chess

Avatar for ianozsvald ianozsvald
January 02, 2026
0

LLMs vs Chess

How well do LLMs (GPT 5.2, Opus 4.5, GLM 4.7, DeepSeek Terminus 3.1) play chess against the mature Stockfish engine? Not so well it turns out, but they're not rubbish either. Given at PyDataLondon 2026-01.

Avatar for ianozsvald

ianozsvald

January 02, 2026
Tweet

Transcript

  1. Can LLMs play chess? Recent paper made curious choices A

    new benchmark? I’m learning chess, so I needed a distraction By [ian]@ianozsvald[.com] Ian Ozsvald
  2. Random bot & Komodo engine (ELO 250) No move history,

    no FEN (?), inverted board for black unicode board, Tools of legal moves What’s in the other paper? By [ian]@ianozsvald[.com] Ian Ozsvald
  3. Stockfish level 0 (elo 1350?) or configurable ELO (or...is it?)

    GLM 4.7, Opus 4.5, GPT 5.2, DeepSeek Terminus 3.1 LLM plays Black, run 3 games, UCI moves (e.g. e2e4) only Max 3 attempts at a legal moves, ‘resign’ allowed My approach By [ian]@ianozsvald[.com] Ian Ozsvald
  4. By [ian]@ianozsvald[.com] Ian Ozsvald Prompt FEN includes 50 move rule,

    castling and en-passant which are missing from graphical board
  5. By [ian]@ianozsvald[.com] Ian Ozsvald GLM 4.7 board state prior to

    loss – good description, quick to resign!
  6. By [ian]@ianozsvald[.com] Ian Ozsvald They all write bad (non UCI)

    answers! DeepSeek Terminus 3.1 JSONDecodeError
  7. Stockfish level 0 (1350 ELO?) vs 4 models They all

    lose, or resign. Stockfish wins 3*4 models They all make illegal moves, sometimes repeatedly Can’t make stockfish ‘easy’ (maybe I’m missing something?) Ablation – ASCII board or no FEN? More games! Outcomes & next steps By [ian]@ianozsvald[.com] Ian Ozsvald