How well do LLMs (GPT 5.2, Opus 4.5, GLM 4.7, DeepSeek Terminus 3.1) play chess against the mature Stockfish engine? Not so well it turns out, but they're not rubbish either. Given at PyDataLondon 2026-01.
GLM 4.7, Opus 4.5, GPT 5.2, DeepSeek Terminus 3.1 LLM plays Black, run 3 games, UCI moves (e.g. e2e4) only Max 3 attempts at a legal moves, ‘resign’ allowed My approach By [ian]@ianozsvald[.com] Ian Ozsvald
lose, or resign. Stockfish wins 3*4 models They all make illegal moves, sometimes repeatedly Can’t make stockfish ‘easy’ (maybe I’m missing something?) Ablation – ASCII board or no FEN? More games! Outcomes & next steps By [ian]@ianozsvald[.com] Ian Ozsvald