Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallel Sampling & Rule-based Prompt Generatio...

Avatar for @smly @smly
December 06, 2025

Parallel Sampling & Rule-based Prompt Generation for Code Golf

Avatar for @smly

@smly

December 06, 2025
Tweet

More Decks by @smly

Other Decks in Research

Transcript

  1. Parallel Sampling & Rule-based Prompt Generation for Code Golf Kohei,

    Jokrasa and THUNDER THUNDER 4th Place Solution for Google Code Golf Championship (NeurIPS 2025)
  2. The Challenge: Google Code Golf Championship 2025 Objective: Solve ARC-AGI

    tasks (Abstract Reasoning Corpus) using Python. Constraint: Minimize source code length (bytes). Metric: • Per Task: Score = max(1, 2500 - Length) for correct solutions. • Final Ranking: Determined by the Cumulative Score across 400 tasks. Key Difficulty: • LLMs typically prioritize readability and explanation, not extreme brevity (Code Golf). • It's okay to solve it manually. Can LLM/AI agents beat professional golfers?
  3. Generation of Initial solutions Before we can shorten solutions, we

    need to generate a first version that passes all tests. We used multiple, (semi-)automated approaches for this: - Give the LLM access to the raw test cases and ask it to write code that passes them - Additionally provide the code used to generate most of the test cases - If necessary, also manually describe (part of) the pattern
  4. Why Repeated Sampling? • LLMs are stochastic; a single generation

    (Greedy/Temperature=0) is rarely optimal for “edge” constraints like code golfing. • Repeated Sampling (Best-of-N): Generating N solutions and selecting the best one significantly improves performance on hard reasoning tasks.[1]
  5. Solution Architecture: The Golfing Loop Approach: Iterative refinement with parallel

    execution on Codex Cloud. Key Components: 1. Parallel Sampling: Generate multiple candidates simultaneously. 2. Rule-Based Prompting: Guide the LLM with syntactic constraints. 3. Verification: Execute code in a sandbox to ensure correctness. 4. Selection: Pick the shortest valid code (basically).
  6. Parallel Sampling for Exploration • Setup: 10-30 “steps” per session.

    • Per Step: ◦ Generate 4 parallel versions of the code. ◦ 10 ~ 120 min per step. • Selection Criteria: ◦ Length-based ◦ Sometimes choose the suboptimal option. • Stop Criteria • Diversity: Encourage diverse implementation strategies through prompts and follow-up prompts
  7. Rule-Based Prompt Generation • LLMs struggle to “golf” without explicit

    instruction. • Method: Analyze the current best code’s AST (Abstract Syntax Tree) to generate specific improvement rules. We believed there was significant room for improvement through regex, so we actively instructed LLMs to use regex in the latter half.
  8. Optimization Target: Compression-Friendly Code Instead of minimizing raw source codes,

    minimize the compressed size (e.g., using zopfli). Concept: Python allows executing compressed code: exec(zlib.decompress(b’...’)). Objective: Instruct the LLM to write “compressible” code (repetitive patterns, consistent variable naming) rather than just “short” code.
  9. Escaping Local Optima: The “Tabula Rasa” Approach • The Challenge:

    Iterative refinement (modifying previous code) quickly hits a plateau (Local Minima). The model gets “stuck” in a specific algorithmic approach. • The Solution: Forced Exploration via “From Scratch” Sampling. • Mechanism: ◦ Executing repeated sampling in a clean environment (using codex exec). ◦ Key Prompt Engineering: Explicitly instruct the model to “Ignore the current code and write a solution from scratch.” • Outcome: ◦ While most new sessions were worse, a small fraction discovered fundamentally different algorithms that were unreachable via incremental edits.
  10. Insight: “Forgetting” as a Mechanism for Diversity • Analogy to

    Dropout: ◦ In Deep Learning, Dropout prevents overfitting by randomly omitting information. ◦ Here, intentionally “dropping” the context of the current best solution prevents “structural overfitting.” • Exploration vs. Exploitation: ◦ Exploitation: Iteratively refining the best code (Rule-based). ◦ Exploration: “Blind” sampling without history.