Valuable Lessons Learned on Kaggle’s ARC AGI LLM Challenge (PyDataGlobal 2024)

Valuable Lessons Learned on Kaggle’s ARC AGI LLM challenge PyDataGlobal
2024-12 talk @IanOzsvald – ianozsvald.com

Strategist/Trainer/Speaker/Author 20+ years Slight LLM fanatic now Interim Chief Data
Scientist By [ian]@ianozsvald[.com] Ian Ozsvald 3rd Edition!

6 mo evening work on ARC AGI Do you have
the right text representation? Can we score (feedback) our way to the truth? Might we have an interesting collaboration? What are some LLM limits? By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

Can LLMs reason? Abstract JSON “initial → target” 400 public
problems, 3-4 examples each + 1 test case 100 private problems on Kaggle (so no OpenAI etc) Abstraction & Reasoning Challenge ARC AGI on Kaggle By [ian]@ianozsvald[.com] Ian Ozsvald

What rules do you need? By [ian]@ianozsvald[.com] Ian Ozsvald I'm
working on these "easy" problems

Some are really complex! By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

Llama 3.0 8B (Q8) 10 GB, 20sec/pr. Llama 3.0 70B
(Q4) 40GB, 3min/pr. Can I solve simpler challenges on RTX 3090 24GB? Why not harder challenges? Llama 3.0 By [ian]@ianozsvald[.com] Ian Ozsvald

Example prompt – setup (Alpaca) By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

Example - 1 of 3-4 example pairs By [ian]@ianozsvald[.com] Ian
Ozsvald

XXX Example – the “ask” By [ian]@ianozsvald[.com] Ian Ozsvald Chain
of thoughts prompt lifts success rate

Whole Prompt By [ian]@ianozsvald[.com] Ian Ozsvald Setup -> Examples ->
Prototype ->

Generate prompt, add specific hints per problem Call Llama.cpp, get
response Run through subprocess (clean env) If it runs – did it get it right? Else capture exception If valid, try separate on test example (write all to db) Overall process – repeat many times By [ian]@ianozsvald[.com] Ian Ozsvald

42% solutions pretty good! By [ian]@ianozsvald[.com] Ian Ozsvald It counts!
Comments! Reasonable numpy! Correct substitution! They correctly solve train and test problems

Convincing weirdness By [ian]@ianozsvald[.com] Ian Ozsvald

Repeat 50 (70B) (overnight) or 250 (8B) times (hours) Do
many independent runs By [ian]@ianozsvald[.com] Ian Ozsvald

Maybe a JSON list isn’t optimal? What about a block
of numbers? Separated numbers? Or CSV-like “ ‘ ‘ “ ? Or Excel? What about representation? By [ian]@ianozsvald[.com] Ian Ozsvald AUDIENCE - WHICH IS BEST?

JSON-only (J) – 12% success JSON+number block – 42% JSON+numbers
– 30% JSON+Excel – 14% JSON+CSV-like – 54% What about representation? By [ian]@ianozsvald[.com] Ian Ozsvald My early "local optima" Also combos, single quotes, double quotes no commas etc - all worse n=50, 70BQ4 model, 9565186b.json, prompt fixed except for representations

Just like in ML? Features matter! Do you have a
way to change your features? YAML / JSON / XML / descriptions / markdown / ? Representation By [ian]@ianozsvald[.com] Ian Ozsvald

What if we tell it “that didn’t work, try better?”
We know we can have a conversation Does it actually improve if we show the mistakes? What if we try to add feedback? By [ian]@ianozsvald[.com] Ian Ozsvald

54% baseline (none or “1 iteration”) Say “do better next
time”, it improves ~65% Results for iteration By [ian]@ianozsvald[.com] Ian Ozsvald n=50, 70BQ4 model, 9565186b.json, prompt fixed, except for the feedback section

XXX Better feedback example (2 of 4 shown) By [ian]@ianozsvald[.com]
Ian Ozsvald

>75% success rate if we add guidance Faster too than
previous method Results for iteration + feedback By [ian]@ianozsvald[.com] Ian Ozsvald

Jumped 12% to 75% correctness by changing representation and by
giving useful feedback Code-based execution (rules/Python) + Oracle feels pretty powerful vs hallucinatory LLM Can you give feedback and iterate? By [ian]@ianozsvald[.com] Ian Ozsvald

Solutions fixate – e.g. on a85 it fixates on mode!
Recording history->request “don’t do this again” avoids same mistake, but do we approach success? Trying→summarised successes as “seed” ideas? Next steps? By [ian]@ianozsvald[.com] Ian Ozsvald

“Techniques” library? By [ian]@ianozsvald[.com] Ian Ozsvald

Representations matter – like classical ML Scoring + iteration with
ground truth enables improvement – like classical ML I’m open to (work) collaborations See my NotANumber.email for updates Conclusion By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

Give it a history “all this reasoning hist. hasn’t worked…”
Llama 3.2 vision (vs “telling it what to focus on”) Extract a library of “helper functions”? More scoring+feedback? (e.g. mask->example?) But how to get it to “see” what’s there? Next steps By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

“Techniques” library? (a85) By [ian]@ianozsvald[.com] Ian Ozsvald

Show hints What happens if you have the wrong hint?
What’s the point if the human is solving the hard part of the problem? Hints seem to be critical By [ian]@ianozsvald[.com] Ian Ozsvald

No hints – no joy (iteration + feedback) By [ian]@ianozsvald[.com]
Ian Ozsvald * CUDA_VISIBLE_DEVICES=0 nice -n 1 python system4_iterations.py -p 9565186b.json -m Meta-Llama-3-70B-Instruct-IQ4_XS.gguf -i 50 -l -t 0.4 --prompt_iterations 3 Given no hints, it'll come up with rules and it'll keep modifying, but it never "locks on" to a valid solution, even with iterations of self-critique

Valuable Lessons Learned on Kaggle’s ARC AGI L...

Valuable Lessons Learned on Kaggle’s ARC AGI LLM Challenge (PyDataGlobal 2024)

ianozsvald

More Decks by ianozsvald

Other Decks in Science

Featured

Transcript