Valuable Lessons Learned on Kaggle’s ARC AGI LLM Challenge (PyDataGlobal 2024)

Slide 1

Slide 1 text

Valuable Lessons Learned on Kaggle’s ARC AGI LLM challenge PyDataGlobal 2024-12 talk @IanOzsvald – ianozsvald.com

Slide 2

Slide 2 text

Strategist/Trainer/Speaker/Author 20+ years Slight LLM fanatic now Interim Chief Data Scientist By [ian]@ianozsvald[.com] Ian Ozsvald 3rd Edition!

Slide 3

Slide 3 text

6 mo evening work on ARC AGI Do you have the right text representation? Can we score (feedback) our way to the truth? Might we have an interesting collaboration? What are some LLM limits? By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

Slide 4

Slide 4 text

Can LLMs reason? Abstract JSON “initial → target” 400 public problems, 3-4 examples each + 1 test case 100 private problems on Kaggle (so no OpenAI etc) Abstraction & Reasoning Challenge ARC AGI on Kaggle By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 5

Slide 5 text

What rules do you need? By [ian]@ianozsvald[.com] Ian Ozsvald I'm working on these "easy" problems

Slide 6

Slide 6 text

Some are really complex! By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

Slide 7

Slide 7 text

Llama 3.0 8B (Q8) 10 GB, 20sec/pr. Llama 3.0 70B (Q4) 40GB, 3min/pr. Can I solve simpler challenges on RTX 3090 24GB? Why not harder challenges? Llama 3.0 By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 8

Slide 8 text

Example prompt – setup (Alpaca) By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

Slide 9

Slide 9 text

Example - 1 of 3-4 example pairs By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 10

Slide 10 text

XXX Example – the “ask” By [ian]@ianozsvald[.com] Ian Ozsvald Chain of thoughts prompt lifts success rate

Slide 11

Slide 11 text

Whole Prompt By [ian]@ianozsvald[.com] Ian Ozsvald Setup -> Examples -> Prototype ->

Slide 12

Slide 12 text

Generate prompt, add specific hints per problem Call Llama.cpp, get response Run through subprocess (clean env) If it runs – did it get it right? Else capture exception If valid, try separate on test example (write all to db) Overall process – repeat many times By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 13

Slide 13 text

42% solutions pretty good! By [ian]@ianozsvald[.com] Ian Ozsvald It counts! Comments! Reasonable numpy! Correct substitution! They correctly solve train and test problems

Slide 14

Slide 14 text

Convincing weirdness By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 15

Slide 15 text

Repeat 50 (70B) (overnight) or 250 (8B) times (hours) Do many independent runs By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 16

Slide 16 text

Maybe a JSON list isn’t optimal? What about a block of numbers? Separated numbers? Or CSV-like “ ‘ ‘ “ ? Or Excel? What about representation? By [ian]@ianozsvald[.com] Ian Ozsvald AUDIENCE - WHICH IS BEST?

Slide 17

Slide 17 text

JSON-only (J) – 12% success JSON+number block – 42% JSON+numbers – 30% JSON+Excel – 14% JSON+CSV-like – 54% What about representation? By [ian]@ianozsvald[.com] Ian Ozsvald My early "local optima" Also combos, single quotes, double quotes no commas etc - all worse n=50, 70BQ4 model, 9565186b.json, prompt fixed except for representations

Slide 18

Slide 18 text

Just like in ML? Features matter! Do you have a way to change your features? YAML / JSON / XML / descriptions / markdown / ? Representation By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 19

Slide 19 text

What if we tell it “that didn’t work, try better?” We know we can have a conversation Does it actually improve if we show the mistakes? What if we try to add feedback? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 20

Slide 20 text

54% baseline (none or “1 iteration”) Say “do better next time”, it improves ~65% Results for iteration By [ian]@ianozsvald[.com] Ian Ozsvald n=50, 70BQ4 model, 9565186b.json, prompt fixed, except for the feedback section

Slide 21

Slide 21 text

XXX Better feedback example (2 of 4 shown) By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 22

Slide 22 text

>75% success rate if we add guidance Faster too than previous method Results for iteration + feedback By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 23

Slide 23 text

Jumped 12% to 75% correctness by changing representation and by giving useful feedback Code-based execution (rules/Python) + Oracle feels pretty powerful vs hallucinatory LLM Can you give feedback and iterate? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 24

Slide 24 text

Solutions fixate – e.g. on a85 it fixates on mode! Recording history->request “don’t do this again” avoids same mistake, but do we approach success? Trying→summarised successes as “seed” ideas? Next steps? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 25

Slide 25 text

“Techniques” library? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 26

Slide 26 text

Representations matter – like classical ML Scoring + iteration with ground truth enables improvement – like classical ML I’m open to (work) collaborations See my NotANumber.email for updates Conclusion By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

Slide 27

Slide 27 text

Give it a history “all this reasoning hist. hasn’t worked…” Llama 3.2 vision (vs “telling it what to focus on”) Extract a library of “helper functions”? More scoring+feedback? (e.g. mask->example?) But how to get it to “see” what’s there? Next steps By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

Slide 28

Slide 28 text

“Techniques” library? (a85) By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 29

Slide 29 text

Show hints What happens if you have the wrong hint? What’s the point if the human is solving the hard part of the problem? Hints seem to be critical By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 30

Slide 30 text

No hints – no joy (iteration + feedback) By [ian]@ianozsvald[.com] Ian Ozsvald * CUDA_VISIBLE_DEVICES=0 nice -n 1 python system4_iterations.py -p 9565186b.json -m Meta-Llama-3-70B-Instruct-IQ4_XS.gguf -i 50 -l -t 0.4 --prompt_iterations 3 Given no hints, it'll come up with rules and it'll keep modifying, but it never "locks on" to a valid solution, even with iterations of self-critique