Llama.cpp for fun (and maybe profit) - 30 minute

Slide 1

Slide 1 text

llama.cpp for fun and (maybe) profit AI & DL for Enterprise 2024-04 talk @IanOzsvald – ianozsvald.com

Slide 2

Slide 2 text

Strategist/Trainer/Speaker/Author 20+ years Trainer – new Fast Pandas course Interim Chief Data Scientist By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!

Slide 3

Slide 3 text

When should we be using LLMs? Strategic Team Advisor By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 4

Slide 4 text

No need for a GPU+VRAM Llama.cpp runs on CPU+RAM Nothing sent off your machine llama.cpp By [ian]@ianozsvald[.com] Ian Ozsvald X

Slide 5

Slide 5 text

Experiment with models as they’re published Use client data/src code – no data sent off your machine Why use local models? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 6

Slide 6 text

Prototype ideas! By [ian]@ianozsvald[.com] Ian Ozsvald llama-2-7b-chat.Q5_K_M.gguf 5GB on disk and in RAM, near real time

Slide 7

Slide 7 text

See the wackyness early on What’s your strategy to catch varied outputs? Why use local models? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 8

Slide 8 text

MS Phi2 can “reason” (IS IT RIGHT?) By [ian]@ianozsvald[.com] Ian Ozsvald I had confident answers: 125.2m/s (good Python) 17.2m/s (partial Python with comments that had mistakes), 40m/s and 31.3m/s (as teacher) Which one to believe? My model is quantised (Q5) but random variation exists anyway… The MS post didn’t disclose the prompt they used https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/

Slide 9

Slide 9 text

Similar to JPG compression Shrink the trained model 32→16→8→7/6/5/4/3/2 bits Fewer bits→worse text completion “Q5 generally an acceptable level” Quantisation By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 10

Slide 10 text

Quantisation By [ian]@ianozsvald[.com] Ian Ozsvald Original fp16 models Better Bigger models with higher quantisation still has lower perplexity than simpler, less quantised models Choose the biggest you can K-quants PR https://github.com/ggerganov/llama.cpp/pull/1684

Slide 11

Slide 11 text

Experiment with multi-modal e.g. OCR and checking photo meets requirements What about image queries? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 12

Slide 12 text

Llava multi-modal Extract facts from images? By [ian]@ianozsvald[.com] Ian Ozsvald llava-v1.5-7b-Q4_K.gguf 4GB on disk & RAM 5s for example llama.cpp provides ./server

Slide 13

Slide 13 text

Trial code-support Code review? “Is this test readable?” What do you do with code and LLMs? Can they help with coding? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 14

Slide 14 text

Can you explain this function please? By [ian]@ianozsvald[.com] Ian Ozsvald codellama-34b-python.Q5_K_M.gguf 23GB on disk & RAM, 30s for example Can we use this as a “code reviewer” for internal code? codellama answer: “The function test_uniform_distribution creates a list of 10 zeros, then increments the position in that list indicated by the murmurhash3_32() digest of i. It does this 100000 times and then checks if the means of those incremented values are uniformly distributed (i.e., if they're all roughly the same).” (surprisingly clear!) https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/tests/test_murmurhash.py

Slide 15

Slide 15 text

Give test functions (e.g. Pandas) to codellama Ask it “is this a good test function?” Try to get it to propose new test functions Check using pytest and coverage tools Shortcut human effort at project maintenance? My experiment for code assist By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 16

Slide 16 text

Using the Python API we can learn how it works Get embeddings API leads to understanding By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 17

Slide 17 text

Q&A model trained on “Let, What, Suppose, Calculate, Solve” as very-likely first tokens API leads to understanding By [ian]@ianozsvald[.com] Ian Ozsvald log(p(1)) == 0 log(p(0.5)) == -0.7

Slide 18

Slide 18 text

Run quantised models on client data locally Experience the wackyness – mitigation? Use Python API to see tokens+perplexity+more Why try llama.cpp? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 19

Slide 19 text

Do you want to talk about training or DS strategy? Discuss: How do we measure correctness? What’s the worst (!) that could go wrong with your projects? Summary By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 20

Slide 20 text

Appendix - Attack via asciiart By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 21

Slide 21 text

Appendix – Ask Mixtral to challenge my Monte Carlo estimation approach By [ian]@ianozsvald[.com] Ian Ozsvald Mixtral gave 5 points and some items I should be careful about, ChatGPT 3.5 gave 7 points, both felt similar

Slide 22

Slide 22 text

WizardCoder is good (tuned llama2) By [ian]@ianozsvald[.com] Ian Ozsvald wizardcoder-python-34b -v1.0.Q5_K_S.gguf 22GB on disk & RAM 15s for example You can replace CoPilot with this for completions