Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LLMs on Small devices [DutchAUG]

LLMs on Small devices [DutchAUG]

Slides from my presentation at the Dutch Android User Group on LLMs that run on small devices like Android.

Hugo Visser

March 22, 2024
Tweet

More Decks by Hugo Visser

Other Decks in Technology

Transcript

  1. Neural networks • Inputs, (hidden) layers, outputs → matrices of

    values • Weights: ~a factor that influences transforming a value between layers • Training: fiddle with the weights until output approximates known input • Result of training → collection of weights (numbers)
  2. Transformer models • Model architecture • Predict next word based

    on previous words • Context window • Training: show text, mask next word, calculate accuracy of prediction ♻ • Inference: start with input sentence, predict word, add to input sentence ♻
  3. Transformer models • Model architecture • Predict probabilities of next

    token based on previous tokens • Context window • Training: show tokens, mask next token, calculate accuracy of prediction ♻ • Inference: start with input tokens, predict next token, add to input tokens ♻
  4. Tokenizers • Mapping from input to numbers • Token: (part

    of) a word, or character, byte etc • Dictionary of a model
  5. • Base model: trained on ~internet (trillions of tokens) ◦

    Can predict next token, not an assistant ◦ 💸💸💸💸 (1000’s of GPUs, $5-100 million) • Fine tuning: continue training to create assistant model ◦ Training text is structured conversation ideal examples (~10k - 100k+ examples) • Human preference tuning ◦ Further fine tuning by generating responses and selecting the best ones ◦ “Alignment”, “Safety” etc Type of models
  6. • Parameters (~weights): bigger is better ◦ (Open source) Llama

    7b, 13b, 70b ◦ GPT4 220b? 8x 220b? (huge) • Each parameter is ~4 bytes → 7b model ~30gb • Need GPU (memory) for acceleration LLMs
  7. • Quantize the model: store each parameter in less bytes

    ◦ Reduces accuracy, but is generally OK ◦ Other optimisations possible • Train smaller models (less parameters) • Both • Or: train specialized ultra small model (llama2.c) (megabytes vs gigabytes) Shrinking LLMs
  8. • Trained by Microsoft on “high quality” data • Open

    source research model • 2.7b parameters w/ performance of 13b parameters model • Quantized model can run on a phone • https://huggingface.co/microsoft/phi-2 Small LLM: Phi-2
  9. Use cases • Private processing of (sensitive) data • Searching

    / querying local documents (RAG) • Sentiment analysis, summarization, background jobs • No additional costs
  10. Limitations • Smaller model → less “knowledge” → more specialised

    models • Open source base models often need tuning → (Dutch) data
  11. A word about privacy • ChatGPT, Gemini web → data

    used for training ◦ Except for enterprise editions • OpenAI API → data not used for training • Google Gemini…it depends ◦ TL;DR Gemini API (not available in the EU) may train on your users data ◦ The Google Cloud Platform Vertex API has different terms and does not
  12. Running LLMs on Android • llama.cpp → open source, CPU

    & GPU ◦ Supports many model architectures, like Phi-2 and Gemma ◦ Moves very fast (multiple releases a day) • Candle → ML framework from 🤗 written in Rust 😅 • Gemma.cpp • Mediapipe (crashes on a real device 󰤇) • Gemini nano → Only on select high end devices, currently EAP
  13. Integrating llama.cpp 1. Download source code 2. Compile with Android

    Studio / NDK 3. Create JNI layer -or- find an existing library 4. Profit!
  14. NDK + cmake • Out of tree build vs part

    of source ◦ (Sometimes) Easier for simple cases ◦ Can be convenient for testing, e.g. run cmd line tools on device, import .so w/ cmake ◦ Hard to maintain https://developer.android.com/ndk/guides/cmake#command-line
  15. NDK + cmake defaultConfig { externalNativeBuild { cmake { arguments+=

    listOf( "-DCMAKE_BUILD_TYPE=Release", "-DBUILD_SHARED_LIBS=ON", "-DLLAMA_STANDALONE=ON", "-DANDROID_STL=c++_shared") } } ndk { abiFilters.add("arm64-v8a") ndkVersion = libs.versions.ndk.get() } }
  16. NDK and dependencies • It’s complicated ◦ Not all projects

    use cmake ◦ Even if they do…😬 • Options: precompile dependencies or include all source • NDK prefab packages ◦ AAR containing libs + headers ◦ Mixed results producing and consuming with cmake
  17. NDK and dependencies android { buildFeatures { prefab = true

    prefabPublishing = true } prefab { create("llama") { headers = "llama.cpp" libraryName = "libllama" } create("common") { headers = "llama.cpp/common" libraryName = "libcommon" } } }
  18. JNI layer • java-llama.cpp ◦ Use as source for Android

    ◦ Add (prefab) llamacpp-android as dependency • Keeping up with llama.cpp is hard, llama.cpp api changes → 💥 https://github.com/kherud/java-llama.cpp
  19. • Very much a proof of concept, due to model

    file size • Acceptable performance with room for improvement • llama.cpp runs open source, fine tuned (by you) models, no restrictions • For real apps maybe prefer a JNI layer with only what you use Conclusions
  20. Resources • Andrej Karpathy’s YouTube channel • Niels Rogge 🤗

    on LLMs https://www.youtube.com/watch?v=Ma4clS-IdhA • huggingface.co • Try quantized models on your own machine: Ollama, LMStudio