Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Llama.cpp+MetalをM1 Ultraで動かしてみた

Llama.cpp+MetalをM1 Ultraで動かしてみた

A 10 minute lightning talk I gave at the 1st AI Study Group in Ebetsu (on 2023/8/4) demoing how I built and tested Llama.cpp on my local Mac Studio M1 Ultra with 64GB using Metal/GPU acceleration.

https://connpass.com/event/291492/

(Note: This presentation is only in Japanese, sorry.)

Michael Tedder

August 04, 2023
Tweet

More Decks by Michael Tedder

Other Decks in Technology

Transcript

  1. $ whoami > テッダー マイケル(昭和51年生まれ / 2000年アメリカから日本) > 25年以上ゲーム業界でリアルタイム3Dとコア部分のゲーム開発   (初代PS 〜

    Switchゲーム機 / モバイル端末 / PC / Rift VR / AR) > 10年以上AWSのクラウドアプリケーション開発(サーバーレス / コンテナ) > JAWS-UG GameTech+札幌運営 / Tokyo Demo Fest実行委員 / AWSコミュニティビルダー > できること: ゲームエンジン開発 / 開発者向けツール開発 / DevOps (CI/CD) アプリ+バックエンド開発 / クラウドアーキテクチャ設計 > 好きな言語: C++17 / GLSL / ASM (x64/ARM/6502/MIPS) / PHP / TypeScript > AI知識レベル: 完全に初心者 > 最近遊んでるゲーム: Satisfactory / Final Fantasy XI
  2. 開発 / 検証環境 • Mac Studio 2022 ◦ Apple M1

    Ultra 64GB ◦ macOS Ventura 13.4.1 • Apple Siliconは ユニファイドメモリ ◦ CPUとGPUが同じメモリを アクセスするのが可能
  3. ソースからビルド Michaels-Mac-Studio:coldfusionjp $ git clone https://github.com/ggerganov/llama.cpp Cloning into 'llama.cpp'... remote:

    Enumerating objects: 6064, done. remote: Counting objects: 100% (2198/2198), done. remote: Compressing objects: 100% (171/171), done. remote: Total 6064 (delta 2113), reused 2050 (delta 2026), pack-reused 3866 Receiving objects: 100% (6064/6064), 4.70 MiB | 24.19 MiB/s, done. Resolving deltas: 100% (4178/4178), done.
  4. ソースからビルド(Metal有効) Michaels-Mac-Studio:build $ cmake -DLLAMA_METAL=ON .. -- The C compiler

    identification is AppleClang 14.0.3.14030022 -- The CXX compiler identification is AppleClang 14.0.3.14030022 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Git: /usr/bin/git (found version "2.39.2 (Apple Git-143)") -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- Accelerate framework found -- CMAKE_SYSTEM_PROCESSOR: arm64 -- ARM detected -- Configuring done (3.4s) -- Generating done (0.1s) -- Build files have been written to: /Users/falken/src/coldfusionjp/llama.cpp/build
  5. Michaels-Mac-Studio:build $ cmake --build . --config Release [ 1%] Built

    target BUILD_INFO [ 3%] Building C object CMakeFiles/ggml.dir/ggml.c.o [ 5%] Building C object CMakeFiles/ggml.dir/ggml-alloc.c.o [ 7%] Building C object CMakeFiles/ggml.dir/ggml-metal.m.o [ 9%] Building C object CMakeFiles/ggml.dir/k_quants.c.o [ 9%] Built target ggml … [ 84%] Built target embd-input-test [ 86%] Building CXX object examples/metal/CMakeFiles/metal.dir/metal.cpp.o [ 88%] Linking CXX executable ../../bin/metal [ 88%] Built target metal [ 90%] Building CXX object examples/server/CMakeFiles/server.dir/server.cpp.o [ 92%] Linking CXX executable ../../bin/server [ 92%] Built target server [ 94%] Building CXX object pocs/vdot/CMakeFiles/vdot.dir/vdot.cpp.o [ 96%] Linking CXX executable ../../bin/vdot [ 96%] Built target vdot [ 98%] Building CXX object pocs/vdot/CMakeFiles/q8dot.dir/q8dot.cpp.o [100%] Linking CXX executable ../../bin/q8dot [100%] Built target q8dot ソースからビルド(Release版)
  6. Michaels-Mac-Studio:models $ mv ~/Downloads/llama-2-7b-chat.ggmlv3.q4_0.bin . Michaels-Mac-Studio:models $ cd .. Michaels-Mac-Studio:llama.cpp

    $ ls -l models total 7423256 -rw-r--r-- 1 falken staff 432610 Aug 3 23:51 ggml-vocab.bin -rw-r--r--@ 1 falken staff 3791725184 Aug 3 23:55 llama-2-7b-chat.ggmlv3.q4_0.bin モデルをコピーする Llama2 の 7B chat (q4.0) が約3.7GB
  7. Michaels-Mac-Studio:llama.cpp $ ./main -m ./models/llama-2-7b-chat.ggmlv3.q4_0.bin --temp 0.1 -p "### Instruction:

    What is the height of Mount Fuji? ### Response:" -b 512 -ngl 32 main: build = 944 (8183159) main: seed = 1691074574 llama.cpp: loading model from ./models/llama-2-7b-chat.ggmlv3.q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 … ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: maxTransferRate = built-in GPU llama_new_context_with_model: max tensor size = 70.31 MB … system_info: n_threads = 16 / 20 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.100000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 ### Instruction: What is the height of Mount Fuji? ### Response: The height of Mount Fuji is 3,776 meters (12,421 feet) above sea level. [end of text] llama_print_timings: load time = 4943.32 ms llama_print_timings: sample time = 20.34 ms / 28 runs ( 0.73 ms per token, 1376.33 tokens per second) llama_print_timings: prompt eval time = 503.32 ms / 19 tokens ( 26.49 ms per token, 37.75 tokens per second) llama_print_timings: eval time = 372.99 ms / 27 runs ( 13.81 ms per token, 72.39 tokens per second) llama_print_timings: total time = 899.05 ms ggml_metal_free: deallocating 実行してみる!
  8. • 7B ◦ -ngl 0 1140.58ms 1174.58ms ◦ -ngl 16

    914.17ms ◦ -ngl 32 915.01ms 914.94ms ◦ -ngl 48 902.47ms 901.16ms ◦ -ngl 56 901.16ms ◦ -ngl 64 902.96ms CPU/GPUのパフォーマンス比較 GPU有効 CPUのみ
  9. • 13B ◦ -ngl 0 2636.29ms 2736.49ms ◦ -ngl 32

    1762.48ms 1778.47ms ◦ -ngl 48 1875.82ms 1898.44ms CPU/GPUのパフォーマンス比較