Llama.cpp+MetalをM1 Ultraで動かしてみた

Slide 1

Slide 1 text

Llama.cpp+Metalを M1 Ultraで動かしてみた江別AI勉強会 Vol.1 2023年8月4日 (金) 株式会社コールド・フュージョンテッダー　マイケル

Slide 2

Slide 2 text

$ whoami > テッダー　マイケル（昭和51年生まれ / 2000年アメリカから日本） > 25年以上ゲーム業界でリアルタイム3Dとコア部分のゲーム開発　　（初代PS 〜 Switchゲーム機 / モバイル端末 / PC / Rift VR / AR） > 10年以上AWSのクラウドアプリケーション開発（サーバーレス / コンテナ） > JAWS-UG GameTech+札幌運営 / Tokyo Demo Fest実行委員 / AWSコミュニティビルダー > できること: ゲームエンジン開発 / 開発者向けツール開発 / DevOps (CI/CD) アプリ+バックエンド開発 / クラウドアーキテクチャ設計 > 好きな言語: C++17 / GLSL / ASM (x64/ARM/6502/MIPS) / PHP / TypeScript > AI知識レベル: 完全に初心者 > 最近遊んでるゲーム: Satisfactory / Final Fantasy XI

Slide 3

Slide 3 text

開発 / 検証環境 ● Mac Studio 2022 ○ Apple M1 Ultra 64GB ○ macOS Ventura 13.4.1 ● Apple Siliconはユニファイドメモリ ○ CPUとGPUが同じメモリをアクセスするのが可能

Slide 4

Slide 4 text

ゼロから実行できるまで... たった15分！

Slide 5

Slide 5 text

ソースからビルド Michaels-Mac-Studio:coldfusionjp $ git clone https://github.com/ggerganov/llama.cpp Cloning into 'llama.cpp'... remote: Enumerating objects: 6064, done. remote: Counting objects: 100% (2198/2198), done. remote: Compressing objects: 100% (171/171), done. remote: Total 6064 (delta 2113), reused 2050 (delta 2026), pack-reused 3866 Receiving objects: 100% (6064/6064), 4.70 MiB | 24.19 MiB/s, done. Resolving deltas: 100% (4178/4178), done.

Slide 6

Slide 6 text

ソースからビルド（Metal有効） Michaels-Mac-Studio:build $ cmake -DLLAMA_METAL=ON .. -- The C compiler identification is AppleClang 14.0.3.14030022 -- The CXX compiler identification is AppleClang 14.0.3.14030022 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Git: /usr/bin/git (found version "2.39.2 (Apple Git-143)") -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- Accelerate framework found -- CMAKE_SYSTEM_PROCESSOR: arm64 -- ARM detected -- Configuring done (3.4s) -- Generating done (0.1s) -- Build files have been written to: /Users/falken/src/coldfusionjp/llama.cpp/build

Slide 7

Slide 7 text

Michaels-Mac-Studio:build $ cmake --build . --config Release [ 1%] Built target BUILD_INFO [ 3%] Building C object CMakeFiles/ggml.dir/ggml.c.o [ 5%] Building C object CMakeFiles/ggml.dir/ggml-alloc.c.o [ 7%] Building C object CMakeFiles/ggml.dir/ggml-metal.m.o [ 9%] Building C object CMakeFiles/ggml.dir/k_quants.c.o [ 9%] Built target ggml … [ 84%] Built target embd-input-test [ 86%] Building CXX object examples/metal/CMakeFiles/metal.dir/metal.cpp.o [ 88%] Linking CXX executable ../../bin/metal [ 88%] Built target metal [ 90%] Building CXX object examples/server/CMakeFiles/server.dir/server.cpp.o [ 92%] Linking CXX executable ../../bin/server [ 92%] Built target server [ 94%] Building CXX object pocs/vdot/CMakeFiles/vdot.dir/vdot.cpp.o [ 96%] Linking CXX executable ../../bin/vdot [ 96%] Built target vdot [ 98%] Building CXX object pocs/vdot/CMakeFiles/q8dot.dir/q8dot.cpp.o [100%] Linking CXX executable ../../bin/q8dot [100%] Built target q8dot ソースからビルド（Release版）

Slide 8

Slide 8 text

Michaels-Mac-Studio:llama.cpp $ find . -name main ./examples/main ./build/bin/main ./build/examples/main Michaels-Mac-Studio:llama.cpp $ cp build/bin/main . バイナリができあがった

Slide 9

Slide 9 text

Michaels-Mac-Studio:models $ mv ~/Downloads/llama-2-7b-chat.ggmlv3.q4_0.bin . Michaels-Mac-Studio:models $ cd .. Michaels-Mac-Studio:llama.cpp $ ls -l models total 7423256 -rw-r--r-- 1 falken staff 432610 Aug 3 23:51 ggml-vocab.bin -rw-r--r--@ 1 falken staff 3791725184 Aug 3 23:55 llama-2-7b-chat.ggmlv3.q4_0.bin モデルをコピーする Llama2 の 7B chat (q4.0) が約3.7GB

Slide 10

Slide 10 text

Michaels-Mac-Studio:llama.cpp $ ./main -m ./models/llama-2-7b-chat.ggmlv3.q4_0.bin --temp 0.1 -p "### Instruction: What is the height of Mount Fuji? ### Response:" -b 512 -ngl 32 main: build = 944 (8183159) main: seed = 1691074574 llama.cpp: loading model from ./models/llama-2-7b-chat.ggmlv3.q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 … ggml_metal_init: recommendedMaxWorkingSetSize = 49152.00 MB ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: maxTransferRate = built-in GPU llama_new_context_with_model: max tensor size = 70.31 MB … system_info: n_threads = 16 / 20 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.100000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0 ### Instruction: What is the height of Mount Fuji? ### Response: The height of Mount Fuji is 3,776 meters (12,421 feet) above sea level. [end of text] llama_print_timings: load time = 4943.32 ms llama_print_timings: sample time = 20.34 ms / 28 runs ( 0.73 ms per token, 1376.33 tokens per second) llama_print_timings: prompt eval time = 503.32 ms / 19 tokens ( 26.49 ms per token, 37.75 tokens per second) llama_print_timings: eval time = 372.99 ms / 27 runs ( 13.81 ms per token, 72.39 tokens per second) llama_print_timings: total time = 899.05 ms ggml_metal_free: deallocating 実行してみる！

Slide 11

Slide 11 text

CPU/GPUのパフォーマンス比較

Slide 12

Slide 12 text

● 7B ○ -ngl 0 1140.58ms 1174.58ms ○ -ngl 16 914.17ms ○ -ngl 32 915.01ms 914.94ms ○ -ngl 48 902.47ms 901.16ms ○ -ngl 56 901.16ms ○ -ngl 64 902.96ms CPU/GPUのパフォーマンス比較 GPU有効 CPUのみ

Slide 13

Slide 13 text

● 13B ○ -ngl 0 2636.29ms 2736.49ms ○ -ngl 32 1762.48ms 1778.47ms ○ -ngl 48 1875.82ms 1898.44ms CPU/GPUのパフォーマンス比較

Slide 14

Slide 14 text

● Metal API利用でGPU有効の加速が見れる ● 13Bでも実行できる！ ○ 70B試す時間がなかった... ● -ngl のメモリ指定が大きすぎてもクラッシュはしない ○ パフォーマンスは若干減るかな？まとめ

Slide 15

Slide 15 text

ご清聴ありがとうございます！