Cloud_Run_GPU___Gemma_2_を使った_LLM_アプリケーション開発のススメ.pdf

Cloud Run GPU + Gemma 2 を使った LLM アプリケーション開発第24回
酒とゲームとインフラとGCP 株式会社スリーシェイク Sreake事業部佐藤 Copyright © 3-shake, Inc. All Rights Reserved.

自己紹介佐藤慧太@SatohJohn • 2012/4 フリュー株式会社入社 ToC 向けのアプリケーション開発を 10年ほど経験
リードエンジニアとして 0からサービスを設計開発運用を経験 • 2023/1 株式会社スリーシェイク入社 SRE として労苦 <Toil>を減らす仕事に従事 Google Cloud Partner Top Engineer ’24 生成 AI とかやってます

Cloud Run GPU 1. 8 月後半に新しくできた Preview 機能である a. Cloud
Run の登録ページ g.co/cloudrun/gpu で登録することで利用可能 2. ユースケース a. LLM モデル、画像の輪郭抽出、領域抽出とかのGPUを使うアプリケーションをスケーラブルに動かす i. job ではないので、あくまでも、短時間のタスク(推論系)に限定される ii. 他のパターンでは GKE Autopilot が考えられるが GKEを作りたくない(管理上)パターンも多いため https://cloud.google.com/run/docs/configuring/services/gpu

ベストプラクティス(model location) Model location Deploy time Development experience Container startup
time Storage cost Container image Slow. An image containing a large model will take longer to import into Cloud Run. Changes to the container image will require redeployment, which may be slow for large images. Depends on the size of the model. For very large models, use Cloud Storage for more predictable but slower performance. Potentially multiple copies in Artifact Registry. Cloud Storage, loaded using Cloud Storage FUSE volume mount Fast. Model downloaded during container startup. Not difficult to set up, does not require changes to the docker image. Fast when network optimizations. Does not parallelize the download. One copy in Cloud Storage. Cloud Storage, downloaded concurrently using the Google Cloud CLI command gcloud storage cp or the Cloud Storage API as shown in the transfer manager concurrent download code sample. Fast. Model downloaded during container startup. Slightly more difficult to set up, because you'll need to either install the Google Cloud CLI on the image or update your code to use the Cloud Storage API. Fast when network optimizations. The Google Cloud CLI downloads the model file in parallel, making it faster than FUSE mount. One copy in Cloud Storage. https://cloud.google.com/run/docs/configuring/services/gpu-best-practices

ベストプラクティス(model location) https://cloud.google.com/run/docs/conﬁguring/services/gpu-best-practices Model location Deploy time Development experience Container
startup time Storage cost Internet Fast. Model downloaded during container startup. Typically simpler (many frameworks download models from central repositories). Typically poor and unpredictable: • Frameworks may apply model transformations during initialization. (You should do this at build time). • Model host and libraries for downloading the model may not be eﬃcient. • There is reliability risk associated with downloading from the internet. Your service could fail to start if the download target is down, and the underlying model downloaded could change, which decreases quality. We recommend hosting in your own Cloud Storage bucket. Depends on the model hosting provider.

ベストプラクティス(model location) https://cloud.google.com/run/docs/conﬁguring/services/gpu-best-practices Model location Deploy time Development experience Container
startup time Storage cost Container image ☓ △ ◯ △ GCS Fuse ◯ ◯ ◯ ◯ GCS Download ◯ ☓ ◯ ◯ Internet ◯ ◯ ☓ ？

ベストプラクティス(リクエスト) https://cloud.google.com/run/docs/conﬁguring/services/gpu-best-practices 1. LLM ではインスタンスあたりの最大同時リクエスト数はアプリケーションが最大に処理できる同時実行数を設定する必要がある a. Ollama フレームワークでは OLLAMA_NUM_PARALLEL
設定がある i. このあたりモデルの性能にもよるのでチューニングする必要がある ii. Google Cloud の経験則で言えば (Number of model instances * parallel queries per model) + (number of model instances * ideal batch size) iii. モデルがどれぐらいのクエリを処理できるか、そのモデルのインスタンス数などを考慮する必要がある

注意点 • 最低でも 4 つの CPU と 16 GiB のメモリを使用する必要がある
• 現状 us-central1, asia-southeast1のみ • CPU always allocated を設定する必要がある • GPU はプラスの料金として金額がかかる ◦ 負荷テストして、GPU7台で料金 1時間ぐらいで2000円程度だった • Cloud Run Job では利用できない • 使える GPU は L4 のみ、後に増える想定 • デフォルト7つまでで、100 まではスケールさせるには割当申請が通りにくそうではある https://cloud.google.com/run/docs/configuring/cpu-allocation?hl=ja

deploy FROM ollama/ollama:0.3.6 # Listen on all interfaces, port 8080
ENV OLLAMA_HOST 0.0.0.0:8080 # Store model weight files in /models ENV OLLAMA_MODELS /models # Reduce logging verbosity ENV OLLAMA_DEBUG false # Never unload model weights from the GPU ENV OLLAMA_KEEP_ALIVE -1 # Store the model weights in the container image ENV MODEL gemma2:2b RUN ollama serve & sleep 5 && ollama pull $MODEL # Start Ollama ENTRYPOINT ["ollama", "serve"]

deploy gcloud beta run deploy $SERVICE_NAME --image $REPOSITORY/$APP_NAME \ --concurrency
4 --cpu 8 --set-env-vars OLLAMA_NUM_PARALLEL=4 \ --gpu 1 --gpu-type nvidia-l4 \ --max-instances 7 --memory 32Gi --no-allow-unauthenticated \ --no-cpu-throttling --timeout=600 --region=us-central1 \ --project=$GOOGLE_PROJECT ※gpu フラグが入ったのが gcloud version 488.0.0 なので注意

まとめ Cloud Run GPU まだ使い所難しいけれど、今後は主流になりそう • LLM アプリケーションで利用ということであれば、 Gemini 1.5
ﬂash の方がコストが下がる可能性がある ◦ 例えば、コンテキストキャッシュなどを使う ◦ セキュリティ、または、自社作成 LLM などは良さそう • パフォーマンステストをやってみた所感で考えると 2b モデルで stream で良ければ使えなくないが、それ以上のモデルになると厳しそう

Cloud_Run_GPU___Gemma_2_を使った_LLM_アプリケーション開発のスス...

Cloud_Run_GPU___Gemma_2_を使った_LLM_アプリケーション開発のススメ.pdf

SatohJohn

More Decks by SatohJohn

Other Decks in Programming

Featured

Transcript

Cloud Run GPU + Gemma 2 を使った LLM アプリケーション開発第24回

自己紹介佐藤慧太@SatohJohn • 2012/4 フリュー株式会社入社 ToC 向けのアプリケーション開発を 10年ほど経験

Cloud Run における GPU とは？ 01 Copyright © 3-shake, Inc.

Cloud Run GPU 1. 8 月後半に新しくできた Preview 機能である a. Cloud

ベストプラクティス(model location) Model location Deploy time Development experience Container startup

ベストプラクティス(model location) https://cloud.google.com/run/docs/conﬁguring/services/gpu-best-practices Model location Deploy time Development experience Container

ベストプラクティス(model location) https://cloud.google.com/run/docs/conﬁguring/services/gpu-best-practices Model location Deploy time Development experience Container

注意点 • 最低でも 4 つの CPU と 16 GiB のメモリを使用する必要がある

Deploy 02 Copyright © 3-shake, Inc. All Rights Reserved.

deploy FROM ollama/ollama:0.3.6 # Listen on all interfaces, port 8080

deploy gcloud beta run deploy $SERVICE_NAME --image $REPOSITORY/$APP_NAME \ --concurrency

DEMO

まとめ 03 Copyright © 3-shake, Inc. All Rights Reserved.

まとめ Cloud Run GPU まだ使い所難しいけれど、今後は主流になりそう • LLM アプリケーションで利用ということであれば、 Gemini 1.5