Cloud_Run_GPU___Gemma_2_を使った_LLM_アプリケーション開発のススメ.pdf

Slide 1

Slide 1 text

Slide 2

Slide 2 text

自己紹介佐藤慧太@SatohJohn ● 2012/4 フリュー株式会社入社 ToC 向けのアプリケーション開発を 10年ほど経験リードエンジニアとして 0からサービスを設計開発運用を経験 ● 2023/1 株式会社スリーシェイク入社 SRE として労苦を減らす仕事に従事 Google Cloud Partner Top Engineer ’24 生成 AI とかやってます

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Cloud Run GPU 1. 8 月後半に新しくできた Preview 機能である a. Cloud Run の登録ページ g.co/cloudrun/gpu で登録することで利用可能 2. ユースケース a. LLM モデル、画像の輪郭抽出、領域抽出とかのGPUを使うアプリケーションをスケーラブルに動かす i. job ではないので、あくまでも、短時間のタスク(推論系)に限定される ii. 他のパターンでは GKE Autopilot が考えられるが GKEを作りたくない(管理上)パターンも多いため https://cloud.google.com/run/docs/configuring/services/gpu

Slide 5

Slide 5 text

ベストプラクティス(model location) Model location Deploy time Development experience Container startup time Storage cost Container image Slow. An image containing a large model will take longer to import into Cloud Run. Changes to the container image will require redeployment, which may be slow for large images. Depends on the size of the model. For very large models, use Cloud Storage for more predictable but slower performance. Potentially multiple copies in Artifact Registry. Cloud Storage, loaded using Cloud Storage FUSE volume mount Fast. Model downloaded during container startup. Not difficult to set up, does not require changes to the docker image. Fast when network optimizations. Does not parallelize the download. One copy in Cloud Storage. Cloud Storage, downloaded concurrently using the Google Cloud CLI command gcloud storage cp or the Cloud Storage API as shown in the transfer manager concurrent download code sample. Fast. Model downloaded during container startup. Slightly more difficult to set up, because you'll need to either install the Google Cloud CLI on the image or update your code to use the Cloud Storage API. Fast when network optimizations. The Google Cloud CLI downloads the model file in parallel, making it faster than FUSE mount. One copy in Cloud Storage. https://cloud.google.com/run/docs/configuring/services/gpu-best-practices

Slide 6

Slide 6 text

ベストプラクティス(model location) https://cloud.google.com/run/docs/conﬁguring/services/gpu-best-practices Model location Deploy time Development experience Container startup time Storage cost Internet Fast. Model downloaded during container startup. Typically simpler (many frameworks download models from central repositories). Typically poor and unpredictable: ● Frameworks may apply model transformations during initialization. (You should do this at build time). ● Model host and libraries for downloading the model may not be eﬃcient. ● There is reliability risk associated with downloading from the internet. Your service could fail to start if the download target is down, and the underlying model downloaded could change, which decreases quality. We recommend hosting in your own Cloud Storage bucket. Depends on the model hosting provider.

Slide 7

Slide 7 text

Slide 8

Slide 8 text

ベストプラクティス(リクエスト) https://cloud.google.com/run/docs/conﬁguring/services/gpu-best-practices 1. LLM ではインスタンスあたりの最大同時リクエスト数はアプリケーションが最大に処理できる同時実行数を設定する必要がある a. Ollama フレームワークでは OLLAMA_NUM_PARALLEL 設定がある i. このあたりモデルの性能にもよるのでチューニングする必要がある ii. Google Cloud の経験則で言えば (Number of model instances * parallel queries per model) + (number of model instances * ideal batch size) iii. モデルがどれぐらいのクエリを処理できるか、そのモデルのインスタンス数などを考慮する必要がある

Slide 9

Slide 9 text

注意点 ● 最低でも 4 つの CPU と 16 GiB のメモリを使用する必要がある ● 現状 us-central1, asia-southeast1のみ ● CPU always allocated を設定する必要がある ● GPU はプラスの料金として金額がかかる ○ 負荷テストして、GPU7台で料金 1時間ぐらいで2000円程度だった ● Cloud Run Job では利用できない ● 使える GPU は L4 のみ、後に増える想定 ● デフォルト7つまでで、100 まではスケールさせるには割当申請が通りにくそうではある https://cloud.google.com/run/docs/configuring/cpu-allocation?hl=ja

Slide 10

Slide 10 text

Slide 11

Slide 11 text

deploy FROM ollama/ollama:0.3.6 # Listen on all interfaces, port 8080 ENV OLLAMA_HOST 0.0.0.0:8080 # Store model weight files in /models ENV OLLAMA_MODELS /models # Reduce logging verbosity ENV OLLAMA_DEBUG false # Never unload model weights from the GPU ENV OLLAMA_KEEP_ALIVE -1 # Store the model weights in the container image ENV MODEL gemma2:2b RUN ollama serve & sleep 5 && ollama pull $MODEL # Start Ollama ENTRYPOINT ["ollama", "serve"]

Slide 12

Slide 12 text

deploy gcloud beta run deploy $SERVICE_NAME --image $REPOSITORY/$APP_NAME \ --concurrency 4 --cpu 8 --set-env-vars OLLAMA_NUM_PARALLEL=4 \ --gpu 1 --gpu-type nvidia-l4 \ --max-instances 7 --memory 32Gi --no-allow-unauthenticated \ --no-cpu-throttling --timeout=600 --region=us-central1 \ --project=$GOOGLE_PROJECT ※gpu フラグが入ったのが gcloud version 488.0.0 なので注意

Slide 13

Slide 13 text

DEMO

Slide 14

Slide 14 text

Slide 15

Slide 15 text

まとめ Cloud Run GPU まだ使い所難しいけれど、今後は主流になりそう ● LLM アプリケーションで利用ということであれば、 Gemini 1.5 ﬂash の方がコストが下がる可能性がある ○ 例えば、コンテキストキャッシュなどを使う ○ セキュリティ、または、自社作成 LLM などは良さそう ● パフォーマンステストをやってみた所感で考えると 2b モデルで stream で良ければ使えなくないが、それ以上のモデルになると厳しそう