Run の登録ページ g.co/cloudrun/gpu で登録することで利用可能 2. ユースケース a. LLM モデル、画像の輪郭抽出、領域抽出とかのGPUを使う アプリケーションをスケーラブルに動かす i. job ではないので、あくまでも、短時間のタスク(推論系)に限定される ii. 他のパターンでは GKE Autopilot が考えられるが GKEを作りたくない(管理上)パターンも多いため https://cloud.google.com/run/docs/configuring/services/gpu
time Storage cost Container image Slow. An image containing a large model will take longer to import into Cloud Run. Changes to the container image will require redeployment, which may be slow for large images. Depends on the size of the model. For very large models, use Cloud Storage for more predictable but slower performance. Potentially multiple copies in Artifact Registry. Cloud Storage, loaded using Cloud Storage FUSE volume mount Fast. Model downloaded during container startup. Not difficult to set up, does not require changes to the docker image. Fast when network optimizations. Does not parallelize the download. One copy in Cloud Storage. Cloud Storage, downloaded concurrently using the Google Cloud CLI command gcloud storage cp or the Cloud Storage API as shown in the transfer manager concurrent download code sample. Fast. Model downloaded during container startup. Slightly more difficult to set up, because you'll need to either install the Google Cloud CLI on the image or update your code to use the Cloud Storage API. Fast when network optimizations. The Google Cloud CLI downloads the model file in parallel, making it faster than FUSE mount. One copy in Cloud Storage. https://cloud.google.com/run/docs/configuring/services/gpu-best-practices
startup time Storage cost Internet Fast. Model downloaded during container startup. Typically simpler (many frameworks download models from central repositories). Typically poor and unpredictable: • Frameworks may apply model transformations during initialization. (You should do this at build time). • Model host and libraries for downloading the model may not be efficient. • There is reliability risk associated with downloading from the internet. Your service could fail to start if the download target is down, and the underlying model downloaded could change, which decreases quality. We recommend hosting in your own Cloud Storage bucket. Depends on the model hosting provider.
設定がある i. このあたりモデルの性能にもよるのでチューニングする必要がある ii. Google Cloud の経験則で言えば (Number of model instances * parallel queries per model) + (number of model instances * ideal batch size) iii. モデルがどれぐらいのクエリを処理できるか、 そのモデルのインスタンス数などを考慮する必要がある
ENV OLLAMA_HOST 0.0.0.0:8080 # Store model weight files in /models ENV OLLAMA_MODELS /models # Reduce logging verbosity ENV OLLAMA_DEBUG false # Never unload model weights from the GPU ENV OLLAMA_KEEP_ALIVE -1 # Store the model weights in the container image ENV MODEL gemma2:2b RUN ollama serve & sleep 5 && ollama pull $MODEL # Start Ollama ENTRYPOINT ["ollama", "serve"]