Upgrade to Pro — share decks privately, control downloads, hide ads and more …

KubeCon NA 2024 Recap: Distributed Cache Empowe...

KubeCon NA 2024 Recap: Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster / Kubernetes Meetup Tokyo #68

Kubernetes Meetup Tokyo #68 での発表資料です。
PFN による KubeCon + CloudNativeCon North America 2024 でのセッション “Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster” を recap し、内製した分散キャッシュシステムの利用事例や負荷分散の工夫などを紹介します。

Preferred Networks

December 25, 2024
Tweet

More Decks by Preferred Networks

Other Decks in Technology

Transcript

  1. KubeCon + CloudNativeCon North America 2024 Recap Distributed Cache Empowers

    AI/ML Workloads on Kubernetes Cluster Yuichiro Ueno, Toru Komatsu (Preferred Networks, Inc.) 2024-12-12 Kubernetes Meetup Tokyo #68
  2. 2 • 上野 裕一郎 (Yuichiro Ueno) ◦ 2021/04 新卒で Preferred

    Networks に入社 ◦ Cluster Services チーム 所属 ▪ Kubernetes で 機械学習基盤 を作る ▪ アクセラレータやネットワークまわり • すきなこと ◦ 性能最適化 ◦ 高性能計算(HPC) ◦ インフラ全般 • KubeCon 初参加 自己紹介 @y1r96
  3. 3 Preferred Networks, Inc. オンプレML基盤の開発‧運⽤ OSS Activities Maintainer opencontainers/runtime-spec youki-dev/youki

    Reviewer containerd/runwasi Member kubernetes org(sig-scheduling/node) Community CNA, CNCJ @utam0k Toru Komatsu
  4. 4 • PFN からの KubeCon 初発表セッションについて Recap させてください ◦ Distributed

    Cache Empowers AI/ML Workloads on Kubernetes Cluster • Kubernetes Meetup Tokyo #60 で紹介したことがあります ◦ 既にご存知の方もいると思うので、駆け足で簡単な振り返りと、 #60 からのアップデートをいくつか抜粋して紹介します! 今日の Recap セッション
  5. 5 PFNの事業: AI技術のバリューチェーンを垂直統合 ソリューション・製品 計算基盤 AIチップ PFNは、チップ、計算基盤、⽣成AI基盤モデル、ソリューション‧製品まで、AI技術のバリュー チェーンを垂直統合し、ソフトウェアとハードウェアを⾼度に融合することで、競争⼒の⾼い技術の 開発および産業応⽤を進めています。 生成AI基盤モデル

    様々な産業‧消費者向けのソリューション‧製品 MN-Core™ MN-Core™ 2 GPUクラスタ MN-3 (MN-Core™ クラスタ) PLaMo Prime(今秋提供予定のLLM) PLaMo Lite(エッジ向けSLM) MN-Core 第三世代 MN-Core™ 2を 計算資源とした クラウドサービス 物質のエネルギー計算モデル PFP LLM向け 推論チップ (2026年提供予定)
  6. Training of Machine Learning Models 7 Compute Node Deep Neural

    Network Compute Node Deep Neural Network Data Samples Dataset Icon pack by Icons8 - https://icons8.com
  7. • Network File System (NFS) with hostPath ◦ Fast but

    not scalable • Object Storage ◦ Scalable but not fast ▪ We're using HDDs as backend of our object storage • Node Local Storage (NVMe) ◦ Very fast but the storage is not globally available, and not scalable ▪ If the workload is moved to different compute node, the data is unreachable. On-premise Storages for the dataset loading 8 Compute Node A Deep Neural Network Compute Node B Deep Neural Network Preempt and Re-scheduling Moved to node B unreachable Icon pack by Icons8 - https://icons8.com Cache
  8. Best hierarchical storage for AI/ML workload ? 9 Object Storage

    Compute Node Deep Neural Network Compute Node Deep Neural Network Cloud of Node Local Storage Capacity-optimized Performance-optimized Icon pack by Icons8 - https://icons8.com
  9. Best hierarchical storage development with: 10 ✓ Topology-Aware Routing ✓

    Informer for Pod Discovery ✓ Token Review API ✓ Consistent Hashing ✓ xDS API Cloud of Node Local Storage Icon pack by Icons8 - https://icons8.com
  10. Simple ✓ Simple HTTP REST API(GET & PUT) ✓ It

    just returns local files Cloud Native ✓ SCS runs on Kubernetes Shared-nothing architecture ✓ Scalable Position as a Cache ✓ It’s just “Cache” and not “Persistent Storage” Overview of Simple Cache Service 12
  11. How to use SCS # Upload `apple.jpg` and save as

    `apple` object in `prj-foobar` bucket. $ curl -H "Authorization: Bearer $(cat /token)" \ -X PUT \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple \ --data-binary @apple.jpg # Download `apple` object in `prj-foobar` bucket $ curl -H "Authorization: Bearer $(cat /token)" \ -X GET \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple 13
  12. How to use SCS # Upload `apple.jpg` and save as

    `apple` object in `prj-foobar` bucket. $ curl -H "Authorization: Bearer $(cat /token)" \ -X PUT \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple \ --data-binary @apple.jpg # Download `apple` object in `prj-foobar` bucket $ curl -H "Authorization: Bearer $(cat /token)" \ -X GET \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple bucket object 14
  13. How to use SCS # Upload `apple.jpg` and save as

    `apple` object in `prj-foobar` bucket. $ curl -H "Authorization: Bearer $(cat /token)" \ -X PUT \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple \ --data-binary @apple.jpg # Download `apple` object in `prj-foobar` bucket $ curl -H "Authorization: Bearer $(cat /token)" \ -X GET \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple Bound Service Account Token 15
  14. Overall Architecture (1/2) Load Balancing in Layer 4 Service with

    Topology Aware Hints Load Balancing in Layer 7 Envoy Proxy with Consistent Hashing User Pods Cache 16
  15. Shared-nothing Architecture Network Zone A Load Balancing in Layer 4

    Service with Topology Aware Hints Load Balancing in Layer 7 Envoy Proxy with Consistent Hashing Network Zone B GET /objects/A GET /objects/A GET /objects/B Cache B Cache A User 18
  16. 1. Mount the Bound SA Token 2. Make the request

    w/ the token in Auth Header Authorization (1/2) User Pods 3. Verify the token by TokenReview API ✓ “Aud as expected?” “Valid until?” “Pod is still alive?” ✓ Resolve the NS of the source from the SA username ➡ Namespace-level authorization can be implemented 19
  17. Authorization (2/2) "Bucket": [ { "Name": "public", "Public": true, "BucketQuota":

    "100Gi" }, { "Name": "kubecon", "Public": false, "BucketQuota": "500Gi", "AllowNamespaces" : [ "prj-kubernetes", "user-utam0k", ] } ] Public Bucket Private Bucket Based on Namespace Selector 20
  18. Authorization (2/2) "Bucket": [ { "Name": "public", "Public": true, "BucketQuota":

    "100Gi" }, { "Name": "kubecon", "Public": false, "BucketQuota": "500Gi", "AllowNamespaces" : [ "prj-kubernetes", "user-utam0k", ] } ] Public Bucket Private Bucket Based on Namespace Selector 21 Blog: 分散キャッシュシステムにおける公 平制御の実現
  19. PFIO is an I/O abstraction library developed by us •

    It can read / write / list Local filesystem, S3 compatible object storage, HDFS, … Read File in Object Storage with PFIO GET /000.jpg 200 OK 22 Object Storage Icon pack by Icons8 - https://icons8.com train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image)
  20. PFIO supports transparent cache mechanism • It automatically checks data

    in SCS first, then try origin later if data is not exist • At first, the desired data is not stored in SCS, therefore PFIO will put it to SCS Transparent Object Storage Cache by PFIO GET /000.jpg 404 Not Found PUT /000.jpg 201 Created Object Storage GET /000.jpg 200 OK 23 train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url, http_cache=scs_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image) Icon pack by Icons8 - https://icons8.com
  21. PFIO supports transparent cache mechanism • It automatically checks data

    in SCS first, then try origin later if data is not exist • If the desired data is stored in SCS, we can skip accessing Object Storage Transparent Object Storage Cache by PFIO GET /000.jpg 200 OK PUT /000.jpg 201 Created GET /000.jpg 200 OK 24 train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url, http_cache=scs_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image) Object Storage SKIP ! Icon pack by Icons8 - https://icons8.com
  22. Implementing Yet Another Cache using SCS Yet Another Cache Features

    to implement ◦ URL Mappings ◦ from origin key to SCS bucket/key ◦ AuthN/AuthZ if needed ◦ Other necessary features Features not to implement ✓ Storage management ◦ Cache Eviction ◦ Capacity Control GET /000.jpg 404 Not Found PUT /000.jpg 201 Created Origin Service GET /000.jpg 200 OK GET /000.jpg 200 OK 25 ① ② ③ ⑤ ④ e.g. Container Image Layer
  23. Q1 How can we optimize the network traffic ? Deploy

    Considerations User Pods Q2 How can we configure Envoy to route the traffic ? 27
  24. Q1 How can we optimize the network traffic ? Deploy

    Considerations User Pods Q2 How can we configure Envoy to route the traffic ? 28
  25. Company: Preferred Networks • Provides ML models like LLMs, and

    solutions for industries • Uses own on-premise infrastructure to provide solutions Infrastructure • 3+ Kubernetes Clusters • 400+ Kubernetes Nodes • 30000+ CPU Cores • 320+ TiB Memory • 2000+ GPUs • Our AI Accelerator: MN-Core™ ◦ HW: RTL, Board/Server Design ◦ SW: Driver, Device Plugin, Compiler Background: Our computing infrastructure 29 Our Infrastructure
  26. Network Zone D Network Zone C Network Zone B Network

    Zone A Network topology of our data center: CLOS network Background: Data Center Network Spine Switch Spine Switch Leaf Switch Leaf Switch Leaf Switch Leaf Switch Node Node Node Node Node Node Node Node Node Node Node Node External / Super Spine Inter-zone Networking (oversubscribed) In-zone Networking 30
  27. • Assumptions ◦ SCS is deployed to all nodes to

    use local NVMe drives ◦ Also, User Pods will be scheduled to all nodes to use all accelerators • Where to deploy Envoy? ◦ We deploy Envoy to all nodes to reduce inter-zone traffic of Pod/Envoy. ◦ Inter-zone traffic of Envoy/SCS is unavoidable in that case. Where to deploy Envoy? Spine Switch Leaf Switch Leaf Switch Node Node Node Node Node Node Spine Switch 31
  28. Reducing inter-zone traffic by K8s Service Topology Aware Routing Node

    Node Node Node Node Node • Pod/Envoy Traffic ◦ Perfect / No network traffic • Envoy load balance ◦ Bad / No distribution of traffic ◦ When some node use SCS heavily, the Envoy's cpu load become high • Pod/Envoy Traffic ◦ Moderate / In-zone network traffic only • Envoy load balance ◦ Moderate / Distribute traffic in a zone ◦ When some node use SCS heavily, Envoy's cpu load is distributed among zone We use Topology Aware Routing to improve Envoy's cpu load balance 32 Internal Traffic Policy
  29. Q1 How can we optimize the network traffic ? Deploy

    Considerations User Pods Q2 How can we configure Envoy to route the traffic ? 33
  30. • We want to route the traffic from Envoy to

    SCS consistently ◦ When we put an object to the N-th SCS, we want to get it from the N-th SCS. • The easiest way to achieve that: ◦ Manage a mapping from bucket/object to id of backend ◦ Mapping should be sharded… ▪ Introduce a distributed MDS ▪ Too complicated solution for us • We don't manage a mapping explicitly ◦ Use hash(bucket + "/" + object) to choose a backend server Load Balancing of Keys (Bucket and Object) 34
  31. • Use (hash % number-of-backends) as backend id ? ◦

    When the number of backends changes, almost every keys remaps ▪ Typical example: Node Failure / Installation ◦ More sophisticated way -> Consistent Hashing ▪ Bound of remapped keys is keys/backends • Two Consistent Hashing algorithms in Envoy/lb_policy Consistent Hashing 35 Backend 1 Backend 3 Backend 2 Backend 4 Hash (10) Backend 1 • {1, 6, 10} Backend 2 • {2, 5, 11} Backend 3 • {7, 8, 9} Backend 4 • {3, 4, 12} Hash (10) RING_HASH MAGLEV
  32. • Load balance of keys is also very important ◦

    The length of arc in RING corresponds to the ratio of responsibilities ▪ Backend 3 is 1.5x responsibility of Backend 4 ◦ It affect the performance ! ▪ B3's cpu usage is 1.5x of B4's • Because B3 is 1.5x busier than B4 • May result the longer latency ▪ The lifetime of B3 data is 1.5x shorter than B4 data • Because the cache capacity is the same • More possibility of deletion We want to see the consistent resource usage and lifetime Key distribution matters! 36 Backend 1 Backend 3 Backend 2 Backend 4 RING_HASH
  33. RING_HASH vs MAGLEV -> We use MAGLEV 37 RING_HASH MAGLEV

    Objects Count per node Backend 1 Backend 3 Backend 2 Backend 4 Hash (10) Backend 1 • {1, 6, 10} Backend 2 • {2, 5, 11} Backend 3 • {7, 8, 9} Backend 4 • {3, 4, 12} Hash (10) RING_HASH MAGLEV Load imbalance up to 1.5x No load imbalance
  34. 41 • 分散キャッシュを使って機械学習ワークロードを加速しています ◦ Kubernetes や Envoy に支えられて実現しました ◦ PFIO

    など Python から使いやすくする工夫もあります • 負荷分散にこだわったつくりになっています ◦ L4: 実際のネットワークトポロジを意識したトラフィック制御 ◦ L7: 完全なキーの分散を実現する MAGLEV の採用 ▪ shared-nothing architecture • PFN はこれからも Kubernetes でおもしろい機械学習基盤を作ります Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster
  35. 42 Preferred Networks の基盤技術グループでは採用を実施中です! • 機械学習プラットフォームエンジニア Kubernetes, 社内向け機械学習プラットフォーム、外販クラウドサービスの開発運用 キーワード: K8s,

    オンプレ, GPU, Observability, MLOps, HPC, スケジューラ, AWS,       Front/Backend, コンテナネットワーク, データセンタネットワーク, RDMA, RoCE v2 • ストレージエンジニア ストレージの企画設計管理運用 • 大規模計算基盤エンジニア/ネットワーク・インフラ運用エンジニア クラスタの物理設計、MN-Core™ を含めた先端システム設計等 We’re hiring! カジュアル面談にお気軽にご応募ください