KubeCon NA 2024 Recap: Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster / Kubernetes Meetup Tokyo #68

KubeCon + CloudNativeCon North America 2024 Recap Distributed Cache Empowers
AI/ML Workloads on Kubernetes Cluster Yuichiro Ueno, Toru Komatsu (Preferred Networks, Inc.) 2024-12-12 Kubernetes Meetup Tokyo #68

2 • 上野裕一郎 (Yuichiro Ueno) ◦ 2021/04 新卒で Preferred
Networks に入社 ◦ Cluster Services チーム所属 ▪ Kubernetes で機械学習基盤を作る ▪ アクセラレータやネットワークまわり • すきなこと ◦ 性能最適化 ◦ 高性能計算（HPC） ◦ インフラ全般 • KubeCon 初参加自己紹介 @y1r96

3 Preferred Networks, Inc. オンプレML基盤の開発‧運⽤ OSS Activities Maintainer opencontainers/runtime-spec youki-dev/youki
Reviewer containerd/runwasi Member kubernetes org(sig-scheduling/node) Community CNA, CNCJ @utam0k Toru Komatsu

4 • PFN からの KubeCon 初発表セッションについて Recap させてください ◦ Distributed
Cache Empowers AI/ML Workloads on Kubernetes Cluster • Kubernetes Meetup Tokyo #60 で紹介したことがあります ◦ 既にご存知の方もいると思うので、駆け足で簡単な振り返りと、 #60 からのアップデートをいくつか抜粋して紹介します！今日の Recap セッション

5 PFNの事業: AI技術のバリューチェーンを垂直統合ソリューション・製品計算基盤 AIチップ PFNは、チップ、計算基盤、⽣成AI基盤モデル、ソリューション‧製品まで、AI技術のバリューチェーンを垂直統合し、ソフトウェアとハードウェアを⾼度に融合することで、競争⼒の⾼い技術の開発および産業応⽤を進めています。生成AI基盤モデル
様々な産業‧消費者向けのソリューション‧製品 MN-Core™ MN-Core™ 2 GPUクラスタ MN-3 (MN-Core™ クラスタ) PLaMo Prime（今秋提供予定のLLM） PLaMo Lite（エッジ向けSLM） MN-Core 第三世代 MN-Core™ 2を計算資源としたクラウドサービス物質のエネルギー計算モデル PFP LLM向け推論チップ（2026年提供予定）

6 背景

Training of Machine Learning Models 7 Compute Node Deep Neural
Network Compute Node Deep Neural Network Data Samples Dataset Icon pack by Icons8 - https://icons8.com

• Network File System (NFS) with hostPath ◦ Fast but
not scalable • Object Storage ◦ Scalable but not fast ▪ We're using HDDs as backend of our object storage • Node Local Storage (NVMe) ◦ Very fast but the storage is not globally available, and not scalable ▪ If the workload is moved to different compute node, the data is unreachable. On-premise Storages for the dataset loading 8 Compute Node A Deep Neural Network Compute Node B Deep Neural Network Preempt and Re-scheduling Moved to node B unreachable Icon pack by Icons8 - https://icons8.com Cache

Best hierarchical storage for AI/ML workload ? 9 Object Storage
Compute Node Deep Neural Network Compute Node Deep Neural Network Cloud of Node Local Storage Capacity-optimized Performance-optimized Icon pack by Icons8 - https://icons8.com

Best hierarchical storage development with: 10 ✓ Topology-Aware Routing ✓
Informer for Pod Discovery ✓ Token Review API ✓ Consistent Hashing ✓ xDS API Cloud of Node Local Storage Icon pack by Icons8 - https://icons8.com

11 アーキテクチャ・使い方

Simple ✓ Simple HTTP REST API（GET & PUT） ✓ It
just returns local ﬁles Cloud Native ✓ SCS runs on Kubernetes Shared-nothing architecture ✓ Scalable Position as a Cache ✓ It’s just “Cache” and not “Persistent Storage” Overview of Simple Cache Service 12

How to use SCS # Upload àpple.jpg` and save as
àpple` object in `prj-foobar` bucket. $ curl -H "Authorization: Bearer $(cat /token)" \ -X PUT \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple \ --data-binary @apple.jpg # Download àpple` object in `prj-foobar` bucket $ curl -H "Authorization: Bearer $(cat /token)" \ -X GET \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple 13

`apple` object in `prj-foobar` bucket. $ curl -H "Authorization: Bearer $(cat /token)" \ -X PUT \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple \ --data-binary @apple.jpg # Download `apple` object in `prj-foobar` bucket $ curl -H "Authorization: Bearer $(cat /token)" \ -X GET \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple bucket object 14

`apple` object in `prj-foobar` bucket. $ curl -H "Authorization: Bearer $(cat /token)" \ -X PUT \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple \ --data-binary @apple.jpg # Download `apple` object in `prj-foobar` bucket $ curl -H "Authorization: Bearer $(cat /token)" \ -X GET \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple Bound Service Account Token 15

Overall Architecture (1/2) Load Balancing in Layer 4 Service with
Topology Aware Hints Load Balancing in Layer 7 Envoy Proxy with Consistent Hashing User Pods Cache 16

Overall Architecture (2/2) .sqlite Application Cache Data Meta Data 17
Icon pack by Icons8 - https://icons8.com

Shared-nothing Architecture Network Zone A Load Balancing in Layer 4
Service with Topology Aware Hints Load Balancing in Layer 7 Envoy Proxy with Consistent Hashing Network Zone B GET /objects/A GET /objects/A GET /objects/B Cache B Cache A User 18

1. Mount the Bound SA Token 2. Make the request
w/ the token in Auth Header Authorization (1/2) User Pods 3. Verify the token by TokenReview API ✓ “Aud as expected?” “Valid until?” “Pod is still alive?” ✓ Resolve the NS of the source from the SA username ➡ Namespace-level authorization can be implemented 19

Authorization (2/2) "Bucket": [ { "Name": "public", "Public": true, "BucketQuota":
"100Gi" }, { "Name": "kubecon", "Public": false, "BucketQuota": "500Gi", "AllowNamespaces" : [ "prj-kubernetes", "user-utam0k", ] } ] Public Bucket Private Bucket Based on Namespace Selector 20

Authorization (2/2) "Bucket": [ { "Name": "public", "Public": true, "BucketQuota":
"100Gi" }, { "Name": "kubecon", "Public": false, "BucketQuota": "500Gi", "AllowNamespaces" : [ "prj-kubernetes", "user-utam0k", ] } ] Public Bucket Private Bucket Based on Namespace Selector 21 Blog: 分散キャッシュシステムにおける公平制御の実現

PFIO is an I/O abstraction library developed by us •
It can read / write / list Local ﬁlesystem, S3 compatible object storage, HDFS, … Read File in Object Storage with PFIO GET /000.jpg 200 OK 22 Object Storage Icon pack by Icons8 - https://icons8.com train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image)

PFIO supports transparent cache mechanism • It automatically checks data
in SCS ﬁrst, then try origin later if data is not exist • At ﬁrst, the desired data is not stored in SCS, therefore PFIO will put it to SCS Transparent Object Storage Cache by PFIO GET /000.jpg 404 Not Found PUT /000.jpg 201 Created Object Storage GET /000.jpg 200 OK 23 train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url, http_cache=scs_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image) Icon pack by Icons8 - https://icons8.com

PFIO supports transparent cache mechanism • It automatically checks data
in SCS ﬁrst, then try origin later if data is not exist • If the desired data is stored in SCS, we can skip accessing Object Storage Transparent Object Storage Cache by PFIO GET /000.jpg 200 OK PUT /000.jpg 201 Created GET /000.jpg 200 OK 24 train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url, http_cache=scs_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image) Object Storage SKIP ! Icon pack by Icons8 - https://icons8.com

Implementing Yet Another Cache using SCS Yet Another Cache Features
to implement ◦ URL Mappings ◦ from origin key to SCS bucket/key ◦ AuthN/AuthZ if needed ◦ Other necessary features Features not to implement ✓ Storage management ◦ Cache Eviction ◦ Capacity Control GET /000.jpg 404 Not Found PUT /000.jpg 201 Created Origin Service GET /000.jpg 200 OK GET /000.jpg 200 OK 25 ① ② ③ ⑤ ④ e.g. Container Image Layer

26 本番環境へのデプロイ

Q1 How can we optimize the network traffic ? Deploy
Considerations User Pods Q2 How can we configure Envoy to route the traffic ? 27

Company: Preferred Networks • Provides ML models like LLMs, and
solutions for industries • Uses own on-premise infrastructure to provide solutions Infrastructure • 3+ Kubernetes Clusters • 400+ Kubernetes Nodes • 30000+ CPU Cores • 320+ TiB Memory • 2000+ GPUs • Our AI Accelerator: MN-Core™ ◦ HW: RTL, Board/Server Design ◦ SW: Driver, Device Plugin, Compiler Background: Our computing infrastructure 29 Our Infrastructure

Network Zone D Network Zone C Network Zone B Network
Zone A Network topology of our data center: CLOS network Background: Data Center Network Spine Switch Spine Switch Leaf Switch Leaf Switch Leaf Switch Leaf Switch Node Node Node Node Node Node Node Node Node Node Node Node External / Super Spine Inter-zone Networking (oversubscribed) In-zone Networking 30

• Assumptions ◦ SCS is deployed to all nodes to
use local NVMe drives ◦ Also, User Pods will be scheduled to all nodes to use all accelerators • Where to deploy Envoy? ◦ We deploy Envoy to all nodes to reduce inter-zone trafﬁc of Pod/Envoy. ◦ Inter-zone trafﬁc of Envoy/SCS is unavoidable in that case. Where to deploy Envoy? Spine Switch Leaf Switch Leaf Switch Node Node Node Node Node Node Spine Switch 31

Reducing inter-zone traffic by K8s Service Topology Aware Routing Node
Node Node Node Node Node • Pod/Envoy Traffic ◦ Perfect / No network traffic • Envoy load balance ◦ Bad / No distribution of traffic ◦ When some node use SCS heavily, the Envoy's cpu load become high • Pod/Envoy Traffic ◦ Moderate / In-zone network traffic only • Envoy load balance ◦ Moderate / Distribute traffic in a zone ◦ When some node use SCS heavily, Envoy's cpu load is distributed among zone We use Topology Aware Routing to improve Envoy's cpu load balance 32 Internal Traffic Policy

• We want to route the trafﬁc from Envoy to
SCS consistently ◦ When we put an object to the N-th SCS, we want to get it from the N-th SCS. • The easiest way to achieve that: ◦ Manage a mapping from bucket/object to id of backend ◦ Mapping should be sharded… ▪ Introduce a distributed MDS ▪ Too complicated solution for us • We don't manage a mapping explicitly ◦ Use hash(bucket + "/" + object) to choose a backend server Load Balancing of Keys (Bucket and Object) 34

• Use (hash % number-of-backends) as backend id ? ◦
When the number of backends changes, almost every keys remaps ▪ Typical example: Node Failure / Installation ◦ More sophisticated way -> Consistent Hashing ▪ Bound of remapped keys is keys/backends • Two Consistent Hashing algorithms in Envoy/lb_policy Consistent Hashing 35 Backend 1 Backend 3 Backend 2 Backend 4 Hash (10) Backend 1 • {1, 6, 10} Backend 2 • {2, 5, 11} Backend 3 • {7, 8, 9} Backend 4 • {3, 4, 12} Hash (10) RING_HASH MAGLEV

• Load balance of keys is also very important ◦
The length of arc in RING corresponds to the ratio of responsibilities ▪ Backend 3 is 1.5x responsibility of Backend 4 ◦ It affect the performance ! ▪ B3's cpu usage is 1.5x of B4's • Because B3 is 1.5x busier than B4 • May result the longer latency ▪ The lifetime of B3 data is 1.5x shorter than B4 data • Because the cache capacity is the same • More possibility of deletion We want to see the consistent resource usage and lifetime Key distribution matters! 36 Backend 1 Backend 3 Backend 2 Backend 4 RING_HASH

RING_HASH vs MAGLEV -> We use MAGLEV 37 RING_HASH MAGLEV
Objects Count per node Backend 1 Backend 3 Backend 2 Backend 4 Hash (10) Backend 1 • {1, 6, 10} Backend 2 • {2, 5, 11} Backend 3 • {7, 8, 9} Backend 4 • {3, 4, 12} Hash (10) RING_HASH MAGLEV Load imbalance up to 1.5x No load imbalance

API calls / sec 38 Peak: 37k requests / sec

Numbers of SCS: Aggregated Trafﬁc 39 Peak: 75.1 GiB/s throughput

40 まとめ

41 • 分散キャッシュを使って機械学習ワークロードを加速しています ◦ Kubernetes や Envoy に支えられて実現しました ◦ PFIO
など Python から使いやすくする工夫もあります • 負荷分散にこだわったつくりになっています ◦ L4: 実際のネットワークトポロジを意識したトラフィック制御 ◦ L7: 完全なキーの分散を実現する MAGLEV の採用 ▪ shared-nothing architecture • PFN はこれからも Kubernetes でおもしろい機械学習基盤を作ります Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster

42 Preferred Networks の基盤技術グループでは採用を実施中です！ • 機械学習プラットフォームエンジニア Kubernetes, 社内向け機械学習プラットフォーム、外販クラウドサービスの開発運用キーワード: K8s,
オンプレ, GPU, Observability, MLOps, HPC, スケジューラ, AWS, 　　　　　 Front/Backend, コンテナネットワーク, データセンタネットワーク, RDMA, RoCE v2 • ストレージエンジニアストレージの企画設計管理運用 • 大規模計算基盤エンジニア/ネットワーク・インフラ運用エンジニアクラスタの物理設計、MN-Core™ を含めた先端システム設計等 We’re hiring! カジュアル面談にお気軽にご応募ください

Making the real world computable

KubeCon NA 2024 Recap: Distributed Cache Empowe...

KubeCon NA 2024 Recap: Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster / Kubernetes Meetup Tokyo #68

More Decks by Preferred Networks

Other Decks in Technology

Featured

Transcript