KubeCon NA 2024 Recap: Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster / Kubernetes Meetup Tokyo #68

Slide 1

Slide 1 text

KubeCon + CloudNativeCon North America 2024 Recap Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster Yuichiro Ueno, Toru Komatsu (Preferred Networks, Inc.) 2024-12-12 Kubernetes Meetup Tokyo #68

Slide 2

Slide 2 text

2 ● 上野裕一郎 (Yuichiro Ueno) ○ 2021/04 新卒で Preferred Networks に入社 ○ Cluster Services チーム所属 ■ Kubernetes で機械学習基盤を作る ■ アクセラレータやネットワークまわり ● すきなこと ○ 性能最適化 ○ 高性能計算（HPC） ○ インフラ全般 ● KubeCon 初参加自己紹介 @y1r96

Slide 3

Slide 3 text

3 Preferred Networks, Inc. オンプレML基盤の開発‧運⽤ OSS Activities Maintainer opencontainers/runtime-spec youki-dev/youki Reviewer containerd/runwasi Member kubernetes org(sig-scheduling/node) Community CNA, CNCJ @utam0k Toru Komatsu

Slide 4

Slide 4 text

4 ● PFN からの KubeCon 初発表セッションについて Recap させてください ○ Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster ● Kubernetes Meetup Tokyo #60 で紹介したことがあります ○ 既にご存知の方もいると思うので、駆け足で簡単な振り返りと、 #60 からのアップデートをいくつか抜粋して紹介します！今日の Recap セッション

Slide 5

Slide 5 text

5 PFNの事業: AI技術のバリューチェーンを垂直統合ソリューション・製品計算基盤 AIチップ PFNは、チップ、計算基盤、⽣成AI基盤モデル、ソリューション‧製品まで、AI技術のバリューチェーンを垂直統合し、ソフトウェアとハードウェアを⾼度に融合することで、競争⼒の⾼い技術の開発および産業応⽤を進めています。生成AI基盤モデル様々な産業‧消費者向けのソリューション‧製品 MN-Core™ MN-Core™ 2 GPUクラスタ MN-3 (MN-Core™ クラスタ) PLaMo Prime（今秋提供予定のLLM） PLaMo Lite（エッジ向けSLM） MN-Core 第三世代 MN-Core™ 2を計算資源としたクラウドサービス物質のエネルギー計算モデル PFP LLM向け推論チップ（2026年提供予定）

Slide 6

Slide 6 text

6 背景

Slide 7

Slide 7 text

Training of Machine Learning Models 7 Compute Node Deep Neural Network Compute Node Deep Neural Network Data Samples Dataset Icon pack by Icons8 - https://icons8.com

Slide 8

Slide 8 text

● Network File System (NFS) with hostPath ○ Fast but not scalable ● Object Storage ○ Scalable but not fast ■ We're using HDDs as backend of our object storage ● Node Local Storage (NVMe) ○ Very fast but the storage is not globally available, and not scalable ■ If the workload is moved to different compute node, the data is unreachable. On-premise Storages for the dataset loading 8 Compute Node A Deep Neural Network Compute Node B Deep Neural Network Preempt and Re-scheduling Moved to node B unreachable Icon pack by Icons8 - https://icons8.com Cache

Slide 9

Slide 9 text

Best hierarchical storage for AI/ML workload ? 9 Object Storage Compute Node Deep Neural Network Compute Node Deep Neural Network Cloud of Node Local Storage Capacity-optimized Performance-optimized Icon pack by Icons8 - https://icons8.com

Slide 10

Slide 10 text

Best hierarchical storage development with: 10 ✓ Topology-Aware Routing ✓ Informer for Pod Discovery ✓ Token Review API ✓ Consistent Hashing ✓ xDS API Cloud of Node Local Storage Icon pack by Icons8 - https://icons8.com

Slide 11

Slide 11 text

11 アーキテクチャ・使い方

Slide 12

Slide 12 text

Simple ✓ Simple HTTP REST API（GET & PUT） ✓ It just returns local ﬁles Cloud Native ✓ SCS runs on Kubernetes Shared-nothing architecture ✓ Scalable Position as a Cache ✓ It’s just “Cache” and not “Persistent Storage” Overview of Simple Cache Service 12

Slide 13

Slide 13 text

How to use SCS # Upload àpple.jpg` and save as àpple` object in `prj-foobar` bucket. $ curl -H "Authorization: Bearer $(cat /token)" \ -X PUT \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple \ --data-binary @apple.jpg # Download àpple` object in `prj-foobar` bucket $ curl -H "Authorization: Bearer $(cat /token)" \ -X GET \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple 13

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Overall Architecture (1/2) Load Balancing in Layer 4 Service with Topology Aware Hints Load Balancing in Layer 7 Envoy Proxy with Consistent Hashing User Pods Cache 16

Slide 17

Slide 17 text

Overall Architecture (2/2) .sqlite Application Cache Data Meta Data 17 Icon pack by Icons8 - https://icons8.com

Slide 18

Slide 18 text

Shared-nothing Architecture Network Zone A Load Balancing in Layer 4 Service with Topology Aware Hints Load Balancing in Layer 7 Envoy Proxy with Consistent Hashing Network Zone B GET /objects/A GET /objects/A GET /objects/B Cache B Cache A User 18

Slide 19

Slide 19 text

1. Mount the Bound SA Token 2. Make the request w/ the token in Auth Header Authorization (1/2) User Pods 3. Verify the token by TokenReview API ✓ “Aud as expected?” “Valid until?” “Pod is still alive?” ✓ Resolve the NS of the source from the SA username ➡ Namespace-level authorization can be implemented 19

Slide 20

Slide 20 text

Authorization (2/2) "Bucket": [ { "Name": "public", "Public": true, "BucketQuota": "100Gi" }, { "Name": "kubecon", "Public": false, "BucketQuota": "500Gi", "AllowNamespaces" : [ "prj-kubernetes", "user-utam0k", ] } ] Public Bucket Private Bucket Based on Namespace Selector 20

Slide 21

Slide 21 text

Slide 22

Slide 22 text

PFIO is an I/O abstraction library developed by us ● It can read / write / list Local ﬁlesystem, S3 compatible object storage, HDFS, … Read File in Object Storage with PFIO GET /000.jpg 200 OK 22 Object Storage Icon pack by Icons8 - https://icons8.com train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image)

Slide 23

Slide 23 text

PFIO supports transparent cache mechanism ● It automatically checks data in SCS ﬁrst, then try origin later if data is not exist ● At ﬁrst, the desired data is not stored in SCS, therefore PFIO will put it to SCS Transparent Object Storage Cache by PFIO GET /000.jpg 404 Not Found PUT /000.jpg 201 Created Object Storage GET /000.jpg 200 OK 23 train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url, http_cache=scs_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image) Icon pack by Icons8 - https://icons8.com

Slide 24

Slide 24 text

PFIO supports transparent cache mechanism ● It automatically checks data in SCS ﬁrst, then try origin later if data is not exist ● If the desired data is stored in SCS, we can skip accessing Object Storage Transparent Object Storage Cache by PFIO GET /000.jpg 200 OK PUT /000.jpg 201 Created GET /000.jpg 200 OK 24 train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url, http_cache=scs_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image) Object Storage SKIP ! Icon pack by Icons8 - https://icons8.com

Slide 25

Slide 25 text

Implementing Yet Another Cache using SCS Yet Another Cache Features to implement ○ URL Mappings ○ from origin key to SCS bucket/key ○ AuthN/AuthZ if needed ○ Other necessary features Features not to implement ✓ Storage management ○ Cache Eviction ○ Capacity Control GET /000.jpg 404 Not Found PUT /000.jpg 201 Created Origin Service GET /000.jpg 200 OK GET /000.jpg 200 OK 25 ① ② ③ ⑤ ④ e.g. Container Image Layer

Slide 26

Slide 26 text

26 本番環境へのデプロイ

Slide 27

Slide 27 text

Q1 How can we optimize the network traffic ? Deploy Considerations User Pods Q2 How can we configure Envoy to route the traffic ? 27

Slide 28

Slide 28 text

Q1 How can we optimize the network traffic ? Deploy Considerations User Pods Q2 How can we configure Envoy to route the traffic ? 28

Slide 29

Slide 29 text

Company: Preferred Networks ● Provides ML models like LLMs, and solutions for industries ● Uses own on-premise infrastructure to provide solutions Infrastructure ● 3+ Kubernetes Clusters ● 400+ Kubernetes Nodes ● 30000+ CPU Cores ● 320+ TiB Memory ● 2000+ GPUs ● Our AI Accelerator: MN-Core™ ○ HW: RTL, Board/Server Design ○ SW: Driver, Device Plugin, Compiler Background: Our computing infrastructure 29 Our Infrastructure

Slide 30

Slide 30 text

Network Zone D Network Zone C Network Zone B Network Zone A Network topology of our data center: CLOS network Background: Data Center Network Spine Switch Spine Switch Leaf Switch Leaf Switch Leaf Switch Leaf Switch Node Node Node Node Node Node Node Node Node Node Node Node External / Super Spine Inter-zone Networking (oversubscribed) In-zone Networking 30

Slide 31

Slide 31 text

● Assumptions ○ SCS is deployed to all nodes to use local NVMe drives ○ Also, User Pods will be scheduled to all nodes to use all accelerators ● Where to deploy Envoy? ○ We deploy Envoy to all nodes to reduce inter-zone trafﬁc of Pod/Envoy. ○ Inter-zone trafﬁc of Envoy/SCS is unavoidable in that case. Where to deploy Envoy? Spine Switch Leaf Switch Leaf Switch Node Node Node Node Node Node Spine Switch 31

Slide 32

Slide 32 text

Reducing inter-zone traffic by K8s Service Topology Aware Routing Node Node Node Node Node Node ● Pod/Envoy Traffic ○ Perfect / No network traffic ● Envoy load balance ○ Bad / No distribution of traffic ○ When some node use SCS heavily, the Envoy's cpu load become high ● Pod/Envoy Traffic ○ Moderate / In-zone network traffic only ● Envoy load balance ○ Moderate / Distribute traffic in a zone ○ When some node use SCS heavily, Envoy's cpu load is distributed among zone We use Topology Aware Routing to improve Envoy's cpu load balance 32 Internal Traffic Policy

Slide 33

Slide 33 text

Q1 How can we optimize the network traffic ? Deploy Considerations User Pods Q2 How can we configure Envoy to route the traffic ? 33

Slide 34

Slide 34 text

● We want to route the trafﬁc from Envoy to SCS consistently ○ When we put an object to the N-th SCS, we want to get it from the N-th SCS. ● The easiest way to achieve that: ○ Manage a mapping from bucket/object to id of backend ○ Mapping should be sharded… ■ Introduce a distributed MDS ■ Too complicated solution for us ● We don't manage a mapping explicitly ○ Use hash(bucket + "/" + object) to choose a backend server Load Balancing of Keys (Bucket and Object) 34

Slide 35

Slide 35 text

● Use (hash % number-of-backends) as backend id ? ○ When the number of backends changes, almost every keys remaps ■ Typical example: Node Failure / Installation ○ More sophisticated way -> Consistent Hashing ■ Bound of remapped keys is keys/backends ● Two Consistent Hashing algorithms in Envoy/lb_policy Consistent Hashing 35 Backend 1 Backend 3 Backend 2 Backend 4 Hash (10) Backend 1 ● {1, 6, 10} Backend 2 ● {2, 5, 11} Backend 3 ● {7, 8, 9} Backend 4 ● {3, 4, 12} Hash (10) RING_HASH MAGLEV

Slide 36

Slide 36 text

● Load balance of keys is also very important ○ The length of arc in RING corresponds to the ratio of responsibilities ■ Backend 3 is 1.5x responsibility of Backend 4 ○ It affect the performance ! ■ B3's cpu usage is 1.5x of B4's ● Because B3 is 1.5x busier than B4 ● May result the longer latency ■ The lifetime of B3 data is 1.5x shorter than B4 data ● Because the cache capacity is the same ● More possibility of deletion We want to see the consistent resource usage and lifetime Key distribution matters! 36 Backend 1 Backend 3 Backend 2 Backend 4 RING_HASH

Slide 37

Slide 37 text

RING_HASH vs MAGLEV -> We use MAGLEV 37 RING_HASH MAGLEV Objects Count per node Backend 1 Backend 3 Backend 2 Backend 4 Hash (10) Backend 1 ● {1, 6, 10} Backend 2 ● {2, 5, 11} Backend 3 ● {7, 8, 9} Backend 4 ● {3, 4, 12} Hash (10) RING_HASH MAGLEV Load imbalance up to 1.5x No load imbalance

Slide 38

Slide 38 text

API calls / sec 38 Peak: 37k requests / sec

Slide 39

Slide 39 text

Numbers of SCS: Aggregated Trafﬁc 39 Peak: 75.1 GiB/s throughput

Slide 40

Slide 40 text

40 まとめ

Slide 41

Slide 41 text

41 ● 分散キャッシュを使って機械学習ワークロードを加速しています ○ Kubernetes や Envoy に支えられて実現しました ○ PFIO など Python から使いやすくする工夫もあります ● 負荷分散にこだわったつくりになっています ○ L4: 実際のネットワークトポロジを意識したトラフィック制御 ○ L7: 完全なキーの分散を実現する MAGLEV の採用 ■ shared-nothing architecture ● PFN はこれからも Kubernetes でおもしろい機械学習基盤を作ります Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster

Slide 42

Slide 42 text

42 Preferred Networks の基盤技術グループでは採用を実施中です！ ● 機械学習プラットフォームエンジニア Kubernetes, 社内向け機械学習プラットフォーム、外販クラウドサービスの開発運用キーワード: K8s, オンプレ, GPU, Observability, MLOps, HPC, スケジューラ, AWS, 　　　　　 Front/Backend, コンテナネットワーク, データセンタネットワーク, RDMA, RoCE v2 ● ストレージエンジニアストレージの企画設計管理運用 ● 大規模計算基盤エンジニア/ネットワーク・インフラ運用エンジニアクラスタの物理設計、MN-Core™ を含めた先端システム設計等 We’re hiring! カジュアル面談にお気軽にご応募ください

Slide 43

Slide 43 text

Making the real world computable