Slide 1

Slide 1 text

Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster Yuichiro Ueno, Toru Komatsu (Preferred Networks, Inc.) KubeCon North America 2024 1

Slide 2

Slide 2 text

Speakers @utam0k @y1r96 Toru Komatsu Preferred Networks, Inc. 2 Yuichiro Ueno Preferred Networks, Inc.

Slide 3

Slide 3 text

1. Background: AI / ML Workloads ✓ Storage Requirements and Kubernetes Usage 2. Our system: Simple Cache Service 3. Use case 4. Deploy Considerations ✓ How to optimize network traffic and key distribution to achieve higher performance ✓ The number of SCS in production 5. Summary Today's topic 3

Slide 4

Slide 4 text

Distributed Cache Empowers AI/ML Workloads on Kubernetes Cluster Yuichiro Ueno, Toru Komatsu (Preferred Networks, Inc.) KubeCon North America 2024 4

Slide 5

Slide 5 text

Training of Machine Learning Models 5 Compute Node Deep Neural Network Compute Node Deep Neural Network Data Samples Dataset Icon pack by Icons8 - https://icons8.com

Slide 6

Slide 6 text

● Network File System (NFS) with hostPath ○ Fast but not scalable ● Object Storage ○ Scalable but not fast ■ We're using HDDs as backend of our object storage ● Node Local Storage (NVMe) ○ Very fast but the storage is not globally available, and not scalable ■ If the workload is moved to different compute node, the data is unreachable. On-premise Storages for the dataset loading 12 Compute Node A Deep Neural Network Compute Node B Deep Neural Network Preempt and Re-scheduling Moved to node B unreachable Icon pack by Icons8 - https://icons8.com Cache

Slide 7

Slide 7 text

Best hierarchical storage for AI/ML workload ? 13 Object Storage Compute Node Deep Neural Network Compute Node Deep Neural Network Cloud of Node Local Storage Capacity-optimized Performance-optimized Icon pack by Icons8 - https://icons8.com

Slide 8

Slide 8 text

Best hierarchical storage development with: 14 ✓ Topology-Aware Routing ✓ Informer for Pod Discovery ✓ Token Review API ✓ Consistent Hashing ✓ xDS API Cloud of Node Local Storage Icon pack by Icons8 - https://icons8.com

Slide 9

Slide 9 text

Overview of Simple Cache Service 15

Slide 10

Slide 10 text

Simple ✓ Simple HTTP REST API(GET & PUT) ✓ It just returns local files Cloud Native ✓ SCS runs on Kubernetes Shared-nothing architecture ✓ Scalable Position as a Cache ✓ It’s just “Cache” and not “Persistent Storage” Overview of Simple Cache Service 16

Slide 11

Slide 11 text

How to use SCS # Upload `apple.jpg` and save as `apple` object in `prj-foobar` bucket. $ curl -H "Authorization: Bearer $(cat /token)" \ -X PUT \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple \ --data-binary @apple.jpg # Download `apple` object in `prj-foobar` bucket $ curl -H "Authorization: Bearer $(cat /token)" \ -X GET \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple 17

Slide 12

Slide 12 text

How to use SCS # Upload `apple.jpg` and save as `apple` object in `prj-foobar` bucket. $ curl -H "Authorization: Bearer $(cat /token)" \ -X PUT \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple \ --data-binary @apple.jpg # Download `apple` object in `prj-foobar` bucket $ curl -H "Authorization: Bearer $(cat /token)" \ -X GET \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple bucket object 18

Slide 13

Slide 13 text

How to use SCS # Upload `apple.jpg` and save as `apple` object in `prj-foobar` bucket. $ curl -H "Authorization: Bearer $(cat /token)" \ -X PUT \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple \ --data-binary @apple.jpg # Download `apple` object in `prj-foobar` bucket $ curl -H "Authorization: Bearer $(cat /token)" \ -X GET \ http://cache.cache-service.svc/v1/objects/prj-foobar/apple Bound Service Account Token 19

Slide 14

Slide 14 text

Overall Architecture (1/2) Load Balancing in Layer 4 Service with Topology Aware Hints Load Balancing in Layer 7 Envoy Proxy with Consistent Hashing User Pods Cache 20

Slide 15

Slide 15 text

Overall Architecture (2/2) .sqlite Application Cache Data Meta Data 21 Icon pack by Icons8 - https://icons8.com

Slide 16

Slide 16 text

Shared-nothing Architecture Network Zone A Load Balancing in Layer 4 Service with Topology Aware Hints Load Balancing in Layer 7 Envoy Proxy with Consistent Hashing Network Zone B GET /objects/A GET /objects/A GET /objects/B Cache B Cache A User 22

Slide 17

Slide 17 text

1. Mount the Bound SA Token 2. Make the request w/ the token in Auth Header Authorization (1/2) User Pods 3. Verify the token by TokenReview API ✓ “Aud as expected?” “Valid until?” “Pod is still alive?” ✓ Resolve the NS of the source from the SA username ➡ Namespace-level authorization can be implemented 23

Slide 18

Slide 18 text

Authorization (2/2) "Bucket": [ { "Name": "public", "Public": true, "BucketQuota": "100Gi" }, { "Name": "kubecon", "Public": false, "BucketQuota": "500Gi", "AllowNamespaces" : [ "prj-kubernetes", "user-utam0k", ] } ] Public Bucket Private Bucket Based on Namespace Selector 24

Slide 19

Slide 19 text

Unfortunately, storage is a limited resource… 😭 LRU(Least Recently Used) Strategy Total Limit When each bucket reaches its capacity limit, object deletion begins based on LRU 25

Slide 20

Slide 20 text

Use case 26

Slide 21

Slide 21 text

Case 1 SCS as a Cache for Slower Storage ✓ Make faster AI/ML Workloads ! Case 2 SCS as a Backend for Yet Another Cache ✓ Make faster startup of AI/ML Workloads ! Use case of SCS in AI / ML Workloads 27

Slide 22

Slide 22 text

Case 1 SCS as a Cache for Slower Storage ✓ Make faster AI/ML Workloads ! Case 2 SCS as a Backend for Yet Another Cache ✓ Make faster startup of AI/ML Workloads ! Use case of SCS in AI / ML Workloads 28 →

Slide 23

Slide 23 text

PFIO is an I/O abstraction library developed by us ● It can read / write / list Local filesystem, S3 compatible object storage, HDFS, … Read File in Object Storage with PFIO GET /000.jpg 200 OK 29 Object Storage Icon pack by Icons8 - https://icons8.com train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image)

Slide 24

Slide 24 text

PFIO supports transparent cache mechanism ● It automatically checks data in SCS first, then try origin later if data is not exist ● At first, the desired data is not stored in SCS, therefore PFIO will put it to SCS Transparent Object Storage Cache by PFIO GET /000.jpg 404 Not Found PUT /000.jpg 201 Created Object Storage GET /000.jpg 200 OK 30 train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url, http_cache=scs_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image) Icon pack by Icons8 - https://icons8.com

Slide 25

Slide 25 text

PFIO supports transparent cache mechanism ● It automatically checks data in SCS first, then try origin later if data is not exist ● If the desired data is stored in SCS, we can skip accessing Object Storage Transparent Object Storage Cache by PFIO GET /000.jpg 200 OK PUT /000.jpg 201 Created GET /000.jpg 200 OK 31 train_with_scs.py import pfio import torch fs = pfio.v2.from_url(zip_url, http_cache=scs_url) # fs is Local filesystem like object, actually S3 file_url = "000.jpg" with fs.open(file_url) as fp: image = fp.read() # image = torch.Tensor(image)... # loss = model(image) Object Storage SKIP ! Icon pack by Icons8 - https://icons8.com

Slide 26

Slide 26 text

Case 1 SCS as a Cache for Slower Storage ✓ Make faster AI/ML Workloads ! Case 2 SCS as a Backend for Yet Another Cache ✓ Make faster startup of AI/ML Workloads ! Use case of SCS in AI / ML Workloads 32 →

Slide 27

Slide 27 text

Type 1 Container Images It includes a lot of dependencies ○ Compilers, CUDA (runtime and library), MPI, and PyTorch As a result, our all-in-one container image is 30+ GB Weekly cache hit rate to SCS is 94.3% in our cluster Type 2 Models Large Language Model is larger and larger ! ○ GB ~ TB size Our researchers want to evaluate the performance of public LLMs Characteristics Ephemeral, Large, and Hot Many users access the same file Cache mechanism works well Other large files in AI / ML Workloads 33

Slide 28

Slide 28 text

Implementing Yet Another Cache using SCS Yet Another Cache Features to implement ○ URL Mappings ○ from origin key to SCS bucket/key ○ AuthN/AuthZ if needed ○ Other necessary features Features not to implement ✓ Storage management ○ Cache Eviction ○ Capacity Control GET /000.jpg 404 Not Found PUT /000.jpg 201 Created Origin Service GET /000.jpg 200 OK GET /000.jpg 200 OK 34 ① ② ③ ⑤ ④ e.g. Container Image Layer

Slide 29

Slide 29 text

Deploying SCS 35

Slide 30

Slide 30 text

Q1 How can we optimize the network traffic ? Deploy Considerations User Pods Q2 How can we configure Envoy to route the traffic ? 36

Slide 31

Slide 31 text

Q1 How can we optimize the network traffic ? Deploy Considerations User Pods Q2 How can we configure Envoy to route the traffic ? 37

Slide 32

Slide 32 text

Company: Preferred Networks ● Provides ML models like LLMs, and solutions for industries ● Uses own on-premise infrastructure to provide solutions Infrastructure ● 3+ Kubernetes Clusters ● 400+ Kubernetes Nodes ● 30000+ CPU Cores ● 320+ TiB Memory ● 2000+ GPUs ● Our AI Accelerator: MN-Core™ ○ HW: RTL, Board/Server Design ○ SW: Driver, Device Plugin, Compiler Background: Our computing infrastructure 38 Our Infrastructure

Slide 33

Slide 33 text

Network Zone D Network Zone C Network Zone B Network Zone A Network topology of our data center: CLOS network Background: Data Center Network Spine Switch Spine Switch Leaf Switch Leaf Switch Leaf Switch Leaf Switch Node Node Node Node Node Node Node Node Node Node Node Node External / Super Spine Inter-zone Networking (oversubscribed) In-zone Networking 39

Slide 34

Slide 34 text

● Assumptions ○ SCS is deployed to all nodes to use local NVMe drives ○ Also, User Pods will be scheduled to all nodes to use all accelerators ● Where to deploy Envoy? ○ We deploy Envoy to all nodes to reduce inter-zone traffic of Pod/Envoy. ○ Inter-zone traffic of Envoy/SCS is unavoidable in that case. Where to deploy Envoy? Spine Switch Leaf Switch Leaf Switch Node Node Node Node Node Node Spine Switch 43

Slide 35

Slide 35 text

Reducing inter-zone traffic by K8s Service Topology Aware Routing Node Node Node Node Node Node ● Pod/Envoy Traffic ○ Perfect / No network traffic ● Envoy load balance ○ Bad / No distribution of traffic ○ When some node use SCS heavily, the Envoy's cpu load become high ● Pod/Envoy Traffic ○ Moderate / In-zone network traffic only ● Envoy load balance ○ Moderate / Distribute traffic in a zone ○ When some node use SCS heavily, Envoy's cpu load is distributed among zone We use Topology Aware Routing to improve Envoy's cpu load balance 49 Internal Traffic Policy

Slide 36

Slide 36 text

Q1 How can we optimize the network traffic ? Deploy Considerations User Pods Q2 How can we configure Envoy to route the traffic ? 50

Slide 37

Slide 37 text

● We want to route the traffic from Envoy to SCS consistently ○ When we put an object to the N-th SCS, we want to get it from the N-th SCS. ● The easiest way to achieve that: ○ Manage a mapping from bucket/object to id of backend ○ Mapping should be sharded… ■ Introduce a distributed MDS ■ Too complicated solution for us ● We don't manage a mapping explicitly ○ Use hash(bucket + "/" + object) to choose a backend server Load Balancing of Keys (Bucket and Object) 53

Slide 38

Slide 38 text

● Use (hash % number-of-backends) as backend id ? ○ When the number of backends changes, almost every keys remaps ■ Typical example: Node Failure / Installation ○ More sophisticated way -> Consistent Hashing ■ Bound of remapped keys is keys/backends ● Two Consistent Hashing algorithms in Envoy/lb_policy Consistent Hashing 55 Backend 1 Backend 3 Backend 2 Backend 4 Hash (10) Backend 1 ● {1, 6, 10} Backend 2 ● {2, 5, 11} Backend 3 ● {7, 8, 9} Backend 4 ● {3, 4, 12} Hash (10) RING_HASH MAGLEV

Slide 39

Slide 39 text

● Load balance of keys is also very important ○ The length of arc in RING corresponds to the ratio of responsibilities ■ Backend 3 is 1.5x responsibility of Backend 4 ○ It affect the performance ! ■ B3's cpu usage is 1.5x of B4's ● Because B3 is 1.5x busier than B4 ● May result the longer latency ■ The lifetime of B3 data is 1.5x shorter than B4 data ● Because the cache capacity is the same ● More possibility of deletion We want to see the consistent resource usage and lifetime Key distribution matters! 56 Backend 1 Backend 3 Backend 2 Backend 4 RING_HASH

Slide 40

Slide 40 text

RING_HASH vs MAGLEV -> We use MAGLEV 58 RING_HASH MAGLEV Objects Count per node Backend 1 Backend 3 Backend 2 Backend 4 Hash (10) Backend 1 ● {1, 6, 10} Backend 2 ● {2, 5, 11} Backend 3 ● {7, 8, 9} Backend 4 ● {3, 4, 12} Hash (10) RING_HASH MAGLEV Load imbalance up to 1.5x No load imbalance

Slide 41

Slide 41 text

The number of SCS in production 59

Slide 42

Slide 42 text

API calls / sec 60 Peak: 37k requests / sec

Slide 43

Slide 43 text

Numbers of SCS: Aggregated Traffic 61 Peak: 75.1 GiB/s throughput

Slide 44

Slide 44 text

● Peak performance of the last 30 days ○ 37k requests / sec ○ 75.1 GiB/s throughput ● We achieved this performance in the production environment with 55 Backend Servers with 82.5 TB NVMe Storage in total ● Usage ○ 268M Objects ○ Response code statistics: ■ 200 OK (GET): 96.2 % ■ 404 Not Found (GET): 0.9 % ■ 201 Created (PUT): 2.9 % Numbers of SCS 62

Slide 45

Slide 45 text

Summary 63

Slide 46

Slide 46 text

SCS Summary Features Feature 1 Shared-nothing: Consistent Hashing with Envoy Feature 2 AuthN / AuthZ: Bound SA Token with TokenReview API Feature 3 Transparent Cache: PFIO Use cases in the Real World Case 1 AI/ML Dataset Loading Case 2 Large Model Deployment and Container Images Optimization Techniques Tech 1 CLOS Network optimization: Internal Traffic Policy / Tech 1 Topology-Aware Routing Tech 2 Consistent Hashing Algorithm: RING_HASH / MAGLEV Supported by Cloud Native Technologies: Kubernetes and Envoy Internship members: @naoki9911, @ciffelia, @takonomura 64

Slide 47

Slide 47 text

66