Slide 1

Slide 1 text

Edge IoT system with NVIDIA Jetson managed by Rancher YANO Tetsuro Stylez 02/11/2024

Slide 2

Slide 2 text

Today's materials URL/本日の資料のURL 1 https://bit.ly/

Slide 3

Slide 3 text

self-introduction 2 Career: 10 years in system operation, 6 years in networking, 8 years in SI,8 years in charge of OSS products (Nextcloud/Rancher) Other: Rubyist with no progress at all, recently programming with generative AI. My first PC was an OKI if-800.... YANO Tetsuro tetsurow.yano Stylez Inc.

Slide 4

Slide 4 text

Geeko, These puppets is in my home. 15years 15years

Slide 5

Slide 5 text

Today I'm talking about... 4 • Considerations for running Kubernetes on an NVIDIA Jetson Orin • Best Kubernetes configuration for edges • Container Image Management Techniques • CI/CD Pipeline for Container and Deployment • Data storage recommendations • Things to consider when microserving applications ◆Topics

Slide 6

Slide 6 text

What are the needs of factories in Japan? 5

Slide 7

Slide 7 text

There are a lot of factories in Japan. 6  It is said that Japanese factories are being replaced by Chinese ones, but China has not yet managed to overtake Japan in the production of precision machinery and machinery that does not break down over long periods of time.  日本の工場は中国に取って代わられていると言われますが、精密な機械や長期間にわたって壊れない機 械の物作りはまだ中国に負けていません  However, simple tasks such as ‘inspecting products’, ‘making rounds’, and ‘replenishing parts’ as well as ‘handling minor malfunctions’ are reducing work efficiency.  しかし、「人が製品を点検する事」や「見回ること」「部品の不足を補充する」といった単純作業や「ちょっと した故障の対応」が作業の効率を下げています。

Slide 8

Slide 8 text

How does AI reduce the workload? 7  Anomaly detection and scratch detection  Detection of missing parts  Foreign object detection  Data collection on environment and equipment Empty OK! NG! NG! NG!

Slide 9

Slide 9 text

More to manage makes it difficult. 8  Increase in the number of devices on which AI is running Image AI Diagnosis If it's not much, that's fine. Image AI Diagnosis Image AI Diagnosis Image AI Diagnosis 画像AI診断 Image AI Diagnosis Image AI Diagno 10+ units are unmanageable. Image AI Diagnosis Image AI Diagnosis Image AI Diagnosis Image AI Diagnosis Image AI Diagnos More machines Increased operational workload Fewer operating hours.

Slide 10

Slide 10 text

Moving from standalone PC server to orchestration 9  Change so that many machines can be centrally managed through orchestration  To do this, a management system is required Kubernetes

Slide 11

Slide 11 text

Platform architecture 10

Slide 12

Slide 12 text

Technology Stack 11  Technology stack of the proposed architecture GitLab CI/CD Pipeline Operation Intel Server Intel Server Linux(Ubuntu) Linux(Ubuntu) Rancher Kubernetes Engine GitLab GitLab Runner Rancher NVIDIA Jetson Development) K3S FLEET Harbor Linux(Ubuntu L4T) K3S Harbor Promethus Docker Rancher UI Apps Continuous Integration Continuous Delivery Rancher Continuous Delivery

Slide 13

Slide 13 text

Orchestrate and centralise 12  Centrally manage multiple machines. AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition Centralised management of multiple units using an orchestration tool

Slide 14

Slide 14 text

Why Edge Computing? なぜエッジコンピューティングですか? 13

Slide 15

Slide 15 text

Edge Computing 14  To collect and process data in real time, edge devices are installed in the field. エッジデバイスは現場に設置され、データを収集しリアルタイムで処理します The benefits of edge computing include reduced latency, improved data protection and cost savings. This enables faster decision making and more efficient operations in the field. Edge Device Collect Data Realtime Process feedback Cloud Computing Edge Computing Sends all data to the cloud for processing Send only necessary data to the cloud Data processing at the edge Cloud Cloud Edge

Slide 16

Slide 16 text

But... 15

Slide 17

Slide 17 text

Edge Device Issues 16 There are various hurdles to overcome in order to introduce edge devices. How do I update? The model's accuracy has gone down. High Price Reduce Cost what's going on? I want to replace a broken one. Network? How to Connect?

Slide 18

Slide 18 text

Select the least expensive HW device. 17 It is important to choose inexpensive equipment that is easy to replace. Generated by Bing Shift High Price Low Price

Slide 19

Slide 19 text

What does NVIDIA Jetson do? 18 It has a GPU, so it's perfect for AI workloads! Robot Automation Image recognition Autonomous vehicle Voice Response Bot Integrated GPU!

Slide 20

Slide 20 text

Running Kubernetes with K3S at Jetson 19  Our recommendation for Kubernetes is K3S, which is optimised for edge devices. Ideal for Edge Simple and secure Optimised for ARM Shift Original

Slide 21

Slide 21 text

Why use K3S on Jetson? 20  Memory usage when running mnist on K3S 614MB of virtual memory 418MB of actual memory  Research paper from ABB Corporate Research Ladenburg, Germany  The results show that K3S has the lowest memory usage. https://programming-group.com/assets/pdf/papers/2023_Lightweight-Kubernetes-Distributions.pdf

Slide 22

Slide 22 text

K3S (Kubernetes), Jetson and GPUs 21

Slide 23

Slide 23 text

GPU Operator GPU Driver Pod(GPU Workload) How to use a GPU with Kubernetes 22 Three technologies are required to make Kubernetes aware of GPUs.  Container Runtime  Device Plugin  GPU Operator Linux OS GPU Driver(Kernel Module) Kubernetes(K3S) Device Plugin Container Runtime

Slide 24

Slide 24 text

In the container, HW does not appear in /dev/ 23  The container is abstracted as much as possible to be hardware independent.  This gives you the "freedom" to run it anywhere.  As you can see below, there is almost "NO hardware" under /dev/. ls –la /dev/ in Container This Machine have GPU

Slide 25

Slide 25 text

Enable GPU in container runtime. 24 NVIDIA provides a Container Runtime that can use GPUs Shift GPU Driver Linux OS GPU Driver(Kernel Module) Container Runtime GPU Driver Linux OS GPU Driver(Kernel Module) Nvidia Container Runtime

Slide 26

Slide 26 text

GPU-enabled containers in /dev/. 25  Start a container with the following command docker run -d --rm --gpus all ubuntu:latest  NVIDIA* is up ls –la /dev/ in Container nvidia-*

Slide 27

Slide 27 text

KubernetesでのGPU 26 This is because Kubernetes (K3S) does not currently use Docker. Then Kubernetes (K3S) will not recognize GPU. Capacity: cpu: 16 ephemeral-storage: 479079112Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65626228Ki pods: 110 Allocatable: cpu: 16 ephemeral-storage: 466048159789 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65626228Ki pods: 110 no gpu found

Slide 28

Slide 28 text

GPU Driver Install Kubernetes Device Plugin 27 Three technologies are required to make Kubernetes aware of GPUs.  Container Runtime  Device Plugin  GPU Operator Linux OS GPU Driver(Kernel Module) Kubernetes(K3S) Device Plugin Container Runtime

Slide 29

Slide 29 text

Show GPU with Kubernetes Device Plugin 28 Install Kubernetes Device Plugin Capacity: cpu: 16 ephemeral-storage: 479079112Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65626228Ki pods: 110 Allocatable: cpu: 16 ephemeral-storage: 466048159789 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65626228Ki pods: 110 gpu found

Slide 30

Slide 30 text

The GPU has several settings 29 GPU Operator GPU Driver Linux OS GPU Driver(Kernel Module) Kubernetes(K3S) Device Plugin Container Runtime The Device Plugin only shows the GPU. GPU functions are made available by the GPU Operator.  Container Runtime  Device Plugin  GPU Operator

Slide 31

Slide 31 text

Kubernetes GPU Operator 30  Features enabled by GPU Operator Feature Description Automatic Node Labeling Automatically labels nodes with GPUs to simplify scheduling. GPU Device Plugin Deployment Automatically deploys necessary device plugins to nodes to make GPU resources available. GPU Driver Management Manages the installation and updating of NVIDIA GPU drivers to enable GPU usage on nodes. Monitoring and Alerting Monitors GPU usage and provides alerts if any issues arise. GPU Metrics Collection Collects GPU usage and performance data for integration with monitoring tools. Multi-version Support Supports different GPU setup versions (drivers and CUDA) to provide flexibility. Automatic Updates by Operator Automatically updates components when new versions become available.

Slide 32

Slide 32 text

About things other than NVIDIA Jetson 31

Slide 33

Slide 33 text

Getting used to Ubuntu OS 32 The Ubuntu OS is the default on Jetson devices, so familiarize yourself with Ubuntu. If low latency is needed, such as diagnostic imaging, consider a real-time kernel. Real-time kernel depending on requirements Ubuntu Shift SLE Micro 6.0? Umm

Slide 34

Slide 34 text

narrow network Build a stable network 33  Here is a customer's story - “The factory doesn't have a network. The first step is to pull it in." No network in the factory. Shift Stable network

Slide 36

Slide 36 text

[Solved] Leave some slack in the wiring of the LAN cable. 35 It is not good if the LAN cable bends 90 degrees at the connector. It is recommended to make the LAN cable have extra length in front of the connector. Make a loop by turning the LAN cable one turn in front of the connector. The size is about a ping-pong ball. 今更聞けない「Cat6A」|GIGAスクール|特集・連載|平野通信機材株式会社 https://www.hiranotsushin.jp/news/gigaschoolnavi/2020/000752.html

Slide 37

Slide 37 text

Building Container Images 36

Slide 38

Slide 38 text

Build container images for x86_64 and ARM 37  When developing for both px86_64 and ARM, there are two options  1. multi-arch build with Qemu emulation  2. multi-build with x86_64 and ARM Runner Advantagesds of Build メリット Disadvantages Multi-architecture build with Qemu emulation Developers can build using Dockerfile without worrying about architecture. Two container images can be created in one build. Build time is longer because you make two in one build. Multiple builds using two types of runners: x86_64 and ARM Built on any CPU. Need x86 and ARM Runner Need to manage

Slide 39

Slide 39 text

Build on x86_64 and ARM with two Runners 38  Building with two runners is recommended GitLab Pipeline x86_64 Runner x86_64 build ARM Build Workflow ARM64 Runner jobs: main: runs-on: ubuntu-latest steps: <<<中略>>> # 5. setup buildx - name: Setup Docker Buildx uses: docker/setup-buildx-action@v1 # 6. Build and Push - name: Build and push id: docker_build uses: docker/build-push-action@v2 with: context: . file: ./Dockerfile jobs: main: runs-on: self-hosted steps: <<<中略>>> # 5. setup buildx - name: Setup Docker Buildx uses: docker/setup-buildx-action@v1 # 6. Build and Push - name: Build and push id: docker_build uses: docker/build-push-action@v2 with: context: . file: ./Dockerfile Running a dedicated Runner with Jetson

Slide 40

Slide 40 text

Container Image Management Techniques 39

Slide 41

Slide 41 text

Place a container image near the cluster. 40  The network communication bandwidth in the factory is narrow  Place the container image close to the cluster GitLab GitLab Runner GitLab Container Registry K3S K3S Harbor Daily Batch Mirroring Build Exec Container Push Factory Narrow Band Wide Band

Slide 42

Slide 42 text

Best Kubernetes configuration for edges. 41

Slide 43

Slide 43 text

What unit should I use to manage it? 42  Kubernetes is a system for managing multiple servers as a group. Kubernetesは複数のサーバーを塊として管理する仕組みです  What is the best way to do chunk server management? サーバーをどういう単位で塊にして管理するのがベストなアーキテクチャでしょうか? One production line is one cluster. One device per cluster One Cluster One Cluster One Cluster One Cluster

Slide 44

Slide 44 text

What unit should I use to manage it? 43  Kubernetes is a system for managing multiple servers as a group. Kubernetesは複数のサーバーを塊として管理する仕組みです  What is the best way to do chunk server management? サーバーをどういう単位で塊にして管理するのがベストなアーキテクチャでしょうか? One production line is one cluster. One device per cluster This is the better way

Slide 45

Slide 45 text

Application deployment 44

Slide 46

Slide 46 text

Deploying and updating applications 45  Manually updating applications on dozens of machines would not be easy for administrators.  FLEET is an automated distribution mechanism. Shift AI Image Recognition AI Image Recognition AI Image Recognition 画像AI診断 画像AI診断 AI Image Recognition Operational load increase AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition Manual distribution Automatic distribution

Slide 47

Slide 47 text

What makes it easier? 46  Enables operation of large-scale edge environments Manage multiple clusters on Rancher GUI Product Line A Product Line B Product Line C ラインA ラインB ラインC Distribute containers across multiple clusters

Slide 48

Slide 48 text

How to settings Deploy to Fleet 47 1. Prepare the manifest file in the GitLab repository 2. Register the GitLab repository from Rancher 3. Register the cluster with the group in Rancher 4. The distribution will start automatically. GitLab Manifest File Repository GitLab Repository Product Line A Cluster Group

Slide 49

Slide 49 text

Update & Upgrade 48

Slide 50

Slide 50 text

Consider upgrading Kubernetes itself 49  Updating Kubernetes is often overlooked.  To use Rancher, you need a suitable version of Kubernetes.  If you do not upgrade, you will not be able to use Rancher. Rancherを動かせるKubernetesバージョン(赤字は非推奨のバージョン) 〇 対応 アクティブサポート Rancherのバージョン □ 非公式対応 メンテナンスサポート △ RKEとRKE2/K3sで違う サポート終了 Rancher 2.8.5 2.8.4 2.8.3 2.8.2 2.8.1 2.8.0 2.7.10 2.7.9 2.7.8 2.7.7 2.7.6 2.7.5 2.7.4 2.7.3 2.7.2 2.7.1 2.7.0 2.6.14 2.6.13 2.6.12 2.6.11 2.6.10 2.6.9 2.6.8 2.6.7 2.6.5 2.6.4 2.6.3 2.6.2 2.6.1 2.6.0 リリース日 2024/6/18 2024/5/23 2024/3/29 2024/2/8 2024/1/24 2023/12/7 2024/2/8 2023/10/26 2023/10/5 2023/9/28 2023/8/30 2023/6/29 2023/5/31 2023/4/24 2023/4/12 2023/1/24 2022/11/16 2024/2/8 2023/5/31 2023/4/27 2023/3/8 2023/1/24 2022/10/18 2022/8/30 2022/8/19 2022/5/12 2022/3/31 2021/12/21 2021/10/20 2021/10/11 2021/8/31 EOM 2024/9/22 2024/9/22 2024/9/22 2024/9/22 2024/9/22 2024/9/22 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 EOL 2025/7/22 2025/7/22 2025/7/22 2025/7/22 2025/7/22 2025/7/22 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 リリース リリース日 アクティブ サポート メンテナンス サポート 1.29 2023/12/14 2024/12/31 2025/2/28 1.28 2023/8/15 2024/8/28 2024/10/28 □ □ □ 1.27 2023/4/11 2024/4/28 2024/6/28 ○ ○ ○ ○ ○ ○ 1.26 2022/12/8 2023/12/28 2024/2/28 ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ 1.25 2022/8/23 2023/8/27 2023/10/27 ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ 1.24 2022/5/3 2023/5/28 2023/7/28 ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ 1.23 2021/12/7 2022/12/28 2023/2/28 ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ 1.22 2021/8/4 2022/8/28 2022/10/28 ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ 1.21 2021/4/8 2022/4/28 2022/6/28 ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ 1.20 2020/12/8 2021/12/28 2022/2/28 ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ RKE2、及びk3sのサポートライ フサイクルはオリジナルの kubernetesに準じます 未リリース Compatible Version Matrix

Slide 51

Slide 51 text

Rolling update of K3S using Rancher 50  Rancher can be used to perform rolling updates of Kubernetes under its control.  You can update by selecting "Kubernetes Version" from the Cluster Manager. Rolling Update

Slide 52

Slide 52 text

Building Edge Applications Tips 51

Slide 53

Slide 53 text

Connect external devices and containers via IP. 52 Use IP-connected cameras and audio devices whenever possible Use WebSocket or RTSP for images and video. Shift TCP/IP Streaming USB VGA No USB connection Use IP Connection

Slide 54

Slide 54 text

Split application by view, processing, and control 53 Split View, Processing, and Control into separate containers Each loosely coupled with a Queue (NATS in the figure below). PLC Control C++ Library Process View Control Communication Queue Pub/Sub Model Subject model Queue Wrapper Camera Implement processes in separate containers

Slide 55

Slide 55 text

Tips on data storage 54

Slide 56

Slide 56 text

On-premises S3-compatible storage 55  S3 compatible object storage on-premises would be useful.  A place to store data from the Edge device.  It is also a place to put model files for machine learning. S3-compatible storage Deep learning models Images

Slide 57

Slide 57 text

Summary 56

Slide 58

Slide 58 text

Summary of the key presentation points 57 NVIDIA Jetson Device for Edge GPUs with Kubernetes require runtime and device plugins Use Fleet to deploy your apps Run Kubernetes upgrade regularly Get the best separation of applications S3 is a convenient data storage method

Slide 59

Slide 59 text

Contact Stylez for AI/IoT Solutions 58 Interested in Kubernetes? Manage Kubernetes with Rancher. Please contact us Stylez is an authorized partner of SUSE Rancher in Japan Stylez Inc. http://stylez.co.jp