Edge IoT system with NVIDIA Jetson managed by Rancher

Edge IoT system with NVIDIA Jetson managed by Rancher YANO
Tetsuro Stylez 02/11/2024

Today's materials URL/本日の資料のURL 1 https://bit.ly/

self-introduction 2 Career: 10 years in system operation, 6 years
in networking, 8 years in SI,8 years in charge of OSS products (Nextcloud/Rancher) Other: Rubyist with no progress at all, recently programming with generative AI. My first PC was an OKI if-800.... YANO Tetsuro tetsurow.yano Stylez Inc.

Geeko, These puppets is in my home. 15years 15years

Today I'm talking about... 4 • Considerations for running Kubernetes
on an NVIDIA Jetson Orin • Best Kubernetes configuration for edges • Container Image Management Techniques • CI/CD Pipeline for Container and Deployment • Data storage recommendations • Things to consider when microserving applications ◆Topics

What are the needs of factories in Japan? 5

There are a lot of factories in Japan. 6 
It is said that Japanese factories are being replaced by Chinese ones, but China has not yet managed to overtake Japan in the production of precision machinery and machinery that does not break down over long periods of time.  日本の工場は中国に取って代わられていると言われますが、精密な機械や長期間にわたって壊れない機械の物作りはまだ中国に負けていません  However, simple tasks such as ‘inspecting products’, ‘making rounds’, and ‘replenishing parts’ as well as ‘handling minor malfunctions’ are reducing work efficiency.  しかし、「人が製品を点検する事」や「見回ること」「部品の不足を補充する」といった単純作業や「ちょっとした故障の対応」が作業の効率を下げています。

How does AI reduce the workload? 7  Anomaly detection
and scratch detection  Detection of missing parts  Foreign object detection  Data collection on environment and equipment Empty OK! NG! NG! NG!

More to manage makes it difficult. 8  Increase in
the number of devices on which AI is running Image AI Diagnosis If it's not much, that's fine. Image AI Diagnosis Image AI Diagnosis Image AI Diagnosis 画像AI診断 Image AI Diagnosis Image AI Diagno 10+ units are unmanageable. Image AI Diagnosis Image AI Diagnosis Image AI Diagnosis Image AI Diagnosis Image AI Diagnos More machines Increased operational workload Fewer operating hours.

Moving from standalone PC server to orchestration 9  Change
so that many machines can be centrally managed through orchestration  To do this, a management system is required Kubernetes

Platform architecture 10

Technology Stack 11  Technology stack of the proposed architecture
GitLab CI/CD Pipeline Operation Intel Server Intel Server Linux(Ubuntu) Linux(Ubuntu) Rancher Kubernetes Engine GitLab GitLab Runner Rancher NVIDIA Jetson Development) K3S FLEET Harbor Linux(Ubuntu L4T) K3S Harbor Promethus Docker Rancher UI Apps Continuous Integration Continuous Delivery Rancher Continuous Delivery

Orchestrate and centralise 12  Centrally manage multiple machines. AI
Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition Centralised management of multiple units using an orchestration tool

Why Edge Computing? なぜエッジコンピューティングですか？ 13

Edge Computing 14  To collect and process data in
real time, edge devices are installed in the field. エッジデバイスは現場に設置され、データを収集しリアルタイムで処理します The benefits of edge computing include reduced latency, improved data protection and cost savings. This enables faster decision making and more efficient operations in the field. Edge Device Collect Data Realtime Process feedback Cloud Computing Edge Computing Sends all data to the cloud for processing Send only necessary data to the cloud Data processing at the edge Cloud Cloud Edge

But... 15

Edge Device Issues 16 There are various hurdles to overcome
in order to introduce edge devices. How do I update? The model's accuracy has gone down. High Price Reduce Cost what's going on? I want to replace a broken one. Network？ How to Connect?

Select the least expensive HW device. 17 It is important
to choose inexpensive equipment that is easy to replace. Generated by Bing Shift High Price Low Price

What does NVIDIA Jetson do? 18 It has a GPU,
so it's perfect for AI workloads! Robot Automation Image recognition Autonomous vehicle Voice Response Bot Integrated GPU！

Running Kubernetes with K3S at Jetson 19  Our recommendation
for Kubernetes is K3S, which is optimised for edge devices. Ideal for Edge Simple and secure Optimised for ARM Shift Original

Why use K3S on Jetson? 20  Memory usage when
running mnist on K3S 614MB of virtual memory 418MB of actual memory  Research paper from ABB Corporate Research Ladenburg, Germany  The results show that K3S has the lowest memory usage. https://programming-group.com/assets/pdf/papers/2023_Lightweight-Kubernetes-Distributions.pdf

K3S (Kubernetes), Jetson and GPUs 21

GPU Operator GPU Driver Pod(GPU Workload) How to use a
GPU with Kubernetes 22 Three technologies are required to make Kubernetes aware of GPUs.  Container Runtime  Device Plugin  GPU Operator Linux OS GPU Driver(Kernel Module) Kubernetes(K3S) Device Plugin Container Runtime

In the container, HW does not appear in /dev/ 23
 The container is abstracted as much as possible to be hardware independent.  This gives you the "freedom" to run it anywhere.  As you can see below, there is almost "NO hardware" under /dev/. ls –la /dev/ in Container This Machine have GPU

Enable GPU in container runtime. 24 NVIDIA provides a Container
Runtime that can use GPUs Shift GPU Driver Linux OS GPU Driver(Kernel Module) Container Runtime GPU Driver Linux OS GPU Driver(Kernel Module) Nvidia Container Runtime

GPU-enabled containers in /dev/. 25  Start a container with
the following command docker run -d --rm --gpus all ubuntu:latest  NVIDIA* is up ls –la /dev/ in Container nvidia-*

KubernetesでのGPU 26 This is because Kubernetes (K3S) does not currently
use Docker. Then Kubernetes (K3S) will not recognize GPU. Capacity: cpu: 16 ephemeral-storage: 479079112Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65626228Ki pods: 110 Allocatable: cpu: 16 ephemeral-storage: 466048159789 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65626228Ki pods: 110 no gpu found

GPU Driver Install Kubernetes Device Plugin 27 Three technologies are
required to make Kubernetes aware of GPUs.  Container Runtime  Device Plugin  GPU Operator Linux OS GPU Driver(Kernel Module) Kubernetes(K3S) Device Plugin Container Runtime

Show GPU with Kubernetes Device Plugin 28 Install Kubernetes Device
Plugin Capacity: cpu: 16 ephemeral-storage: 479079112Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65626228Ki pods: 110 Allocatable: cpu: 16 ephemeral-storage: 466048159789 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65626228Ki pods: 110 gpu found

The GPU has several settings 29 GPU Operator GPU Driver
Linux OS GPU Driver(Kernel Module) Kubernetes(K3S) Device Plugin Container Runtime The Device Plugin only shows the GPU. GPU functions are made available by the GPU Operator.  Container Runtime  Device Plugin  GPU Operator

Kubernetes GPU Operator 30  Features enabled by GPU Operator
Feature Description Automatic Node Labeling Automatically labels nodes with GPUs to simplify scheduling. GPU Device Plugin Deployment Automatically deploys necessary device plugins to nodes to make GPU resources available. GPU Driver Management Manages the installation and updating of NVIDIA GPU drivers to enable GPU usage on nodes. Monitoring and Alerting Monitors GPU usage and provides alerts if any issues arise. GPU Metrics Collection Collects GPU usage and performance data for integration with monitoring tools. Multi-version Support Supports different GPU setup versions (drivers and CUDA) to provide flexibility. Automatic Updates by Operator Automatically updates components when new versions become available.

About things other than NVIDIA Jetson 31

Getting used to Ubuntu OS 32 The Ubuntu OS is
the default on Jetson devices, so familiarize yourself with Ubuntu. If low latency is needed, such as diagnostic imaging, consider a real-time kernel. Real-time kernel depending on requirements Ubuntu Shift SLE Micro 6.0? Umm

narrow network Build a stable network 33  Here is
a customer's story - “The factory doesn't have a network. The first step is to pull it in." No network in the factory. Shift Stable network

<A problem that actually occurred> 34  Camera image acquisition
timeout problem on the factory line 1. Possible protocol error with the application and camera 2. Possible network failure between the camera and the switch 3. Possible problem with the camera itself 5. Camera may stop due to LAN PoE failure. 4. Possible connection error with Jetson AGX Product line PoE connection Python Application Handmade LAN cable Get Camera image CONTEC DX-U1200 NVIDIA® Jetson Xavier NX Unable to retrieve image timeout ocurred CONTEC SH-9008AT-POE non-intelligent Switch Camera status unknown LAN link status is unknown. Same when camera is replaced There is no problem with other cameras. There is no problem with other cameras.

[Solved] Leave some slack in the wiring of the LAN
cable. 35 It is not good if the LAN cable bends 90 degrees at the connector. It is recommended to make the LAN cable have extra length in front of the connector. Make a loop by turning the LAN cable one turn in front of the connector. The size is about a ping-pong ball. 今更聞けない「Cat6A」｜GIGAスクール｜特集・連載｜平野通信機材株式会社 https://www.hiranotsushin.jp/news/gigaschoolnavi/2020/000752.html

Building Container Images 36

Build container images for x86_64 and ARM 37  When
developing for both px86_64 and ARM, there are two options  1. multi-arch build with Qemu emulation  2. multi-build with x86_64 and ARM Runner Advantagesds of Build メリット Disadvantages Multi-architecture build with Qemu emulation Developers can build using Dockerfile without worrying about architecture. Two container images can be created in one build. Build time is longer because you make two in one build. Multiple builds using two types of runners: x86_64 and ARM Built on any CPU. Need x86 and ARM Runner Need to manage

Build on x86_64 and ARM with two Runners 38 
Building with two runners is recommended GitLab Pipeline x86_64 Runner x86_64 build ARM Build Workflow ARM64 Runner jobs: main: runs-on: ubuntu-latest steps: ＜＜＜中略＞＞＞ # 5. setup buildx - name: Setup Docker Buildx uses: docker/setup-buildx-action@v1 # 6. Build and Push - name: Build and push id: docker_build uses: docker/build-push-action@v2 with: context: . file: ./Dockerfile jobs: main: runs-on: self-hosted steps: ＜＜＜中略＞＞＞ # 5. setup buildx - name: Setup Docker Buildx uses: docker/setup-buildx-action@v1 # 6. Build and Push - name: Build and push id: docker_build uses: docker/build-push-action@v2 with: context: . file: ./Dockerfile Running a dedicated Runner with Jetson

Container Image Management Techniques 39

Place a container image near the cluster. 40  The
network communication bandwidth in the factory is narrow  Place the container image close to the cluster GitLab GitLab Runner GitLab Container Registry K3S K3S Harbor Daily Batch Mirroring Build Exec Container Push Factory Narrow Band Wide Band

Best Kubernetes configuration for edges. 41

What unit should I use to manage it? 42 
Kubernetes is a system for managing multiple servers as a group. Kubernetesは複数のサーバーを塊として管理する仕組みです  What is the best way to do chunk server management? サーバーをどういう単位で塊にして管理するのがベストなアーキテクチャでしょうか？ One production line is one cluster. One device per cluster One Cluster One Cluster One Cluster One Cluster

What unit should I use to manage it? 43 
Kubernetes is a system for managing multiple servers as a group. Kubernetesは複数のサーバーを塊として管理する仕組みです  What is the best way to do chunk server management? サーバーをどういう単位で塊にして管理するのがベストなアーキテクチャでしょうか？ One production line is one cluster. One device per cluster This is the better way

Application deployment 44

Deploying and updating applications 45  Manually updating applications on
dozens of machines would not be easy for administrators.  FLEET is an automated distribution mechanism. Shift AI Image Recognition AI Image Recognition AI Image Recognition 画像AI診断画像AI診断 AI Image Recognition Operational load increase AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition Manual distribution Automatic distribution

What makes it easier? 46  Enables operation of large-scale
edge environments Manage multiple clusters on Rancher GUI Product Line A Product Line B Product Line C ラインA ラインB ラインC Distribute containers across multiple clusters

How to settings Deploy to Fleet 47 1. Prepare the
manifest file in the GitLab repository 2. Register the GitLab repository from Rancher 3. Register the cluster with the group in Rancher 4. The distribution will start automatically. GitLab Manifest File Repository GitLab Repository Product Line A Cluster Group

Update & Upgrade 48

Consider upgrading Kubernetes itself 49  Updating Kubernetes is often
overlooked.  To use Rancher, you need a suitable version of Kubernetes.  If you do not upgrade, you will not be able to use Rancher. Rancherを動かせるKubernetesバージョン(赤字は非推奨のバージョン) 〇対応アクティブサポート Rancherのバージョン □ 非公式対応メンテナンスサポート △ RKEとRKE2/K3sで違うサポート終了 Rancher 2.8.5 2.8.4 2.8.3 2.8.2 2.8.1 2.8.0 2.7.10 2.7.9 2.7.8 2.7.7 2.7.6 2.7.5 2.7.4 2.7.3 2.7.2 2.7.1 2.7.0 2.6.14 2.6.13 2.6.12 2.6.11 2.6.10 2.6.9 2.6.8 2.6.7 2.6.5 2.6.4 2.6.3 2.6.2 2.6.1 2.6.0 リリース日 2024/6/18 2024/5/23 2024/3/29 2024/2/8 2024/1/24 2023/12/7 2024/2/8 2023/10/26 2023/10/5 2023/9/28 2023/8/30 2023/6/29 2023/5/31 2023/4/24 2023/4/12 2023/1/24 2022/11/16 2024/2/8 2023/5/31 2023/4/27 2023/3/8 2023/1/24 2022/10/18 2022/8/30 2022/8/19 2022/5/12 2022/3/31 2021/12/21 2021/10/20 2021/10/11 2021/8/31 EOM 2024/9/22 2024/9/22 2024/9/22 2024/9/22 2024/9/22 2024/9/22 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 EOL 2025/7/22 2025/7/22 2025/7/22 2025/7/22 2025/7/22 2025/7/22 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 リリースリリース日アクティブサポートメンテナンスサポート 1.29 2023/12/14 2024/12/31 2025/2/28 1.28 2023/8/15 2024/8/28 2024/10/28 □ □ □ 1.27 2023/4/11 2024/4/28 2024/6/28 ◦ ◦ ◦ ◦ ◦ ◦ 1.26 2022/12/8 2023/12/28 2024/2/28 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 1.25 2022/8/23 2023/8/27 2023/10/27 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 1.24 2022/5/3 2023/5/28 2023/7/28 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 1.23 2021/12/7 2022/12/28 2023/2/28 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 1.22 2021/8/4 2022/8/28 2022/10/28 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 1.21 2021/4/8 2022/4/28 2022/6/28 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 1.20 2020/12/8 2021/12/28 2022/2/28 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ RKE2、及びk3sのサポートライフサイクルはオリジナルの kubernetesに準じます未リリース Compatible Version Matrix

Rolling update of K3S using Rancher 50  Rancher can
be used to perform rolling updates of Kubernetes under its control.  You can update by selecting "Kubernetes Version" from the Cluster Manager. Rolling Update

Building Edge Applications Tips 51

Connect external devices and containers via IP. 52 Use IP-connected
cameras and audio devices whenever possible Use WebSocket or RTSP for images and video. Shift TCP/IP Streaming USB VGA No USB connection Use IP Connection

Split application by view, processing, and control 53 Split View,
Processing, and Control into separate containers Each loosely coupled with a Queue (NATS in the figure below). PLC Control C++ Library Process View Control Communication Queue Pub/Sub Model Subject model Queue Wrapper Camera Implement processes in separate containers

Tips on data storage 54

On-premises S3-compatible storage 55  S3 compatible object storage on-premises
would be useful.  A place to store data from the Edge device.  It is also a place to put model files for machine learning. S3-compatible storage Deep learning models Images

Summary 56

Summary of the key presentation points 57 NVIDIA Jetson Device
for Edge GPUs with Kubernetes require runtime and device plugins Use Fleet to deploy your apps Run Kubernetes upgrade regularly Get the best separation of applications S3 is a convenient data storage method

Contact Stylez for AI/IoT Solutions 58 Interested in Kubernetes? Manage
Kubernetes with Rancher. Please contact us Stylez is an authorized partner of SUSE Rancher in Japan Stylez Inc. http://stylez.co.jp

Edge IoT system with NVIDIA Jetson managed by R...

Edge IoT system with NVIDIA Jetson managed by Rancher

More Decks by yanoteturo

Other Decks in Technology

Featured

Transcript