Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Edge IoT system with NVIDIA Jetson managed by R...

yanoteturo
November 01, 2024

Edge IoT system with NVIDIA Jetson managed by Rancher

Considerations for running Kubernetes on an NVIDIA Jetson Orin
Best Kubernetes configuration for edges
Container Image Management Techniques
CI/CD Pipeline for Container and Deployment
Data storage recommendations
Things to consider when microserving applications

yanoteturo

November 01, 2024
Tweet

More Decks by yanoteturo

Other Decks in Technology

Transcript

  1. self-introduction 2 Career: 10 years in system operation, 6 years

    in networking, 8 years in SI,8 years in charge of OSS products (Nextcloud/Rancher) Other: Rubyist with no progress at all, recently programming with generative AI. My first PC was an OKI if-800.... YANO Tetsuro tetsurow.yano Stylez Inc.
  2. Today I'm talking about... 4 • Considerations for running Kubernetes

    on an NVIDIA Jetson Orin • Best Kubernetes configuration for edges • Container Image Management Techniques • CI/CD Pipeline for Container and Deployment • Data storage recommendations • Things to consider when microserving applications ◆Topics
  3. There are a lot of factories in Japan. 6 

    It is said that Japanese factories are being replaced by Chinese ones, but China has not yet managed to overtake Japan in the production of precision machinery and machinery that does not break down over long periods of time.  日本の工場は中国に取って代わられていると言われますが、精密な機械や長期間にわたって壊れない機 械の物作りはまだ中国に負けていません  However, simple tasks such as ‘inspecting products’, ‘making rounds’, and ‘replenishing parts’ as well as ‘handling minor malfunctions’ are reducing work efficiency.  しかし、「人が製品を点検する事」や「見回ること」「部品の不足を補充する」といった単純作業や「ちょっと した故障の対応」が作業の効率を下げています。
  4. How does AI reduce the workload? 7  Anomaly detection

    and scratch detection  Detection of missing parts  Foreign object detection  Data collection on environment and equipment Empty OK! NG! NG! NG!
  5. More to manage makes it difficult. 8  Increase in

    the number of devices on which AI is running Image AI Diagnosis If it's not much, that's fine. Image AI Diagnosis Image AI Diagnosis Image AI Diagnosis 画像AI診断 Image AI Diagnosis Image AI Diagno 10+ units are unmanageable. Image AI Diagnosis Image AI Diagnosis Image AI Diagnosis Image AI Diagnosis Image AI Diagnos More machines Increased operational workload Fewer operating hours.
  6. Moving from standalone PC server to orchestration 9  Change

    so that many machines can be centrally managed through orchestration  To do this, a management system is required Kubernetes
  7. Technology Stack 11  Technology stack of the proposed architecture

    GitLab CI/CD Pipeline Operation Intel Server Intel Server Linux(Ubuntu) Linux(Ubuntu) Rancher Kubernetes Engine GitLab GitLab Runner Rancher NVIDIA Jetson Development) K3S FLEET Harbor Linux(Ubuntu L4T) K3S Harbor Promethus Docker Rancher UI Apps Continuous Integration Continuous Delivery Rancher Continuous Delivery
  8. Orchestrate and centralise 12  Centrally manage multiple machines. AI

    Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition Centralised management of multiple units using an orchestration tool
  9. Edge Computing 14  To collect and process data in

    real time, edge devices are installed in the field. エッジデバイスは現場に設置され、データを収集しリアルタイムで処理します The benefits of edge computing include reduced latency, improved data protection and cost savings. This enables faster decision making and more efficient operations in the field. Edge Device Collect Data Realtime Process feedback Cloud Computing Edge Computing Sends all data to the cloud for processing Send only necessary data to the cloud Data processing at the edge Cloud Cloud Edge
  10. Edge Device Issues 16 There are various hurdles to overcome

    in order to introduce edge devices. How do I update? The model's accuracy has gone down. High Price Reduce Cost what's going on? I want to replace a broken one. Network? How to Connect?
  11. Select the least expensive HW device. 17 It is important

    to choose inexpensive equipment that is easy to replace. Generated by Bing Shift High Price Low Price
  12. What does NVIDIA Jetson do? 18 It has a GPU,

    so it's perfect for AI workloads! Robot Automation Image recognition Autonomous vehicle Voice Response Bot Integrated GPU!
  13. Running Kubernetes with K3S at Jetson 19  Our recommendation

    for Kubernetes is K3S, which is optimised for edge devices. Ideal for Edge Simple and secure Optimised for ARM Shift Original
  14. Why use K3S on Jetson? 20  Memory usage when

    running mnist on K3S 614MB of virtual memory 418MB of actual memory  Research paper from ABB Corporate Research Ladenburg, Germany  The results show that K3S has the lowest memory usage. https://programming-group.com/assets/pdf/papers/2023_Lightweight-Kubernetes-Distributions.pdf
  15. GPU Operator GPU Driver Pod(GPU Workload) How to use a

    GPU with Kubernetes 22 Three technologies are required to make Kubernetes aware of GPUs.  Container Runtime  Device Plugin  GPU Operator Linux OS GPU Driver(Kernel Module) Kubernetes(K3S) Device Plugin Container Runtime
  16. In the container, HW does not appear in /dev/ 23

     The container is abstracted as much as possible to be hardware independent.  This gives you the "freedom" to run it anywhere.  As you can see below, there is almost "NO hardware" under /dev/. ls –la /dev/ in Container This Machine have GPU
  17. Enable GPU in container runtime. 24 NVIDIA provides a Container

    Runtime that can use GPUs Shift GPU Driver Linux OS GPU Driver(Kernel Module) Container Runtime GPU Driver Linux OS GPU Driver(Kernel Module) Nvidia Container Runtime
  18. GPU-enabled containers in /dev/. 25  Start a container with

    the following command docker run -d --rm --gpus all ubuntu:latest  NVIDIA* is up ls –la /dev/ in Container nvidia-*
  19. KubernetesでのGPU 26 This is because Kubernetes (K3S) does not currently

    use Docker. Then Kubernetes (K3S) will not recognize GPU. Capacity: cpu: 16 ephemeral-storage: 479079112Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65626228Ki pods: 110 Allocatable: cpu: 16 ephemeral-storage: 466048159789 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65626228Ki pods: 110 no gpu found
  20. GPU Driver Install Kubernetes Device Plugin 27 Three technologies are

    required to make Kubernetes aware of GPUs.  Container Runtime  Device Plugin  GPU Operator Linux OS GPU Driver(Kernel Module) Kubernetes(K3S) Device Plugin Container Runtime
  21. Show GPU with Kubernetes Device Plugin 28 Install Kubernetes Device

    Plugin Capacity: cpu: 16 ephemeral-storage: 479079112Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65626228Ki pods: 110 Allocatable: cpu: 16 ephemeral-storage: 466048159789 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 65626228Ki pods: 110 gpu found
  22. The GPU has several settings 29 GPU Operator GPU Driver

    Linux OS GPU Driver(Kernel Module) Kubernetes(K3S) Device Plugin Container Runtime The Device Plugin only shows the GPU. GPU functions are made available by the GPU Operator.  Container Runtime  Device Plugin  GPU Operator
  23. Kubernetes GPU Operator 30  Features enabled by GPU Operator

    Feature Description Automatic Node Labeling Automatically labels nodes with GPUs to simplify scheduling. GPU Device Plugin Deployment Automatically deploys necessary device plugins to nodes to make GPU resources available. GPU Driver Management Manages the installation and updating of NVIDIA GPU drivers to enable GPU usage on nodes. Monitoring and Alerting Monitors GPU usage and provides alerts if any issues arise. GPU Metrics Collection Collects GPU usage and performance data for integration with monitoring tools. Multi-version Support Supports different GPU setup versions (drivers and CUDA) to provide flexibility. Automatic Updates by Operator Automatically updates components when new versions become available.
  24. Getting used to Ubuntu OS 32 The Ubuntu OS is

    the default on Jetson devices, so familiarize yourself with Ubuntu. If low latency is needed, such as diagnostic imaging, consider a real-time kernel. Real-time kernel depending on requirements Ubuntu Shift SLE Micro 6.0? Umm
  25. narrow network Build a stable network 33  Here is

    a customer's story - “The factory doesn't have a network. The first step is to pull it in." No network in the factory. Shift Stable network
  26. <A problem that actually occurred> 34  Camera image acquisition

    timeout problem on the factory line 1. Possible protocol error with the application and camera 2. Possible network failure between the camera and the switch 3. Possible problem with the camera itself 5. Camera may stop due to LAN PoE failure. 4. Possible connection error with Jetson AGX Product line PoE connection Python Application Handmade LAN cable Get Camera image CONTEC DX-U1200 NVIDIA® Jetson Xavier NX Unable to retrieve image timeout ocurred CONTEC SH-9008AT-POE non-intelligent Switch Camera status unknown LAN link status is unknown. Same when camera is replaced There is no problem with other cameras. There is no problem with other cameras.
  27. [Solved] Leave some slack in the wiring of the LAN

    cable. 35 It is not good if the LAN cable bends 90 degrees at the connector. It is recommended to make the LAN cable have extra length in front of the connector. Make a loop by turning the LAN cable one turn in front of the connector. The size is about a ping-pong ball. 今更聞けない「Cat6A」|GIGAスクール|特集・連載|平野通信機材株式会社 https://www.hiranotsushin.jp/news/gigaschoolnavi/2020/000752.html
  28. Build container images for x86_64 and ARM 37  When

    developing for both px86_64 and ARM, there are two options  1. multi-arch build with Qemu emulation  2. multi-build with x86_64 and ARM Runner Advantagesds of Build メリット Disadvantages Multi-architecture build with Qemu emulation Developers can build using Dockerfile without worrying about architecture. Two container images can be created in one build. Build time is longer because you make two in one build. Multiple builds using two types of runners: x86_64 and ARM Built on any CPU. Need x86 and ARM Runner Need to manage
  29. Build on x86_64 and ARM with two Runners 38 

    Building with two runners is recommended GitLab Pipeline x86_64 Runner x86_64 build ARM Build Workflow ARM64 Runner jobs: main: runs-on: ubuntu-latest steps: <<<中略>>> # 5. setup buildx - name: Setup Docker Buildx uses: docker/setup-buildx-action@v1 # 6. Build and Push - name: Build and push id: docker_build uses: docker/build-push-action@v2 with: context: . file: ./Dockerfile jobs: main: runs-on: self-hosted steps: <<<中略>>> # 5. setup buildx - name: Setup Docker Buildx uses: docker/setup-buildx-action@v1 # 6. Build and Push - name: Build and push id: docker_build uses: docker/build-push-action@v2 with: context: . file: ./Dockerfile Running a dedicated Runner with Jetson
  30. Place a container image near the cluster. 40  The

    network communication bandwidth in the factory is narrow  Place the container image close to the cluster GitLab GitLab Runner GitLab Container Registry K3S K3S Harbor Daily Batch Mirroring Build Exec Container Push Factory Narrow Band Wide Band
  31. What unit should I use to manage it? 42 

    Kubernetes is a system for managing multiple servers as a group. Kubernetesは複数のサーバーを塊として管理する仕組みです  What is the best way to do chunk server management? サーバーをどういう単位で塊にして管理するのがベストなアーキテクチャでしょうか? One production line is one cluster. One device per cluster One Cluster One Cluster One Cluster One Cluster
  32. What unit should I use to manage it? 43 

    Kubernetes is a system for managing multiple servers as a group. Kubernetesは複数のサーバーを塊として管理する仕組みです  What is the best way to do chunk server management? サーバーをどういう単位で塊にして管理するのがベストなアーキテクチャでしょうか? One production line is one cluster. One device per cluster This is the better way
  33. Deploying and updating applications 45  Manually updating applications on

    dozens of machines would not be easy for administrators.  FLEET is an automated distribution mechanism. Shift AI Image Recognition AI Image Recognition AI Image Recognition 画像AI診断 画像AI診断 AI Image Recognition Operational load increase AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition AI Image Recognition Manual distribution Automatic distribution
  34. What makes it easier? 46  Enables operation of large-scale

    edge environments Manage multiple clusters on Rancher GUI Product Line A Product Line B Product Line C ラインA ラインB ラインC Distribute containers across multiple clusters
  35. How to settings Deploy to Fleet 47 1. Prepare the

    manifest file in the GitLab repository 2. Register the GitLab repository from Rancher 3. Register the cluster with the group in Rancher 4. The distribution will start automatically. GitLab Manifest File Repository GitLab Repository Product Line A Cluster Group
  36. Consider upgrading Kubernetes itself 49  Updating Kubernetes is often

    overlooked.  To use Rancher, you need a suitable version of Kubernetes.  If you do not upgrade, you will not be able to use Rancher. Rancherを動かせるKubernetesバージョン(赤字は非推奨のバージョン) 〇 対応 アクティブサポート Rancherのバージョン □ 非公式対応 メンテナンスサポート △ RKEとRKE2/K3sで違う サポート終了 Rancher 2.8.5 2.8.4 2.8.3 2.8.2 2.8.1 2.8.0 2.7.10 2.7.9 2.7.8 2.7.7 2.7.6 2.7.5 2.7.4 2.7.3 2.7.2 2.7.1 2.7.0 2.6.14 2.6.13 2.6.12 2.6.11 2.6.10 2.6.9 2.6.8 2.6.7 2.6.5 2.6.4 2.6.3 2.6.2 2.6.1 2.6.0 リリース日 2024/6/18 2024/5/23 2024/3/29 2024/2/8 2024/1/24 2023/12/7 2024/2/8 2023/10/26 2023/10/5 2023/9/28 2023/8/30 2023/6/29 2023/5/31 2023/4/24 2023/4/12 2023/1/24 2022/11/16 2024/2/8 2023/5/31 2023/4/27 2023/3/8 2023/1/24 2022/10/18 2022/8/30 2022/8/19 2022/5/12 2022/3/31 2021/12/21 2021/10/20 2021/10/11 2021/8/31 EOM 2024/9/22 2024/9/22 2024/9/22 2024/9/22 2024/9/22 2024/9/22 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2024/5/15 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 2023/3/1 EOL 2025/7/22 2025/7/22 2025/7/22 2025/7/22 2025/7/22 2025/7/22 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/11/18 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 2024/4/30 リリース リリース日 アクティブ サポート メンテナンス サポート 1.29 2023/12/14 2024/12/31 2025/2/28 1.28 2023/8/15 2024/8/28 2024/10/28 □ □ □ 1.27 2023/4/11 2024/4/28 2024/6/28 ◦ ◦ ◦ ◦ ◦ ◦ 1.26 2022/12/8 2023/12/28 2024/2/28 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 1.25 2022/8/23 2023/8/27 2023/10/27 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 1.24 2022/5/3 2023/5/28 2023/7/28 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 1.23 2021/12/7 2022/12/28 2023/2/28 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 1.22 2021/8/4 2022/8/28 2022/10/28 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 1.21 2021/4/8 2022/4/28 2022/6/28 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 1.20 2020/12/8 2021/12/28 2022/2/28 ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ RKE2、及びk3sのサポートライ フサイクルはオリジナルの kubernetesに準じます 未リリース Compatible Version Matrix
  37. Rolling update of K3S using Rancher 50  Rancher can

    be used to perform rolling updates of Kubernetes under its control.  You can update by selecting "Kubernetes Version" from the Cluster Manager. Rolling Update
  38. Connect external devices and containers via IP. 52 Use IP-connected

    cameras and audio devices whenever possible Use WebSocket or RTSP for images and video. Shift TCP/IP Streaming USB VGA No USB connection Use IP Connection
  39. Split application by view, processing, and control 53 Split View,

    Processing, and Control into separate containers Each loosely coupled with a Queue (NATS in the figure below). PLC Control C++ Library Process View Control Communication Queue Pub/Sub Model Subject model Queue Wrapper Camera Implement processes in separate containers
  40. On-premises S3-compatible storage 55  S3 compatible object storage on-premises

    would be useful.  A place to store data from the Edge device.  It is also a place to put model files for machine learning. S3-compatible storage Deep learning models Images
  41. Summary of the key presentation points 57 NVIDIA Jetson Device

    for Edge GPUs with Kubernetes require runtime and device plugins Use Fleet to deploy your apps Run Kubernetes upgrade regularly Get the best separation of applications S3 is a convenient data storage method
  42. Contact Stylez for AI/IoT Solutions 58 Interested in Kubernetes? Manage

    Kubernetes with Rancher. Please contact us Stylez is an authorized partner of SUSE Rancher in Japan Stylez Inc. http://stylez.co.jp