Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Recursion - Meetup - Presentation - June 19th.pdf

Recursion - Meetup - Presentation - June 19th.pdf

Avatar for cncf-canada-meetups

cncf-canada-meetups

June 19, 2025
Tweet

More Decks by cncf-canada-meetups

Other Decks in Science

Transcript

  1. Presenter: Jamon Camisso Email: [email protected] Location: Recursion Toronto Date: 2025-06-19

    19:15 UTC-4 CNCF Toronto June Meetup: Sandboxing pods with runc
  2. Overview – Privileged pods are easy – VMs provide great

    isolation ◦ Kata Containers ◦ Firecracker ◦ KubeVirt – But?
  3. Overview – Privileged pods are easy – Secure, privileged, container

    pods can be too (with work) – runc + OCI spec allows multiple implementations, e.g. ◦ crun (podman, cri-o) ◦ gVisor (GKE, App Engine, Cloud Run) ◦ sysbox (Docker Enhanced Container Isolation) ◦ youki (experimental)
  4. gVisor – gVisor (architecture) – Effectively a userspace kernel written

    in Go ◦ Implements most syscalls like a PV hypervisor would – 2 components ◦ Sentry (main runsc+containerd interface) ◦ Gofer (runs more privileged system calls)
  5. sysbox – Uses all Linux namespaces ◦ cgroups ▪ resource

    allocation, needed for containers ◦ userns ▪ Remaps root in container to non-root on host – Plus all others like mount, IPC etc.
  6. But? – Anjali, Tyler Caraza-Harter, and Michael M. Swift. 2020.

    Blending Containers and Virtual Machines: A Study of Firecracker and gVisor. In 16th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE ’20), March 17, 2020, Lausanne, Switzerland. ACM, New York, NY, USA, 13 pages. – https://doi.org/10.1145/3381052.3381315 – Tl;dr in 2020 both ran more kernel code than plain runc
  7. Demo – Let’s try gVisor ◦ Run a privileged pod

    ◦ Exec in, look around, do some things ◦ Check ps from the host
  8. Demo – Let’s focus on sysbox because: ◦ I haven’t

    looked into gVisor escapes ◦ gVisor takes some setup, or using GKE ▪ Default GKE node pool doesn’t support it ◦ Sysbox is a quick `kubectl apply` on most clouds ▪ Nodes must have a sysbox-install=yes label ▪ Should look familiar on host nodes, eg
  9. Demo – Let’s try sysbox ◦ K3s in K8s the

    bad way (privileges) ◦ K3s in K8s the good way (sysbox runtimeClassName) ◦ Both work, who cares?
  10. Demo – Let’s try sysbox ◦ Bad pod ▪ chroot

    /host/proc/1/root /bin/sh ◦ Good pod ▪ chroot /host/proc/1/root /bin/sh
  11. Demo – Let’s try sysbox ◦ Things to check ▪

    Pod: ls -alh /proc/self/ns ▪ Host: ls -alh /proc/1/ns/ ◦ Sysbox will have different cgroup and user inodes ◦ Privileged runc will share host’s
  12. Demo – Let’s try sysbox ◦ nginx and network debugging

    ▪ tcpdump in pod ▪ curl to service IP • Bridge and veth traffic show up - privileged!
  13. K8s 1.33 The theme for Kubernetes v1.33 is Octarine: The

    Color of Magic. This release highlights the open source magic that Kubernetes enables across the ecosystem. “It’s still magic even if you know how it’s done.” Sir Terry Pratchett’s 64 enhancements • Alpha: 24 • Beta: 20 • Stable: 18 • 2 deprecated
  14. Status Alpha: 1.27 Beta: 1.33 KEP KEP-1287 In-Place Pod Resize

    (IPPR) for VPA - Increase resource utilization and minimize costs by using non-disruptive Vertical Pod Autoscaling (VPA) for automatic workload rightsizing - You can control whether a container should be restarted when resizing by setting resizePolicy in the container specification. This allows fine-grained control based on resource type (CPU or memory). - NotRequired: (Default) Apply the resource change to the running container without restarting it. - RestartContainer: Restart the container to apply the new resource values.
  15. In-Place Pod Resizing - Before Pod Container 1 Container 2

    Service Pod Container 1 Container 2 Service Pod Container 1 Service Container 2 1 2 3
  16. In-Place Pod Resizing (Beta) Pod Container 1 Container 2 Service

    1 Pod Container 1 Service Container 2 2 In Place Resizing
  17. How to: In-Place Pod Resize 1. Update kubectl to 1.33

    or newer 2. Create a Pod 3. Resize the pod using 'patch': *- Public Preview; limitations apply; kubectl patch pod <pod-name> --subresource resize --patch \ '{"spec":{"containers":[{"name":"<container-name>", "resources":{"requests":{"cpu":"100"}, "limits":{"cpu":"100"}}}]}}' Official documentation: https://kubernetes.io/docs/tasks/configure-pod-container/resize-container-resources/
  18. Status Alpha: 1.28 Beta: 1.29 Stable: 1.33 KEP KEP-753 Sidecar

    containers Sidecar containers are now Stable! - Sidecar containers used to enhance or to extend the functionality of the primary app container by providing additional services, or functionality such as logging, monitoring, security, or data synchronization, without directly altering the primary application code. - Use “restartPolicy” field to use init container as sidecar container. - Omitting the “restartPolicy” field means you want to create a pure init container.
  19. CNCF Slack Changes 6.20.2025 CNCF Slack Downgrades to Free Plan

    - 90 Days of History - Disabling of Workflows - Private channels go away - Use slackdump to back up any DMs or private channel history (QR code below) We might move to Discord!!!
  20. New Certification - Cloud Native Platform Engineering Associate (CNPA) NEW

    CERT LET’S GOOOO!!! - Platform Engineering Core Fundamentals - Platform Observability, Security, Conformance - Continuous Delivery & Platform Engineering - Platform APIs and Provisioning Infrastructure - IDPs and Developer Experience - Measuring your Platform
  21. !

  22. • Ima%e Buildin% • Concourse: oci-build, a wrapper around Buildkit

    • Equivalent to the docker build command • Ima%e Pushin%/Pullin% • Concourse: re%istry-ima%e, uploads and downloads OCI ima%es • Equivalent to docker [push/pull]
  23. > docker pull concourse/dev@sha256:eecb3e6c87b4//@ //@ Status: Downloaded newer image for

    concourse/ dev@sha256:eecb3e6c87b4ba201c0a7c49034b2eb837a62a759700 86848bbebeb93b9f904f
  24. > docker manifest inspect golang:latest { "schemaVersion": 2, "mediaType": "application/vnd.oci.image.index.v1+json",

    "manifests": [ { "mediaType": "application/vnd.oci.image.manifest.v1+json", "size": 2322, "digest": "sha256x6e867e7a9b18808f61e7f1e8815535199f526bb227be340be6547f239a94228b", "platform": { "architecture": "amd64", "os": "linux" } }, { "mediaType": "application/vnd.oci.image.manifest.v1+json", "size": 2324, "digest": "sha256:a33b16b24e602f34d335236c3ab39a8aa583b9fa1b6a44392f8d2dc1af6aff4e", "platform": { "architecture": "arm64", "os": "linux", "variant": "v8" } }, //@ }
  25. Ima%e Index Ima%e Manifest (x86) Ima%e Manifest (ARM) laye└/f7f344792fec laye└/84137b232743

    laye└/44a85b317b94 laye└/409d7214a039 laye└/8ff2fcf0febf laye└/de030034854c laye└/9cb31e2e37ea laye└/bdb05694a875
  26. > docker manifest inspect golang:latest { "schemaVersion": 2, "mediaType": "application/vnd.oci.image.index.v1+json",

    "manifests": [ { "mediaType": "application/vnd.oci.image.manifest.v1+json", "size": 2322, "digest": "sha256x6e867e7a9b18808f61e7f1e8815535199f526bb227be340be6547f239a94228b", "platform": { "architecture": "amd64", "os": "linux" } }, { "mediaType": "application/vnd.oci.image.manifest.v1+json", "size": 2324, "digest": "sha256:a33b16b24e602f34d335236c3ab39a8aa583b9fa1b6a44392f8d2dc1af6aff4e", "platform": { "architecture": "arm64", "os": "linux", "variant": "v8" } }, //@ }
  27. > curl -H "Authorization: Bearer //@" \ -H "Accept: application/vnd.docker.distribution.manifest.v2+json"

    \ https:/*registry-1.docker.io/v2/concourse/dev/manifests/sha256:eecb3e6c87b4 | jq '.' { "schemaVersion": 2, "mediaType": "application/vnd.oci.image.index.v1+json", "manifests": [ { "mediaType": "application/vnd.oci.image.index.v1+json", "digest": "sha256x8cda417769d72bfb43b813da10dd90ef31f01b536022f68759fef7cf8a3651b9", "size": 647, "annotations": { "org.opencontainers.image.created": "2025-05-24T04x40x30Z" } } ] }
  28. > curl -H "Authorization: Bearer //@" \ -H "Accept: application/vnd.docker.distribution.manifest.v2+json"

    \ https:/*registry-1.docker.io/v2/concourse/dev/manifests/sha256x8cda417769 | jq '.' { "schemaVersion": 2, "mediaType": "application/vnd.oci.image.index.v1+json", "manifests": [ { "mediaType": "application/vnd.oci.image.manifest.v1+json", "digest": "sha256:a84b856b76f0fa439d3d9414fbbd3654a738225b6dcce285f4ce26b35cd43d99", "size": 5460, "platform": { "architecture": "arm64", "os": "linux" } }, { "mediaType": "application/vnd.oci.image.manifest.v1+json", "digest": "sha256x65907920e61e856e0edd3503a651d4c0789eb0137790d5f9f32ebd45c9ac77fd", "size": 5460, "platform": { "architecture": "amd64", "os": "linux" } } ] }
  29. Ima%e Index Ima%e Manifest (x86) Ima%e Manifest (ARM) laye└/f7f344792fec laye└/84137b232743

    laye└/44a85b317b94 laye└/409d7214a039 laye└/8ff2fcf0febf laye└/de030034854c laye└/9cb31e2e37ea laye└/bdb05694a875
  30. Ima%e Index Ima%e Manifest (x86) Ima%e Manifest (ARM) laye└/f7f344792fec laye└/84137b232743

    laye└/44a85b317b94 laye└/409d7214a039 laye└/8ff2fcf0febf laye└/de030034854c laye└/9cb31e2e37ea laye└/bdb05694a875 Ima%e Index
  31. • Docker Hub doesn't "unwrap" the nested ima%e index •

    The docke└ CLI does "unwrap" the nested ima%e index
  32. • oci-build - wrapper around buildkit • re%istry-ima%e - pushes

    OCI ima%es • Is re%istry-ima%e wrappin% the ima%e index in another ima%e index before pushin% the ima%e?
  33. • re%istry-ima%e calls layout.ImageIndexF└omPath() from %o-containerre%istry • Reads ima%e from

    disk and represents the ima%e as Go structs • No manipulation of the ima%e
  34. • local • tar • oci • docker • ima%e

    Reference: https://docs.docker.com/reference/cli/docker/buildx/build/#output
  35. ./ ...---- blobs/ :rr ├---- sha256/ :rr ├----//@ ...---- index.json

    ├---- oci+layout Ima%e Layout Spec: https://%ithub.com/opencontainers/ima%e-spec/blob/main/ima%e-layout.md
  36. { "schemaVersion": 2, "mediaType": "application/vnd.oci.image.index.v1+json", "manifests": [ { "mediaType": "application/vnd.oci.image.index.v1+json",

    "digest": "sha256:e1960d2b8ced17f2aced1244a438b57ecadb3ee23089eaf1e1a583e2816fa6fb ", "size": 1607, "annotations": { "io.containerd.image.name": "docker.io/concourse/multi+ arch:latest", "org.opencontainers.image.created": "2025-06-18T23x59x02Z", "org.opencontainers.image.ref.name": "latest" } } ] }
  37. { "schemaVersion": 2, "mediaType": "application/vnd.oci.image.index.v1+json", "manifests": [ { "mediaType": "application/vnd.oci.image.manifest.v1+json",

    "digest": "sha256x5a143fad4f8af50feab33d5e7652c4f385b1b0710dafe3e7291615530f6ad114", "size": 476, "platform": { "architecture": "amd64", "os": "linux" } }, { "mediaType": "application/vnd.oci.image.manifest.v1+json", "digest": "sha256x2defa492ae9cacd82dd172e6571ae2ff315d6d142b8ed2a6536620a31d9c6e03", "size": 476, "platform": { "architecture": "arm64", "os": "linux" } } ] }
  38. { "schemaVersion": 2, "mediaType": "application/vnd.oci.image.index.v1+json", "manifests": [ { "mediaType": "application/vnd.oci.image.manifest.v1+json",

    "digest": "sha256x5a143fad4f8af50feab33d5e7652c4f385b1b0710dafe3e7291615530f6ad114", "size": 476, "annotations": { "io.containerd.image.name": "docker.io/concourse/multi+arch:latest", "org.opencontainers.image.created": "2025-06-19T00x17x17Z", "org.opencontainers.image.ref.name": "latest" }, "platform": { "architecture": "amd64", "os": "linux" } } ] }
  39. "

  40. #

  41. $ %

  42. 2 © 2025 Intuit Inc. All rights reserved. Agenda 1.

    Platform @ Intuit 2. Event Driven Architecture 3. Challenges 4. Numaflow 5. Demo
  43. Technology @ Intuit 100M customers 107B consumer tax refunds per

    year $2T+ invoices managed on our platform per year 18M total US workers paid via QB payroll Intuit is leading the way in building an AI-native development platform using cloud native open source technology. We’re committed to building tools that scale and giving back to the open source community.
  44. We believe in open source and open collaboration bit.ly/intuit-oss Created,

    open-sourced, used, and maintained by Intuit Recipient of the End User Award in 2019 & 2022 End user of cloud native and mobile open source tech
  45. © 2025 Intuit Inc. All rights reserved. 5 E-commerce Order

    Processing Event Processing - Real World Examples Event-Driven Systems: The Backbone of Modern Technology Everywhere When customer places an order, an event triggers inventory, payment, and shipping processes IoT Data Processing IoT sensors trigger events when vibrations or temperatures exceed safe limits, initiating maintenance and safety protocols Real Time Analytics Events trigger continuous analysis and insights, like monitoring website traffic, orders etc Fraud Detection When a transaction is initiated, it triggers processes to identify and prevent fraudulent activities
  46. © 2025 Intuit Inc. All rights reserved. 6 Asynchronous &

    Responsive Scalable & Flexible Event Driven Architecture Enables you to detect, process, and respond to events as they happen Reliable & Resilient 1 2 3 PROCESSING PROCESSING PROCESSING PROCESSING
  47. ©2025 Intuit Inc. All rights reserved. Challenges for Event Driven

    Applications Learning Kafka etc is time consuming, requires writing a lot of boilerplate code to integrate and can be inefficient if parallel consumption isn’t implemented appropriately Boilerplate Code Scaling event-driven applications while maintaining reliability is challenging and costly. Managing infrastructure efficiently without overspending adds to the complexity Scaling Complexities Observability in event-driven applications is challenging, identifying where latencies are introduced whether in event consumption, or processing it requires deep visibility, making it difficult to diagnose issues Observability 8
  48. © 2025 Intuit Inc. All rights reserved. 10 Scalable, Reliable,

    and Secure Abstract Infrastructure Decouple Source/ Sinks Serverless for Event Processing “Enables developers to build applications faster by eliminating the need for them to manage infrastructure” “Decouple the business logic from the source and the destination” “Cost efficient, Elastic, No downtime upgrades, and Security patches”
  49. ©2025 Intuit Inc. All rights reserved. 12 STREAM SOURCE N

    Pods X/Sec UDF X/Sec N Pods SINK N Pods X/Sec SINK N Pods X/Sec STREAM STORE K > P K <= P Advanced Event Processing (Pipeline) UDF N Pods X/Sec
  50. ©2022 Intuit Inc. All rights reserved. Browse Open Issues and

    Contribute • Explore GitHub for issues tagged with ‘good first issue’ or ‘help wanted’ • Pick one that excites you! Bring Numaflow to Work • Event-driven use cases are everywhere • Try out Numaflow in your projects and share your experience! Join the Community • Connect on Slack and be part of the conversation Rar Let’s shoot for the Stars! https://github.com/numaproj/numaflow Be part of the Community!
  51. ©2025 Intuit Inc. All rights reserved. 18 K8s native, serverless

    platform for running scalable and reliable event-driven applications Scalable and Cost efficient Automatically scales from 0 to X, handling backpressure, while being lightweight and cost-efficient. Capable of running on edge with a low resource footprint K8s native event processing K8s native lightweight event processing with fully featured streaming semantics Versatile and can seamlessly operate on the edge, on-prem or in the cloud Language agnostic framework SDKs in Java, Python, Golang, Rust. In-built source/sink connectors. Easy to write sources, functions and sinks
  52. © 2023 Intuit Inc. All rights reserved. 19 From Fintech

    to Aerospace!! Event Processing Processing financial transactions and propagate to other capabilities Processing IoT Data Processing on both high and low-volume sensor data streams received from IoT systems Accelerator Chaining AI/ML pipeline with dynamic resource allocation (GPUs, FPGAs etc) Digital Signal Processing Detection, decoding, and demodulation of RF signals across edge devices and on cloud Fraud Detection Analyze crypto and blockchain-related transactions for fraud
  53. Defining SLOs and SLIs for ArgoCD June 19, 2025 Toronto,

    ON Serhiy Martynenko, The New York Times
  54. ArgoCD at NYT Create / Onboard Create Develop Build, Test,

    Deploy Run Route CI AWS / EKS Ingress Source Control NYT Engineer End User Monitor Observability
  55. SLI/SLO SLO (service level objectives) SLI (service level indicator) Measurable

    indicator of performance/reliability Target/goal, set for SLI over specific window • % of successful Sync • Sync duration • number of bph (barks per hour) • success for 95% of Sync ops / 30 days • 95% of Sync completed <20s / 30 days • >=5 barks / hour
  56. Traditional Monitoring Benefits: • Well-suited for debugging and diagnostics •

    Informative about system health (CPU/memory) • Immediate alerts about critical production incidents • Historical data and patterns 󰤇󰤈🤷 YES, BUT… : • Reactive Response • Data Overload • Alert Fatigue • Are our customers happy?
  57. SLI/SLO Approach • Customer-focused metrics • Proactive reliability • Data-driven

    prioritization (reliability vs features) • Error budgets
  58. Error Budgets • If SLO = 99.9%, then 0.1% is

    your "error budget" • For each 9 we add to our SLO, we significantly increase the engineering effort required to maintain it. • Conscious Decisions: how much reliability tax we're willing to pay • Freedom to innovate when under budget: ◦ Budget depleted → focus on reliability ◦ Budget healthy → ship new features
  59. Metrics into Meaningful SLIs • Identify key user journeys •

    Map those journeys to measurable events • Map those events to existing ArgoCD metrics • Transform raw metrics into pass/fail indicators
  60. Setting up good SLO • Analyze historical performance • Consider

    user expectations and business requirements • Start conservative (lower targets) and adjust • Different component may need different targets • Adjust based on feedback
  61. Flow of Deployment 1. Trigger Detection 2. Source Fetching 3.

    Manifest Generation 4. Reconciliation & Diff Calculation 5. Resource Updates 6. Health Assessment 7. Sync Completion 8. Notification
  62. Common bottlenecks • Git operations (large repos or network issues)

    • Manifest generation (especially with complex Helm charts) • Resource updates & reconciliation (slow K8s API server responses) • UI non-available (problems with ArgoCD availability)
  63. ArgoCD user journey Two perspectives: • "Platform Team View" :

    Commit changes → Git repository updated → ArgoCD detects changes → Applications synchronized → Deployment successful • "Developer View" : code committed → deployed" with black boxes Important developer touchpoints: • "Did my code deploy successfully?" • "How long until my changes are live?" • “Can I check status of deployment?”
  64. Next Steps • Continuous improvements: ◦ Revisit SLOs ◦ Update

    dashboard • Use CI metrics to create SLIs, related to GitOps: ◦ Manifest generation performance • Advanced metrics: ◦ Leverage webhook notifications to calculate additional metrics
  65. Conclusion • Choose metrics/queries that represent potential problem • Set

    up alerts on error budget - early warning about issues • Revisit and refine SLI/SLO - it’s foundation for continuous improvements • IaC approach • Simple SLOs, but more info on dashboard
  66. More about New York Times • Scaling Argo Security and

    Multi-Tenancy in AWS EKS at the New York Times (David Grizzanti & Luke Philips) ◦ https://youtu.be/rro686bRIQU?si=1xOw1kSt3swdfU3V • The ArgoCD AppProject - What Is a Project and How to Power Your Multi-Tenant Security (Luke Philips & Serhiy Martynenko) ◦ https://youtu.be/x2WfwLSufCI?si=qMPoLpr75Of836rB • What We Learned Designing & Securing a Multi-Tenant Developer Platform at The New York Times (Ahmed Bebars & David Grizzanti) ◦ https://youtu.be/EniokAz-Plg?si=pnxqlC1Xd3XtCck8 • Automating Configuration and Permissions Testing for GitOps with OPA Conftest (Eve Ben Ezra & Michael Hume) ◦ https://youtu.be/VCX4UALQjeg?si=BdKkEE_3BeLEcVYg