Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Nubank to 1 Deploy per Minute

Avatar for Julio Turolla Julio Turolla
March 19, 2024
110

Scaling Nubank to 1 Deploy per Minute

Nubank is the leading technology company providing financial services in Latin America, with over 70 million customers across 3 countries. With 2000 engineers building and pushing code to production every day, delivering changes to production becomes a challenge. Processes get in the way, queues get huge, and out-of-the-box solutions break. By using contract-based testing, sharding, leveraging a data-driven engineering culture, and building on top of Tekton (a Kubernetes native open source CI/CD framework part of the Continuous Delivery Foundation), we will tell you the journey behind building a platform that enables one successful change every minute in a fast and safe way.

Avatar for Julio Turolla

Julio Turolla

March 19, 2024
Tweet

Transcript

  1. How Nubank Deploys One Change to Production Every Minute *

    business minute ~2400 deploys per week Júlio Turolla, Felipe Goh *
  2. Nubank • Financial Services in Latam • Office in Berlin

    • 50% of adult population in 󰎙 • 8k employees (2000 engineers) • 󰎙+󰐏+󰎪
  3. Impact of Continuous Integration and Delivery • 7 People at

    the team • Impacts 2.000 engineers shipping software • Impacts 80 MM using Nubank ◦ Groceries ◦ Support loved ones ◦ Pay their bar tabs
  4. Usual platforms vs. Unusual Scale • Open Source niche CICD

    software created by someone in the internet ◦ Beautiful UI ◦ You have to deploy it ◦ Tested to 10000 jobs per month Simply explodes
  5. • Awesome SaaS Vendor with a Great Product ◦ Fully

    managed ◦ Machines in ☁ ◦ Never explodes in Saletime, gonna work very hard so it doesn't explode in Runtime ◦ After migration, we're kinda locked in! Usual platforms vs. Unusual Scale Yearly Contract Renegotiations
  6. • Big A to Z Cloud Vendor's CICD Offering ◦

    Best customer support on earth ◦ Never ever explodes ◦ Basically only .com uses it, never heard of other cases ◦ Off the shelf pricing, you know how much you pay You have to integrate with 4 other A to Z Products Usual platforms vs. Unusual Scale Hidden Hard Limits
  7. Reaching scale • Scale in which Most OSS is not

    tested with • Scale in which most SaaS don't offer an off-the-shelf price • And even if they do, it usually has limits
  8. CICD at Nubank • 10 years ago • GoCD, by

    ThoughtWorks • Value Stream Map • Centralized control server, distributed agents • Homogeneous stack ◦ Clojure Services, Datomic, Dynamodb, Kafka, HTTP • Before Sharding • Happily coding pipelines in XML 100
  9. Fast forward 6 years… (2019) • ~600 services • ~1.5k

    engineers • Multiple products and aws accounts • Stronger access controls ◦ Gocd has to have strict IAM Policies, but Not capable of ◦ Result: One agent per IAM Role • 500 agents reaching 1 centralized, stateful, non-horizontally scalable server • Self inflicted DDOs 30.000
  10. • 400 deploys and 500k tasks/month in d0 • 2x

    growth per semester • Strong authorization and authentication • Efficient • Centralized pipeline declaration • Easy to use New CICD Requirements
  11. Getting to Tekton • Tekton was in its early days

    • Built on top of Kubernetes • It's a building block for CICD • Not super friendly ◦ Manual pipeline setup ◦ Lacked logs visualization • No Value Stream Map
  12. apiVersion: tekton.dev/v1beta1 kind: Task metadata: name: hello spec: steps: -

    name: echo image: alpine script: | #!/bin/sh echo "Hello World" NAME SUCCEEDED REASON hello-task-run True Succeeded
  13. CICD Vision • Similar UX to great SaaS • Create

    a manifest in git repo ◦ Creates Pipeline ◦ Setup Webhooks ◦ Setup SSH • Runs on arbitrary Github Events • Log visualization tool • Dashboard Not as simple as writing another Clojure Service.
  14. How to code for the Kubernetes API? • Novelty at

    the Company • A discipline long desired • Open doors for a new type of meta-system, oriented for managing our infra • No clear choice of company-sponsored programming language
  15. — Alan Ghelardi, Staff SWE at Nubank It was clear

    that we had to extend Tekton, and creating Kubernetes native applications was very valuable in this scenario. As Kubernetes and Tekton are written in Golang, it's the most natural choice to work alongside those projects. We could take advantage of the existing ecosystem, instead of recreating structures in other languages. Even though the most obvious choice would be Java, with interop to Clojure, it would be a tremendous effort to maintain and evolve.
  16. Nu Workflows • Observed the manual steps to make Tekton

    work • Devised Nu Workflows • First relevant system in Golang at the company • Cover basic features ◦ Workflow lifecycle ◦ Workflow triggering ◦ Garbage Collection ◦ Rerun
  17. This proof of concept infrastructure lived longer than expected… •

    A well intentioned engineer… ◦ … deleted 2700 branches to cleanup a repository ◦ … each running 10 workflows ◦ … each running tens of tasks ◦ … with no queueing ◦ … in a VPC with 2000 IPv4s • Flooded the Kubernetes API • EKS Control plane couldn't scale up • Crashed gracefully ◦ Offline for 1 day
  18. Limits of the platform? • Constant overlook, evaluated at design

    time • Open internet: How can users abuse it? • All sorted out for us if writing a Clojure Service • New technology: come up with these details ourselves • Internal system, safe sandbox • Trustworthy users can be hazardous, too • Iteration 0 ◦ Lacked flow control mechanisms ◦ Lacked a dedicated infrastructure
  19. Iteration 1, Reliability and Reproducibility • Two workstreams: Application and

    Infrastructure • Application features to support growth ◦ Prioritized development of queueing ◦ Concurrency controls ◦ Observability • Infrastructure to support growth ◦ Correctly sized and scalable networking ◦ Reproducible deployment of components ◦ Upgrades and feedback cycle ◦ Discover bottlenecks again ◦ Theoretical limit of 32k concurrent runs
  20. Scaled to 4000 deploys, 1 million tasks per month, 1200

    services 1.000.000 (~0,4 deploys per business minute)
  21. Vendor Limits - AWS EKS • EKS Limits ◦ ETCD

    ◦ API Latency • 10.000 concurrent runs • Slowness in Kubernetes API (up to minutes for a response) • Instability
  22. Vendor Limits - Github • One of the Biggest user

    of Github APIs • Vendors have limits, as great systems do • Optimization of requests, caching ◦ Cache hit really relevant at this scale • Github helped us a lot
  23. Iteration 2 - How to keep growing? • Reached the

    limits of a single EKS cluster • Not willing to leave EKS ◦ No longer running any self-hosted k8s at the company ◦ Reliance in IRSA for authn and authz • Already worked in having a reproducible infrastructure • Sharding ◦ Making the platform linearly scalable ◦ 1 worker cluster for every 7000 concurrent jobs ◦ 1 control plane
  24. Iteration 2 - Sharding • Enabled Use Cases ◦ 2

    repos take 60% of the load in CICD ◦ Dedicate clusters against noisy neighbors ◦ Fast track clusters, with zero queueing for urgent deploy ◦ Low priority clusters for scheduled automations ◦ Allow for dedicated clusters if we have a regulatory requirement for a country
  25. Reached short of 4 million tasks per month, 2400 deploys

    per week 4.000.000 (~1 deploy per business minute)
  26. Concurrency Cap per Iteration Iteration Theoretical Concurrency Actual Concurrency Limit

    Iteration 0 — Proof-of-Concept Not evaluated 2.000 Iteration 1 — Reliability 32.000 7.000 (best performance) Iteration 2 — Sharding 7.000 * N Unknown (>70k)
  27. Closing Up • When reaching certain scale, options become complex

    • Neither off-the-shelf nor open source suffices, it depends • Growth is not stopping: • Reached 1M customers in 1 month for the newly released Mexican Bank Account • Expect to reach 10M TaskRuns/mo in the next 18 months • Deploying new shards by demand • Work to improve shard max concurrency We're ready for Growth until the Next Bottleneck is discovered.
  28. End

  29. • Open Source Software Sponsored by Big Tech ◦ Runs

    on top of the Other OSS sponsored by Big Tech ◦ Nobody offers it, run yourself. ◦ BTW participate in this committee to change it ◦ Good luck! Usual platforms vs. Unusual Scale lucky to have a UI Simplicity? Build it yourself No Graphical User Interface