Kubernetes Cron Jobs: Alpha to Production

Kubernetes Cron Jobs: Alpha to Production

Stripe runs many critical production scheduled tasks using Kubernetes Cron Jobs. These do everything from monitoring to moving money across the globe, with stringent requirements for reliability and timeliness.

In this presentation, we’ll discuss our methodology for evaluating Kubernetes as well as things we learned during the course of setting up clusters, shaking out bugs, and ultimately migrating our production workloads.

37c3b62c818038a84bef21a78a126eec?s=128

Franklin Hu

May 15, 2018
Tweet

Transcript

  1. 5.

    To move money, • Create a file with many transaction

    records in it • Writing the file to a bank’s FTP server • Bank sweep files in a directory ~daily • Maybe write a response file some other FTP Why cron jobs?
  2. 7.

    Cron Job Assumptions => Move money as soon as possible

    => Get transactions in files as soon as possible => Run jobs close to the deadline, but not so close that we risk missing the deadline (cost/benefit)
  3. 8.

    Cron Job Platform Implications • Jobs should be scheduled predictably

    ◦ Not running a job (or delaying to miss a deadline) means our users don’t get paid • Terminating running job may be high cost
  4. 11.

    Status Quo • Started running Chronos + Mesos in ~2015

    • Lots of issues with Chronos (2015-2017) ◦ Super buggy, frequently causing issues (lost jobs) ◦ Not actively developed (abandoned?) ◦ Built a lot of monitoring scaffolding around Chronos to know when it lost jobs ◦ Lost jobs paged humans, but didn’t always manifest as incidents • Many other teams started using Chronos for cron jobs
  5. 15.

    Why did we pick Kubernetes? • Open source! • Active

    community that takes contributions • Large upside to run other workloads ◦ Services, one-time Jobs, etc. • We use Go widely for infrastructure; ICs were able to pick up quickly Risks • Kubernetes has lots of pieces => Operating it well is challenging • Cron Job API was in alpha :\
  6. 18.

    Building Confidence: Talk to other companies • Lean on others’

    experiences! Trust but verify! • Introed via mutual contacts and/or via the Internet • Talk to other companies that have used Kubernetes ◦ All used it in different ways, in different environments (GKE, bare metal) • Suss out the common points
  7. 19.

    Building Confidence: Talk to other companies Some Common Advice •

    Prioritize etcd reliability • Some features more stable than others. Some companies wait for the next release after being marked stable to wait for bugs to get fixed • Consider using a hosted solution (GKE, AKS, EKS) because setting up a HA cluster is a lot of work
  8. 20.

    Building Confidence: Read the code • CronJob API was in

    alpha so made us nervous • Question: Can we understand it? • The entire controller is < 1000 LOC! Notably, ◦ The controller is a stateless service ◦ Every ten seconds, it runs a syncAll function... ◦ That fetches all CronJobs from the Kubernetes API, iterates through, and figures out which ones to run • Gave us confidence that if there were issues, we could fix them.
  9. 21.

    Building Confidence: Load testing • Question: Does it work? Can

    it scale? • Requirement: schedule ~50 jobs/minute • Goal: find the CronJob controller’s limit • Test ◦ 3-node cluster; 1000 cron jobs that each ran every minute. ◦ Each of them: bash -c 'echo hello world'
  10. 22.

    Building Confidence: Load testing Failure! • Kubernetes maxes out at

    1 pod per second per node ◦ 180 jobs/min on a 3 node cluster • Was good enough at the time ◦ Escape hatch: scale out workers ◦ Patchable long term
  11. 24.

    Building Confidence: Operating etcd • Setup etcd replication (tolerate node

    failures!) ◦ We run with 5 replicas in production • Make sure you have enough I/O bandwidth ◦ Ran into an issue where a single node with slow fsync caused continuous leader elections. Important if you’re using network storage • Built out tooling + runbooks for managing node lifecycles • Upstreamed some fixes to make etcd play nice with Consul DNS ◦ Done by someone with little Go experience! Etcd is a big project, but was able to get feet wet with config change
  12. 25.

    Building Confidence: Operating etcd Testing • Recovering from backup •

    Rebuilding the entire cluster without downtime • Load testing
  13. 26.

    Building Confidence: Metrics and Monitoring Aside: Veneur • Sink for

    various observability primitives with lots of outputs • Supports statsd or SSF input • https://github.com/stripe/veneur/
  14. 27.

    Building Confidence: Metrics and Monitoring • Use the kube-state-metrics package

    for cluster-level metrics • Asked Observability to write a Prometheus plugin! And they open sourced it! • veneur-prometheus to scrape metrics out of kube-state-metrics and emit them into our metrics pipeline • Create Datadog alerts for various things, like # pending pods
  15. 29.

    Designing a Migration: Requirements • No security regressions • No

    user-facing incidents • Finish before holiday season SLOs? • Do not require changes to jobs to migrate • Ratchet up requirements after migrating ◦ E.g. Job runtime
  16. 30.

    • Reduces migration risk! • Kubernetes has a lot of

    fancy features that you can use ◦ Avoided pod-to-pod networking, full containerization • Modified interfaces where necessary ◦ Took away human SSH access, but provided stop gap coverage • Punted non-essentials until after the migration was finished Designing a Migration: Cut scope
  17. 33.

    • Expose a single interface! • Use feature flags! ◦

    Built tooling that let us flag jobs between old and new clusters • Took < 5 mins to flip, so if something went wrong we could easily switch it back Designing a Migration: Migrate incrementally
  18. 34.

    Major migration goal: don’t cause any outages • Had a

    variety of jobs => Breaking a low-impact one is okay • Used this to discover where our gaps were one edge-case at a time Designing a Migration: Migrate incrementally
  19. 36.

    Designing a Migration: Investigate bugs Rule: If Kubernetes does something

    unexpected, investigate, find root cause, and come up with remediation • Found a bunch of bugs during testing ◦ Cronjobs with names longer than 52 characters silently fail to schedule jobs ◦ Pods would sometimes get stuck in the Pending state forever [0] [1] ◦ The scheduler would crash every 3 hours ◦ Flannel’s hostgw backend didn’t replace outdated route table entries
  20. 37.

    Designing a Migration: Investigate bugs Give back to the community

    • Upstreamed all our fixes to be good citizen ◦ Kubernetes, etcd, and others • Discovered Kubernetes’s SIGs (special interest groups)
  21. 39.

    Designing a Migration: Game Days • Come up with a

    failure scenario ◦ Single Kubernetes API server failure • Cause the scenario in production! • Make sure the system behaves as expected
  22. 41.

    Designing a Migration: Game Days Things we tested • Terminate

    one Kubernetes API server • Terminate all the Kubernetes API servers and bring them back up • Terminate an etcd node • Network partition between all workers and API servers
  23. 42.

    So how did it go? • ~6 months in production

    • No major incidents since production roll out \o/ • Most minor issues have been with Kubernetes worker failures ◦ Pods stuck in pending state on a worker due to various reasons ◦ Fork bombed ourselves and ran into thread/proc limits ◦ DNS resolve.conf behavior
  24. 43.

    The Future • Need to invest more in Kubernetes cluster

    rebuilds ◦ Still running 1.7.x branch we launched with • Will look more at EKS (AWS’s managed solution) once it’s available • Leaning a lot on open source tools related to Kubernetes ◦ Envoy, kube2iam, Confidant
  25. 44.

    Summing it up • Define a clear business reason for

    your Kubernetes projects (and all infrastructure projects!). • Talk to your users! • Kubernetes is not right for everyone (it’s hard to run!) • … but if you decide to go for it, invest time in learning how to properly operate a cluster