Kubernetes Cron Jobs: Alpha to Production

Kubernetes Cron Jobs Going from Alpha to Production @thisisfranklin

What is Stripe?

Global Payments mv $alice $bob $amt

Why cron jobs?

To move money, • Create a file with many transaction
records in it • Writing the file to a bank’s FTP server • Bank sweep files in a directory ~daily • Maybe write a response file some other FTP Why cron jobs?

Developers understand cron jobs

Cron Job Assumptions => Move money as soon as possible
=> Get transactions in files as soon as possible => Run jobs close to the deadline, but not so close that we risk missing the deadline (cost/benefit)

Cron Job Platform Implications • Jobs should be scheduled predictably
◦ Not running a job (or delaying to miss a deadline) means our users don’t get paid • Terminating running job may be high cost

Overview Why did we pick Kubernetes? Building Confidence Designing a
Migration

Why did we pick Kubernetes?

Status Quo • Started running Chronos + Mesos in ~2015
• Lots of issues with Chronos (2015-2017) ◦ Super buggy, frequently causing issues (lost jobs) ◦ Not actively developed (abandoned?) ◦ Built a lot of monitoring scaffolding around Chronos to know when it lost jobs ◦ Lost jobs paged humans, but didn’t always manifest as incidents • Many other teams started using Chronos for cron jobs

Service Level Objectives (SLOs) Measurable system characteristics (availability, performance, throughput,
etc.)

99.99% of jobs start running within 20 min of their
scheduled times

99.99% of jobs run to completion

Why did we pick Kubernetes? • Open source! • Active
community that takes contributions • Large upside to run other workloads ◦ Services, one-time Jobs, etc. • We use Go widely for infrastructure; ICs were able to pick up quickly Risks • Kubernetes has lots of pieces => Operating it well is challenging • Cron Job API was in alpha :\

Building Confidence

What is Kubernetes?

Building Confidence: Talk to other companies • Lean on others’
experiences! Trust but verify! • Introed via mutual contacts and/or via the Internet • Talk to other companies that have used Kubernetes ◦ All used it in different ways, in different environments (GKE, bare metal) • Suss out the common points

Building Confidence: Talk to other companies Some Common Advice •
Prioritize etcd reliability • Some features more stable than others. Some companies wait for the next release after being marked stable to wait for bugs to get fixed • Consider using a hosted solution (GKE, AKS, EKS) because setting up a HA cluster is a lot of work

Building Confidence: Read the code • CronJob API was in
alpha so made us nervous • Question: Can we understand it? • The entire controller is < 1000 LOC! Notably, ◦ The controller is a stateless service ◦ Every ten seconds, it runs a syncAll function... ◦ That fetches all CronJobs from the Kubernetes API, iterates through, and figures out which ones to run • Gave us confidence that if there were issues, we could fix them.

Building Confidence: Load testing • Question: Does it work? Can
it scale? • Requirement: schedule ~50 jobs/minute • Goal: find the CronJob controller’s limit • Test ◦ 3-node cluster; 1000 cron jobs that each ran every minute. ◦ Each of them: bash -c 'echo hello world'

Building Confidence: Load testing Failure! • Kubernetes maxes out at
1 pod per second per node ◦ 180 jobs/min on a 3 node cluster • Was good enough at the time ◦ Escape hatch: scale out workers ◦ Patchable long term

Building Confidence: Operating etcd

Building Confidence: Operating etcd • Setup etcd replication (tolerate node
failures!) ◦ We run with 5 replicas in production • Make sure you have enough I/O bandwidth ◦ Ran into an issue where a single node with slow fsync caused continuous leader elections. Important if you’re using network storage • Built out tooling + runbooks for managing node lifecycles • Upstreamed some fixes to make etcd play nice with Consul DNS ◦ Done by someone with little Go experience! Etcd is a big project, but was able to get feet wet with config change

Building Confidence: Operating etcd Testing • Recovering from backup •
Rebuilding the entire cluster without downtime • Load testing

Building Confidence: Metrics and Monitoring Aside: Veneur • Sink for
various observability primitives with lots of outputs • Supports statsd or SSF input • https://github.com/stripe/veneur/

Building Confidence: Metrics and Monitoring • Use the kube-state-metrics package
for cluster-level metrics • Asked Observability to write a Prometheus plugin! And they open sourced it! • veneur-prometheus to scrape metrics out of kube-state-metrics and emit them into our metrics pipeline • Create Datadog alerts for various things, like # pending pods

Designing a Migration

Designing a Migration: Requirements • No security regressions • No
user-facing incidents • Finish before holiday season SLOs? • Do not require changes to jobs to migrate • Ratchet up requirements after migrating ◦ E.g. Job runtime

• Reduces migration risk! • Kubernetes has a lot of
fancy features that you can use ◦ Avoided pod-to-pod networking, full containerization • Modified interfaces where necessary ◦ Took away human SSH access, but provided stop gap coverage • Punted non-essentials until after the migration was finished Designing a Migration: Cut scope

Doing something “the right way” != Operating well

Migrate Incrementally

• Expose a single interface! • Use feature flags! ◦
Built tooling that let us flag jobs between old and new clusters • Took < 5 mins to flip, so if something went wrong we could easily switch it back Designing a Migration: Migrate incrementally

Major migration goal: don’t cause any outages • Had a
variety of jobs => Breaking a low-impact one is okay • Used this to discover where our gaps were one edge-case at a time Designing a Migration: Migrate incrementally

Rule: No ghosts

Designing a Migration: Investigate bugs Rule: If Kubernetes does something
unexpected, investigate, find root cause, and come up with remediation • Found a bunch of bugs during testing ◦ Cronjobs with names longer than 52 characters silently fail to schedule jobs ◦ Pods would sometimes get stuck in the Pending state forever [0] [1] ◦ The scheduler would crash every 3 hours ◦ Flannel’s hostgw backend didn’t replace outdated route table entries

Designing a Migration: Investigate bugs Give back to the community
• Upstreamed all our fixes to be good citizen ◦ Kubernetes, etcd, and others • Discovered Kubernetes’s SIGs (special interest groups)

If you find bugs, you’ll need to fix them yourself

Designing a Migration: Game Days • Come up with a
failure scenario ◦ Single Kubernetes API server failure • Cause the scenario in production! • Make sure the system behaves as expected

Game Days test tech, docs, and humans

Designing a Migration: Game Days Things we tested • Terminate
one Kubernetes API server • Terminate all the Kubernetes API servers and bring them back up • Terminate an etcd node • Network partition between all workers and API servers

So how did it go? • ~6 months in production
• No major incidents since production roll out \o/ • Most minor issues have been with Kubernetes worker failures ◦ Pods stuck in pending state on a worker due to various reasons ◦ Fork bombed ourselves and ran into thread/proc limits ◦ DNS resolve.conf behavior

The Future • Need to invest more in Kubernetes cluster
rebuilds ◦ Still running 1.7.x branch we launched with • Will look more at EKS (AWS’s managed solution) once it’s available • Leaning a lot on open source tools related to Kubernetes ◦ Envoy, kube2iam, Confidant

Summing it up • Define a clear business reason for
your Kubernetes projects (and all infrastructure projects!). • Talk to your users! • Kubernetes is not right for everyone (it’s hard to run!) • … but if you decide to go for it, invest time in learning how to properly operate a cluster

Resources • https://stripe.com/blog/operating-kubernetes • https://stripe.com/blog/game-day-exercises-at-stripe • https://github.com/stripe/veneur

Thanks! Thoughts or feedback? @thisisfranklin [email protected]

Kubernetes Cron Jobs: Alpha to Production

Kubernetes Cron Jobs: Alpha to Production

More Decks by Franklin Hu

Other Decks in Technology

Featured

Transcript