Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Kubernetes Cron Jobs: Alpha to Production

Kubernetes Cron Jobs: Alpha to Production

Stripe runs many critical production scheduled tasks using Kubernetes Cron Jobs. These do everything from monitoring to moving money across the globe, with stringent requirements for reliability and timeliness.

In this presentation, we’ll discuss our methodology for evaluating Kubernetes as well as things we learned during the course of setting up clusters, shaking out bugs, and ultimately migrating our production workloads.

Franklin Hu

May 15, 2018
Tweet

More Decks by Franklin Hu

Other Decks in Technology

Transcript

  1. Kubernetes Cron Jobs
    Going from Alpha to Production
    @thisisfranklin

    View full-size slide

  2. What is Stripe?

    View full-size slide

  3. Global Payments
    mv $alice $bob $amt

    View full-size slide

  4. Why cron jobs?

    View full-size slide

  5. To move money,
    ● Create a file with many transaction records in it
    ● Writing the file to a bank’s FTP server
    ● Bank sweep files in a directory ~daily
    ● Maybe write a response file some other FTP
    Why cron jobs?

    View full-size slide

  6. Developers understand cron jobs

    View full-size slide

  7. Cron Job Assumptions
    => Move money as soon as possible
    => Get transactions in files as soon as possible
    => Run jobs close to the deadline, but not so close that we risk missing
    the deadline (cost/benefit)

    View full-size slide

  8. Cron Job Platform Implications
    ● Jobs should be scheduled predictably
    ○ Not running a job (or delaying to miss a deadline) means our users don’t get paid
    ● Terminating running job may be high cost

    View full-size slide

  9. Overview
    Why did we pick Kubernetes?
    Building Confidence
    Designing a Migration

    View full-size slide

  10. Why did we pick Kubernetes?

    View full-size slide

  11. Status Quo
    ● Started running Chronos + Mesos in ~2015
    ● Lots of issues with Chronos (2015-2017)
    ○ Super buggy, frequently causing issues (lost jobs)
    ○ Not actively developed (abandoned?)
    ○ Built a lot of monitoring scaffolding around Chronos to know when it lost jobs
    ○ Lost jobs paged humans, but didn’t always manifest as incidents
    ● Many other teams started using Chronos for cron jobs

    View full-size slide

  12. Service Level Objectives (SLOs)
    Measurable system characteristics
    (availability, performance, throughput, etc.)

    View full-size slide

  13. 99.99% of jobs start running within 20
    min of their scheduled times

    View full-size slide

  14. 99.99% of jobs run to completion

    View full-size slide

  15. Why did we pick Kubernetes?
    ● Open source!
    ● Active community that takes contributions
    ● Large upside to run other workloads
    ○ Services, one-time Jobs, etc.
    ● We use Go widely for infrastructure; ICs were able to pick up quickly
    Risks
    ● Kubernetes has lots of pieces => Operating it well is challenging
    ● Cron Job API was in alpha :\

    View full-size slide

  16. Building Confidence

    View full-size slide

  17. What is Kubernetes?

    View full-size slide

  18. Building Confidence: Talk to other companies
    ● Lean on others’ experiences! Trust but verify!
    ● Introed via mutual contacts and/or via the Internet
    ● Talk to other companies that have used Kubernetes
    ○ All used it in different ways, in different environments (GKE, bare metal)
    ● Suss out the common points

    View full-size slide

  19. Building Confidence: Talk to other companies
    Some Common Advice
    ● Prioritize etcd reliability
    ● Some features more stable than others. Some companies wait for the
    next release after being marked stable to wait for bugs to get fixed
    ● Consider using a hosted solution (GKE, AKS, EKS) because setting up a
    HA cluster is a lot of work

    View full-size slide

  20. Building Confidence: Read the code
    ● CronJob API was in alpha so made us nervous
    ● Question: Can we understand it?
    ● The entire controller is < 1000 LOC! Notably,
    ○ The controller is a stateless service
    ○ Every ten seconds, it runs a syncAll function...
    ○ That fetches all CronJobs from the Kubernetes API, iterates through, and figures out
    which ones to run
    ● Gave us confidence that if there were issues, we could fix them.

    View full-size slide

  21. Building Confidence: Load testing
    ● Question: Does it work? Can it scale?
    ● Requirement: schedule ~50 jobs/minute
    ● Goal: find the CronJob controller’s limit
    ● Test
    ○ 3-node cluster; 1000 cron jobs that each ran every minute.
    ○ Each of them: bash -c 'echo hello world'

    View full-size slide

  22. Building Confidence: Load testing
    Failure!
    ● Kubernetes maxes out at 1 pod per second per node
    ○ 180 jobs/min on a 3 node cluster
    ● Was good enough at the time
    ○ Escape hatch: scale out workers
    ○ Patchable long term

    View full-size slide

  23. Building Confidence: Operating etcd

    View full-size slide

  24. Building Confidence: Operating etcd
    ● Setup etcd replication (tolerate node failures!)
    ○ We run with 5 replicas in production
    ● Make sure you have enough I/O bandwidth
    ○ Ran into an issue where a single node with slow fsync caused continuous leader
    elections. Important if you’re using network storage
    ● Built out tooling + runbooks for managing node lifecycles
    ● Upstreamed some fixes to make etcd play nice with Consul DNS
    ○ Done by someone with little Go experience! Etcd is a big project, but was able to get
    feet wet with config change

    View full-size slide

  25. Building Confidence: Operating etcd
    Testing
    ● Recovering from backup
    ● Rebuilding the entire cluster without downtime
    ● Load testing

    View full-size slide

  26. Building Confidence: Metrics and Monitoring
    Aside: Veneur
    ● Sink for various observability primitives with lots of outputs
    ● Supports statsd or SSF input
    ● https://github.com/stripe/veneur/

    View full-size slide

  27. Building Confidence: Metrics and Monitoring
    ● Use the kube-state-metrics package for cluster-level metrics
    ● Asked Observability to write a Prometheus plugin! And they open
    sourced it!
    ● veneur-prometheus to scrape metrics out of kube-state-metrics and
    emit them into our metrics pipeline
    ● Create Datadog alerts for various things, like # pending pods

    View full-size slide

  28. Designing a Migration

    View full-size slide

  29. Designing a Migration: Requirements
    ● No security regressions
    ● No user-facing incidents
    ● Finish before holiday season
    SLOs?
    ● Do not require changes to jobs to migrate
    ● Ratchet up requirements after migrating
    ○ E.g. Job runtime

    View full-size slide

  30. ● Reduces migration risk!
    ● Kubernetes has a lot of fancy features that you can use
    ○ Avoided pod-to-pod networking, full containerization
    ● Modified interfaces where necessary
    ○ Took away human SSH access, but provided stop gap coverage
    ● Punted non-essentials until after the migration was finished
    Designing a Migration: Cut scope

    View full-size slide

  31. Doing something “the right way”
    != Operating well

    View full-size slide

  32. Migrate Incrementally

    View full-size slide

  33. ● Expose a single interface!
    ● Use feature flags!
    ○ Built tooling that let us flag jobs
    between old and new clusters
    ● Took < 5 mins to flip, so if
    something went wrong we
    could easily switch it back
    Designing a Migration: Migrate incrementally

    View full-size slide

  34. Major migration goal: don’t cause
    any outages
    ● Had a variety of jobs =>
    Breaking a low-impact one is
    okay
    ● Used this to discover where our
    gaps were one edge-case at a
    time
    Designing a Migration: Migrate incrementally

    View full-size slide

  35. Rule: No ghosts

    View full-size slide

  36. Designing a Migration: Investigate bugs
    Rule: If Kubernetes does something unexpected, investigate, find root
    cause, and come up with remediation
    ● Found a bunch of bugs during testing
    ○ Cronjobs with names longer than 52 characters silently fail to schedule jobs
    ○ Pods would sometimes get stuck in the Pending state forever [0] [1]
    ○ The scheduler would crash every 3 hours
    ○ Flannel’s hostgw backend didn’t replace outdated route table entries

    View full-size slide

  37. Designing a Migration: Investigate bugs
    Give back to the community
    ● Upstreamed all our fixes to be good citizen
    ○ Kubernetes, etcd, and others
    ● Discovered Kubernetes’s SIGs (special interest groups)

    View full-size slide

  38. If you find bugs, you’ll need to fix
    them yourself

    View full-size slide

  39. Designing a Migration: Game Days
    ● Come up with a failure scenario
    ○ Single Kubernetes API server failure
    ● Cause the scenario in production!
    ● Make sure the system behaves as expected

    View full-size slide

  40. Game Days test tech, docs, and humans

    View full-size slide

  41. Designing a Migration: Game Days
    Things we tested
    ● Terminate one Kubernetes API server
    ● Terminate all the Kubernetes API servers and bring them back up
    ● Terminate an etcd node
    ● Network partition between all workers and API servers

    View full-size slide

  42. So how did it go?
    ● ~6 months in production
    ● No major incidents since production roll out \o/
    ● Most minor issues have been with Kubernetes worker failures
    ○ Pods stuck in pending state on a worker due to various reasons
    ○ Fork bombed ourselves and ran into thread/proc limits
    ○ DNS resolve.conf behavior

    View full-size slide

  43. The Future
    ● Need to invest more in Kubernetes cluster rebuilds
    ○ Still running 1.7.x branch we launched with
    ● Will look more at EKS (AWS’s managed solution) once it’s available
    ● Leaning a lot on open source tools related to Kubernetes
    ○ Envoy, kube2iam, Confidant

    View full-size slide

  44. Summing it up
    ● Define a clear business reason for your Kubernetes projects (and all
    infrastructure projects!).
    ● Talk to your users!
    ● Kubernetes is not right for everyone (it’s hard to run!)
    ● … but if you decide to go for it, invest time in learning how to properly
    operate a cluster

    View full-size slide

  45. Resources
    ● https://stripe.com/blog/operating-kubernetes
    ● https://stripe.com/blog/game-day-exercises-at-stripe
    ● https://github.com/stripe/veneur

    View full-size slide

  46. Thanks!
    Thoughts or feedback?
    @thisisfranklin
    [email protected]

    View full-size slide