Forklifting Django: Migrating A Complex Django App To Kubernetes

Forklifting Django: Migrating A Complex Django App To Kubernetes

Kubernetes has rapidly become an omnipresent tool in the infrastructure world, running everything from the biggest monolith to the slightest microservices. But with power comes complexity, and Kubernetes has never been accused of being too simple or having too few moving pieces. Over about 9 months, we’ve moved Ridecell’s primary Django application from AWS and Ansible to Kubernetes, encountering many pitfalls along the way. This talk will show the approaches we tried, the worst of the issues we encountered, and lay out a path for migrating other Python web applications more effectively!


Noah Kantrowitz

August 02, 2019


  1. 1.

    Forklifting Django Migrating A Complex Django App To Kubernetes Noah

    Kantrowitz – @kantrn – DjangoCon US 2019
  2. 2.

    Quick intro, who I am and what I do. Who

    Am I? » Noah Kantrowitz ! » " @kantrn # coderanger » Ops % and Python & » Ridecell ' Noah Kantrowitz – @kantrn – DjangoCon US 2019
  3. 3.

    Deploying a web application is hard. A lot of individual

    bits that all must happen in the right order, over and over, and has to be fast and easy to use. In the beginning, deployment was entirely manual and thus slow, error prone, and time- intensive. In time, new tools like Fabric, Chef, and Ansible helped to automate things, and Docker helped move more steps from deploy-time to build- time. But with the rise of new container tools, come new deployment challenges. Deployment Is Hard Noah Kantrowitz – @kantrn – DjangoCon US 2019
  4. 4.

    So let's set the stage. My main application is a

    pretty standard Django monolith. We're still on Python 2, which means Django 1, but for deployment that doesn't really change much. We've got Celery for background tasks and Channels for websockets. We use Postgres as the SQL database, Redis as a cache, and Rabbit for Celery. Additionally my main application is one instance per customer. So we deploy a separate copy for each customer. This allows independent versioning for each customer and reduces the chances of one customer seeing data for another. For deployment, this mostly means we need to put out a lot more environments than usual, but the same problems apply to smaller scale systems. My Application Summon Platform Python 2 ! and Django 1.11 Celery and Channels Postgres, Redis, RabbitMQ ⚡ Single Tenant Noah Kantrowitz – @kantrn – DjangoCon US 2019
  5. 5.

    That leads in to what the deployment system looked like

    when we started on this adventure. This main application is entirely AWS- based, though we do have a few things using GCP as an experiment. The main two tools for deployment were Terraform for initial provisioning (along with some Python scripts for the bits that there wasn't a provider for) and a bunch of Ansible roles for configuring the systems and deploying the actual app code. Secrets lived in Ansible Vault and access control was managed by SSH key distribution. It was a solid setup, but launching a new instance took a few hours of work and required someone from my team to do most of it. Similarly scaling up and down was complicated, and the system was very much not built for autoscaling. Current Deployment AWS – Infrastructure and storage Terraform – Shared modules for provisioning infrastructure Ansible – System configuration and deployment Noah Kantrowitz – @kantrn – DjangoCon US 2019
  6. 6.

    We also have a bunch of Django-based microservices but let's

    just focus on the big application for now. Microservices? Noah Kantrowitz – @kantrn – DjangoCon US 2019
  7. 7.

    We did have a few constraints for our new system.

    We knew we wanted to use Kubernetes as our container management system. We already had it in use for our microservices and it's my weapon of choice for container wrangling overall. Additionally we wanted to keep using RDS and CloudAMQP for databases. I am probably more in favor of databases on Kubernetes than most, but with a small team I'll pretty much always trade money for not having to manage them myself. This would also ease migration as we wouldn't have to make changes to the production databases. The More Things Change The more they stay the same Kubernetes – Already in use for microservices RDS – Production Postgres CloudAMQP – Production RabbitMQ Cloudflare – CDN Shield Noah Kantrowitz – @kantrn – DjangoCon US 2019
  8. 8.

    What do I mean by a forklift migration. While I

    could make some changes to the application, the overall goal was to pick it up from AWS and drop it as directly as possible into Kubernetes. ⚠ Forklift Mode Noah Kantrowitz – @kantrn – DjangoCon US 2019
  9. 9.

    Before we start talking specifics, let's define some jargon for

    people that haven't heard it before. Aside: K8s 101 Noah Kantrowitz – @kantrn – DjangoCon US 2019
  10. 10.

    Two generic terms, a container is a cool way to

    run a process, an image is the larval form of a container. Container A way to run a Linux (usually) process so it can't see any other container and is generally somewhat locked down. Image All the files needed to create a container. Noah Kantrowitz – @kantrn – DjangoCon US 2019
  11. 11.

    Kubernetes itself is many things to many people, but the

    most important thing for this talk is it is an API to make containers jump through hoops until they are useful. Pods are a Kubernetes-specific term mostly meaning the same thing as Container but with some side benefits. Kubernetes An API driven system for coordinating multiple containers and related infrastructure to do useful work. Pod A collection of one or more containers running on a single host. Generally has a main application container and a bunch of helpers, called sidecars. Noah Kantrowitz – @kantrn – DjangoCon US 2019
  12. 12.

    So with all that laid out, I was ready to

    roll up my sleeves and jump in. Round One Noah Kantrowitz – @kantrn – DjangoCon US 2019
  13. 13.

    The first step in containerizing any application is to make

    a container image. Multi-stage builds are especially useful for this kind of application as the build requirements are often much bigger than what is needed at runtime. One thing of note here is the fact that I'm not using a virtualenv. I'm sure some will disagree with me, but I find it not actually very helpful inside a container which will only ever be used for one thing. FROM ubuntu:16.04 AS build ADD / RUN apt-get update && apt-get install -y python-dev … && \ python / COPY . /src WORKDIR /src RUN python -m pip install --extra-index=… -r requirements.txt && \ find . -name '*.pyc' -delete && \ ( python -m compileall . || true ) && \ env DOCKER_BUILD=true python collectstatic Noah Kantrowitz – @kantrn – DjangoCon US 2019
  14. 14.

    After the build phase, we set up the smaller runtime

    image and copy over just the needed files. This also sets up a non- root user to run the application as. FROM ubuntu:16.04 RUN apt-get update && \ apt-get install -y --no-install-recommends python … && \ locale-gen en_US.UTF-8 && \ useradd -M --shell /bin/false --home /src summon COPY --from=build /src /src/ COPY --from=build /usr/local/lib/python2.7/dist-packages … COPY --from=build /var/www /var/www USER summon WORKDIR /src Noah Kantrowitz – @kantrn – DjangoCon US 2019
  15. 15.

    My first attempt to get things set up in Kubernetes

    was purely static manifests with no automation beyond kubectl apply. Fitting a whole Kubernetes YAML file on screen is hard so this is just the main container defintion but it gives you the idea. Starting simple, just mount up a config, run some migrations, and launch gunicorn. Static Manifests image: command: - sh - -c - python migrate && \ exec python -m -b summon_platform.wsgi ports: - containerPort: 8000 resources: requests: memory: 512M cpu: 1000m volumeMounts: - name: config-volume mountPath: /etc/config Noah Kantrowitz – @kantrn – DjangoCon US 2019
  16. 16.

    A quick improvement was to move the migrations to an

    init container. And for bonus points make the postgres password dynamic at runtime. This helped a bit but still had a lot of issues with ordering. With more than one replica, every pod would try to run the migrations and the celery and channels pods would upgrade before the migrations had finished. A good start, this proved our application would run, but clearly we needed a more organized deployment system underneath it. initContainers: - image: command: - sh - -c - sed "s/__PGPASSWORD__/$(cat /postgres/password)/" \ </etc/config-orig/app.yml >/etc/config/app.yml && \ python migrate volumeMounts: - name: config-volume mountPath: /etc/config - name: config-orig mountPath: /etc/config-orig - name: postgres-credentials mountPath: /postgres Noah Kantrowitz – @kantrn – DjangoCon US 2019
  17. 17.

    A few things I glossed over here just so we

    are keeping track. The other three main daemons work exactly like the gunicorn pods but with a different command. All of the pods use the same image, since building all of the daemons into the image is relatively little overhead compared to the size of the shared bits. I've also ignored Celerybeat so far, which we'll get to later. My application is mostly an API used by mobile apps, with a web interface mostly used by administrative staff, so to keep things simple for now I decided to build all the static files into the image too along with a chopped down version of the Caddy web server to serve them. All of this HTTP traffic for django, daphne, and static files were routed through the Kubernetes Ingress system, which handles HTTP routing and management Additionally in this initial phase I started playing with the Zalando postgres- operator to set up local databases during testing, as provisioning RDS can be quite slow. Other Bits Celeryd, Daphne, Channelworkers Celerybeat – Not yet Static files Service and Ingress postgres-operator Noah Kantrowitz – @kantrn – DjangoCon US 2019
  18. 18.

    So by this point I had a working proof-of-concept, I

    could deploy my whole stack on to Kubernetes and use it like normal. What I needed next was a way to do that repeatedly and reliably. Part Two Noah Kantrowitz – @kantrn – DjangoCon US 2019
  19. 19.

    The next tool in the standard Kubernetes quiver is Helm,

    which calls itself the Kubernetes package manager. This is overall very similar to what we had before, but we can put the manifests in a Helm chart and install it as many times as needed. Helm And other hat-based techniques Noah Kantrowitz – @kantrn – DjangoCon US 2019
  20. 20.

    I won't go over the chart itself since it's basically

    the same thing we saw before with more curly braces but overall Helm did improve some things. It made deploying multiple environments much smoother as we could stamp out installs of the same chart trivially and store just the values that need to be different between them in their own YAML files. It is also widely supported by the Kubernetes community and ecosystem tools, and we already had operational experience with it on other projects. The Good Bits Repeatable – helm install myapp Templating – values-{qa,uat,prod}.yml Simple – Community standard solution Integrations – Already in use for our other services Noah Kantrowitz – @kantrn – DjangoCon US 2019
  21. 21.

    But Helm brought a lot of problems too. It does

    little to fix the ordering and sequencing issues we had before. Helm does have a hooks system that we tried to use, but error handling was not great and overall it was complex to work with. There's also some major gaps in the Helm ecosystem that were frustrating. Unit and integration testing in isolation is very difficult and secrets management is almost entirely left to plugins. The Less Good Bits Ordering – Still hard to do migrations first Stitiching together different tools Minimal testing tools Secrets management Noah Kantrowitz – @kantrn – DjangoCon US 2019
  22. 22.

    And then the big problem with Helm, Tiller. It's the

    server- side component used to interact with the Kubernetes API and it's a security nightmare. There are workarounds, but all of them come with frustrating downsides. The next major version of Helm will be removing Tiller entirely, but that is still a ways off unfortunately. The short version of the problem is that Tiller needs very broad permissions to be able to do its job, and it has no internal authorization system so anyone that can talk to Tiller can use those permissions. The Name Of My Sadness Is Tiller Noah Kantrowitz – @kantrn – DjangoCon US 2019
  23. 23.

    Helm worked but it wasn't the smooth experience we wanted,

    and the permissions issues with Tiller made a self- service option very tricky to set up. Section Three Noah Kantrowitz – @kantrn – DjangoCon US 2019
  24. 24.

    And so with that, our current, and hopefully final, approach:

    a Kubernetes operator. Kubernetes operators tie together three main concepts. CRDs, watches and controllers. CRDs, Watches, Controllers Operators Noah Kantrowitz – @kantrn – DjangoCon US 2019
  25. 25.

    A Custom Resoure Definition, or CRD, is a way to

    add new object types into Kubernetes. Just like Pods and Services are types, we can make our own SummonPlatform type. This lets use use our new object type to hold all the configuration required for an instance of the app. apiVersion: kind: CustomResourceDefinition metadata: name: spec: group: names: kind: SummonPlatform plural: summonplatforms scope: Namespaced subresources: status: {} version: v1beta1 Noah Kantrowitz – @kantrn – DjangoCon US 2019
  26. 26.

    Actually using my custom type works the same as any

    other Kubernetes object. Here I'm defining the version of my app to deploy as well as setting a Django configuration value. This is a bit more verbose than a Helm values file, but allows for things like schema validation and idempotent kubectl apply. apiVersion: kind: SummonPlatform metadata: name: car1-qa namespace: summon-qa spec: version: "108912-6e2f8b7-release-2019-8" config: DEBUG: True Noah Kantrowitz – @kantrn – DjangoCon US 2019
  27. 27.

    So the CRD gives us a place to put our

    configuration, next we need to actually do something with it. The driver for that is something called an API watch. Basically it's like push notifications for Kubernetes data. Whenever an instance of my custom object changes, I want to get notified so I can do something with that change. Watches Noah Kantrowitz – @kantrn – DjangoCon US 2019
  28. 28.

    The heart of any operator is its controllers. Each controller

    sets up some API watches, waits for a change, and then does its best to make sure the state of the world refelects the new config. Repeat forever. Watch Reconcile Repeat Noah Kantrowitz – @kantrn – DjangoCon US 2019
  29. 29.

    Before we dive in to talk about more specifics of

    our Django controller. Let's pull back again and talk more generally about building convergent systems and what that means. Aside: Convergent Systems Noah Kantrowitz – @kantrn – DjangoCon US 2019
  30. 30.

    And a step back even from that is to define

    the opposite. A procedural system is built by giving the system each step you want it to take in order. Procedural Design Listing out the steps you want the system to take, in order and with minimal feedback from one step to the next. Noah Kantrowitz – @kantrn – DjangoCon US 2019
  31. 31.

    As opposed to a convergent system where we only bother

    writing down the end state we want, rather than all the steps. And let the system work out the details. Convergent Design Defining the desired end state and letting the system decide how best to achieve it. Noah Kantrowitz – @kantrn – DjangoCon US 2019
  32. 32.

    And very briefly, two more important concepts. A system is

    idempotent if it only takes action when needed to correct the state. Promise Theory is a mathematical framework for describing and designing convergent systems by breaking them down into smaller subsystems that adhere to specific contract. Idempotence The property of a system (or subsystem) where it only takes action if needs to. If you're already at the goal, don't do the thing again. Promise Theory A way to build a convergent system by breaking it into little chunks, Actors. Each takes the desired state of some specific thing and continually tries to match the goal. Noah Kantrowitz – @kantrn – DjangoCon US 2019
  33. 33.

    Promise Theory systems can work at several levels. We start

    with small Actors and compose those into bigger Actors. In practical terms this means taking the thing you want to do and breaking it down into multiple convergent bits which can each be built and tested independently. This helps reduce the surface area of the larger Actors and keeps the code more manageable. Breaking Things Down » MyApp » Deployment » PostgresDatabase » RDSInstance » RabbitMQVhost » S3Bucket Noah Kantrowitz – @kantrn – DjangoCon US 2019
  34. 34.

    So why does all this design and theory actually matter?

    Because part of being successful with Kubernetes is to rotate our thinking from procedural to convergent. You might have noticed that my brief description of a Promise Theory actor lines up pretty well with a Kubernetes controller, and that is no accident. Why? Noah Kantrowitz – @kantrn – DjangoCon US 2019
  35. 35.

    But why does Kubernetes use these patterns in the first

    place? Beacuse the past several decades of tooling for managing big, complex, distributed systems has found that convergent systems are far more stable and easier to work with. The deep problem is dealing with state drift. Failures happen and disrupt the existing state, which if you didn't account for every possible starting point in your procedural scripts will easily break things. By using a convergent approach, your system will adapt to hopefully any weird current state and still be able to get you back to the goal. We Have To Go Deeper Noah Kantrowitz – @kantrn – DjangoCon US 2019
  36. 36.

    Controllers can be used for all kinds of things, but

    most follow this pattern. On any change, read the root object, use that to generate a bunch Kubernetes objects to would implement that config, and then apply those to the cluster like kubectl apply. This loop ensures that every change rebuilds everything from scratch in terms of configuration, so that we don't get drift over time. Each custom type has one controller, package up a whole bunch of those and you've got an operator. Simple Controller » kubectl apply » Watch trigger » Reconcile » Make changes » GOTO 10 Noah Kantrowitz – @kantrn – DjangoCon US 2019
  37. 37.

    But okay, enough about the ideas behind this system, where's

    the code? Unfortunately writing complex operators in Python is doable but currently a bit tricky. ? Noah Kantrowitz – @kantrn – DjangoCon US 2019
  38. 38.

    There are two projects trying to make it easier to

    write operators in Python, kopf and metacontroller, but both are aimed at very simple use cases and are fairly early in development. There is also the low- level Python client library, but building a controller loop from scratch is a lot of work. So in the end we decided to stick with the more community standard Kubebuilder framework, and the controller-runtime library, even though they are in Go. ! zalando-incubator/kopf GoogleCloudPlatform/metacontroller kubernetes-sigs/kubebuilder Noah Kantrowitz – @kantrn – DjangoCon US 2019
  39. 39.

    So the first problem we tackled from the previous Helm

    iteration was sequencing. We added a lightweight state machine to track how far along we are in the process. The first phase covers the setup of databases and other underling infrastructure. Then running migrations, deploying the new app images, and finally being ready. Building A State Machine Initializing Migrating Deploying Ready Error Noah Kantrowitz – @kantrn – DjangoCon US 2019
  40. 40.

    Since the original is in Go, this is a Python-y

    example of how we handle migrations. Because the controller is real code rather than a declarative system like Helm, we can design the logic more carefully to be convergent and with the state machine from before we know that the migrations will have completed before we try to move on to deploying the new code. Migrations if root.Spec.Version == root.Status.MigrateVersion: return 'already migrated' try: migrations = Get('job', root.Name) except NotFound: return start_migrations_for(root.Spec.Version) if migrations.Status == 'Success': root.Status.MigrateVersion = root.Spec.Version return 'done' return 'migrating' Noah Kantrowitz – @kantrn – DjangoCon US 2019
  41. 41.

    I put aside Celerybeat earlier, but we do have to

    deal with it eventually. There are two problems with beat. First, if you run two copies, your scheduled jobs all get run twice. Even if you design your tasks carefully to be idempoent (which you should), that's still a bit load increase. Second, it's stateful. By default beat stores the last run time of each scheduled task in a local file. There is django-celery-beat which moves the state storage into the main database, but depending on how frequently you run scheduled tasks, that might be a lot of write load on your SQL database. Celery-beatx is a project which helps with both of these, it handles locking between multiple instances of beat so only one is active at a time, but you can still run multiple for redundancy, and it allows using Redis or Memcache for state storage, which works better for the use case. But there is a catch, beatx is Python 3 only. So failing that, we want to use a statefulset. Celerybeat StatefulSet vs. BeatX Noah Kantrowitz – @kantrn – DjangoCon US 2019
  42. 42.

    Again with pseudo-code so you don't have try and read

    my terrible Go. Now that we have our databases and database users created automatically, we have to put those in the configuration dyanmically. This is another bit of code in the controller, watching for changes in the secrets which hold the randomized passwords and generating our config YAML which we can then read and parse from App Configuration dbSecret = Get('secret', f"{root.Name}.db-pass") rmqSecret = Get('secret', f"{root.Name}.rmq-pass") config = root.Spec.Config.copy() config["DATABASE_URL"] = f"pq://...:{dbSecret}@..." configYaml = yaml.dumps(config) appConfig = Secret(root.Name, 'summon-platform.yml'=configYaml) Update(f"{root.Name}.appconfig", appConfig) Noah Kantrowitz – @kantrn – DjangoCon US 2019
  43. 43.

    So that covers the underlying tech, what about workflow? We've

    switched from a more traditional script setup to GitOps. GitOps Noah Kantrowitz – @kantrn – DjangoCon US 2019
  44. 44.

    Our starting point, more or less, a simple bash script

    wrapping Ansible. This definitely works, and is repeatable, but has some issues as far as workflow. A source of truth ends up being the running system, and things can drift over time because convergence is both partial and manual. Let's look at both in detail. ansible-playbook \ deploy.yml\ --limit "$INSTANCE"\ --tags deploy\ --extra-vars="REV=$REV" Noah Kantrowitz – @kantrn – DjangoCon US 2019
  45. 45.

    The idea of definining your sources of truth is to

    map out where exactly is authoratative for each piece of information in your system. So for example with our Ansible code, the playboook and role files in git are the source of truth for how to configure an instance. A problem with this system is that which version of the app is deployed where does not live anywhere in git, the source of truth is whichever version is checked on each machine. This also commonly comes up with using Jenkins parameterized jobs for deployment, where the record of what version is where lives in Jenkins build records. Sources Of Truth » Ansible » The system » Jenkins » ??? » Git Noah Kantrowitz – @kantrn – DjangoCon US 2019
  46. 46.

    A related issue is that because the workflow involves manual

    action, it's possible for things to be missed over time and slowly drift out of expected state. Usually an old test server that sits in a corner of our AWS account, forgotten and unpatched for months or years. Drift Noah Kantrowitz – @kantrn – DjangoCon US 2019
  47. 47.

    So, GitOps! The central idea of GitOps is that git

    is your only source of truth. All configuration data goes in files of some kind which go in some repositories. You then have some kind of automation that watches the repositories and applies any changes when they are merged. GitOps! Noah Kantrowitz – @kantrn – DjangoCon US 2019
  48. 48.

    This approach gets you a lot of things. Because the

    full state of your configuration is reflected in git, the automation can continually clean up drift. This also means that in case of an emergency, you can always restore your entire configuration set by doing one deploy. It also helps bring operational changes to work like code changes, with review and commit messages and all those lovely things. And you get a simple audit log to show who made changes and when. Benefits » Continuous de-drift » Disaster recovery » Change review and logging » Unified process w/ code Noah Kantrowitz – @kantrn – DjangoCon US 2019
  49. 49.

    But it's not perfect. The biggest friction we've had swith

    GitOps is frustration at having to get minor changes code reviewed. We are using GitHub's internal branch protection rules system which is very inflexible when it comes to review policies. Our plan is to write our own replacement as a GitHub App but that will take some time. Also this has put git and GitHub in the workflow of a lot of engineering-adjacent folks that needed help to get accounts configured and learn to use the tools. And of course if people use bad commits or review inconsistently, you don't get the benefits around that data. Downsides » Review overhead » Non-engineers fear git » git commit -m "update config" Noah Kantrowitz – @kantrn – DjangoCon US 2019
  50. 50.

    Put it all together and we get a new end

    to end workflow based on GitOps. ArgoCD is the tool we use to do the git sync part of the workflow. New Deployment Workflow 1. GitHub repo with YAML files for each app instance 2. Make a PR changing the version field in the relevant YAML file 3. Get approval and merge 4. ArgoCD receives merge webhook and syncs YAML file into Kubernetes 5. Ridecell-operator reconciles instance Noah Kantrowitz – @kantrn – DjangoCon US 2019
  51. 51.

    As a second piece of the puzzle, we also built

    a custom CLI tool to handle some of the more common things our engineering team would have to do. The first two are wrappers around kubectl exec to get you a bash or Django shell respectively, and the last is a wrapper for psql to connect you to the right database. We've been slowly adding more debugging assistance tools to it over time as well. Ridectl ridectl shell qa ridectl pyshell qa ridectl dbshell qa Noah Kantrowitz – @kantrn – DjangoCon US 2019
  52. 52.

    Taking a cue from Homebrew, we also made a command

    to both debug and help set up environment issues. It can both analyze basic environment setup and for cases where we can automatically suggest a fix, it will. Doctor $ ridectl doctor ✅ Found $EDITOR Environment Variable ✅ Found Homebrew ✅ Found Homebrew Caskroom ✅ Found Postgresql CLI ✅ Found Google Cloud CLI ✅ Found Google Cloud CLI credentials ✅ Found Kubectl CLI ✅ Found Kubernetes Test ❌ Did not find Kubernetes config ✅ Found AWS Credentials ✅ Found S3 Flavors Access Noah Kantrowitz – @kantrn – DjangoCon US 2019
  53. 53.

    And a quick bonus round suggestion, while many of the

    more operations focused folks were happy with kubectl describe, we recently added a simple web view to show application status information. This is still in testing, but our hope is it will improve the accessibility of this new platform for less command-line based members of our teams. Noah Kantrowitz – @kantrn – DjangoCon US 2019
  54. 54.

    While our Django deployment logic is definitely tailored to our

    specific needs, the code is all open if you would like to mine it for ideas or anything else. ! Ridecell/ridecell-operator Ridecell/ridectl Noah Kantrowitz – @kantrn – DjangoCon US 2019
  55. 55.

    And that just about wraps it up. Thank you so

    much for listening. Thank You Noah Kantrowitz – @kantrn – DjangoCon US 2019
  56. 57.

    » Intro » Starting Point » Basic app architecture (Django,

    Celery, Channels) » Databases (Postgres, Redis, RabbitMQ) » Ansible, Terraform, AWS » Things not changing » Production DBs Noah Kantrowitz – @kantrn – DjangoCon US 2019