Abstracting Nomad - Patterns for deploying to Nomad at scale

Slide 1

Slide 1 text

Abstracting Nomad Pa tt erns for deploying to Nomad at Scale Hi everyone, my name is Jose Diaz-Gonzalez, and this talk is about Abstracting Nomad, or “Patterns for deploying to Nomad at Scale”

Slide 2

Slide 2 text

Who we are and what drove our initial Nomad usage Preamble Before we jump into what SeatGeek has done and how we have solved our problems, I want to give a bit of background as to what our environment looks like and what drives us.

Slide 3

Slide 3 text

What is SeatGeek? With our combination of technological prowess, user- first attitude and dashing good looks, we, SeatGeek, are simplifying and modernizing the ticketing industry. By simultaneously catering to both the consumer and enterprise markets, we’re powering a new, open entertainment industry where fans have effortless access to experiences, and teams, venues and shows have seamless access to their audiences. We think it’s time that everyone can expect more from ticketing. SeatGeek is a high-growth live-event ticketing platform. SeatGeek is proud to partner with some of the most recognized names in sports and live entertainment across the globe including the Dallas Cowboys, Brooklyn Nets and Liverpool F.C., as well as Major League Soccer (MLS), National Football League (NFL), half of the English Premier League (EPL) and multiple theaters across NYC’s Broadway and London’s West End.

Slide 4

Slide 4 text

The Ticketing Problem What is the problem space for SeatGeek, and who are the folks we consider our customers? Who do we work with to ensure our customers have a great experience, and what is that experience?

Slide 5

Slide 5 text

Who are our customers anyway? Who are our customers? On the consumer side, that includes folks buying tickets to concerts, sporting events, theater, etc. If you attend a live event, we cover the process of finding the best seats available, ensuring you have all the information you need prior to attending the event, and offering the best in-venue experience possible.

Slide 6

Slide 6 text

Who do we work with? We’ve also been massively fortunate to establish partnerships with some of the leading properties in the live events world. If you are Football fan, you might know a few of them, such as the Dallas Cowboys or Manchester City. We partner with Basketball teams such as the Cleveland Cavaliers and Brooklyn Nets. We also power ticketing operations for organizations such as Brooklyn Sports & Entertainment Global and Jujamcyn. All these partnerships include all manner of backoffice operations for ensuring those partners provide the best live events experience to attendees.

Slide 7

Slide 7 text

Some Background With that context, lets talk a bit more about the technical side of the problem. Aside from the underlying languages and frameworks in use, services at SeatGeek are generally homogenous in how they run. All apps load con fi guration in the roughly the same way. On the consumer side of the business - seatgeek.com - we load environment variables into a con fi g object and read that everywhere. Our partner-facing services - powering back-o ffi ce ticket operations for our partners - typically have a lot more to con fi gure and therefore load con fi guration via fi les on disk.

Slide 8

Slide 8 text

Some Background All applications also communicate in the same way. Application developers have the option to send messages over HTTP across our internal service mesh to other services. In rare cases, we also communicate via TCP, with engineers deciding which to communicate with by interpreting service dependency con fi guration.

Slide 9

Slide 9 text

Some Background Additionally, longer running tasks - such as payment processing or email sending - occur in background processes. An engineer can choose to send a message to one of our message brokers - largely RabbitMQ - and then process that work asynchronously, either in their own service or within a separate one. Messages are sent in a standard format, and our background worker frameworks process those messages in roughly same way, regardless of language.

Slide 10

Slide 10 text

Some Background Our resilience team has also gone to great lengths to standardize how we ship metrics and logs for monitoring purposes. If your application makes an HTTP request to another service, we’ve enabled APM to track those requests across the stack. Application logs are shipped in a standard format where possible, and we enrich them with platform-level data so we can isolate issues to regions, instances, or any other number of dimensions. Metadata about where messages and metrics are generated are also injected for investigation during and post-incident. This is all to say that the solution that we’re building towards and works for SeatGeek makes quite a few assumptions about what engineers can and should be exposed to, how applications are built, and what a “good” application constitutes. Our approach may not work as well for organizations with di ff erent requirements across services, teams, and departments.

Slide 11

Slide 11 text

Platform Needs With that context, I’ll talk about our Nomad adoption. In the early days of Nomad at SeatGeek, we mainly experimented with docker-based deployments. The earliest test of Nomad was just seeing how an engineer might perform di ff erent actions on the platform. Our platform requirements were - and still are - fairly simple: - When an engineer completes a bug fi x or feature, how do they get those changes from their local machine onto our fl eet of servers? - If we want to route from one service to another, how should that be con fi gured? - If an engineer wants to specify an application secret - for use by payment processing, for example - how do we ensure the correct values are set on all app processes?

Slide 12

Slide 12 text

Just a CLI Away With the context that we began using Nomad years prior to the 1.0 release, many of the above work fl ows were possible to execute via the `nomad` cli and a `sed` call. - Deploying a particular commit sha required replacing the `image` property and submitting the job. - Scaling an application was equally easy with a simple sed call - App con fi guration could be provided via `template` stanzas - Using consul-backed DNS, our internal envoy-based service mesh could route requests on behalf of services. The pictured example roughly shows how this work fl ow was initially implemented, though times have changed a bit since our initial Nomad investigation.

Slide 13

Slide 13 text

Just a CLI Away Today, by con fi guring the correct HCL job fi le, these tasks are much simpler to execute for smaller Nomad installations. For many folks just dipping their toes into Nomad, you can get very far with the `nomad` cli and HCL2’s templating support. We recommend using the Nomad cli directly for the following use cases: - Proof of Concepts where your organization has never used Nomad before: There isn’t a great need to standup more than a local cluster and submit jobs to see how Nomad works. While writing your own tooling might seem like a great idea, in all honesty it is a lot of work and should only be undertaken after much consideration - Testing usage of new Nomad functionality in an existing environment. While it is useful to prove that your functionality works end-to-end within any deployment tooling, experimenting with functionality can be a much quicker experience and allow you to gain con fi dence in the functionality prior to spending a ton of time implementing some new feature-set. - Lastly, if you are using Nomad in a home lab, you may not have a ton of time to develop custom deployment tooling for Nomad. Using the CLI directly can save a ton of time and allow you to get back to other tasks.

Slide 14

Slide 14 text

Nomad CLI v2? While these two weren’t options for us, you may also wish to look into `levant` and `nomad-pack`. - Levant can be considered a slightly improved method of interacting with Nomad on the CLI. It allows for a few enhancements to templating that overcome limitations in HCL2. Levant also has a few interesting ways of interacting with Nomad itself, and I think it is a good option for those with smaller clusters or home labs. - Nomad Pack is a newer entry into the Nomad space, and can be considered something equivalent to Helm. It’s incredibly useful if you are providing a library of applications to install. To be quite honest, neither of these are enticing for SeatGeek as they do not provide a way for enforcing rules or constraints on the deployed job artifacts. While we trust all of our engineers to do the right thing, we feel allowing blanket access to all features in Nomad is not the right approach from a security and correctness standpoint.

Slide 15

Slide 15 text

Planning for the Future All that aside, once we had our feet wet with job specs and our most common work fl ows, we started looking at expanding SeatGeek’s footprint. We asked ourselves a few questions: - What were the pain points around deployments? - What did we feel comfortable supporting on the platform side? - What is the best user experience we could provide?

Slide 16

Slide 16 text

Our v1 Deployment experience Deli One decision we made early on was to allow engineers full access to what was deployed and how it was deployed. If you were an engineer, we expected you to be able to understand the following concepts: - How to write a Docker fi le - The various knobs in a Nomad Job speci fi cation - How consul templating worked This was a complete 180° from our previous deployment methods. Previously, engineers had limited access to the host operating system, installed OS packages, how runtime environments were built, where they were placed, and how processes were scaled. This caused friction in the deployment process, and in many cases pushed back some delivery dates due to con fl icting team needs. Our primary goal with the new platform was to make deployments as self-service as possible, removing the need for an external team to get involved in order to unblock a deployment, and get folks shipping product faster.

Slide 17

Slide 17 text

The Interface In practical terms, this meant an engineer's interface to deployments was an `_infrastructure` directory. This directory contained: - All the fi les needed for an app to be built - one or more Docker fi les - A set of Nomad job fi les to deploy for the service - Con fi guration - fi le or environment variable based - used across all processes within an application - Files for con fi guring how applications are monitored

Slide 18

Slide 18 text

User’s Journey The user journey for an engineer deploying to the platform was as follows: - Engineer goes to the deployments UI we named Deli (short for Delivery) - Engineer fi nds their app/environment combination - Engineer selects a commit sha they wish to deploy - Engineer clicks "deploy" Under the hood, this would invoke some code that would: 1. Read the aforementioned `_infrastructure` directory 2. Inject con fi guration into the Nomad jobs 3. Submit each Nomad job to the Nomad cluster 4. Tail the resulting Nomad deployments for progress until completion This sort of system works fairly well. We recommend a similar setup for Nomad usage when: - Your users are technical and are comfortable picking up new languages, frameworks, and tools - The underlying platform does not have many constraints and you are comfortable with users utilizing the system to it’s fullest extent - Your users _want_ to con fi gure the underlying system

Slide 19

Slide 19 text

Job Processing Initially, the "Inject con fi guration into the service" step involved the following: - Injected environment variables - Con fi gured monitoring sidecars - Vendoring speci fi ed template fi les This was a lovely, readable, 20-line for loop. There were a few functions to cover the above functionality, but overall it was easy to contribute to, debug, and maintain.

Slide 20

Slide 20 text

Job Processing, but more of it This lovely piece of code eventually grew to a large number of modules spanning thousands of lines of code, much of which was now in a 1k+ line for-loop (with several nested inner loops). The logic that modi fi ed jobs covered everything from: - autoscaling rules - monitoring con fi guration - constraining speci fi ed options to ones the platform supported, etc. - validating that properties followed naming conventions It became something platform engineers loathed working with, making it di ffi cult or impossible to support new functionality.

Slide 21

Slide 21 text

Middleware The next iteration of our deployment middleware took inspiration from Kubernetes world in the form of Admission Controllers. In our system, we have a service we call NAC (which is short for Nomad Admission Controller) that acts as a semi-transparent proxy to Nomad. Controllers within NAC handle speci fi c logic such as: - A controller that disables `raw_exec` jobs - A controller that injects valid datacenters - A controller that enforces versioned app secret templates - A controller that forces app-speci fi c vault policies The nice thing about NAC is that each controller is isolated code-wise, making it easier for platform engineers to contribute to speci fi c parts of the codebase relevant to them. It also allows us to enable or disable controllers as necessary, making it easy for us to phase in functionality by department or phase out functionality by job type. We recommend taking this path if: - You have a highly technical team maintaining your deployment experience - For Reasons™, jobs must follow a set of rules - Rules that you apply to Nomad jobs change on a per-nomad cluster basis - Your engineers do not mind that the jobs they write are not quite the jobs that get submitted For folks that have a standard set of rules that never change, we recommend using a lint-based system instead that validates jobs in CI, and gates jobs from submission during deployment if the submitted jobs do not pass lint rules.

Slide 22

Slide 22 text

Our v2 Deployment experience Rollout The above changes still describe our initial Deli-based deployment system. In 2021, we took a closer look at how engineers used our platform to deploy.

Slide 23

Slide 23 text

Complaints They had the following to say: - I don't understand all the options in the Nomad job speci fi cation, so I usually copy-paste another job. - The HCL I con fi gure and the job that gets deployed aren't ever the same, and I don't know how to tell where that con fi g comes from. - I hate needing to copy-paste the same con fi g to every job, and frequently forget one or more jobs.

Slide 24

Slide 24 text

Refreshed Approach At this point, we took a step back and looked at various ways of unifying the con fi g we exposed to engineers. This culminated in our newer Rollout deployment system, where we emphasize convention over con fi guration, and preferring API-based interaction to the current HCL system. What this means that instead of having an 80+ line HCL fi le that de fi nes how to run an http process, an engineer can write 5 lines to specify that same process. This sort of method of abstracting scheduler con fi guration into a simpler format is fairly common across schedulers, whether they be ECS, Kubernetes or Nomad. The unfortunate fact about this pattern is because it is tightly tied to how applications are run and scaled in a given organization, it isn’t straightforward to share a single con fi guration format across publicly. While there is a ton of prior art in what _can_ be con fi gured, I would recommend starting small and slowly expanding format scope as you fi nd you need features.

Slide 25

Slide 25 text

Exposed as APIs We've also standardized and simpli fi ed how external systems integrate with deployments, resulting in many fewer commits to get something deployed. Previously, we mentioned that we embrace API interactions over HCL con fi g, but now it seems we've introduced a new con fi guration format! The previous yaml was actually the yaml version of our json api. When an engineer deploys, the con fi g is consumed by the deployment service, which then processes it and keeps track of deployment artifacts, responses from APIs, mutations of resources, etc. We only recommend this last path when all of the following is true: - Nomad jobs are extremely similar, resulting in a fairly homogeneous set of deployed artifacts - Your team has the capacity to work on internal deployment tooling - The bene fi ts of exposing HCL to engineering teams do not outweigh the disadvantages - Your team is okay with yet another fi le format

Slide 26

Slide 26 text

Evolution of how Secrets Management works Secret Management With the previous context, I’ll give more speci fi c examples of how we’ve evolved approaches to various processes. One of the more interesting topics is secret management. All applications require access to secrets, but how exactly should they be exposed, and what happens when you prefer one method over another?

Slide 27

Slide 27 text

Secrets: Prototype Secret management at SeatGeek was interesting. With our initial Nomad setup, we essentially had `template` stanzas that did something like the following While this might work well for a playground application, it has a few sharp edges: - Requires specifying every key for an app - If there are multiple jobs, the `template` stanza needs to be updated everywhere - If any one of these key is missing, the deployment will fail as the job cannot start - And If speci fi ed as a template source fi le, that fi le cannot be accidentally omitted from a given job

Slide 28

Slide 28 text

Secrets: Deli Our fi rst iteration for using secrets in a production environment resulted in a fi le we call `env.hcl`. The `env.hcl` fi le is located in our `_infrastructure` folder, and was meant to be editable by engineers. This format allowed us access to the following: - Vault KVV1 secrets - Datastore Credentials via Vault Mounts - Consul KV access

Slide 29

Slide 29 text

Secrets: Deli To add a secret `LOLLIPOP` to the `paintshop` app, and engineer would perform the following: - They would add a secret to Vault at the path `secrets/paintshop/LOLLIPOP` - Add the speci fi ed stanza to their app's `env.hcl` - Deploy the app Under the hood, Deli deploy tooling would: - Read/parse the `env.hcl` into a consul template-friendly format - Inject the previous `template` stanza into each job, iterating over all keys as appropriate This worked fairly well and we've only recently migrated away from it for a few reasons: - KVV1 is not conducive to recovery if a secret is changed to a bad value - Changing a secret would result in allocations all restarting roughly at the same time - In our pre-production environment, di ff erent values for a secret cannot be used across di ff erent review apps

Slide 30

Slide 30 text

Secrets: Rollout With our Rollout system, an engineer adding a secret `LOLLIPOP` to the `paintshop` app would do the following: - Create a new version of the `json` dictionary stored at àpp/paintshop/env` with the key `LOLLIPOP` and their desired value. - Trigger a deploy. Under the hood, _all_ app secrets are now stored in that json dictionary, allowing an engineer the ability to quickly scan what is available. Secrets are now versioned, and deploying a secret version is an explicit action an engineer needs to take. This is what that template looks like A bene fi t of this system is now we do not need to parse HCL, which removes a ton of code from our deployment system. One nice side-e ff ect of this is that our review app environments can now use env-speci fi c app secrets. If an engineer deploys a review app to the `candycane` namespace, the deployment tooling will automatically create àpp/paintshop/env.candycane` based on the existing secrets stored in àpp/paintshop/env`. We're even exploring process-speci fi c secrets, for cases where we might want to expose something like a payments token to one process within a given app and not all of them.

Slide 31

Slide 31 text

Evolution of our integrations with external tools Observability Our application monitoring stack has gone through a few iterations. Initially, we had folks copy-paste the following into their apps: - A Filebeat sidecar for shipping logs - A Telegraf sidecar for shipping statsd metrics This worked... well it didn't work for very long. The application monitoring space has progressed fairly quickly, both in SeatGeek and external to us. Requiring that an engineer add or update a con fi g to _each_ job in an app - with some apps having a hundred or more jobs - gets tiring really quickly, especially as there isn't a way to programmatically modify hcl and write it out.

Slide 32

Slide 32 text

Observability: v1 Our next generation of monitoring con fi guration was template based. An engineer might add any of the fi les above to their repository. Deli would inject sidecar tasks into Nomad Jobs that were con fi gured by the respective con fi guration fi les. Deli would even use those fi les as consul template fi les, allowing folks access to Consul/Vault for fetching con fi g. Early on in our Deli-based deployment system, we would detect the existence of these fi les and automatically inject the correct sidecars. This worked fairly well, except that: - An engineer might forget one of these fi les and now they are missing logs - We still wanted uniformity in con fi guration over time We recommend shipping con fi guration with the repository like this when: - You use sidecars - Your deployment tooling injects sidecars _or_ engineers write them in manually - Your monitoring con fi guration is highly bespoke for every application

Slide 33

Slide 33 text

Observability: v2 Rather than letting this be open-ended, our deployment tooling now injects all of this information. Deli switched to a method where engineers could specify Docker fi le labels to control con fi guration app-wide, and override settings via Nomad Job Docker labels when they needed some job-speci fi c knob. For example, While we support json logging by default, an engineer can add the following to their Docker fi le to specify a format other than json. Downstream, the DataDog agent is con fi gured to consume this label and ensure the logs shipped by this service are properly parsed by DataDog.

Slide 34

Slide 34 text

Observability: v2 Similarly, if we want to override the con fi g for a single job in Deli, we can specify the respective label in that job. The nice thing about this sort of setup is that if a given process within an application uses a con fi g slightly di ff erent than the defaults but shares the same Docker image, we can still “do the right thing” for that process. There are a few downsides to label-based con fi g: - Labels may need to be duplicated if an app has multiple Docker fi les with similar con fi g - Adding support for new con fi guration properties means expanding the set of labels in use While the former is helped by better aligning application code with platform conventions, the latter is harder to do in a fl exible manner without a custom label parsing mechanism. If anyone is interested in such a thing, I recommend looking into the `caddy-docker-proxy` project by Lucas Lorentz on Github. https://github.com/lucaslorentz/caddy-docker-proxy

Slide 35

Slide 35 text

Convention Observability: Rollout In the Rollout world, we make a lot of assumptions about "the best" monitoring setup, and as such have removed these knobs completely. If you deploy an app and specify a log format in the `Docker fi le`, that applies to all processes using that `Docker fi le`, otherwise you get JSON. Similarly, we make assumptions about how metrics are collected, and inject the correct con fi g to get APM properly shipped. One nice property of this is that we can have a uni fi ed answer as to why a system works a certain way, or fi x bugs wholesale across the platform. This has cut down on support costs signi fi cantly, albeit with some pain as we migrate systems or con fi gurations. We really only recommend something as formalized as Rollout when most/all of your applications act the same. If your developers use the same logging and metrics libraries con fi gured in the same way, using a convention can be quite powerful. However, if you’re in a heterogeneous environment with many languages or many frameworks, this approach will be di ffi cult to do. At SeatGeek, this has been a bit of a struggle over the years, as we’ve run as many languages as there were developers early on, and many older applications are harder to modify for various reasons. The other case where this might not be recommended is if you deploy third-party applications where some of these knobs are not con fi gurable. In our case, we run a few applications where we had to build custom metric proxies to expose metrics in a consumable format. Similarly, many third-party systems might log in JSON but decide on “interesting” ways to set log levels in their output, necessitating remapping to get logs to show up the right way.

Slide 36

Slide 36 text

Evolution of cross-service communication Networking The last subtopic I have is Networking. SeatGeek embraced the concept of “microservices” very early on - they were largely an attempt to isolate older frameworks/languages from newer codebases and development patterns. This happened prior to our adoption of Containerization and Nomad, by which point we had around 40 applications with 100-200 “services” running. We’re now quite a bit larger on both fronts, and as such have had to change our networking model.

Slide 37

Slide 37 text

Networking: Origins Services at SeatGeek largely communicate over http - with a few services over tcp - so for a long time, an engineer would connect to localhost on a "well-de fi ned" port for a given service. These ports were mapped by a load balancer which then sent out requests to the upstream service. Adding a new service required that the platform engineer fi nd an unused port range, assign it at the load balancer level, and document the port for developers to use in their code. We’ve used environment variables for pointing to services since the day we deployed our second service in 2011, which is one thing that hasn’t changed across the years. This was useful for local development - a developer could run the service locally for testing or point at some pre-production version if necessary. It was also great for testing changes to our networking layer, as it was easy to point the environment variable at the new backend and roll the change back when necessary. The port-mapping system worked well when we had a dozen services, but less so once we started scaling up on Nomad.

Slide 38

Slide 38 text

Networking: Deli In our fi rst iteration of service communication on Nomad, we piggybacked o ff of our `env.hcl` fi le format. To setup a connection, an engineer would specify a `service` stanza and connect to the correct service. Under the hood, we would inject a consul template that would point to the registered consul service. We still relied on an envoy-based load balancing system for connecting to services. This setup works well for folks who have a load balancer on every server, proxying requests upstream. You'll want to be a bit more careful about overloading Consul at higher loads. In a few cases, we've easily taken Consul down by populating the load balancer entries straight from the Consul cluster nodes. If using this method, we recommend either through the use of something like Envoy with a custom discovery backend, or through a system that uses the local consul agent service catalog instead of the Consul cluster nodes.

Slide 39

Slide 39 text

Networking: Rollout Our current iteration is fairly similar. Developers can de fi ne services in a `rollout.yml` con fi guration, mirroring their previous work fl ow in Deli. The di ff erence now is that instead of just injecting a consul template fi le that iterated over every service, we now also inject a sidecar service for each upstream de fi ned. This allows us to more securely communicate across services. We recommend looking down this route if you are comfortable with Nomad's current Consul Connect usage limitations. Speci fi cally, many options cannot be set in the sidecar service set in the Nomad jobs themselves, and thus require tooling to create and push said con fi g via API. For smaller installations, it is possible to hardcode each of the sidecar services within your Nomad jobs. For larger Nomad installations with hundreds/thousands of jobs, it may be desirable to come up with a similar con fi g to `env.hcl`/`rollout.yml` for specifying services, injecting the requisite con fi g prior to deployments.

Slide 40

Slide 40 text

Thank You! Thanks to everyone for your patience and time. Hope everyone has a better understanding how the SeatGeek platform utilizes the Hashicorp stack to provide a great experience for our engineering team in service of a great experience for live events Questions? If you want to know more about any of this, please reach out: ● Jose Diaz-Gonzalez @savant /josediazgonzalez The End And thats all I got. For anyone still here, my twitter handle is savant. I hope I’ve taught at least one person something, but if not, thats okay too. Thanks everyone!