Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dagster & Geomagical (with notes)

Dagster & Geomagical (with notes)

Noah Kantrowitz

February 09, 2021
Tweet

More Decks by Noah Kantrowitz

Other Decks in Programming

Transcript

  1. Thank you Sashank. I'm Noah, a principal site reliability engineer

    for Geomagical Labs and today I'm going to cover our approach to Dagster. Noah Kantrowitz > @kantrn - coderanger.net > Principal Ops @ Geomagical > Part of the IKEA family > Augmented reality with furniture
  2. Our use case isn't quite the normal one for Dagster

    so a quick overview of our main product. A user takes a bunch of images of their room, uploads those to our cloud systems, we run a bunch of Fancy Math to build a 3D view of the room, and then the user can view that in a browser and drag in virtual furniture to see how it would look before buying. The key points here, this is directly customer facing so we need it to run quickly and there is little margin for errors or retries, every second counts when a customer is waiting on us. The Fancy Math is also very fancy, our processing steps vary from tiny Python scripts to C++ tools that use tens of gigabytes of RAM, and many steps require access to a GPU. Our Product
  3. Before we dive into our Dagster setup, let's look at

    what we had before. For our main product we use Celery with RabbitMQ. Each solid, to use the Dagster term, runs as its own Deployment in Kubernetes, and we have a custom tool to compile from a simple DAG representation to the Canvas orchestration system that comes with Celery. Starting Point > Celery & RabbitMQ > Each operation as its own daemon > celery.canvas > Custom DAG compiler
  4. We knew we wanted to keep most of the structure

    of our existing solids in place, both because updating them would be costly and that part of the system was going well in terms of workflow. Our DAG compiler was very limited and Canvas didn't offer much room for improvement, so that had to go. Our workloads are very bursty, so anything that allows shutting down capacity when not in use helps keep costs down. And if we could get better run tracking with more detailed information, that's less manual instrumentation we would have to write. Design Goals > Keeping most of the solid structure > Improved DAG expressiveness > Low fixed overhead, compatible with autoscaling > More detailed tracking and metrics
  5. Obviously we found and liked Dagster or you wouldn't be

    hearing from me today, but in short it ticked all the boxes I was looking for and the remaining issues we could help improve. The biggest of which was getting a scalable, stable execution layer in place. Because we needed high potential concurrency and resiliency to failure, the default launcher and executor combos weren't going to cut it. Dagster > Met all our requirements for structural simplicity > DAG compiler was a bit limited but growing fast > Highly responsive team Dagster > No execution setup that met our needs
  6. We did look at dagster_celery, since we knew we wanted

    Celery under the hood no matter what. Unfortunately the workflow for it more or less requires that the code for all solids live alongside the pipelines, or at least be runnable together. This wouldn't work for us because of our widely varying hardware requirements for different solids. But dagster_celery? > Solid and pipeline code commingled > Single runtime environment > Hard to build a workflow around at scale
  7. There is also the dagster_kubernetes helper, which launches Kubernetes Jobs

    on the fly. While this can be okay for infrequently executed jobs where a failure or delay does not have customer-facing consequences, I really really suggest people keep kube- apiserver out of the hot path of their products. It's not built for that kind of thing. But dagster_k8s? > Fine for infrequent or non-customer facing tasks > Do not put kube-apiserver in your hot path > No really, I mean it
  8. Overall Architecture I'll save you all some back and forth

    and cut to what we came up with. Dagit runs as a Kubernetes Deployment, with a custom launcher plugin to serialize new run requests to JSON and send them to a Celery queue. The workspace-level queue worker receives those and executes the run. Most of the solids are proxies using a special decorator to run a remote Celery task on a specific queue and wait for a response.
  9. One of our big goals was being able to scale

    the system down, and Dagster delivers that nicely. The minimum system state is just Dagit and one gRPC daemon for each workspace, all of which use very few resources. Because everything is decoupled via RabbitMQ, we use KEDA to watch queue depth and drive autoscaling. KEDA 2.1 includes several new features to make this use case simpler. Two important things on the Celery side, we need to ensure that active runs and solids stay in the queues so they will be visible to KEDA so we use acks_late and reduce or disable prefetching. Autoscaling > KEDA watching RabbitMQ > Zero-scale: only Dagit and gRPC daemons > task_acks_late = True > worker_prefetch_multiplier = 1
  10. Remote Solids are probably our biggest deviation from "normal" Dagster

    usage. While most users keep all their Solids in the same codebase as the pipelines, we run ours as separate daemons. This means we can tailor the execution environment for each as needed, CPU and GPU hardware, which Docker image to use, sometimes even which operating system to run on. This also means the teams developing each Solid can be somewhat isolated from their usage in pipelines, not completely since use cases do matter but they can release on their own schedule and let the pipelines pick which versions to use. Remote Solids > Independent release cycles for each Solid > Can run multiple versions in parallel > Testing in isolation
  11. This is a simple case of a remote solid. We

    have some helper code to streamline things but underneath it's still a normal Celery task that can be called like any other. Each of these lives in its own project but multiple tasks can be exposed from one remote Solid if the situation calls for it. Writing A Remote Solid app = SolidCelery('repo-something') @app.task(bind=True) def something(self, foo: str) -> str: return f'Hello {foo}'
  12. We do need to expose these remote Solids to Dagster's

    view of the world, so we have a decorator that extends @solid to create "proxy Solids". This is a short example of one. We tell it which queue to use, yield whatever arguments we want to send to the task, recieve its output, and then usually do something with it and yield any needed Dagster events like in normal solids. These can then be used in a Dagster pipeline like any other solid. Proxy Solids @celery_solid(queue='repo-something') def something(context, item): output = yield { 'foo': item['bar'], } item['something'] = output yield Output(item)
  13. Our workflow is fairly light after that. Each overall project

    has a git repository for its pipelines, with one being the "default Pipeline" which is used when a specific pipeline is not requested. New pipelines are created as needed and when one is ready to be promoted to default, we can rename it into place. We use the Dagit GraphQL API to integrate both pipeline discovery and execution with our backend web applications. As we roll this workflow out to more internal projects, the goal is for the teams who currently own our JSON-based DAG definitions to take over writing Dagster pipelines instead. Workflow > One git repo per Dagster repo > main.py which holds "default" Pipeline > solids.py which defines proxy Solids > Misc other pipelines for testing and development
  14. And just quickly, we have all of this wrapped in

    a continuous deployment system via Buildkite. Any commit merged to the main branch will get a container image built, edited into the Kustomize configuration, pushed up using a machine user, and then deployed by ArgoCD. Remote Solids have a little more complexity via a custom Kubernetes operator but the pipeline repositories are just that. CI/CD Briefly, since this is its own rabbit hole > Buildkite > kustomize edit set image > ArgoCD
  15. This whole setup isn't without issues. The biggest downside is

    the time to cold start a pipeline that hasn't been used lately. For one pipeline in particular this can take ten to twenty minutes for things to actually get underway. Because most of the work is in the remote Solids, we don't yet have a way to show progress or status information during a long step. This can make debugging harder as we have to correlate between the Dagster logs and low-level Kubernetes logs quite often. And finally, as with any off-the-beaten-path adventure, we have hit weird edge cases, mostly from Celery not liking being connected as both a consumer and producer at the same time. Downsides > Slow cold start > No feedback during long tasks > New and exciting bugs
  16. We haven't moved our main product over yet, as we've

    been using smaller projects to develop and battle test this execution system and workflow, but so far things are looking very positive. We've been able to throw hundreds of simultaneous runs into the system and have them come out the other side. The main bug blocking us from moving forward on the rollout is difficulty with unexplained task failures, likely related to RabbitMQ timeouts but more debugging output is getting us back on track. How It's Going > Happy with overall progress > Still dropping some tasks at load > Plan to move forward looks good
  17. There's still a lot of room to improve this setup.

    The biggest one would be async execution support for DAGs. Right now the execution process is synchronous so we run 15 threads per DAG worker to reduce memory overhead. However this has revealed a few thread safety issues over time as nothing in the default Dagster setup uses them. A fully async executor would allow better timesharing since the vast majority of the time, our pipelines are blocked on a remote Solid return. We would also like to fix those incremental feedback gaps, and add webhooks to notify our backend systems in case of a failure so we can act on that. Future Plans > Async execution support > Events from solid workers > Pipeline-level webhooks > Predictive auto-scaling? K8s Operator?
  18. Hopefully by now I've convinced some of you that this

    is a good way to run Dagster. While our solids and pipelines themselves are not public, the core helper library is. But it's very undocumented and I'm still frequently making major changes. So think of it as a place to borrow code and ideas from rather than a library to use directly. Can I Use This? Kinda sorta geomagical/dagster_geomagical