Andrew Godwin / @[email protected] Hi, I’m Andrew Godwin • Principal Engineer at Astronomer.io (Airflow) • Django Migrations, Channels, Async, others • Building distributed systems since 2008
Andrew Godwin / @[email protected] Event-Driven Via Queues Totally separate stores of truth that only talk via queues Reconciliation Loops Stateless components that talk to one store of truth There are only two good* ways to build distributed systems**:
Andrew Godwin / @[email protected] Due to size What companies often claim is the reason Due to team structure What the actual reason often is Due to federation It's what the internet was built on!
Andrew Godwin / @[email protected] At-most-once In case of failure, a message will not get delivered At-least-once In case of failure, a messages will get delivered twice or more
Andrew Godwin / @[email protected] Backlogs Need more queue consumers, can be asymmetric Replays I hope you can cope with lots of duplicates! Overflow Do you grind to a halt? Or drop?
Andrew Godwin / @[email protected] Kubernetes Example ● User creates a new Deployment ● Deployment controller notices that the deployment has no pods ○ To reconcile this, it makes one ● Pod controller notices that the pod is not assigned ○ To reconcile this, it assigns it to a node ● Kubelet notices it has an assigned pod that is not running ○ To reconcile this, it creates it locally and starts it
Andrew Godwin / @[email protected] Making a single Fediverse post QUEUE Sending Client Origin API Fanout System Destination Inbox Timeline Builder Receiving Client QUEUE
Andrew Godwin / @[email protected] Airflow Example ● A Task Instance is scheduled to run, and the scheduler passes it over ● The Executor makes a new Workload for the Task Instance ● Allocator notices that the Workload has no Runner ○ To reconcile this, it assigns it to a runner ● Runner notices it has an assigned Workload that is not running ○ To reconcile this, it creates it locally and starts it
Andrew Godwin / @[email protected] All state in the central store You can maybe get away with caching elsewhere Zero service-to-service comms Communication paths are failure modes Controllers should be simple loops Query database, reconcile, repeat
Andrew Godwin / @[email protected] Remember, there's no universal solution Just please, please, avoid building a mess of microservices that do RPC calls