Reconciling Everything

EVERYTHING ANDREW GODWIN // @[email protected] RECONCILING

Andrew Godwin / @[email protected] Hi, I’m Andrew Godwin • Principal
Engineer at Astronomer.io (Airﬂow) • Django Migrations, Channels, Async, others • Building distributed systems since 2008

Andrew Godwin / @[email protected] Let's start with a bold statement
I hear it drives engagement. Maybe also heckling.

Andrew Godwin / @[email protected] Event-Driven Via Queues Totally separate stores
of truth that only talk via queues Reconciliation Loops Stateless components that talk to one store of truth There are only two good* ways to build distributed systems**:

Andrew Godwin / @[email protected] What is "good"? Low maintenance and
high reliability, for me at least.

Andrew Godwin / @[email protected] What kinds of distributed systems? Remember,
friends don't let friends write microservices

Andrew Godwin / @[email protected] Due to size What companies often
claim is the reason Due to team structure What the actual reason often is Due to federation It's what the internet was built on!

Andrew Godwin / @[email protected] RPC Microservices Email Service Accounting Service
Billing Service Reservation Service

Andrew Godwin / @[email protected] A system is defined by its
failure modes Nobody gets paged when it's up and running happily

Andrew Godwin / @[email protected] Event-Driven (Message Passing) Task Runners Billing
System QUEUE QUEUE Log Storage Analytics Aggregator QUEUE

Andrew Godwin / @[email protected] What kind of queue is it?
Hint: It will not deliver messages exactly once

Andrew Godwin / @[email protected] At-most-once In case of failure, a
message will not get delivered At-least-once In case of failure, a messages will get delivered twice or more

Andrew Godwin / @[email protected] Backlogs Need more queue consumers, can
be asymmetric Replays I hope you can cope with lots of duplicates! Overflow Do you grind to a halt? Or drop?

Andrew Godwin / @[email protected] Neither is "perfect" (But I recommend
at-least-once unless you truly do not care about the data)

Andrew Godwin / @[email protected] Reconciliation (Control) Loop Kubelet Deployment Controller
DATABASE Pod Controller Kubectl (User)

Andrew Godwin / @[email protected] Reconciliation (Control) Loop Kubelet Deployment Controller
apiserver Pod Controller Kubectl (User) etcd

Andrew Godwin / @[email protected] Kubernetes Example • User creates a
new Deployment • Deployment controller notices that the deployment has no pods ◦ To reconcile this, it makes one • Pod controller notices that the pod is not assigned ◦ To reconcile this, it assigns it to a node • Kubelet notices it has an assigned pod that is not running ◦ To reconcile this, it creates it locally and starts it

Andrew Godwin / @[email protected] One "single point of failure" is
great Keeping one service up is much simpler

Andrew Godwin / @[email protected] So how do reconciliation loops fail?
They slow down and halt, but with easy restarts and no data loss

Andrew Godwin / @[email protected] Each controller must be stateless This
lets you scale them up, down, restart, upgrade, and recover easily

Andrew Godwin / @[email protected] Practical example: Takahē Of course I
wrote an ActivityPub/Fediverse server!

Andrew Godwin / @[email protected]

Andrew Godwin / @[email protected] The Fediverse is an event-driven system
Well, mostly. It's also imperfect and still evolving.

Andrew Godwin / @[email protected] Making a single Fediverse post QUEUE
Sending Client Origin API Fanout System Destination Inbox Timeline Builder Receiving Client QUEUE

Andrew Godwin / @[email protected] A server needs background workers Both
to fan out sending of posts, and to process incoming posts

Andrew Godwin / @[email protected] Stator Takahē's state-machine reconciliation system

Andrew Godwin / @[email protected] See more inside github.com/jointakahe/takahe

Andrew Godwin / @[email protected] You can also do this with
Airflow We have an Astronomer executor which uses this for running tasks

Andrew Godwin / @[email protected] Airflow Example • A Task Instance
is scheduled to run, and the scheduler passes it over • The Executor makes a new Workload for the Task Instance • Allocator notices that the Workload has no Runner ◦ To reconcile this, it assigns it to a runner • Runner notices it has an assigned Workload that is not running ◦ To reconcile this, it creates it locally and starts it

Andrew Godwin / @[email protected] So, how should you build them?
Carefully! Ha, ha.

Andrew Godwin / @[email protected] All state in the central store
You can maybe get away with caching elsewhere Zero service-to-service comms Communication paths are failure modes Controllers should be simple loops Query database, reconcile, repeat

Andrew Godwin / @[email protected] They are easy to test and
refactor Mocking your datastore is all that's needed; everything else should ﬂow

Andrew Godwin / @[email protected] They self-heal! Provided your datastore stays
up, everything else is ﬂexible

Andrew Godwin / @[email protected] There are downsides - mostly max
scale But I generally design systems with upper bounds on their throughput

Andrew Godwin / @[email protected] The datastore will be your bottleneck
But modern DBs are very capable - and you can do replicas!

Andrew Godwin / @[email protected] Queues are still fine! If you
need the complexity and scale, then great. But they are harder!

Andrew Godwin / @[email protected] Remember, there's no universal solution Just
please, please, avoid building a mess of microservices that do RPC calls

Thanks! Andrew Godwin [email protected] Want to talk more about Takahē
and ActivityPub protocols? Come to the Open Space at 3pm in Room 251AB!

Reconciling Everything

Reconciling Everything

More Decks by Andrew Godwin

Other Decks in Programming

Featured

Transcript