A talk I gave at PyCon Ukraine 2017.
View Slide
Andrew GodwinHi, I'mDjango core developerSenior Software Engineer atUsed to complain about migrations a lot
Distributed Systems
c = 299,792,458 m/s
Early CPUsc= 60m propagation distanceClock~2cm5 MHz
Modern CPUsc= 10cm propagation distance3 GHz
Distributed systems are made ofindependent components
They are slower and harder to writethan synchronous systems
But they can be scaled upmuch, much further
Trade-offs
There is never aperfect solution.
FastGoodCheap
Load BalancerWSGIWorkerWSGIWorkerWSGIWorker
Load BalancerWSGIWorkerWSGIWorkerWSGIWorkerCache
Load BalancerWSGIWorkerWSGIWorkerWSGIWorkerCacheCache Cache
Load BalancerWSGIWorkerWSGIWorkerWSGIWorkerDatabase
CAP Theorem
Partition TolerantConsistentAvailable
PostgreSQL: CPConsistent everywhereHandles network latency/dropsCan't write if main server is down
Cassandra: APCan read/write to any nodeHandles network latency/dropsData can be inconsistent
It's hard to design a productthat might be inconsistent
But if you take the tradeoff,scaling is easy
Otherwise, you must findother solutions
Read Replicas(often called master/slave)Load BalancerWSGIWorkerWSGIWorkerWSGIWorkerReplica ReplicaMain
Replicas scale reads forever...But writes must go to one place
If a request writes to a tableit must be pinned there, solater reads do not get old data
When your write load is toohigh, you must then shard
Vertical ShardingUsersTicketsEventsPayments
Horizontal ShardingUsers0 - 2Users3 - 5Users6 - 8Users9 - A
BothUsers0 - 2Users3 - 5Users6 - 8Users9 - AEvents0 - 2Events3 - 5Events6 - 8Events9 - ATickets0 - 2Tickets3 - 5Tickets6 - 8Tickets9 - A
Both plus cachingUsers0 - 2Users3 - 5Users6 - 8Users9 - AEvents0 - 2Events3 - 5Events6 - 8Events9 - ATickets0 - 2Tickets3 - 5Tickets6 - 8Tickets9 - AUserCacheEventCacheTicketCache
Teams have to scale too;nobody should have to understandeveything in a big system.
Services allow complexity tobe reduced - for a tradeoffof speed
Users0 - 2Users3 - 5Users6 - 8Users9 - AEvents0 - 2Events3 - 5Events6 - 8Events9 - ATickets0 - 2Tickets3 - 5Tickets6 - 8Tickets9 - AUserCacheEventCacheTicketCacheUser ServiceEvent ServiceTicket Service
User ServiceEvent ServiceTicket ServiceWSGI Server
Each service is its own,smaller project, managed andscaled separately.
But how do you communicatebetween them?
Service 2Service 3Service 1Direct Communication
Service 2 Service 3Service 1Service 4Service 5
Service 2Service 3Service 1Service 4Service 5Service 6Service 7Service 8
Service 2 Service 3Service 1Message BusService 2 Service 3Service 1
A single point of failure is notalways bad - if the alternativeis multiple, fragile ones
Channels and ASGI providea standard message busbuilt with certain tradeoffs
Backing Storee.g. Redis, RabbitMQASGI (Channel Layer)Channels LibraryDjangoDjangoChannelsProject
Backing Storee.g. Redis, RabbitMQASGI (Channel Layer)Pure Python
Failure ModeAt most onceMessages either do not arrive, or arrive once.At least onceMessages arrive once, or arrive multiple times
Guarantees vs. LatencyLow latencyMessages arrive very quickly but go missing moreLow loss rateMessages are almost never lost but arrive slower
Queuing TypeFirst In First OutConsistent performance for all usersFirst In Last OutHides backlogs but makes them worse
Queue SizingFinite QueuesSending can failInfinite queuesMakes problems even worse
You must understand whatyou are making(This is surprisingly uncommon)
Design as much as possiblearound shared-nothing
Per-machine cachesOn-demand thumbnailingSigned cookie sessions
Has to be shared?Try to split it
Has to be shared?Try sharding it.
Django's job is to beslowly replaced by your code
Just make sure you match theAPI contract of what you'rereplacing!
Don't try to scale too early;you'll pick the wrong tradeoffs.
Thanks.Andrew Godwin@andrewgodwinchannels.readthedocs.io