$30 off During Our Annual Pro Sale. View Details »

Scaling Django with Distributed Systems

Scaling Django with Distributed Systems

A talk I gave at PyCon Ukraine 2017.

Andrew Godwin

April 07, 2017
Tweet

More Decks by Andrew Godwin

Other Decks in Programming

Transcript

  1. View Slide

  2. Andrew Godwin
    Hi, I'm
    Django core developer
    Senior Software Engineer at
    Used to complain about migrations a lot

    View Slide

  3. Distributed Systems

    View Slide

  4. c = 299,792,458 m/s

    View Slide

  5. Early CPUs
    c
    = 60m propagation distance
    Clock
    ~2cm
    5 MHz

    View Slide

  6. Modern CPUs
    c
    = 10cm propagation distance
    3 GHz

    View Slide

  7. Distributed systems are made of
    independent components

    View Slide

  8. They are slower and harder to write
    than synchronous systems

    View Slide

  9. But they can be scaled up
    much, much further

    View Slide

  10. Trade-offs

    View Slide

  11. There is never a
    perfect solution.

    View Slide

  12. Fast
    Good
    Cheap

    View Slide

  13. View Slide

  14. Load Balancer
    WSGI
    Worker
    WSGI
    Worker
    WSGI
    Worker

    View Slide

  15. Load Balancer
    WSGI
    Worker
    WSGI
    Worker
    WSGI
    Worker
    Cache

    View Slide

  16. Load Balancer
    WSGI
    Worker
    WSGI
    Worker
    WSGI
    Worker
    Cache
    Cache Cache

    View Slide

  17. Load Balancer
    WSGI
    Worker
    WSGI
    Worker
    WSGI
    Worker
    Database

    View Slide

  18. CAP Theorem

    View Slide

  19. Partition Tolerant
    Consistent
    Available

    View Slide

  20. PostgreSQL: CP
    Consistent everywhere
    Handles network latency/drops
    Can't write if main server is down

    View Slide

  21. Cassandra: AP
    Can read/write to any node
    Handles network latency/drops
    Data can be inconsistent

    View Slide

  22. It's hard to design a product
    that might be inconsistent

    View Slide

  23. But if you take the tradeoff,
    scaling is easy

    View Slide

  24. Otherwise, you must find
    other solutions

    View Slide

  25. Read Replicas
    (often called master/slave)
    Load Balancer
    WSGI
    Worker
    WSGI
    Worker
    WSGI
    Worker
    Replica Replica
    Main

    View Slide

  26. Replicas scale reads forever...
    But writes must go to one place

    View Slide

  27. If a request writes to a table
    it must be pinned there, so
    later reads do not get old data

    View Slide

  28. When your write load is too
    high, you must then shard

    View Slide

  29. Vertical Sharding
    Users
    Tickets
    Events
    Payments

    View Slide

  30. Horizontal Sharding
    Users
    0 - 2
    Users
    3 - 5
    Users
    6 - 8
    Users
    9 - A

    View Slide

  31. Both
    Users
    0 - 2
    Users
    3 - 5
    Users
    6 - 8
    Users
    9 - A
    Events
    0 - 2
    Events
    3 - 5
    Events
    6 - 8
    Events
    9 - A
    Tickets
    0 - 2
    Tickets
    3 - 5
    Tickets
    6 - 8
    Tickets
    9 - A

    View Slide

  32. Both plus caching
    Users
    0 - 2
    Users
    3 - 5
    Users
    6 - 8
    Users
    9 - A
    Events
    0 - 2
    Events
    3 - 5
    Events
    6 - 8
    Events
    9 - A
    Tickets
    0 - 2
    Tickets
    3 - 5
    Tickets
    6 - 8
    Tickets
    9 - A
    User
    Cache
    Event
    Cache
    Ticket
    Cache

    View Slide

  33. Teams have to scale too;
    nobody should have to understand
    eveything in a big system.

    View Slide

  34. Services allow complexity to
    be reduced - for a tradeoff
    of speed

    View Slide

  35. Users
    0 - 2
    Users
    3 - 5
    Users
    6 - 8
    Users
    9 - A
    Events
    0 - 2
    Events
    3 - 5
    Events
    6 - 8
    Events
    9 - A
    Tickets
    0 - 2
    Tickets
    3 - 5
    Tickets
    6 - 8
    Tickets
    9 - A
    User
    Cache
    Event
    Cache
    Ticket
    Cache
    User Service
    Event Service
    Ticket Service

    View Slide

  36. User Service
    Event Service
    Ticket Service
    WSGI Server

    View Slide

  37. Each service is its own,
    smaller project, managed and
    scaled separately.

    View Slide

  38. But how do you communicate
    between them?

    View Slide

  39. Service 2
    Service 3
    Service 1
    Direct Communication

    View Slide

  40. Service 2 Service 3
    Service 1
    Service 4
    Service 5

    View Slide

  41. Service 2
    Service 3
    Service 1
    Service 4
    Service 5
    Service 6
    Service 7
    Service 8

    View Slide

  42. Service 2 Service 3
    Service 1
    Message Bus
    Service 2 Service 3
    Service 1

    View Slide

  43. A single point of failure is not
    always bad - if the alternative
    is multiple, fragile ones

    View Slide

  44. Channels and ASGI provide
    a standard message bus
    built with certain tradeoffs

    View Slide

  45. Backing Store
    e.g. Redis, RabbitMQ
    ASGI (Channel Layer)
    Channels Library
    Django
    Django
    Channels
    Project

    View Slide

  46. Backing Store
    e.g. Redis, RabbitMQ
    ASGI (Channel Layer)
    Pure Python

    View Slide

  47. Failure Mode
    At most once
    Messages either do not arrive, or arrive once.
    At least once
    Messages arrive once, or arrive multiple times

    View Slide

  48. Guarantees vs. Latency
    Low latency
    Messages arrive very quickly but go missing more
    Low loss rate
    Messages are almost never lost but arrive slower

    View Slide

  49. Queuing Type
    First In First Out
    Consistent performance for all users
    First In Last Out
    Hides backlogs but makes them worse

    View Slide

  50. Queue Sizing
    Finite Queues
    Sending can fail
    Infinite queues
    Makes problems even worse

    View Slide

  51. You must understand what
    you are making
    (This is surprisingly uncommon)

    View Slide

  52. Design as much as possible
    around shared-nothing

    View Slide

  53. Per-machine caches
    On-demand thumbnailing
    Signed cookie sessions

    View Slide

  54. Has to be shared?
    Try to split it

    View Slide

  55. Has to be shared?
    Try sharding it.

    View Slide

  56. Django's job is to be
    slowly replaced by your code

    View Slide

  57. Just make sure you match the
    API contract of what you're
    replacing!

    View Slide

  58. Don't try to scale too early;
    you'll pick the wrong tradeoffs.

    View Slide

  59. Thanks.
    Andrew Godwin
    @andrewgodwin
    channels.readthedocs.io

    View Slide