$30 off During Our Annual Pro Sale. View Details »

Messaging at Scale at Instagram by Rick Branson

PyCon 2013
March 17, 2013
98k

Messaging at Scale at Instagram by Rick Branson

PyCon 2013

March 17, 2013
Tweet

Transcript

  1. Messaging at Scale
    at Instagram
    Rick Branson, Infrastructure Engineer

    View Slide

  2. Messaging at Scale
    at Instagram
    Rick Branson, Infrastructure Engineer
    ASYNC TASKS
    AT INSTAGRAM

    View Slide

  3. Instagram Feed

    View Slide

  4. I see photos posted by the
    accounts I follow.

    View Slide

  5. Photos are time-ordered
    from newest to oldest.

    View Slide

  6. SELECT * FROM photos
    WHERE author_id IN
    (SELECT target_id FROM following
    WHERE source_id = %(user_id)d)
    ORDER BY creation_time DESC
    LIMIT 10;
    Naive Approach

    View Slide

  7. O(∞)
    •Fetch All Accounts You Follow
    •Fetch All Photos By Those Accounts
    •Sort Photos By Creation Time
    •Return First 10

    View Slide

  8. 382 487 1287 880 27 3201 441 6690 12
    Per-Account Bounded List of Media IDs

    View Slide

  9. 382 487 1287 880 27 3201 441 6690 12
    SELECT follower_id FROM followers
    WHERE user_id = 9023;
    {487, 3201, 441}
    943058139

    View Slide

  10. 382 487 1287 880 27 3201 441 6690 12
    943058139
    943058139
    943058139
    943058139
    {487, 3201, 441}

    View Slide

  11. Fanout-On-Write
    •O(1) read cost
    •O(N) write cost (N = followers)
    •Reads outnumber writes 100:1 or more

    View Slide

  12. Reliability Problems
    •Database Servers Fail
    •Web Request is a Scary Place
    •Justin Bieber (Millions of Followers)

    View Slide

  13. Web
    46
    47
    48
    49
    50
    51
    Broker
    Worker
    46
    Worker
    47
    Worker

    View Slide

  14. Web
    46
    47
    48
    49
    50
    51
    Broker
    Worker
    46
    Worker
    47
    Worker
    X

    View Slide

  15. Web
    46
    47
    48
    49
    50
    51
    Broker
    Worker
    46
    Worker
    47
    Worker
    X
    46
    Redistributed

    View Slide

  16. Chained Tasks
    deliver(photo_id=1234,
    following_id=5678,
    cursor=None)

    View Slide

  17. Chained Tasks
    deliver(photo_id=1234,
    following_id=5678,
    cursor=None)
    deliver(photo_id=1234,
    following_id=5678,
    cursor=3493)

    View Slide

  18. Chained Tasks
    •Batch of 10,000 Followers Per Task
    •Tasks Yield Successive Tasks
    •Much Finer-Grained Load Balancing
    •Failure/Reload Penalty Low

    View Slide

  19. What else?

    View Slide

  20. Other Async Tasks
    •Cross-Posting to Other Networks
    •Search Indexing
    •Spam Analysis
    •Account Deletion
    •API Hook

    View Slide

  21. In the beginning...

    View Slide

  22. Gearman & Python
    •Simple, Purpose-Built Task Queue
    •Weak Framework Support
    •We just built ad hoc worker scripts
    •A mess to add new job types &
    capacity

    View Slide

  23. Gearman in Production
    •Persistence horrifically slow, complex
    •So we ran out of memory and crashed,
    no recovery
    •Single core, didn’t scale well:
    60ms mean submission time for us
    •Probably should have just used Redis

    View Slide

  24. We needed a
    fresh start.

    View Slide

  25. WARNING
    System had to be in production before the heat
    death of the universe. We are probably doing
    something stupid!

    View Slide

  26. Celery
    • Distributed Task Framework
    • Highly Extensible, Pluggable
    • Mature, Feature Rich
    • Great Tooling
    • Excellent Django Support
    • celeryd

    View Slide

  27. Which broker?

    View Slide

  28. Redis
    •We Already Use It
    •Very Fast, Efficient
    •Polling For Task Distribution
    •Messy Non-Synchronous Replication
    •Memory Limits Task Capacity

    View Slide

  29. Beanstalk
    • Purpose-Built Task Queue
    • Very Fast, Efficient
    • Pushes to Consumers
    • Spills to Disk
    • No Replication
    • Useless For Anything Else

    View Slide

  30. RabbitMQ
    • Reasonably Fast, Efficient
    • Spill-To-Disk
    • Low-Maintenance Synchronous Replication
    • Excellent Celery Compatibility
    • Supports Other Use Cases
    • We don’t know Erlang

    View Slide

  31. Our RabbitMQ Setup
    •RabbitMQ 3.0
    •Clusters of Two Broker Nodes, Mirrored
    •Scale Out By Adding Broker Clusters
    •EC2 c1.xlarge, RAID instance storage
    •Way Overprovisioned

    View Slide

  32. Alerting
    •We use Sensu
    •Monitors & alerts on queue length
    threshold
    •Uses rabbitmqctl list_queues

    View Slide

  33. Graphing
    •We use graphite & statsd
    •Per-task sent/fail/success/retry graphs
    •Using celery's hooks to make them
    possible

    View Slide

  34. 0A
    us-east-1a
    us-east-1e
    0E
    1A
    1E
    2A
    2E
    web
    workers

    View Slide

  35. Mean vs P90 Publish Times (ms)

    View Slide

  36. Tasks per second

    View Slide

  37. Aggregate CPU% (all RabbitMQs)

    View Slide

  38. Wait, ~4000 tasks/sec...
    I thought you said scale?

    View Slide

  39. ~25,000 app threads
    publishing tasks

    View Slide

  40. Spans Datacenters

    View Slide

  41. Scale Out

    View Slide

  42. Celery IRL
    •Easy to understand, new engineers
    come up to speed in 15 minutes.
    •New job types deployed without fuss.
    •We hack the config a bit to get what
    we want.

    View Slide

  43. @task(routing_key="task_queue")
    def task_function(task_arg, another_task_arg):
    do_things()
    Related tasks run on the same queue

    View Slide

  44. task_function.delay("foo", "bar")

    View Slide

  45. Scaling Out
    •Celery only supported 1 broker host last
    year when we started.
    •Created kombu-multibroker "shim"
    •Multiple brokers used in a round-robin
    fashion.
    •Breaks some Celery management tools :(

    View Slide

  46. Concurrency Models
    •multiprocessing (pre-fork)
    •eventlet
    •gevent
    •threads

    View Slide

  47. gevent is cool and all, but
    only some of our tasks
    will run right under it.

    View Slide

  48. celeryd_multi
    Run multiple workers with different parameters
    (such as concurrency settings)

    View Slide

  49. CELERY_QUEUE_CONFIG = {
    "default": (
    "normal_task",
    ),
    "gevent": (
    "evented_task",
    ),
    }
    CELERY_QUEUE_GROUP = "default"
    CELERY_QUEUES = [Queue("celery.%s" % key, routing_key=key)
    for key in CELERY_QUEUES[CELERY_QUEUE_GROUP]]

    View Slide

  50. gevent = Network Bound
    •Facebook API
    •Tumblr API
    •Various Background S3 Tasks
    •Checking URLs for Spam

    View Slide

  51. Problem:
    Network-Bound Tasks Sometimes
    Need To Take Some Action

    View Slide

  52. @task(routing_key="task_remote_access"):
    def check_url(object_id, url):
    is_bad = run_url_check(url)
    if is_bad:
    take_some_action.delay(object_id, url)
    @task(routing_key="task_action"):
    def take_some_action(object_id, url):
    do_some_database_thing()
    Ran on "processes" worker
    Ran on "gevent" worker

    View Slide

  53. Problem:
    Slow Tasks Monopolize Workers

    View Slide

  54. Broker
    5
    4
    3
    2
    0
    1
    Main
    Worker
    Worker
    0
    Worker
    1
    5 4 3 2 1 0
    Fetches Batch
    Wait Until Batch
    Finishes Before
    Grabbing Another One

    View Slide

  55. •Run higher concurrency?
    Inefficient :(
    •Lower batch (prefetch) size?
    Min is concurrency count, inefficient :(
    •Separate slow & fast tasks :)

    View Slide

  56. CELERY_QUEUE_CONFIG = {
    "default": (
    "slow_task",
    ),
    "gevent": (
    "evented_task",
    ),
    "fast": (
    "fast_task",
    ),
    "feed": (
    "feed_delivery",
    ),
    }

    View Slide

  57. Our Concurrency Levels
    fast (14)
    default (6)
    feed (12)

    View Slide

  58. Problem:
    Tasks Fail Sometimes

    View Slide

  59. @task(routing_key="media_activation")
    def deactivate_media_content(media_id):
    try:
    media = get_media_store_object(media_id)
    media.deactivate()
    except MediaContentRemoteOperationError, e:
    raise deactivate_media_content.retry(countdown=60)
    Wait 60 seconds before retrying.

    View Slide

  60. Problem:
    Worker Crashes Still Lose Tasks

    View Slide

  61. Normal Flow
    1. Get Tasks
    2. Worker Starts Task
    3. Ack Sent to Broker
    4. Worker Finishes Task

    View Slide

  62. ACKS_LATE Flow
    1. Get Tasks
    2. Worker Starts Task
    3. Worker Finishes Task
    4. Ack Sent to Broker

    View Slide

  63. @task(routing_key="feed_delivery", acks_late=True)
    def deliver_media_to_follower_feeds(media_id,
    following_user_id,
    resume_at=None):
    ...

    View Slide

  64. Why not do this
    everywhere?
    •Tasks must be idempotent!
    •That probably is the case anyway :(
    •Mirroring can cause duplicate tasks
    •FLP Impossibility
    FFFFFFFFFUUUUUUUUU!!!!

    View Slide

  65. There is no such thing as
    running tasks exactly-once.

    View Slide

  66. "... it is impossible for one
    process to tell whether another
    has died (stopped entirely) or is
    just running very slowly."
    Impossibility of Distributed Consensus with One Faulty Process
    Fischer, Lynch, Patterson (1985)

    View Slide

  67. NLP Proof Gives Us Choices:
    To retry or not to retry

    View Slide

  68. Problem:
    Early on, we noticed overloaded
    brokers were dropping tasks...

    View Slide

  69. Publisher Confirms
    •AMQP default is that we don't know if
    things were published or not. :(
    •Publisher Confirms makes broker send
    acknowledgements back on publishes.
    •kombu-multibroker forces this.
    •Can cause duplicate tasks. (FLP again!)

    View Slide

  70. Other Rules of
    Thumb

    View Slide

  71. Avoid using async tasks as a
    "backup" mechanism only during
    failures. It'll probably break.

    View Slide

  72. @task(routing_key="media_activation")
    def deactivate_media_content(media_id):
    try:
    media = get_media_store_object(media_id)
    media.deactivate()
    except MediaContentRemoteOperationError, e:
    raise deactivate_media_content.retry(countdown=60)
    Only pass self-contained, non-opaque
    data (strings, numbers, arrays, lists, and dicts)
    as arguments to tasks.

    View Slide

  73. Tasks should usually execute
    within a few seconds. They
    gum up the works otherwise.

    View Slide

  74. CELERYD_TASK_SOFT_TIME_LIMIT = 20
    CELERYD_TASK_TIME_LIMIT = 30

    View Slide

  75. FUTURE
    •Better Grip on RabbitMQ Performance
    •Utilize Result Storage
    •Single Cluster for Control Queues
    •Eliminate kombu-multibroker

    View Slide

  76. We're hiring!
    [email protected]

    View Slide