Messaging at Scale at Instagram by Rick Branson

Slide 1

Slide 1 text

Messaging at Scale at Instagram Rick Branson, Infrastructure Engineer

Slide 2

Slide 2 text

Messaging at Scale at Instagram Rick Branson, Infrastructure Engineer ASYNC TASKS AT INSTAGRAM

Slide 3

Slide 3 text

Instagram Feed

Slide 4

Slide 4 text

I see photos posted by the accounts I follow.

Slide 5

Slide 5 text

Photos are time-ordered from newest to oldest.

Slide 6

Slide 6 text

SELECT * FROM photos WHERE author_id IN (SELECT target_id FROM following WHERE source_id = %(user_id)d) ORDER BY creation_time DESC LIMIT 10; Naive Approach

Slide 7

Slide 7 text

O(∞) •Fetch All Accounts You Follow •Fetch All Photos By Those Accounts •Sort Photos By Creation Time •Return First 10

Slide 8

Slide 8 text

382 487 1287 880 27 3201 441 6690 12 Per-Account Bounded List of Media IDs

Slide 9

Slide 9 text

382 487 1287 880 27 3201 441 6690 12 SELECT follower_id FROM followers WHERE user_id = 9023; {487, 3201, 441} 943058139

Slide 10

Slide 10 text

382 487 1287 880 27 3201 441 6690 12 943058139 943058139 943058139 943058139 {487, 3201, 441}

Slide 11

Slide 11 text

Fanout-On-Write •O(1) read cost •O(N) write cost (N = followers) •Reads outnumber writes 100:1 or more

Slide 12

Slide 12 text

Reliability Problems •Database Servers Fail •Web Request is a Scary Place •Justin Bieber (Millions of Followers)

Slide 13

Slide 13 text

Web 46 47 48 49 50 51 Broker Worker 46 Worker 47 Worker

Slide 14

Slide 14 text

Web 46 47 48 49 50 51 Broker Worker 46 Worker 47 Worker X

Slide 15

Slide 15 text

Web 46 47 48 49 50 51 Broker Worker 46 Worker 47 Worker X 46 Redistributed

Slide 16

Slide 16 text

Chained Tasks deliver(photo_id=1234, following_id=5678, cursor=None)

Slide 17

Slide 17 text

Chained Tasks deliver(photo_id=1234, following_id=5678, cursor=None) deliver(photo_id=1234, following_id=5678, cursor=3493)

Slide 18

Slide 18 text

Chained Tasks •Batch of 10,000 Followers Per Task •Tasks Yield Successive Tasks •Much Finer-Grained Load Balancing •Failure/Reload Penalty Low

Slide 19

Slide 19 text

What else?

Slide 20

Slide 20 text

Other Async Tasks •Cross-Posting to Other Networks •Search Indexing •Spam Analysis •Account Deletion •API Hook

Slide 21

Slide 21 text

In the beginning...

Slide 22

Slide 22 text

Gearman & Python •Simple, Purpose-Built Task Queue •Weak Framework Support •We just built ad hoc worker scripts •A mess to add new job types & capacity

Slide 23

Slide 23 text

Gearman in Production •Persistence horrifically slow, complex •So we ran out of memory and crashed, no recovery •Single core, didn’t scale well: 60ms mean submission time for us •Probably should have just used Redis

Slide 24

Slide 24 text

We needed a fresh start.

Slide 25

Slide 25 text

WARNING System had to be in production before the heat death of the universe. We are probably doing something stupid!

Slide 26

Slide 26 text

Celery • Distributed Task Framework • Highly Extensible, Pluggable • Mature, Feature Rich • Great Tooling • Excellent Django Support • celeryd

Slide 27

Slide 27 text

Which broker?

Slide 28

Slide 28 text

Redis •We Already Use It •Very Fast, Efficient •Polling For Task Distribution •Messy Non-Synchronous Replication •Memory Limits Task Capacity

Slide 29

Slide 29 text

Beanstalk • Purpose-Built Task Queue • Very Fast, Efficient • Pushes to Consumers • Spills to Disk • No Replication • Useless For Anything Else

Slide 30

Slide 30 text

RabbitMQ • Reasonably Fast, Efficient • Spill-To-Disk • Low-Maintenance Synchronous Replication • Excellent Celery Compatibility • Supports Other Use Cases • We don’t know Erlang

Slide 31

Slide 31 text

Our RabbitMQ Setup •RabbitMQ 3.0 •Clusters of Two Broker Nodes, Mirrored •Scale Out By Adding Broker Clusters •EC2 c1.xlarge, RAID instance storage •Way Overprovisioned

Slide 32

Slide 32 text

Alerting •We use Sensu •Monitors & alerts on queue length threshold •Uses rabbitmqctl list_queues

Slide 33

Slide 33 text

Graphing •We use graphite & statsd •Per-task sent/fail/success/retry graphs •Using celery's hooks to make them possible

Slide 34

Slide 34 text

0A us-east-1a us-east-1e 0E 1A 1E 2A 2E web workers

Slide 35

Slide 35 text

Mean vs P90 Publish Times (ms)

Slide 36

Slide 36 text

Tasks per second

Slide 37

Slide 37 text

Aggregate CPU% (all RabbitMQs)

Slide 38

Slide 38 text

Wait, ~4000 tasks/sec... I thought you said scale?

Slide 39

Slide 39 text

~25,000 app threads publishing tasks

Slide 40

Slide 40 text

Spans Datacenters

Slide 41

Slide 41 text

Scale Out

Slide 42

Slide 42 text

Celery IRL •Easy to understand, new engineers come up to speed in 15 minutes. •New job types deployed without fuss. •We hack the config a bit to get what we want.

Slide 43

Slide 43 text

@task(routing_key="task_queue") def task_function(task_arg, another_task_arg): do_things() Related tasks run on the same queue

Slide 44

Slide 44 text

task_function.delay("foo", "bar")

Slide 45

Slide 45 text

Scaling Out •Celery only supported 1 broker host last year when we started. •Created kombu-multibroker "shim" •Multiple brokers used in a round-robin fashion. •Breaks some Celery management tools :(

Slide 46

Slide 46 text

Concurrency Models •multiprocessing (pre-fork) •eventlet •gevent •threads

Slide 47

Slide 47 text

gevent is cool and all, but only some of our tasks will run right under it.

Slide 48

Slide 48 text

celeryd_multi Run multiple workers with different parameters (such as concurrency settings)

Slide 49

Slide 49 text

CELERY_QUEUE_CONFIG = { "default": ( "normal_task", ), "gevent": ( "evented_task", ), } CELERY_QUEUE_GROUP = "default" CELERY_QUEUES = [Queue("celery.%s" % key, routing_key=key) for key in CELERY_QUEUES[CELERY_QUEUE_GROUP]]

Slide 50

Slide 50 text

gevent = Network Bound •Facebook API •Tumblr API •Various Background S3 Tasks •Checking URLs for Spam

Slide 51

Slide 51 text

Problem: Network-Bound Tasks Sometimes Need To Take Some Action

Slide 52

Slide 52 text

@task(routing_key="task_remote_access"): def check_url(object_id, url): is_bad = run_url_check(url) if is_bad: take_some_action.delay(object_id, url) @task(routing_key="task_action"): def take_some_action(object_id, url): do_some_database_thing() Ran on "processes" worker Ran on "gevent" worker

Slide 53

Slide 53 text

Problem: Slow Tasks Monopolize Workers

Slide 54

Slide 54 text

Broker 5 4 3 2 0 1 Main Worker Worker 0 Worker 1 5 4 3 2 1 0 Fetches Batch Wait Until Batch Finishes Before Grabbing Another One

Slide 55

Slide 55 text

•Run higher concurrency? Inefficient :( •Lower batch (prefetch) size? Min is concurrency count, inefficient :( •Separate slow & fast tasks :)

Slide 56

Slide 56 text

CELERY_QUEUE_CONFIG = { "default": ( "slow_task", ), "gevent": ( "evented_task", ), "fast": ( "fast_task", ), "feed": ( "feed_delivery", ), }

Slide 57

Slide 57 text

Our Concurrency Levels fast (14) default (6) feed (12)

Slide 58

Slide 58 text

Problem: Tasks Fail Sometimes

Slide 59

Slide 59 text

@task(routing_key="media_activation") def deactivate_media_content(media_id): try: media = get_media_store_object(media_id) media.deactivate() except MediaContentRemoteOperationError, e: raise deactivate_media_content.retry(countdown=60) Wait 60 seconds before retrying.

Slide 60

Slide 60 text

Problem: Worker Crashes Still Lose Tasks

Slide 61

Slide 61 text

Normal Flow 1. Get Tasks 2. Worker Starts Task 3. Ack Sent to Broker 4. Worker Finishes Task

Slide 62

Slide 62 text

ACKS_LATE Flow 1. Get Tasks 2. Worker Starts Task 3. Worker Finishes Task 4. Ack Sent to Broker

Slide 63

Slide 63 text

@task(routing_key="feed_delivery", acks_late=True) def deliver_media_to_follower_feeds(media_id, following_user_id, resume_at=None): ...

Slide 64

Slide 64 text

Why not do this everywhere? •Tasks must be idempotent! •That probably is the case anyway :( •Mirroring can cause duplicate tasks •FLP Impossibility FFFFFFFFFUUUUUUUUU!!!!

Slide 65

Slide 65 text

There is no such thing as running tasks exactly-once.

Slide 66

Slide 66 text

"... it is impossible for one process to tell whether another has died (stopped entirely) or is just running very slowly." Impossibility of Distributed Consensus with One Faulty Process Fischer, Lynch, Patterson (1985)

Slide 67

Slide 67 text

NLP Proof Gives Us Choices: To retry or not to retry

Slide 68

Slide 68 text

Problem: Early on, we noticed overloaded brokers were dropping tasks...

Slide 69

Slide 69 text

Publisher Confirms •AMQP default is that we don't know if things were published or not. :( •Publisher Confirms makes broker send acknowledgements back on publishes. •kombu-multibroker forces this. •Can cause duplicate tasks. (FLP again!)

Slide 70

Slide 70 text

Other Rules of Thumb

Slide 71

Slide 71 text

Avoid using async tasks as a "backup" mechanism only during failures. It'll probably break.

Slide 72

Slide 72 text

Slide 73

Slide 73 text

Tasks should usually execute within a few seconds. They gum up the works otherwise.

Slide 74

Slide 74 text

CELERYD_TASK_SOFT_TIME_LIMIT = 20 CELERYD_TASK_TIME_LIMIT = 30

Slide 75

Slide 75 text

FUTURE •Better Grip on RabbitMQ Performance •Utilize Result Storage •Single Cluster for Control Queues •Eliminate kombu-multibroker

Slide 76

Slide 76 text

We're hiring! [email protected]