Messaging at Scale at Instagram by Rick Branson

Messaging at Scale at Instagram Rick Branson, Infrastructure Engineer

Messaging at Scale at Instagram Rick Branson, Infrastructure Engineer ASYNC
TASKS AT INSTAGRAM

Instagram Feed

I see photos posted by the accounts I follow.

Photos are time-ordered from newest to oldest.

SELECT * FROM photos WHERE author_id IN (SELECT target_id FROM
following WHERE source_id = %(user_id)d) ORDER BY creation_time DESC LIMIT 10; Naive Approach

O(∞) •Fetch All Accounts You Follow •Fetch All Photos By
Those Accounts •Sort Photos By Creation Time •Return First 10

382 487 1287 880 27 3201 441 6690 12 Per-Account
Bounded List of Media IDs

382 487 1287 880 27 3201 441 6690 12 SELECT
follower_id FROM followers WHERE user_id = 9023; {487, 3201, 441} 943058139

382 487 1287 880 27 3201 441 6690 12 943058139
943058139 943058139 943058139 {487, 3201, 441}

Fanout-On-Write •O(1) read cost •O(N) write cost (N = followers)
•Reads outnumber writes 100:1 or more

Reliability Problems •Database Servers Fail •Web Request is a Scary
Place •Justin Bieber (Millions of Followers)

Web 46 47 48 49 50 51 Broker Worker 46
Worker 47 Worker

Worker 47 Worker X

Worker 47 Worker X 46 Redistributed

Chained Tasks deliver(photo_id=1234, following_id=5678, cursor=None)

Chained Tasks deliver(photo_id=1234, following_id=5678, cursor=None) deliver(photo_id=1234, following_id=5678, cursor=3493)

Chained Tasks •Batch of 10,000 Followers Per Task •Tasks Yield
Successive Tasks •Much Finer-Grained Load Balancing •Failure/Reload Penalty Low

What else?

Other Async Tasks •Cross-Posting to Other Networks •Search Indexing •Spam
Analysis •Account Deletion •API Hook

In the beginning...

Gearman & Python •Simple, Purpose-Built Task Queue •Weak Framework Support
•We just built ad hoc worker scripts •A mess to add new job types & capacity

Gearman in Production •Persistence horrifically slow, complex •So we ran
out of memory and crashed, no recovery •Single core, didn’t scale well: 60ms mean submission time for us •Probably should have just used Redis

We needed a fresh start.

WARNING System had to be in production before the heat
death of the universe. We are probably doing something stupid!

Celery • Distributed Task Framework • Highly Extensible, Pluggable •
Mature, Feature Rich • Great Tooling • Excellent Django Support • celeryd

Which broker?

Redis •We Already Use It •Very Fast, Efficient •Polling For
Task Distribution •Messy Non-Synchronous Replication •Memory Limits Task Capacity

Beanstalk • Purpose-Built Task Queue • Very Fast, Efficient •
Pushes to Consumers • Spills to Disk • No Replication • Useless For Anything Else

RabbitMQ • Reasonably Fast, Efficient • Spill-To-Disk • Low-Maintenance Synchronous
Replication • Excellent Celery Compatibility • Supports Other Use Cases • We don’t know Erlang

Our RabbitMQ Setup •RabbitMQ 3.0 •Clusters of Two Broker Nodes,
Mirrored •Scale Out By Adding Broker Clusters •EC2 c1.xlarge, RAID instance storage •Way Overprovisioned

Alerting •We use Sensu •Monitors & alerts on queue length
threshold •Uses rabbitmqctl list_queues

Graphing •We use graphite & statsd •Per-task sent/fail/success/retry graphs •Using
celery's hooks to make them possible

0A us-east-1a us-east-1e 0E 1A 1E 2A 2E web workers

Mean vs P90 Publish Times (ms)

Tasks per second

Aggregate CPU% (all RabbitMQs)

Wait, ~4000 tasks/sec... I thought you said scale?

~25,000 app threads publishing tasks

Spans Datacenters

Scale Out

Celery IRL •Easy to understand, new engineers come up to
speed in 15 minutes. •New job types deployed without fuss. •We hack the config a bit to get what we want.

@task(routing_key="task_queue") def task_function(task_arg, another_task_arg): do_things() Related tasks run on the
same queue

task_function.delay("foo", "bar")

Scaling Out •Celery only supported 1 broker host last year
when we started. •Created kombu-multibroker "shim" •Multiple brokers used in a round-robin fashion. •Breaks some Celery management tools :(

Concurrency Models •multiprocessing (pre-fork) •eventlet •gevent •threads

gevent is cool and all, but only some of our
tasks will run right under it.

celeryd_multi Run multiple workers with different parameters (such as concurrency
settings)

CELERY_QUEUE_CONFIG = { "default": ( "normal_task", ), "gevent": ( "evented_task",
), } CELERY_QUEUE_GROUP = "default" CELERY_QUEUES = [Queue("celery.%s" % key, routing_key=key) for key in CELERY_QUEUES[CELERY_QUEUE_GROUP]]

gevent = Network Bound •Facebook API •Tumblr API •Various Background
S3 Tasks •Checking URLs for Spam

Problem: Network-Bound Tasks Sometimes Need To Take Some Action

@task(routing_key="task_remote_access"): def check_url(object_id, url): is_bad = run_url_check(url) if is_bad: take_some_action.delay(object_id,
url) @task(routing_key="task_action"): def take_some_action(object_id, url): do_some_database_thing() Ran on "processes" worker Ran on "gevent" worker

Problem: Slow Tasks Monopolize Workers

Broker 5 4 3 2 0 1 Main Worker Worker
0 Worker 1 5 4 3 2 1 0 Fetches Batch Wait Until Batch Finishes Before Grabbing Another One

•Run higher concurrency? Inefficient :( •Lower batch (prefetch) size? Min
is concurrency count, inefficient :( •Separate slow & fast tasks :)

CELERY_QUEUE_CONFIG = { "default": ( "slow_task", ), "gevent": ( "evented_task",
), "fast": ( "fast_task", ), "feed": ( "feed_delivery", ), }

Our Concurrency Levels fast (14) default (6) feed (12)

Problem: Tasks Fail Sometimes

@task(routing_key="media_activation") def deactivate_media_content(media_id): try: media = get_media_store_object(media_id) media.deactivate() except MediaContentRemoteOperationError,
e: raise deactivate_media_content.retry(countdown=60) Wait 60 seconds before retrying.

Problem: Worker Crashes Still Lose Tasks

Normal Flow 1. Get Tasks 2. Worker Starts Task 3.
Ack Sent to Broker 4. Worker Finishes Task

ACKS_LATE Flow 1. Get Tasks 2. Worker Starts Task 3.
Worker Finishes Task 4. Ack Sent to Broker

@task(routing_key="feed_delivery", acks_late=True) def deliver_media_to_follower_feeds(media_id, following_user_id, resume_at=None): ...

Why not do this everywhere? •Tasks must be idempotent! •That
probably is the case anyway :( •Mirroring can cause duplicate tasks •FLP Impossibility FFFFFFFFFUUUUUUUUU!!!!

There is no such thing as running tasks exactly-once.

"... it is impossible for one process to tell whether
another has died (stopped entirely) or is just running very slowly." Impossibility of Distributed Consensus with One Faulty Process Fischer, Lynch, Patterson (1985)

NLP Proof Gives Us Choices: To retry or not to
retry

Problem: Early on, we noticed overloaded brokers were dropping tasks...

Publisher Confirms •AMQP default is that we don't know if
things were published or not. :( •Publisher Confirms makes broker send acknowledgements back on publishes. •kombu-multibroker forces this. •Can cause duplicate tasks. (FLP again!)

Other Rules of Thumb

Avoid using async tasks as a "backup" mechanism only during
failures. It'll probably break.

@task(routing_key="media_activation") def deactivate_media_content(media_id): try: media = get_media_store_object(media_id) media.deactivate() except MediaContentRemoteOperationError,
e: raise deactivate_media_content.retry(countdown=60) Only pass self-contained, non-opaque data (strings, numbers, arrays, lists, and dicts) as arguments to tasks.

Tasks should usually execute within a few seconds. They gum
up the works otherwise.

CELERYD_TASK_SOFT_TIME_LIMIT = 20 CELERYD_TASK_TIME_LIMIT = 30

FUTURE •Better Grip on RabbitMQ Performance •Utilize Result Storage •Single
Cluster for Control Queues •Eliminate kombu-multibroker

We're hiring! [email protected]

Messaging at Scale at Instagram by Rick Branson

Messaging at Scale at Instagram by Rick Branson

More Decks by PyCon 2013

Featured

Transcript