Messaging at Scale
at Instagram
Rick Branson, Infrastructure Engineer
Slide 2
Slide 2 text
Messaging at Scale
at Instagram
Rick Branson, Infrastructure Engineer
ASYNC TASKS
AT INSTAGRAM
Slide 3
Slide 3 text
Instagram Feed
Slide 4
Slide 4 text
I see photos posted by the
accounts I follow.
Slide 5
Slide 5 text
Photos are time-ordered
from newest to oldest.
Slide 6
Slide 6 text
SELECT * FROM photos
WHERE author_id IN
(SELECT target_id FROM following
WHERE source_id = %(user_id)d)
ORDER BY creation_time DESC
LIMIT 10;
Naive Approach
Slide 7
Slide 7 text
O(∞)
•Fetch All Accounts You Follow
•Fetch All Photos By Those Accounts
•Sort Photos By Creation Time
•Return First 10
Slide 8
Slide 8 text
382 487 1287 880 27 3201 441 6690 12
Per-Account Bounded List of Media IDs
Chained Tasks
•Batch of 10,000 Followers Per Task
•Tasks Yield Successive Tasks
•Much Finer-Grained Load Balancing
•Failure/Reload Penalty Low
Slide 19
Slide 19 text
What else?
Slide 20
Slide 20 text
Other Async Tasks
•Cross-Posting to Other Networks
•Search Indexing
•Spam Analysis
•Account Deletion
•API Hook
Slide 21
Slide 21 text
In the beginning...
Slide 22
Slide 22 text
Gearman & Python
•Simple, Purpose-Built Task Queue
•Weak Framework Support
•We just built ad hoc worker scripts
•A mess to add new job types &
capacity
Slide 23
Slide 23 text
Gearman in Production
•Persistence horrifically slow, complex
•So we ran out of memory and crashed,
no recovery
•Single core, didn’t scale well:
60ms mean submission time for us
•Probably should have just used Redis
Slide 24
Slide 24 text
We needed a
fresh start.
Slide 25
Slide 25 text
WARNING
System had to be in production before the heat
death of the universe. We are probably doing
something stupid!
Redis
•We Already Use It
•Very Fast, Efficient
•Polling For Task Distribution
•Messy Non-Synchronous Replication
•Memory Limits Task Capacity
Slide 29
Slide 29 text
Beanstalk
• Purpose-Built Task Queue
• Very Fast, Efficient
• Pushes to Consumers
• Spills to Disk
• No Replication
• Useless For Anything Else
Slide 30
Slide 30 text
RabbitMQ
• Reasonably Fast, Efficient
• Spill-To-Disk
• Low-Maintenance Synchronous Replication
• Excellent Celery Compatibility
• Supports Other Use Cases
• We don’t know Erlang
Slide 31
Slide 31 text
Our RabbitMQ Setup
•RabbitMQ 3.0
•Clusters of Two Broker Nodes, Mirrored
•Scale Out By Adding Broker Clusters
•EC2 c1.xlarge, RAID instance storage
•Way Overprovisioned
Slide 32
Slide 32 text
Alerting
•We use Sensu
•Monitors & alerts on queue length
threshold
•Uses rabbitmqctl list_queues
Slide 33
Slide 33 text
Graphing
•We use graphite & statsd
•Per-task sent/fail/success/retry graphs
•Using celery's hooks to make them
possible
Slide 34
Slide 34 text
0A
us-east-1a
us-east-1e
0E
1A
1E
2A
2E
web
workers
Slide 35
Slide 35 text
Mean vs P90 Publish Times (ms)
Slide 36
Slide 36 text
Tasks per second
Slide 37
Slide 37 text
Aggregate CPU% (all RabbitMQs)
Slide 38
Slide 38 text
Wait, ~4000 tasks/sec...
I thought you said scale?
Slide 39
Slide 39 text
~25,000 app threads
publishing tasks
Slide 40
Slide 40 text
Spans Datacenters
Slide 41
Slide 41 text
Scale Out
Slide 42
Slide 42 text
Celery IRL
•Easy to understand, new engineers
come up to speed in 15 minutes.
•New job types deployed without fuss.
•We hack the config a bit to get what
we want.
Slide 43
Slide 43 text
@task(routing_key="task_queue")
def task_function(task_arg, another_task_arg):
do_things()
Related tasks run on the same queue
Slide 44
Slide 44 text
task_function.delay("foo", "bar")
Slide 45
Slide 45 text
Scaling Out
•Celery only supported 1 broker host last
year when we started.
•Created kombu-multibroker "shim"
•Multiple brokers used in a round-robin
fashion.
•Breaks some Celery management tools :(
gevent = Network Bound
•Facebook API
•Tumblr API
•Various Background S3 Tasks
•Checking URLs for Spam
Slide 51
Slide 51 text
Problem:
Network-Bound Tasks Sometimes
Need To Take Some Action
Slide 52
Slide 52 text
@task(routing_key="task_remote_access"):
def check_url(object_id, url):
is_bad = run_url_check(url)
if is_bad:
take_some_action.delay(object_id, url)
@task(routing_key="task_action"):
def take_some_action(object_id, url):
do_some_database_thing()
Ran on "processes" worker
Ran on "gevent" worker
Slide 53
Slide 53 text
Problem:
Slow Tasks Monopolize Workers
Slide 54
Slide 54 text
Broker
5
4
3
2
0
1
Main
Worker
Worker
0
Worker
1
5 4 3 2 1 0
Fetches Batch
Wait Until Batch
Finishes Before
Grabbing Another One
Slide 55
Slide 55 text
•Run higher concurrency?
Inefficient :(
•Lower batch (prefetch) size?
Min is concurrency count, inefficient :(
•Separate slow & fast tasks :)
Why not do this
everywhere?
•Tasks must be idempotent!
•That probably is the case anyway :(
•Mirroring can cause duplicate tasks
•FLP Impossibility
FFFFFFFFFUUUUUUUUU!!!!
Slide 65
Slide 65 text
There is no such thing as
running tasks exactly-once.
Slide 66
Slide 66 text
"... it is impossible for one
process to tell whether another
has died (stopped entirely) or is
just running very slowly."
Impossibility of Distributed Consensus with One Faulty Process
Fischer, Lynch, Patterson (1985)
Slide 67
Slide 67 text
NLP Proof Gives Us Choices:
To retry or not to retry
Slide 68
Slide 68 text
Problem:
Early on, we noticed overloaded
brokers were dropping tasks...
Slide 69
Slide 69 text
Publisher Confirms
•AMQP default is that we don't know if
things were published or not. :(
•Publisher Confirms makes broker send
acknowledgements back on publishes.
•kombu-multibroker forces this.
•Can cause duplicate tasks. (FLP again!)
Slide 70
Slide 70 text
Other Rules of
Thumb
Slide 71
Slide 71 text
Avoid using async tasks as a
"backup" mechanism only during
failures. It'll probably break.
Slide 72
Slide 72 text
@task(routing_key="media_activation")
def deactivate_media_content(media_id):
try:
media = get_media_store_object(media_id)
media.deactivate()
except MediaContentRemoteOperationError, e:
raise deactivate_media_content.retry(countdown=60)
Only pass self-contained, non-opaque
data (strings, numbers, arrays, lists, and dicts)
as arguments to tasks.
Slide 73
Slide 73 text
Tasks should usually execute
within a few seconds. They
gum up the works otherwise.