Using Celery with Social Networks

@dgouldin Using Celery with Social Networks David Gouldin

@dgouldin So you want to interface with ______ Twitter Facebook
LinkedIn

@dgouldin 3rd party interfaces are hard

@dgouldin 3rd party interfaces are hard • Much slower to
access than local data • Users expect results. NOW! SPEED

@dgouldin 3rd party interfaces are hard • Different rules for
every service • Reactive vs Proactive (limits aren’t always published) RATE LIMITS

@dgouldin 3rd party interfaces are hard • Outages (yes, Facebook
does go down) • Random failures INSTABILITY

@dgouldin Celery to the rescue!

@dgouldin Why Celery? • Asynchronous • Distributed • Fault Tolerant

@dgouldin Now you have 2 problems • Organization • Distribution
• Rate limiting • Failover • Queue Partitioning • Debugging

@dgouldin Before we dive in ...

@dgouldin ALWAYS use RabbitMQ as your broker NEVER use RabbitMQ
as your result store

@dgouldin Task Organization

@dgouldin Task Organization • Workers are ephemeral. • Smaller tasks
mean better distribution. • Preferably one 3rd party call per task • Tasks aren’t free: small does NOT mean trivial! SMALL. ATOMIC.

@dgouldin Task Organization • Task arguments should be primitives •
NO model instances! (Use PKs.) • Defer data access to the task itself • Prevents serialization sync issues and increases performance MINIMAL STATE

@dgouldin Task Organization MINIMAL STATE @task() def bad(model1): # ...
do stuff with model1 here # If the first 2 succeed but the 3rd fails, we have to # do all 3 over again! first_api_call(model1) second_api_call(model1) third_api_call(model1) # This could fail since model1 is a potentially stale # class instance deserialized by celery! model1.save()

@dgouldin Task Organization MINIMAL STATE @task() def good(model1_pk): try: model1
= Model1.objects.get(pk=model1_pk) except Model1.DoesNotExist as e: # Guard against race conditions and retry in a bit. good.retry(e, countdown=1) # ... do stuff with model1 here # The current task is used as a dispatcher for # parallelized API calls. first_api_call_task.delay(model1.access_token) second_api_call_task.delay(model1.access_token) third_api_call_task.delay(model1.access_token) model1.save()

@dgouldin Task Organization • Don’t forget: tasks are just classes
• Create abstract parent task classes for common patterns MAKE IT CLASSY

@dgouldin Task Organization MAKE IT CLASSY @task() def import_twitter_followers(user_id): try:
user = User.objects.get(pk=user_id) except User.DoesNotExist as e: import_twitter_followers.retry(e, countdown=1) access_token = user.twitter_account.access_token client = TwitterClient(access_token) followers = client.followers() # ... do stuff with followers here

@dgouldin Task Organization MAKE IT CLASSY class TwitterAPITask(task.Task): abstract =
True def api_call(self, user, method, args=None, kwargs=None): try: access_token = user.twitter_account.access_token except TwitterAccount.DoesNotExist as e: TwitterAPITask.retry(e, countdown=1) args = args or [] kwargs = kwargs or {} client = TwitterClient(access_token) return getattr(client, method)(*args, **kwargs)

@dgouldin Task Organization MAKE IT CLASSY class TwitterImportFollowers(TwitterAPITask): def run(self,
user_id): try: user = User.objects.get(pk=user_id) except User.DoesNotExist as e: TwitterImportFollowers.retry(e, countdown=1) followers = self.api_call(user, 'followers') # ... do stuff with followers here

@dgouldin Task Organization • Not always possible • Tasks fail.
It happens. Rerun is a simple ﬁx. IDEMPOTENT

@dgouldin Task Distribution

@dgouldin Task Distribution • Pages are logical places to break
up tasks • Pagination strategies differ: • limit/offset vs cursor • # of pages isn’t always known PAGINATION

@dgouldin Task Distribution PAGINATION limit/offset supported & set size known
dispatcher pg 1 pg 2 pg n ... [:100] [100:200] [-100:]

@dgouldin Task Distribution PAGINATION @task() def limit_offset_task_dispatcher(user_id): # ... limit
= 100 for offset in range(0, user.num_friends, limit): # launch all pages immediately LimitOffsetImportPage.delay(user_id, offset, limit) limit/offset supported & set size known

@dgouldin Task Distribution PAGINATION limit/offset not supported (set size irrelevant)
pg 1 pg 2 pg n ... [:m] [m:2m] [-m:]

@dgouldin Task Distribution PAGINATION class CursorImportPage(CursorBasedTask): def run(self, user_id, cursor=0):
# ... page = self.call(user, 'friends', kwargs={'cursor': cursor}) if page.next_cursor != -1: # this page launches the next BEFORE processing CursorImportPage.delay(user_id, cursor=page.next_cursor) # ... limit/offset not supported (set size irrelevant)

@dgouldin Task Distribution PAGINATION limit/offset supported & set size unknown
dispatcher pg 1 pg 2 pg 3 pg 4 pg 5 pg 6 ∅ pg 7 pg 8 ∅ ∅ [:100] [300:400] [600:700]

@dgouldin Task Distribution PAGINATION @task() def limit_offset_task_dispatcher(user_id): # ... limit
= 100 concurrent = 3 for page_num in range(concurrent_pages): # launch a set number of concurrent pages immediately LimitOffsetImportPage.delay(user_id, page_num, limit, concurrent) limit/offset supported & set size unknown

@dgouldin Task Distribution PAGINATION class LimitOffsetImportPage(LimitOffsetBasedTask): def run(self, user_id, page_num,
limit, concurrent): # ... offset = page_num * limit page = self.call(user, 'friends', kwargs={'offset': offset, 'limit': limit}) if page.friends: # this page is not empty, launch another! next_page_num = page_num + concurrent LimitOffsetImportPage.delay(user_id, next_page_num, limit, concurrent) # ... limit/offset supported & set size unknown

@dgouldin Task Distribution • Setting page size is an art,
not a science • Minimize total API calls when possible • Avoid long-running tasks: set a timeout • Remember: minimize state in task def’ns (don’t pass API data between tasks) PAGINATION

@dgouldin Task Distribution • “Done?” is hard for distributed systems
• Celery 3 has dependency built in! (YAY) • Requires ignore_result=False • DO NOT USE RABBITMQ AS YOUR RESULT BACKEND!!1! DEPENDENCIES

@dgouldin Rate Limiting

@dgouldin Rate Limiting • Celery’s rate_limit doesn’t do what you
think it does. • 3rd party rate limits depend on many factors. PROBLEMS

@dgouldin Rate Limiting RATE_LIMIT @task(rate_limit='1/h') def save_grandma(): log.warn("Time to take
your insulin!") and now ... a live demo (maybe)

@dgouldin Rate Limiting • Doesn’t work with multiple worker daemons.
• Fails on worker daemon restart. • Luckily, our rate limits are externally enforced. • Use an external store (e.g. redis), NOT Celery’s built-in support! RATE_LIMIT

@dgouldin Rate Limiting • Who’s asking • What feature •
Requesting public or private info • Unknowns MANY FACTORS

@dgouldin Rate Limiting MANY FACTORS 'x-ratelimit-class': 'api_identified', 'x-ratelimit-limit': '350', 'x-ratelimit-remaining':
'257', 'x-ratelimit-reset': '1345696749', https://api.twitter.com/1/account/settings.json 'x-featureratelimit-class': 'usersearch', 'x-featureratelimit-limit': '180', 'x-featureratelimit-remaining': '179', 'x-featureratelimit-reset': '1345700189', https://api.twitter.com/1/users/search.json?q=djangocon

@dgouldin Rate Limiting • Simple: store “limited until” timestamp •
Harder: store counters and incr per call KNOWN LIMITS (ﬁxed time window)

@dgouldin Rate Limiting KNOWN LIMITS (ﬁxed time window) def call(self,
*args, **kwargs): key = self.generate_key(*args, **kwargs) if redis.exists(key): until = int(redis.get(key)) countdown = until - int(time.time()) if countdown > 0: self.retry(countdown=countdown) else: redis.delete(key) try: return self._call(*args, **kwargs) except RateLimitException as e: redis.set(key, e.until) countdown = e.until - int(time.time()) self.retry(countdown=countdown)

@dgouldin Rate Limiting • Store a redis sorted set of
timestamps • Remove any stale items from the set • If len() of the new set > limit, wait long enough for the oldest to drop off KNOWN LIMITS (rolling time window)

@dgouldin Rate Limiting KNOWN LIMITS def call(self, *args, **kwargs): key
= self.generate_key(*args, **kwargs) window = 3600 * 24 * 2 # 2 day window expires = int(time.time()) - window redis.zremrangebyscore(key, 'inf', expires) if redis.zcard(key) < 25: # 25 call limit now = int(time.time()) redis.zadd(key, now, now) return self._call(*args, **kwargs) else: first = int(redis.zrange(key, 0, 0)[0]) countdown = (first + window) - int(time.time()) self.retry(countdown=countdown)

@dgouldin Rate Limiting • Store a counter, incr on rate
limit, decr on no rate limit. • When counter > 0, exponentially back off BEFORE making calls. UNKNOWN LIMITS

@dgouldin Rate Limiting UNKNOWN LIMITS def call(self, *args, **kwargs): key
= self.generate_key(*args, **kwargs) backoff_exponent = redis.get(key) if backoff_exponent and not self.request.retries: # new tasks must wait backoff before calling self.retry(countdown=2**backoff_exponent) try: return self._call(*args, **kwargs) except FacebookClient.RateLimitException as e: redis.incr(key) self.retry(countdown=2**backoff_exponent) else: redis.decr(key)

@dgouldin Failover

@dgouldin Failover • Celery’s countdown doesn’t do what you think
it does. • 3rd parties can fail in lots of “interesting” ways. PROBLEMS

@dgouldin Failover • Tasks are immediately dispatched to a worker
daemon with ETA in the serialized message. • Celery’s hard work & RabbitMQ’s “ack” feature prevent lost work. • This is still a highly suboptimal solution! COUNTDOWN

@dgouldin Failover • Solution: Dead letter exchange AMQP extension. COUNTDOWN

@dgouldin Failover COUNTDOWN celery countdown celery 535a6f47 routing_key: "celery" Exchanges
Bindings Queues routing_key: "celery" Workers TTL = 60 seconds dead letter exchange = "celery"

@dgouldin Failover COUNTDOWN # add the queue to the app
so celery knows about it and use # that queue for this task task_id = task_id or gen_unique_id() app.amqp.queues.add(task_id, exchange=settings.COUNTDOWN_EXCHANGE.name, exchange_type=settings.COUNTDOWN_EXCHANGE.type, routing_key=options['routing_key'], queue_arguments={ 'x-message-ttl': countdown * 1000, 'x-dead-letter-exchange': options['queue'], 'x-expires': (countdown + 1) * 1000, }) options.update({ 'queue': task_id, 'exchange': settings.COUNTDOWN_EXCHANGE, })

@dgouldin Failover • Create an abstract task base class for
each third party. • Handle all error conditions within a single call() function on that base class. THIRD PARTIES

@dgouldin Multiple Queues

@dgouldin Multiple Queues • Better control over task prioritization &
resource distribution. • Queue segmentation allows for spikes. • Background work needs its own “trickle” queue. WHY?

@dgouldin Multiple Queues HOW? CELERY_QUEUES = { "interactive": { "binding_key":
"interactive" }, "build_network": { "binding_key": "build_network" }, "post_message": { "binding_key": "post_message", } "trickle": { "binding_key": "trickle" }, }

@dgouldin Multiple Queues HOW? $ ./manage.py celery worker --queues=interactive,trickle

@dgouldin Multiple Queues • Maintenance tasks (such as keeping avatars
up to date) are low priority. • They can still overwhelm a queue and crowd out other similar priority tasks. • Rather than dumping batches of tasks, “trickle” a few at a time using a cron and persistent cursor. TRICKLE QUEUE

@dgouldin celerybeat Don’t use it.

@dgouldin celerybeat • Periodic task persistence gets out of sync
with code. • Just 1 more process to manage. • Cron: it’s just. Not. That. Hard. WHY NOT?

@dgouldin Extras Wow, there’s still time left?

@dgouldin Extras • Don’t use always_eager. • Logging, logging, logging
• Unit tests are good, but integration tests save lives. DEBUGGING

@dgouldin Extras • C-level blocking prevents soft timeout (so set
a timeout on that socket call!) • Soft timeout doesn’t automatically retry. • Default task result is “PENDING” even if Celery has no idea or result cache has expired. GOTCHAS

@dgouldin Questions?

Using Celery with Social Networks

Using Celery with Social Networks

More Decks by David Gouldin

Other Decks in Programming

Featured

Transcript