Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Celery with Social Networks

David Gouldin
September 05, 2012

Using Celery with Social Networks

Many web applications need to interface with social networks, and celery, a Python distributed task queue library, is a great tool for the job. However, achieving speed and stability can be difficult. This talk will cover task organization/distribution, rate limiting, failover, and other practices to aid in working with social networks at scale.

David Gouldin

September 05, 2012

More Decks by David Gouldin

Other Decks in Programming


  1. @dgouldin 3rd party interfaces are hard • Much slower to

    access than local data • Users expect results. NOW! SPEED
  2. @dgouldin 3rd party interfaces are hard • Different rules for

    every service • Reactive vs Proactive (limits aren’t always published) RATE LIMITS
  3. @dgouldin 3rd party interfaces are hard • Outages (yes, Facebook

    does go down) • Random failures INSTABILITY
  4. @dgouldin Now you have 2 problems • Organization • Distribution

    • Rate limiting • Failover • Queue Partitioning • Debugging
  5. @dgouldin Task Organization • Workers are ephemeral. • Smaller tasks

    mean better distribution. • Preferably one 3rd party call per task • Tasks aren’t free: small does NOT mean trivial! SMALL. ATOMIC.
  6. @dgouldin Task Organization • Task arguments should be primitives •

    NO model instances! (Use PKs.) • Defer data access to the task itself • Prevents serialization sync issues and increases performance MINIMAL STATE
  7. @dgouldin Task Organization MINIMAL STATE @task() def bad(model1): # ...

    do stuff with model1 here # If the first 2 succeed but the 3rd fails, we have to # do all 3 over again! first_api_call(model1) second_api_call(model1) third_api_call(model1) # This could fail since model1 is a potentially stale # class instance deserialized by celery! model1.save()
  8. @dgouldin Task Organization MINIMAL STATE @task() def good(model1_pk): try: model1

    = Model1.objects.get(pk=model1_pk) except Model1.DoesNotExist as e: # Guard against race conditions and retry in a bit. good.retry(e, countdown=1) # ... do stuff with model1 here # The current task is used as a dispatcher for # parallelized API calls. first_api_call_task.delay(model1.access_token) second_api_call_task.delay(model1.access_token) third_api_call_task.delay(model1.access_token) model1.save()
  9. @dgouldin Task Organization • Don’t forget: tasks are just classes

    • Create abstract parent task classes for common patterns MAKE IT CLASSY
  10. @dgouldin Task Organization MAKE IT CLASSY @task() def import_twitter_followers(user_id): try:

    user = User.objects.get(pk=user_id) except User.DoesNotExist as e: import_twitter_followers.retry(e, countdown=1) access_token = user.twitter_account.access_token client = TwitterClient(access_token) followers = client.followers() # ... do stuff with followers here
  11. @dgouldin Task Organization MAKE IT CLASSY class TwitterAPITask(task.Task): abstract =

    True def api_call(self, user, method, args=None, kwargs=None): try: access_token = user.twitter_account.access_token except TwitterAccount.DoesNotExist as e: TwitterAPITask.retry(e, countdown=1) args = args or [] kwargs = kwargs or {} client = TwitterClient(access_token) return getattr(client, method)(*args, **kwargs)
  12. @dgouldin Task Organization MAKE IT CLASSY class TwitterImportFollowers(TwitterAPITask): def run(self,

    user_id): try: user = User.objects.get(pk=user_id) except User.DoesNotExist as e: TwitterImportFollowers.retry(e, countdown=1) followers = self.api_call(user, 'followers') # ... do stuff with followers here
  13. @dgouldin Task Organization • Not always possible • Tasks fail.

    It happens. Rerun is a simple fix. IDEMPOTENT
  14. @dgouldin Task Distribution • Pages are logical places to break

    up tasks • Pagination strategies differ: • limit/offset vs cursor • # of pages isn’t always known PAGINATION
  15. @dgouldin Task Distribution PAGINATION limit/offset supported & set size known

    dispatcher pg 1 pg 2 pg n ... [:100] [100:200] [-100:]
  16. @dgouldin Task Distribution PAGINATION @task() def limit_offset_task_dispatcher(user_id): # ... limit

    = 100 for offset in range(0, user.num_friends, limit): # launch all pages immediately LimitOffsetImportPage.delay(user_id, offset, limit) limit/offset supported & set size known
  17. @dgouldin Task Distribution PAGINATION class CursorImportPage(CursorBasedTask): def run(self, user_id, cursor=0):

    # ... page = self.call(user, 'friends', kwargs={'cursor': cursor}) if page.next_cursor != -1: # this page launches the next BEFORE processing CursorImportPage.delay(user_id, cursor=page.next_cursor) # ... limit/offset not supported (set size irrelevant)
  18. @dgouldin Task Distribution PAGINATION limit/offset supported & set size unknown

    dispatcher pg 1 pg 2 pg 3 pg 4 pg 5 pg 6 ∅ pg 7 pg 8 ∅ ∅ [:100] [300:400] [600:700]
  19. @dgouldin Task Distribution PAGINATION @task() def limit_offset_task_dispatcher(user_id): # ... limit

    = 100 concurrent = 3 for page_num in range(concurrent_pages): # launch a set number of concurrent pages immediately LimitOffsetImportPage.delay(user_id, page_num, limit, concurrent) limit/offset supported & set size unknown
  20. @dgouldin Task Distribution PAGINATION class LimitOffsetImportPage(LimitOffsetBasedTask): def run(self, user_id, page_num,

    limit, concurrent): # ... offset = page_num * limit page = self.call(user, 'friends', kwargs={'offset': offset, 'limit': limit}) if page.friends: # this page is not empty, launch another! next_page_num = page_num + concurrent LimitOffsetImportPage.delay(user_id, next_page_num, limit, concurrent) # ... limit/offset supported & set size unknown
  21. @dgouldin Task Distribution • Setting page size is an art,

    not a science • Minimize total API calls when possible • Avoid long-running tasks: set a timeout • Remember: minimize state in task def’ns (don’t pass API data between tasks) PAGINATION
  22. @dgouldin Task Distribution • “Done?” is hard for distributed systems

    • Celery 3 has dependency built in! (YAY) • Requires ignore_result=False • DO NOT USE RABBITMQ AS YOUR RESULT BACKEND!!1! DEPENDENCIES
  23. @dgouldin Rate Limiting • Celery’s rate_limit doesn’t do what you

    think it does. • 3rd party rate limits depend on many factors. PROBLEMS
  24. @dgouldin Rate Limiting • Doesn’t work with multiple worker daemons.

    • Fails on worker daemon restart. • Luckily, our rate limits are externally enforced. • Use an external store (e.g. redis), NOT Celery’s built-in support! RATE_LIMIT
  25. @dgouldin Rate Limiting • Who’s asking • What feature •

    Requesting public or private info • Unknowns MANY FACTORS
  26. @dgouldin Rate Limiting MANY FACTORS 'x-ratelimit-class': 'api_identified', 'x-ratelimit-limit': '350', 'x-ratelimit-remaining':

    '257', 'x-ratelimit-reset': '1345696749', https://api.twitter.com/1/account/settings.json 'x-featureratelimit-class': 'usersearch', 'x-featureratelimit-limit': '180', 'x-featureratelimit-remaining': '179', 'x-featureratelimit-reset': '1345700189', https://api.twitter.com/1/users/search.json?q=djangocon
  27. @dgouldin Rate Limiting • Simple: store “limited until” timestamp •

    Harder: store counters and incr per call KNOWN LIMITS (fixed time window)
  28. @dgouldin Rate Limiting KNOWN LIMITS (fixed time window) def call(self,

    *args, **kwargs): key = self.generate_key(*args, **kwargs) if redis.exists(key): until = int(redis.get(key)) countdown = until - int(time.time()) if countdown > 0: self.retry(countdown=countdown) else: redis.delete(key) try: return self._call(*args, **kwargs) except RateLimitException as e: redis.set(key, e.until) countdown = e.until - int(time.time()) self.retry(countdown=countdown)
  29. @dgouldin Rate Limiting • Store a redis sorted set of

    timestamps • Remove any stale items from the set • If len() of the new set > limit, wait long enough for the oldest to drop off KNOWN LIMITS (rolling time window)
  30. @dgouldin Rate Limiting KNOWN LIMITS def call(self, *args, **kwargs): key

    = self.generate_key(*args, **kwargs) window = 3600 * 24 * 2 # 2 day window expires = int(time.time()) - window redis.zremrangebyscore(key, 'inf', expires) if redis.zcard(key) < 25: # 25 call limit now = int(time.time()) redis.zadd(key, now, now) return self._call(*args, **kwargs) else: first = int(redis.zrange(key, 0, 0)[0]) countdown = (first + window) - int(time.time()) self.retry(countdown=countdown)
  31. @dgouldin Rate Limiting • Store a counter, incr on rate

    limit, decr on no rate limit. • When counter > 0, exponentially back off BEFORE making calls. UNKNOWN LIMITS
  32. @dgouldin Rate Limiting UNKNOWN LIMITS def call(self, *args, **kwargs): key

    = self.generate_key(*args, **kwargs) backoff_exponent = redis.get(key) if backoff_exponent and not self.request.retries: # new tasks must wait backoff before calling self.retry(countdown=2**backoff_exponent) try: return self._call(*args, **kwargs) except FacebookClient.RateLimitException as e: redis.incr(key) self.retry(countdown=2**backoff_exponent) else: redis.decr(key)
  33. @dgouldin Failover • Celery’s countdown doesn’t do what you think

    it does. • 3rd parties can fail in lots of “interesting” ways. PROBLEMS
  34. @dgouldin Failover • Tasks are immediately dispatched to a worker

    daemon with ETA in the serialized message. • Celery’s hard work & RabbitMQ’s “ack” feature prevent lost work. • This is still a highly suboptimal solution! COUNTDOWN
  35. @dgouldin Failover COUNTDOWN celery countdown celery 535a6f47 routing_key: "celery" Exchanges

    Bindings Queues routing_key: "celery" Workers TTL = 60 seconds dead letter exchange = "celery"
  36. @dgouldin Failover COUNTDOWN # add the queue to the app

    so celery knows about it and use # that queue for this task task_id = task_id or gen_unique_id() app.amqp.queues.add(task_id, exchange=settings.COUNTDOWN_EXCHANGE.name, exchange_type=settings.COUNTDOWN_EXCHANGE.type, routing_key=options['routing_key'], queue_arguments={ 'x-message-ttl': countdown * 1000, 'x-dead-letter-exchange': options['queue'], 'x-expires': (countdown + 1) * 1000, }) options.update({ 'queue': task_id, 'exchange': settings.COUNTDOWN_EXCHANGE, })
  37. @dgouldin Failover • Create an abstract task base class for

    each third party. • Handle all error conditions within a single call() function on that base class. THIRD PARTIES
  38. @dgouldin Multiple Queues • Better control over task prioritization &

    resource distribution. • Queue segmentation allows for spikes. • Background work needs its own “trickle” queue. WHY?
  39. @dgouldin Multiple Queues HOW? CELERY_QUEUES = { "interactive": { "binding_key":

    "interactive" }, "build_network": { "binding_key": "build_network" }, "post_message": { "binding_key": "post_message", } "trickle": { "binding_key": "trickle" }, }
  40. @dgouldin Multiple Queues • Maintenance tasks (such as keeping avatars

    up to date) are low priority. • They can still overwhelm a queue and crowd out other similar priority tasks. • Rather than dumping batches of tasks, “trickle” a few at a time using a cron and persistent cursor. TRICKLE QUEUE
  41. @dgouldin celerybeat • Periodic task persistence gets out of sync

    with code. • Just 1 more process to manage. • Cron: it’s just. Not. That. Hard. WHY NOT?
  42. @dgouldin Extras • Don’t use always_eager. • Logging, logging, logging

    • Unit tests are good, but integration tests save lives. DEBUGGING
  43. @dgouldin Extras • C-level blocking prevents soft timeout (so set

    a timeout on that socket call!) • Soft timeout doesn’t automatically retry. • Default task result is “PENDING” even if Celery has no idea or result cache has expired. GOTCHAS