Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Celery with Social Networks

David Gouldin
September 05, 2012

Using Celery with Social Networks

Many web applications need to interface with social networks, and celery, a Python distributed task queue library, is a great tool for the job. However, achieving speed and stability can be difficult. This talk will cover task organization/distribution, rate limiting, failover, and other practices to aid in working with social networks at scale.

David Gouldin

September 05, 2012
Tweet

More Decks by David Gouldin

Other Decks in Programming

Transcript

  1. @dgouldin 3rd party interfaces are hard • Much slower to

    access than local data • Users expect results. NOW! SPEED
  2. @dgouldin 3rd party interfaces are hard • Different rules for

    every service • Reactive vs Proactive (limits aren’t always published) RATE LIMITS
  3. @dgouldin 3rd party interfaces are hard • Outages (yes, Facebook

    does go down) • Random failures INSTABILITY
  4. @dgouldin Now you have 2 problems • Organization • Distribution

    • Rate limiting • Failover • Queue Partitioning • Debugging
  5. @dgouldin Task Organization • Workers are ephemeral. • Smaller tasks

    mean better distribution. • Preferably one 3rd party call per task • Tasks aren’t free: small does NOT mean trivial! SMALL. ATOMIC.
  6. @dgouldin Task Organization • Task arguments should be primitives •

    NO model instances! (Use PKs.) • Defer data access to the task itself • Prevents serialization sync issues and increases performance MINIMAL STATE
  7. @dgouldin Task Organization MINIMAL STATE @task() def bad(model1): # ...

    do stuff with model1 here # If the first 2 succeed but the 3rd fails, we have to # do all 3 over again! first_api_call(model1) second_api_call(model1) third_api_call(model1) # This could fail since model1 is a potentially stale # class instance deserialized by celery! model1.save()
  8. @dgouldin Task Organization MINIMAL STATE @task() def good(model1_pk): try: model1

    = Model1.objects.get(pk=model1_pk) except Model1.DoesNotExist as e: # Guard against race conditions and retry in a bit. good.retry(e, countdown=1) # ... do stuff with model1 here # The current task is used as a dispatcher for # parallelized API calls. first_api_call_task.delay(model1.access_token) second_api_call_task.delay(model1.access_token) third_api_call_task.delay(model1.access_token) model1.save()
  9. @dgouldin Task Organization • Don’t forget: tasks are just classes

    • Create abstract parent task classes for common patterns MAKE IT CLASSY
  10. @dgouldin Task Organization MAKE IT CLASSY @task() def import_twitter_followers(user_id): try:

    user = User.objects.get(pk=user_id) except User.DoesNotExist as e: import_twitter_followers.retry(e, countdown=1) access_token = user.twitter_account.access_token client = TwitterClient(access_token) followers = client.followers() # ... do stuff with followers here
  11. @dgouldin Task Organization MAKE IT CLASSY class TwitterAPITask(task.Task): abstract =

    True def api_call(self, user, method, args=None, kwargs=None): try: access_token = user.twitter_account.access_token except TwitterAccount.DoesNotExist as e: TwitterAPITask.retry(e, countdown=1) args = args or [] kwargs = kwargs or {} client = TwitterClient(access_token) return getattr(client, method)(*args, **kwargs)
  12. @dgouldin Task Organization MAKE IT CLASSY class TwitterImportFollowers(TwitterAPITask): def run(self,

    user_id): try: user = User.objects.get(pk=user_id) except User.DoesNotExist as e: TwitterImportFollowers.retry(e, countdown=1) followers = self.api_call(user, 'followers') # ... do stuff with followers here
  13. @dgouldin Task Organization • Not always possible • Tasks fail.

    It happens. Rerun is a simple fix. IDEMPOTENT
  14. @dgouldin Task Distribution • Pages are logical places to break

    up tasks • Pagination strategies differ: • limit/offset vs cursor • # of pages isn’t always known PAGINATION
  15. @dgouldin Task Distribution PAGINATION limit/offset supported & set size known

    dispatcher pg 1 pg 2 pg n ... [:100] [100:200] [-100:]
  16. @dgouldin Task Distribution PAGINATION @task() def limit_offset_task_dispatcher(user_id): # ... limit

    = 100 for offset in range(0, user.num_friends, limit): # launch all pages immediately LimitOffsetImportPage.delay(user_id, offset, limit) limit/offset supported & set size known
  17. @dgouldin Task Distribution PAGINATION class CursorImportPage(CursorBasedTask): def run(self, user_id, cursor=0):

    # ... page = self.call(user, 'friends', kwargs={'cursor': cursor}) if page.next_cursor != -1: # this page launches the next BEFORE processing CursorImportPage.delay(user_id, cursor=page.next_cursor) # ... limit/offset not supported (set size irrelevant)
  18. @dgouldin Task Distribution PAGINATION limit/offset supported & set size unknown

    dispatcher pg 1 pg 2 pg 3 pg 4 pg 5 pg 6 ∅ pg 7 pg 8 ∅ ∅ [:100] [300:400] [600:700]
  19. @dgouldin Task Distribution PAGINATION @task() def limit_offset_task_dispatcher(user_id): # ... limit

    = 100 concurrent = 3 for page_num in range(concurrent_pages): # launch a set number of concurrent pages immediately LimitOffsetImportPage.delay(user_id, page_num, limit, concurrent) limit/offset supported & set size unknown
  20. @dgouldin Task Distribution PAGINATION class LimitOffsetImportPage(LimitOffsetBasedTask): def run(self, user_id, page_num,

    limit, concurrent): # ... offset = page_num * limit page = self.call(user, 'friends', kwargs={'offset': offset, 'limit': limit}) if page.friends: # this page is not empty, launch another! next_page_num = page_num + concurrent LimitOffsetImportPage.delay(user_id, next_page_num, limit, concurrent) # ... limit/offset supported & set size unknown
  21. @dgouldin Task Distribution • Setting page size is an art,

    not a science • Minimize total API calls when possible • Avoid long-running tasks: set a timeout • Remember: minimize state in task def’ns (don’t pass API data between tasks) PAGINATION
  22. @dgouldin Task Distribution • “Done?” is hard for distributed systems

    • Celery 3 has dependency built in! (YAY) • Requires ignore_result=False • DO NOT USE RABBITMQ AS YOUR RESULT BACKEND!!1! DEPENDENCIES
  23. @dgouldin Rate Limiting • Celery’s rate_limit doesn’t do what you

    think it does. • 3rd party rate limits depend on many factors. PROBLEMS
  24. @dgouldin Rate Limiting • Doesn’t work with multiple worker daemons.

    • Fails on worker daemon restart. • Luckily, our rate limits are externally enforced. • Use an external store (e.g. redis), NOT Celery’s built-in support! RATE_LIMIT
  25. @dgouldin Rate Limiting • Who’s asking • What feature •

    Requesting public or private info • Unknowns MANY FACTORS
  26. @dgouldin Rate Limiting MANY FACTORS 'x-ratelimit-class': 'api_identified', 'x-ratelimit-limit': '350', 'x-ratelimit-remaining':

    '257', 'x-ratelimit-reset': '1345696749', https://api.twitter.com/1/account/settings.json 'x-featureratelimit-class': 'usersearch', 'x-featureratelimit-limit': '180', 'x-featureratelimit-remaining': '179', 'x-featureratelimit-reset': '1345700189', https://api.twitter.com/1/users/search.json?q=djangocon
  27. @dgouldin Rate Limiting • Simple: store “limited until” timestamp •

    Harder: store counters and incr per call KNOWN LIMITS (fixed time window)
  28. @dgouldin Rate Limiting KNOWN LIMITS (fixed time window) def call(self,

    *args, **kwargs): key = self.generate_key(*args, **kwargs) if redis.exists(key): until = int(redis.get(key)) countdown = until - int(time.time()) if countdown > 0: self.retry(countdown=countdown) else: redis.delete(key) try: return self._call(*args, **kwargs) except RateLimitException as e: redis.set(key, e.until) countdown = e.until - int(time.time()) self.retry(countdown=countdown)
  29. @dgouldin Rate Limiting • Store a redis sorted set of

    timestamps • Remove any stale items from the set • If len() of the new set > limit, wait long enough for the oldest to drop off KNOWN LIMITS (rolling time window)
  30. @dgouldin Rate Limiting KNOWN LIMITS def call(self, *args, **kwargs): key

    = self.generate_key(*args, **kwargs) window = 3600 * 24 * 2 # 2 day window expires = int(time.time()) - window redis.zremrangebyscore(key, 'inf', expires) if redis.zcard(key) < 25: # 25 call limit now = int(time.time()) redis.zadd(key, now, now) return self._call(*args, **kwargs) else: first = int(redis.zrange(key, 0, 0)[0]) countdown = (first + window) - int(time.time()) self.retry(countdown=countdown)
  31. @dgouldin Rate Limiting • Store a counter, incr on rate

    limit, decr on no rate limit. • When counter > 0, exponentially back off BEFORE making calls. UNKNOWN LIMITS
  32. @dgouldin Rate Limiting UNKNOWN LIMITS def call(self, *args, **kwargs): key

    = self.generate_key(*args, **kwargs) backoff_exponent = redis.get(key) if backoff_exponent and not self.request.retries: # new tasks must wait backoff before calling self.retry(countdown=2**backoff_exponent) try: return self._call(*args, **kwargs) except FacebookClient.RateLimitException as e: redis.incr(key) self.retry(countdown=2**backoff_exponent) else: redis.decr(key)
  33. @dgouldin Failover • Celery’s countdown doesn’t do what you think

    it does. • 3rd parties can fail in lots of “interesting” ways. PROBLEMS
  34. @dgouldin Failover • Tasks are immediately dispatched to a worker

    daemon with ETA in the serialized message. • Celery’s hard work & RabbitMQ’s “ack” feature prevent lost work. • This is still a highly suboptimal solution! COUNTDOWN
  35. @dgouldin Failover COUNTDOWN celery countdown celery 535a6f47 routing_key: "celery" Exchanges

    Bindings Queues routing_key: "celery" Workers TTL = 60 seconds dead letter exchange = "celery"
  36. @dgouldin Failover COUNTDOWN # add the queue to the app

    so celery knows about it and use # that queue for this task task_id = task_id or gen_unique_id() app.amqp.queues.add(task_id, exchange=settings.COUNTDOWN_EXCHANGE.name, exchange_type=settings.COUNTDOWN_EXCHANGE.type, routing_key=options['routing_key'], queue_arguments={ 'x-message-ttl': countdown * 1000, 'x-dead-letter-exchange': options['queue'], 'x-expires': (countdown + 1) * 1000, }) options.update({ 'queue': task_id, 'exchange': settings.COUNTDOWN_EXCHANGE, })
  37. @dgouldin Failover • Create an abstract task base class for

    each third party. • Handle all error conditions within a single call() function on that base class. THIRD PARTIES
  38. @dgouldin Multiple Queues • Better control over task prioritization &

    resource distribution. • Queue segmentation allows for spikes. • Background work needs its own “trickle” queue. WHY?
  39. @dgouldin Multiple Queues HOW? CELERY_QUEUES = { "interactive": { "binding_key":

    "interactive" }, "build_network": { "binding_key": "build_network" }, "post_message": { "binding_key": "post_message", } "trickle": { "binding_key": "trickle" }, }
  40. @dgouldin Multiple Queues • Maintenance tasks (such as keeping avatars

    up to date) are low priority. • They can still overwhelm a queue and crowd out other similar priority tasks. • Rather than dumping batches of tasks, “trickle” a few at a time using a cron and persistent cursor. TRICKLE QUEUE
  41. @dgouldin celerybeat • Periodic task persistence gets out of sync

    with code. • Just 1 more process to manage. • Cron: it’s just. Not. That. Hard. WHY NOT?
  42. @dgouldin Extras • Don’t use always_eager. • Logging, logging, logging

    • Unit tests are good, but integration tests save lives. DEBUGGING
  43. @dgouldin Extras • C-level blocking prevents soft timeout (so set

    a timeout on that socket call!) • Soft timeout doesn’t automatically retry. • Default task result is “PENDING” even if Celery has no idea or result cache has expired. GOTCHAS