Slide 1

Slide 1 text

@dgouldin Using Celery with Social Networks David Gouldin

Slide 2

Slide 2 text

@dgouldin So you want to interface with ______ Twitter Facebook LinkedIn

Slide 3

Slide 3 text

@dgouldin 3rd party interfaces are hard

Slide 4

Slide 4 text

@dgouldin 3rd party interfaces are hard • Much slower to access than local data • Users expect results. NOW! SPEED

Slide 5

Slide 5 text

@dgouldin 3rd party interfaces are hard • Different rules for every service • Reactive vs Proactive (limits aren’t always published) RATE LIMITS

Slide 6

Slide 6 text

@dgouldin 3rd party interfaces are hard • Outages (yes, Facebook does go down) • Random failures INSTABILITY

Slide 7

Slide 7 text

@dgouldin Celery to the rescue!

Slide 8

Slide 8 text

@dgouldin Why Celery? • Asynchronous • Distributed • Fault Tolerant

Slide 9

Slide 9 text

@dgouldin Now you have 2 problems • Organization • Distribution • Rate limiting • Failover • Queue Partitioning • Debugging

Slide 10

Slide 10 text

@dgouldin Before we dive in ...

Slide 11

Slide 11 text

@dgouldin ALWAYS use RabbitMQ as your broker NEVER use RabbitMQ as your result store

Slide 12

Slide 12 text

@dgouldin Task Organization

Slide 13

Slide 13 text

@dgouldin Task Organization • Workers are ephemeral. • Smaller tasks mean better distribution. • Preferably one 3rd party call per task • Tasks aren’t free: small does NOT mean trivial! SMALL. ATOMIC.

Slide 14

Slide 14 text

@dgouldin Task Organization • Task arguments should be primitives • NO model instances! (Use PKs.) • Defer data access to the task itself • Prevents serialization sync issues and increases performance MINIMAL STATE

Slide 15

Slide 15 text

@dgouldin Task Organization MINIMAL STATE @task() def bad(model1): # ... do stuff with model1 here # If the first 2 succeed but the 3rd fails, we have to # do all 3 over again! first_api_call(model1) second_api_call(model1) third_api_call(model1) # This could fail since model1 is a potentially stale # class instance deserialized by celery! model1.save()

Slide 16

Slide 16 text

@dgouldin Task Organization MINIMAL STATE @task() def good(model1_pk): try: model1 = Model1.objects.get(pk=model1_pk) except Model1.DoesNotExist as e: # Guard against race conditions and retry in a bit. good.retry(e, countdown=1) # ... do stuff with model1 here # The current task is used as a dispatcher for # parallelized API calls. first_api_call_task.delay(model1.access_token) second_api_call_task.delay(model1.access_token) third_api_call_task.delay(model1.access_token) model1.save()

Slide 17

Slide 17 text

@dgouldin Task Organization • Don’t forget: tasks are just classes • Create abstract parent task classes for common patterns MAKE IT CLASSY

Slide 18

Slide 18 text

@dgouldin Task Organization MAKE IT CLASSY @task() def import_twitter_followers(user_id): try: user = User.objects.get(pk=user_id) except User.DoesNotExist as e: import_twitter_followers.retry(e, countdown=1) access_token = user.twitter_account.access_token client = TwitterClient(access_token) followers = client.followers() # ... do stuff with followers here

Slide 19

Slide 19 text

@dgouldin Task Organization MAKE IT CLASSY class TwitterAPITask(task.Task): abstract = True def api_call(self, user, method, args=None, kwargs=None): try: access_token = user.twitter_account.access_token except TwitterAccount.DoesNotExist as e: TwitterAPITask.retry(e, countdown=1) args = args or [] kwargs = kwargs or {} client = TwitterClient(access_token) return getattr(client, method)(*args, **kwargs)

Slide 20

Slide 20 text

@dgouldin Task Organization MAKE IT CLASSY class TwitterImportFollowers(TwitterAPITask): def run(self, user_id): try: user = User.objects.get(pk=user_id) except User.DoesNotExist as e: TwitterImportFollowers.retry(e, countdown=1) followers = self.api_call(user, 'followers') # ... do stuff with followers here

Slide 21

Slide 21 text

@dgouldin Task Organization • Not always possible • Tasks fail. It happens. Rerun is a simple fix. IDEMPOTENT

Slide 22

Slide 22 text

@dgouldin Task Distribution

Slide 23

Slide 23 text

@dgouldin Task Distribution • Pages are logical places to break up tasks • Pagination strategies differ: • limit/offset vs cursor • # of pages isn’t always known PAGINATION

Slide 24

Slide 24 text

@dgouldin Task Distribution PAGINATION limit/offset supported & set size known dispatcher pg 1 pg 2 pg n ... [:100] [100:200] [-100:]

Slide 25

Slide 25 text

@dgouldin Task Distribution PAGINATION @task() def limit_offset_task_dispatcher(user_id): # ... limit = 100 for offset in range(0, user.num_friends, limit): # launch all pages immediately LimitOffsetImportPage.delay(user_id, offset, limit) limit/offset supported & set size known

Slide 26

Slide 26 text

@dgouldin Task Distribution PAGINATION limit/offset not supported (set size irrelevant) pg 1 pg 2 pg n ... [:m] [m:2m] [-m:]

Slide 27

Slide 27 text

@dgouldin Task Distribution PAGINATION class CursorImportPage(CursorBasedTask): def run(self, user_id, cursor=0): # ... page = self.call(user, 'friends', kwargs={'cursor': cursor}) if page.next_cursor != -1: # this page launches the next BEFORE processing CursorImportPage.delay(user_id, cursor=page.next_cursor) # ... limit/offset not supported (set size irrelevant)

Slide 28

Slide 28 text

@dgouldin Task Distribution PAGINATION limit/offset supported & set size unknown dispatcher pg 1 pg 2 pg 3 pg 4 pg 5 pg 6 ∅ pg 7 pg 8 ∅ ∅ [:100] [300:400] [600:700]

Slide 29

Slide 29 text

@dgouldin Task Distribution PAGINATION @task() def limit_offset_task_dispatcher(user_id): # ... limit = 100 concurrent = 3 for page_num in range(concurrent_pages): # launch a set number of concurrent pages immediately LimitOffsetImportPage.delay(user_id, page_num, limit, concurrent) limit/offset supported & set size unknown

Slide 30

Slide 30 text

@dgouldin Task Distribution PAGINATION class LimitOffsetImportPage(LimitOffsetBasedTask): def run(self, user_id, page_num, limit, concurrent): # ... offset = page_num * limit page = self.call(user, 'friends', kwargs={'offset': offset, 'limit': limit}) if page.friends: # this page is not empty, launch another! next_page_num = page_num + concurrent LimitOffsetImportPage.delay(user_id, next_page_num, limit, concurrent) # ... limit/offset supported & set size unknown

Slide 31

Slide 31 text

@dgouldin Task Distribution • Setting page size is an art, not a science • Minimize total API calls when possible • Avoid long-running tasks: set a timeout • Remember: minimize state in task def’ns (don’t pass API data between tasks) PAGINATION

Slide 32

Slide 32 text

@dgouldin Task Distribution • “Done?” is hard for distributed systems • Celery 3 has dependency built in! (YAY) • Requires ignore_result=False • DO NOT USE RABBITMQ AS YOUR RESULT BACKEND!!1! DEPENDENCIES

Slide 33

Slide 33 text

@dgouldin Rate Limiting

Slide 34

Slide 34 text

@dgouldin Rate Limiting • Celery’s rate_limit doesn’t do what you think it does. • 3rd party rate limits depend on many factors. PROBLEMS

Slide 35

Slide 35 text

@dgouldin Rate Limiting RATE_LIMIT @task(rate_limit='1/h') def save_grandma(): log.warn("Time to take your insulin!") and now ... a live demo (maybe)

Slide 36

Slide 36 text

@dgouldin Rate Limiting • Doesn’t work with multiple worker daemons. • Fails on worker daemon restart. • Luckily, our rate limits are externally enforced. • Use an external store (e.g. redis), NOT Celery’s built-in support! RATE_LIMIT

Slide 37

Slide 37 text

@dgouldin Rate Limiting • Who’s asking • What feature • Requesting public or private info • Unknowns MANY FACTORS

Slide 38

Slide 38 text

@dgouldin Rate Limiting MANY FACTORS 'x-ratelimit-class': 'api_identified', 'x-ratelimit-limit': '350', 'x-ratelimit-remaining': '257', 'x-ratelimit-reset': '1345696749', https://api.twitter.com/1/account/settings.json 'x-featureratelimit-class': 'usersearch', 'x-featureratelimit-limit': '180', 'x-featureratelimit-remaining': '179', 'x-featureratelimit-reset': '1345700189', https://api.twitter.com/1/users/search.json?q=djangocon

Slide 39

Slide 39 text

@dgouldin Rate Limiting • Simple: store “limited until” timestamp • Harder: store counters and incr per call KNOWN LIMITS (fixed time window)

Slide 40

Slide 40 text

@dgouldin Rate Limiting KNOWN LIMITS (fixed time window) def call(self, *args, **kwargs): key = self.generate_key(*args, **kwargs) if redis.exists(key): until = int(redis.get(key)) countdown = until - int(time.time()) if countdown > 0: self.retry(countdown=countdown) else: redis.delete(key) try: return self._call(*args, **kwargs) except RateLimitException as e: redis.set(key, e.until) countdown = e.until - int(time.time()) self.retry(countdown=countdown)

Slide 41

Slide 41 text

@dgouldin Rate Limiting • Store a redis sorted set of timestamps • Remove any stale items from the set • If len() of the new set > limit, wait long enough for the oldest to drop off KNOWN LIMITS (rolling time window)

Slide 42

Slide 42 text

@dgouldin Rate Limiting KNOWN LIMITS def call(self, *args, **kwargs): key = self.generate_key(*args, **kwargs) window = 3600 * 24 * 2 # 2 day window expires = int(time.time()) - window redis.zremrangebyscore(key, 'inf', expires) if redis.zcard(key) < 25: # 25 call limit now = int(time.time()) redis.zadd(key, now, now) return self._call(*args, **kwargs) else: first = int(redis.zrange(key, 0, 0)[0]) countdown = (first + window) - int(time.time()) self.retry(countdown=countdown)

Slide 43

Slide 43 text

@dgouldin Rate Limiting • Store a counter, incr on rate limit, decr on no rate limit. • When counter > 0, exponentially back off BEFORE making calls. UNKNOWN LIMITS

Slide 44

Slide 44 text

@dgouldin Rate Limiting UNKNOWN LIMITS def call(self, *args, **kwargs): key = self.generate_key(*args, **kwargs) backoff_exponent = redis.get(key) if backoff_exponent and not self.request.retries: # new tasks must wait backoff before calling self.retry(countdown=2**backoff_exponent) try: return self._call(*args, **kwargs) except FacebookClient.RateLimitException as e: redis.incr(key) self.retry(countdown=2**backoff_exponent) else: redis.decr(key)

Slide 45

Slide 45 text

@dgouldin Failover

Slide 46

Slide 46 text

@dgouldin Failover • Celery’s countdown doesn’t do what you think it does. • 3rd parties can fail in lots of “interesting” ways. PROBLEMS

Slide 47

Slide 47 text

@dgouldin Failover • Tasks are immediately dispatched to a worker daemon with ETA in the serialized message. • Celery’s hard work & RabbitMQ’s “ack” feature prevent lost work. • This is still a highly suboptimal solution! COUNTDOWN

Slide 48

Slide 48 text

@dgouldin Failover • Solution: Dead letter exchange AMQP extension. COUNTDOWN

Slide 49

Slide 49 text

@dgouldin Failover COUNTDOWN celery countdown celery 535a6f47 routing_key: "celery" Exchanges Bindings Queues routing_key: "celery" Workers TTL = 60 seconds dead letter exchange = "celery"

Slide 50

Slide 50 text

@dgouldin Failover COUNTDOWN # add the queue to the app so celery knows about it and use # that queue for this task task_id = task_id or gen_unique_id() app.amqp.queues.add(task_id, exchange=settings.COUNTDOWN_EXCHANGE.name, exchange_type=settings.COUNTDOWN_EXCHANGE.type, routing_key=options['routing_key'], queue_arguments={ 'x-message-ttl': countdown * 1000, 'x-dead-letter-exchange': options['queue'], 'x-expires': (countdown + 1) * 1000, }) options.update({ 'queue': task_id, 'exchange': settings.COUNTDOWN_EXCHANGE, })

Slide 51

Slide 51 text

@dgouldin Failover • Create an abstract task base class for each third party. • Handle all error conditions within a single call() function on that base class. THIRD PARTIES

Slide 52

Slide 52 text

@dgouldin Multiple Queues

Slide 53

Slide 53 text

@dgouldin Multiple Queues • Better control over task prioritization & resource distribution. • Queue segmentation allows for spikes. • Background work needs its own “trickle” queue. WHY?

Slide 54

Slide 54 text

@dgouldin Multiple Queues HOW? CELERY_QUEUES = { "interactive": { "binding_key": "interactive" }, "build_network": { "binding_key": "build_network" }, "post_message": { "binding_key": "post_message", } "trickle": { "binding_key": "trickle" }, }

Slide 55

Slide 55 text

@dgouldin Multiple Queues HOW? $ ./manage.py celery worker --queues=interactive,trickle

Slide 56

Slide 56 text

@dgouldin Multiple Queues • Maintenance tasks (such as keeping avatars up to date) are low priority. • They can still overwhelm a queue and crowd out other similar priority tasks. • Rather than dumping batches of tasks, “trickle” a few at a time using a cron and persistent cursor. TRICKLE QUEUE

Slide 57

Slide 57 text

@dgouldin celerybeat Don’t use it.

Slide 58

Slide 58 text

@dgouldin celerybeat • Periodic task persistence gets out of sync with code. • Just 1 more process to manage. • Cron: it’s just. Not. That. Hard. WHY NOT?

Slide 59

Slide 59 text

@dgouldin Extras Wow, there’s still time left?

Slide 60

Slide 60 text

@dgouldin Extras • Don’t use always_eager. • Logging, logging, logging • Unit tests are good, but integration tests save lives. DEBUGGING

Slide 61

Slide 61 text

@dgouldin Extras • C-level blocking prevents soft timeout (so set a timeout on that socket call!) • Soft timeout doesn’t automatically retry. • Default task result is “PENDING” even if Celery has no idea or result cache has expired. GOTCHAS

Slide 62

Slide 62 text

@dgouldin Questions?