$30 off During Our Annual Pro Sale. View Details »

Celery @Pycon'2012

Celery @Pycon'2012

Ask Solem

March 19, 2012
Tweet

More Decks by Ask Solem

Other Decks in Programming

Transcript

  1. Text Me Work at VMware on the RabbitMQ team Father

    (son now 2 years old) Norwegian, living in London, UK Celery Lead Developer Monday, 19 March 12
  2. Text “Celery aims to be a flexible and reliable best-of-breed

    solution to process vast amounts of messages in a distributed fashion, while providing operations with the tools to maintain such a system.” Monday, 19 March 12
  3. Text Chunking Chunking very granular tasks is good Reuse connections

    Warm CPU/Disk caches Reduces messaging overhead Chunks can use threads/greenlets to further parallelize the computation. Monday, 19 March 12
  4. Text Feeds at Opera 200k+ feeds to refresh every hour

    First take: SELECT * FROM and send a task for each row Monday, 19 March 12
  5. Text from feeds.models import Feed @task def refresh_feeds(): for feed

    in Feed.objects.all(): refresh_feed.delay(feed.url) First take Monday, 19 March 12
  6. Text @task(ignore_result=True) def refresh_feeds(iterations=1000, window=3600, buffer=0.80): feeds = Feeds.objects.all() total

    = feeds.count() # size of each slice size = ceil(total / iterations) # the distance (in time) between each slice # when using 80% of the time window available. distance = ceil((window / iterations) * buffer) for page in xrange(iterations): qs = Feeds(start=page * size, stop=min((page + 1) * size, total)) refresh_slice.apply_async((qs, countdown=distance * page)) Chunking Monday, 19 March 12
  7. Text from celery.task import group, task @task(ignore_result=True) def refresh_slice(start, stop):

    feeds = Feed.objects.all() group(refresh_feed.subtask((feed.url, )) for feed in feeds[start:stop]).apply_async() Chunking Monday, 19 March 12
  8. Text Chords Synchronization Primitive Also known as barrier Consists of

    headers and a body the header is a taskset (group) body is applied with the results of the headers Pseudocode: def chord(headers, body): body([h() for h in headers]) Monday, 19 March 12
  9. Text Chords “Native” support for Redis memcached Others use a

    fallback implementation that polls the result ~1s latency (configurable) not ideal but often good enough Monday, 19 March 12
  10. Text # - native chord using atomic counters. def after_task_returns(task):

    if task.request.chord: group_id = task.request.taskset body = task.request.chord headers = TaskSetResult.restore(group_id) key = "chord-%s" % (group_id, ) if redis.incr(key) >= len(headers): subtask(body).delay(headers.join()) deps.delete() redis.delete(key) Implementation Monday, 19 March 12
  11. Text # - fallback resorts to polling @task(max_retries=None, ignore_result=True) def

    unlock_chord(group_id, body, headers): headers = TaskSetResult(group_id, map(AsyncResult, headers)) if headers.ready(): subtask(body).delay(headers.join()) else: unlock_chord.retry(countdown=1) Implementation Monday, 19 March 12
  12. Text from __future__ import division from math import ceil from

    celery.task import chord, task @task def _sum(L): return sum(L) @task def psum(L, grains=4): chunks = int(ceil(len(L) / grains)) return chord(_sum.subtask(( L[chunks * i:chunks * (i + 1)], )) for i in xrange(grains))(_sum.subtask()) Parallel Sum Monday, 19 March 12
  13. Text from collections import defaultdict from celery.task import chord, group,

    subtask, task @task def reduce(results, reducer): reducer = subtask(reducer) d = defaultdict(list) for items in results: for key, value in items: d[key].append(value) return group(reducer.clone((key, values)) for key, values in d.iteritems()) \ .apply_async() MapReduce Monday, 19 March 12
  14. Text @task def mapreduce(items, mapper, reducer): mapper = subtask(mapper) reducer

    = subtask(reducer) return chord(mapper.clone((item, ) for item in items)( reduce.subtask((reducer, ))) MapReduce Monday, 19 March 12
  15. Text import requests @task def mapper(document_url): response = requests.get(document_url) if

    response.ok: return [(word, 1) for word in words(response.content.split()) if word.isalpha()] response.raise_for_status() def words(it, punctuation=’(),./:;?’): return (w.replace(“-”, “”).strip(punctuation).lower() for w in it) Counting words Monday, 19 March 12
  16. Text import requests from celery.task import task def count_words(document_urls): return

    mapreduce.delay(document_urls, mapper.subtask(), reducer.subtask()) @task def reducer(word, counts): return word, sum(counts) Counting Words Monday, 19 March 12
  17. Text Blocking Blocking is bad and not just when using

    g/event(let) Use timeouts and retry if possible socket.settimeout() socket.setdefaulttimeout() and be smart about routing... Monday, 19 March 12
  18. Text Smart Routing Route long-running tasks to dedicated workers Makes

    way for higher priority tasks. Needs hacking: Reroute tasks to workers with free CPU Monday, 19 March 12
  19. Text Cyme is cyme |sīm| noun Botany a flower cluster

    with a central stem bearing a single terminal flower that develops first, the other flowers in the cluster developing as terminal buds of lateral stems. Monday, 19 March 12
  20. Text Cyme is ❝Cyme is a distributed service where each

    node manages the Celery instances on that machine. Cyme is a distributed service where each node manages the Celery instances on that machine. Cyme is a distributed service where each node manages the Celery instances on that machine. Cyme is a distributed service where each node manages the Celery instances on that machine. Cyme is a distributed service where each node manages the Celery instances on that machine. Cyme is a distributed service where each node manages the Celery instances on that machine. Monday, 19 March 12
  21. Text Distributed A cyme node is called a branch No

    master (decentralized.) Branches know their neighbors. Every branch has an HTTP API Monday, 19 March 12
  22. Text HTTP API Create and manage Applications (defaults) Worker Instances

    Queues Configure individual workers autoscale settings (concurrency) queues consumed from Monday, 19 March 12
  23. Text $ pip install cyme Getting started Install: $ cyme-branch

    -D instances/ Start a branch: Monday, 19 March 12
  24. f i g u r e t e x t

    o u t p u t f r o m c y m e - b r a n c h Monday, 19 March 12
  25. Text Clients Command-line cyme Python cyme.Client Ruby Cyme::Client curl /

    wget Soon: Admin UI? Javascript? Django Admin? Monday, 19 March 12
  26. Text >>> from cyme import Client >>> cyme = Client(‘http://:8000’)

    >>> cyme.branches [u'9bcdb936-e80d-4d1f-900f-be2679cdcfcf'] My neighbors GET http://localhost:8000/branches/ Monday, 19 March 12
  27. Text >>> app = cyme.add(‘pycon’, ... broker=’amqp://’, ... arguments=’--workdir=/opt/play/proj’) >>>

    app <App: 'http://:8000/pycon'> Our app PUT http://localhost:8000/pycon Monday, 19 March 12
  28. Text >>> instance = app.instances.add() >>> instance <Instance: '2d56dba0-4c95-4945-8c96-3e7d43d244b3'> Our

    first instance Automatically generated name, or provide your own POST http://localhost:8000/pycon/instances Monday, 19 March 12
  29. Text > 2d56dba0-4c95-4945-8c96-3e7d43d244b3: DOWN > Restarting node 2d56dba0...244b3: OK Our

    first instance In the branch logs we it see it start: Support files created in cyme dir: $ ls -l instances/48918f6c...356df54161cb/ rw-r--r-- worker.log rw-r--r-- worker.pid rw-r--r-- worker.statedb Monday, 19 March 12
  30. Text >>> app.instances.all_names() ['2d56dba0-4c95-4945-8c96-3e7d43d244b3'] >>> app.instances.get(_[0]).stats() {'consumer': {'prefetch_count': 80, 'broker':

    {…} 'pool': {'timeouts': [None, None], 'processes': [7701]}, 'autoscaler': {'current': 1, 'max': 1, 'min': 1}} Our first instance Monday, 19 March 12
  31. Text min: keep at least these many max: increase/decrease based

    on demand, but never more than this. >>> instance.autoscale(min=4, max=10) {u'max': 10, u'min': 4} >>> instance.stats()[“pool”][“processes”] [7701, 7754, 7755, 7756] Autoscale PUT http://localhost:8000/pycon/instances/2d56…/autoscale?min=max= Monday, 19 March 12
  32. Text >>> app.queues.add(“celery”) # — is the same as >>>

    app.queues.add(“celery”, ... exchange=”celery”, ... routing_key=”celery”, ... exchange_type=”direct”, **opts) Queues PUT http://localhost:8000/pycon/queues/celery Monday, 19 March 12
  33. f i g u r e t e x t

    o u t p u t f r o m t a i l - f i n s t a n c e s / 2 d 5 6 … / w o r k e r . l o g Monday, 19 March 12