Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Computing Is Hard, Lets Go Shopping...

PyCon 2014
April 11, 2014
890

Distributed Computing Is Hard, Lets Go Shopping by Lewis Franklin

PyCon 2014

April 11, 2014
Tweet

Transcript

  1. Who Am I (And Why Should You Listen) - Developing

    in Python for 10 years - Using Celery for 2 years
  2. Who Am I (And Why Should You Listen) - Developing

    in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects
  3. Who Am I (And Why Should You Listen) - Developing

    in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects - Celery Mutex
  4. Who Am I (And Why Should You Listen) - Developing

    in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects - Celery Mutex - zkcelery
  5. Who Am I (And Why Should You Listen) - Developing

    in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects - Celery Mutex - zkcelery - I have screwed up with Celery.
  6. Who Am I (And Why Should You Listen) - Developing

    in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects - Celery Mutex - zkcelery - I have screwed up with Celery. - A lot.
  7. Why Am I Here - Distributed Computing Is Awesome -

    Not Enough About It At PyCon 2013
  8. Why Am I Here - Distributed Computing Is Awesome -

    Not Enough About It At PyCon 2013 - Wanted to focus on Celery outside of web
  9. Why Am I Here - Distributed Computing Is Awesome -

    Not Enough About It At PyCon 2013 - Wanted to focus on Celery outside of web - I want to help others avoid some pitfalls
  10. Where Are We Going? - This talk is about distributed

    computing - I focus on Celery, because its what I know
  11. Where Are We Going? - This talk is about distributed

    computing - I focus on Celery, because its what I know - But I know its not the only game in town.
  12. Where Are We Going? - This talk is about distributed

    computing - I focus on Celery, because its what I know - But I know its not the only game in town. - This talk does assume, at times, you know Celery
  13. Where Are We Going? - This talk is about distributed

    computing - I focus on Celery, because its what I know - But I know its not the only game in town. - This talk does assume, at times, you know Celery - If you don't, stick around. You still may learn something
  14. Where Are We Going? - This talk is about distributed

    computing - I focus on Celery, because its what I know - But I know its not the only game in town. - This talk does assume, at times, you know Celery - If you don't, stick around. You still may learn something - I'd love to talk with you about Celery
  15. —celeryproject.com Celery is an asynchronous task queue/job queue based on

    distributed message passing. It is focused on real-time operation, but supports scheduling as well.
  16. Unpacking the Description - Celery allows you to use Python

    with a message broker. (e.g., RabbitMQ, Redis)
  17. Unpacking the Description - Celery allows you to use Python

    with a message broker. (e.g., RabbitMQ, Redis) - You build tasks that can then be run locally or across a group of computers that can communicate with the message broker.
  18. Unpacking the Description - Celery allows you to use Python

    with a message broker. (e.g., RabbitMQ, Redis) - You build tasks that can then be run locally or across a group of computers that can communicate with the message broker. - Tasks can be reactive or scheduled
  19. Fallacies of Distributed Computing 1994, Peter Deutsch - The network

    is reliable - Latency is zero - Bandwidth is infinite
  20. Fallacies of Distributed Computing 1994, Peter Deutsch - The network

    is reliable - Latency is zero - Bandwidth is infinite - The network is secure
  21. Fallacies of Distributed Computing 1994, Peter Deutsch - The network

    is reliable - Latency is zero - Bandwidth is infinite - The network is secure - Topology doesn't change
  22. Fallacies of Distributed Computing 1994, Peter Deutsch - The network

    is reliable - Latency is zero - Bandwidth is infinite - The network is secure - Topology doesn't change - There is one administrator
  23. Fallacies of Distributed Computing 1994, Peter Deutsch - The network

    is reliable - Latency is zero - Bandwidth is infinite - The network is secure - Topology doesn't change - There is one administrator - Transport cost is zero
  24. Memory Management - Dealing with large (1GB+) files - Understand

    the consequences for each call - Utilize generators / iterators when possible
  25. Memory Management - Dealing with large (1GB+) files - Understand

    the consequences for each call - Utilize generators / iterators when possible - etree.iterparse() vs . etree.parse()
  26. Memory Management - Dealing with large (1GB+) files - Understand

    the consequences for each call - Utilize generators / iterators when possible - etree.iterparse() vs . etree.parse() - r. iter_content() vs. r.content
  27. Memory Management <?xml version="1.0"?> <VehicleSales> <VehicleSale> <CRMSaleType>1</CRMSaleType> <BuyRateAPR>0.2081</BuyRateAPR> <APR>22.81</APR> <BackGross>1136.24</BackGross>

    <BodyStyle>HB</BodyStyle> <Branch>540</Branch> <CashDown>2000.00</CashDown> <CashPrice>14991.00</CashPrice> <CostPrice>11087.25</CostPrice> </VehicleSale> </VehicleSales>
  28. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  29. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  30. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  31. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  32. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  33. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  34. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  35. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  36. Data Locality - It easy to forget where your data

    is - 1 file, 2 tasks - Closer is better
  37. Data Locality - It easy to forget where your data

    is - 1 file, 2 tasks - Closer is better - Memory > file system > NAS > carrier pigeon
  38. Data Locality - It easy to forget where your data

    is - 1 file, 2 tasks - Closer is better - Memory > file system > NAS > carrier pigeon - Find the balance between speed and accessibility
  39. Segregation of Duties - Use multiple queues to "Keep 'em

    Separated" - An Idle Queue is the Devil's Playground - Find the balance - Fine grained queues - Resource utilization
  40. Segregation of Duties - Use multiple queues to "Keep 'em

    Separated" - An Idle Queue is the Devil's Playground - Find the balance - Fine grained queues - Resource utilization
  41. Segregation of Duties - Use multiple queues to "Keep 'em

    Separated" - An Idle Queue is the Devil's Playground - Find the balance
  42. Segregation of Duties - Use multiple queues to "Keep 'em

    Separated" - An Idle Queue is the Devil's Playground - Find the balance - Fine grained queues
  43. Segregation of Duties - Use multiple queues to "Keep 'em

    Separated" - An Idle Queue is the Devil's Playground - Find the balance - Fine grained queues - Resource utilization
  44. Segregation of Duties - Use multiple queues to "Keep 'em

    Separated" - An Idle Queue is the Devil's Playground - Find the balance - Fine grained queues - Resource utilization - Celery Autoscaling
  45. Simplifying Similar Tasks - Utilize Celery Abstract Tasks - Serve

    as a base class for new task types - Can Add Custom Handlers
  46. Simplifying Similar Tasks - Utilize Celery Abstract Tasks - Serve

    as a base class for new task types - Can Add Custom Handlers - Useful for shared "boilerplate" code
  47. Simplifying Similar Tasks - Utilize Celery Abstract Tasks - Serve

    as a base class for new task types - Can Add Custom Handlers - Useful for shared "boilerplate" code - Database connection
  48. Simplifying Similar Tasks - Utilize Celery Abstract Tasks - Serve

    as a base class for new task types - Can Add Custom Handlers - Useful for shared "boilerplate" code - Database connection - Celery Mutex
  49. Custom Abstract Class class DBTask(celery.Task): abstract = True _db =

    None ! @property def db(self): if self._db is None: self._db = Database.connect() return self._db ! def after_failure(self, *args, **kwargs): send_email('The task failed!') ! ! @app.task(base=DebugTask) def get_data(table_name): return get_data.db.table(table_name).all()
  50. Custom Abstract Class class DBTask(celery.Task): abstract = True _db =

    None ! @property def db(self): if self._db is None: self._db = Database.connect() return self._db ! def after_failure(self, *args, **kwargs): send_email('The task failed!') ! ! @app.task(base=DebugTask) def get_data(table_name): return get_data.db.table(table_name).all()
  51. Custom Abstract Class class DBTask(celery.Task): abstract = True _db =

    None ! @property def db(self): if self._db is None: self._db = Database.connect() return self._db ! def after_failure(self, *args, **kwargs): send_email('The task failed!') ! ! @app.task(base=DebugTask) def get_data(table_name): return get_data.db.table(table_name).all()
  52. Custom Abstract Class class DBTask(celery.Task): abstract = True _db =

    None ! @property def db(self): if self._db is None: self._db = Database.connect() return self._db ! def after_failure(self, *args, **kwargs): send_email('The task failed!') ! ! @app.task(base=DebugTask) def get_data(table_name): return get_data.db.table(table_name).all()
  53. Custom Abstract Class class DBTask(celery.Task): abstract = True _db =

    None ! @property def db(self): if self._db is None: self._db = Database.connect() return self._db ! def after_failure(self, *args, **kwargs): send_email('The task failed!') ! ! @app.task(base=DebugTask) def get_data(table_name): return get_data.db.table(table_name).all()
  54. Data, Data, Everywhere '"Sam the dog eats a pound of

    chocolate and poos all over the house" - a children's guide to proactive systems monitoring' --@jessenoller
  55. Keeping Track of Tasks - Install pre-baked monitoring tools -

    Celery Flower - RabbitMQ Management Plugin
  56. Keeping Track of Tasks - Install pre-baked monitoring tools -

    Celery Flower - RabbitMQ Management Plugin - Understand Celery Events / Snapshots
  57. Keeping Track of Tasks - Install pre-baked monitoring tools -

    Celery Flower - RabbitMQ Management Plugin - Understand Celery Events / Snapshots - Hook Into Monitoring Tools
  58. Keeping Track of Tasks - Install pre-baked monitoring tools -

    Celery Flower - RabbitMQ Management Plugin - Understand Celery Events / Snapshots - Hook Into Monitoring Tools - Nagios
  59. Keeping Track of Tasks - Install pre-baked monitoring tools -

    Celery Flower - RabbitMQ Management Plugin - Understand Celery Events / Snapshots - Hook Into Monitoring Tools - Nagios - Zabbix (rabbitmq-zabbix)
  60. Logging - Save Your Fingers! - Find a system you

    trust - LogStash - Heka - SysLog
  61. Logging - Save Your Fingers! - Find a system you

    trust - LogStash - Heka - SysLog - NFS Mount
  62. Error Tracking - Use a central error logger - I

    use Sentry - Just. Use. Something.
  63. Error Tracking - Use a central error logger - I

    use Sentry - Just. Use. Something. - Leverage LoggerAdapter to capture extra info
  64. Error Tracking - Use a central error logger - I

    use Sentry - Just. Use. Something. - Leverage LoggerAdapter to capture extra info - Hostname
  65. Error Tracking - Use a central error logger - I

    use Sentry - Just. Use. Something. - Leverage LoggerAdapter to capture extra info - Hostname - Worker name
  66. Error Tracking def get_logger(logger_name, **kwargs): logger = logging.getLogger(logger_name) return logging.LoggerAdapter(logger,

    kwargs) ! extra = {'DMS Specific': {'company': company, 'enterprise': enterprise, 'start_uri': start_uri, 'process_path': process_path, } } tags = {'enterprise': enterprise, 'company': company} logger = get_logger(self, extra=extra, tags=tags)
  67. Error Tracking def get_logger(logger_name, **kwargs): logger = logging.getLogger(logger_name) return logging.LoggerAdapter(logger,

    kwargs) ! extra = {'DMS Specific': {'company': company, 'enterprise': enterprise, 'start_uri': start_uri, 'process_path': process_path, } } tags = {'enterprise': enterprise, 'company': company} logger = get_logger(self, extra=extra, tags=tags)
  68. Error Tracking def get_logger(logger_name, **kwargs): logger = logging.getLogger(logger_name) return logging.LoggerAdapter(logger,

    kwargs) ! extra = {'DMS Specific': {'company': company, 'enterprise': enterprise, 'start_uri': start_uri, 'process_path': process_path, } } tags = {'enterprise': enterprise, 'company': company} logger = get_logger(self, extra=extra, tags=tags)
  69. Error Tracking def get_logger(logger_name, **kwargs): logger = logging.getLogger(logger_name) return logging.LoggerAdapter(logger,

    kwargs) ! extra = {'DMS Specific': {'company': company, 'enterprise': enterprise, 'start_uri': start_uri, 'process_path': process_path, } } tags = {'enterprise': enterprise, 'company': company} logger = get_logger(self, extra=extra, tags=tags)
  70. Testing, testing, testing, testing 'SthDae om g : a children's

    guide to asynchronous programming.' --@AnthonyBriggs
  71. Testing Strategy - Test outside of Celery - Test with

    a single worker/job - Test with one worker, multiple concurrent jobs
  72. Testing Strategy - Test outside of Celery - Test with

    a single worker/job - Test with one worker, multiple concurrent jobs - Test with multiple servers
  73. Testing Strategy - Test outside of Celery - Test with

    a single worker/job - Test with one worker, multiple concurrent jobs - Test with multiple servers - Ramp up as much as possible
  74. Race Conditions @app.task def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name,

    clean=True) def build(worker):! db.callproc('create_report', params)! report_path = cursor.fetchone()! shutil.copy(report_path, 'data.xml')! worker.apply_stylesheet('style.xsl', 'data.xml')
  75. Race Conditions @app.task def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name,

    clean=True) def build(worker):! db.callproc('create_report', params)! report_path = cursor.fetchone()! shutil.copy(report_path, 'data.xml')! worker.apply_stylesheet('style.xsl', 'data.xml')
  76. Serenity Prayer '"Sam the dog is half missing but he

    was whole at the pet store" - a parents guide to explaining eventually consistently to children.' --@jessenoller
  77. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

    - Sometimes Calling Outside Your Environment
  78. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

    - Sometimes Calling Outside Your Environment - Watch Their Memory Usage
  79. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

    - Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output
  80. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

    - Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output - Segregate to its own queue
  81. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

    - Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output - Segregate to its own queue - Utilize Soft / Hard Timeouts (Celery)
  82. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

    - Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output - Segregate to its own queue - Utilize Soft / Hard Timeouts (Celery) - Soft Timeout lets you "clean up" afterwards
  83. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

    - Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output - Segregate to its own queue - Utilize Soft / Hard Timeouts (Celery) - Soft Timeout lets you "clean up" afterwards - Hard Timeout kills without remorse
  84. Handling "Abusive" Tasks @app.task(soft_time_limit=3600)! def run_job(job_id):! try:! job = AbusiveJob(job_id)!

    job.build()! job.run()! except celery.exceptions.SoftTimeLimitExceeded:! raise celery.task.current.retry()! except celery.exceptions.MaxRetriesExceededError:! send_email('AbusiveJob failed')
  85. Handling "Abusive" Tasks @app.task(soft_time_limit=3600)! def run_job(job_id):! try:! job = AbusiveJob(job_id)!

    job.build()! job.run()! except celery.exceptions.SoftTimeLimitExceeded:! raise celery.task.current.retry()! except celery.exceptions.MaxRetriesExceededError:! send_email('AbusiveJob failed')
  86. Handling "Abusive" Tasks @app.task(soft_time_limit=3600)! def run_job(job_id):! try:! job = AbusiveJob(job_id)!

    job.build()! job.run()! except celery.exceptions.SoftTimeLimitExceeded:! raise celery.task.current.retry()! except celery.exceptions.MaxRetriesExceededError:! send_email('AbusiveJob failed')
  87. Handling "Abusive" Tasks @app.task(soft_time_limit=3600)! def run_job(job_id):! try:! job = AbusiveJob(job_id)!

    job.build()! job.run()! except celery.exceptions.SoftTimeLimitExceeded:! raise celery.task.current.retry()! except celery.exceptions.MaxRetriesExceededError:! send_email('AbusiveJob failed')
  88. Single Points of Failure - Identify Your Single Points of

    Failure - Database - Broker - Eliminate the Ones You Can
  89. Single Points of Failure - Identify Your Single Points of

    Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster
  90. Single Points of Failure - Identify Your Single Points of

    Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster - Database Slaves
  91. Single Points of Failure - Identify Your Single Points of

    Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster - Database Slaves - Mitigate Those You Can't
  92. Single Points of Failure - Identify Your Single Points of

    Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster - Database Slaves - Mitigate Those You Can't - Add pre-run check
  93. Single Points of Failure - Identify Your Single Points of

    Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster - Database Slaves - Mitigate Those You Can't - Add pre-run check - Utilize retries
  94. Clock Synchronization - Remember clocks may differ - Use NTP

    - Still don't assume they are in sync
  95. Clock Synchronization @app.task! def find_reports(time=None):! if not time:! time =

    '{:%H}0000'.format(datetime.datetime.now())! select_stmt = 'SELECT id FROM reports WHERE time = ?'! for report_id in db.execute(select_stmt, (time,)):! run_report.delay(report_id)! ! ! CELERYBEAT_SCHEDULE = {! 'report_finder': {! 'task': 'scheduled_report_finder',! 'schedule': celery.schedules.crontab(minute=0)! }! }
  96. Clock Synchronization @app.task! def find_reports(time=None):! if not time:! time =

    '{:%H}0000'.format(datetime.datetime.now())! select_stmt = 'SELECT id FROM reports WHERE time = ?'! for report_id in db.execute(select_stmt, (time,)):! run_report.delay(report_id)! ! ! CELERYBEAT_SCHEDULE = {! 'report_finder': {! 'task': 'scheduled_report_finder',! 'schedule': celery.schedules.crontab(minute=0)! }! }
  97. Clock Synchronization @app.task! def find_reports(time=None):! if not time:! time =

    '{:%H}0000'.format(datetime.datetime.now())! select_stmt = 'SELECT id FROM reports WHERE time = ?'! for report_id in db.execute(select_stmt, (time,)):! run_report.delay(report_id)! ! ! CELERYBEAT_SCHEDULE = {! 'report_finder': {! 'task': 'scheduled_report_finder',! 'schedule': celery.schedules.crontab(minute=0)! }! }
  98. Clock Synchronization @app.task! def find_reports(time=None):! if not time:! time =

    '{:%H}0000'.format(datetime.datetime.now())! select_stmt = 'SELECT id FROM reports WHERE time = ?'! for report_id in db.execute(select_stmt, (time,)):! run_report.delay(report_id)! ! ! CELERYBEAT_SCHEDULE = {! 'report_finder': {! 'task': 'scheduled_report_finder',! 'schedule': celery.schedules.crontab(minute=0)! }! }
  99. Calling In the Calvary '"Sam the dog sam the dog

    dog the Sam dog" - a parents guide to explaining leader election in distributed systems for children' --@jessenoller
  100. Limiting Jobs - Client calls; you are hammering their server

    - What do you do? - Distributed semaphore
  101. Limiting Jobs - Client calls; you are hammering their server

    - What do you do? - Distributed semaphore - Set number of leases
  102. Limiting Jobs - Client calls; you are hammering their server

    - What do you do? - Distributed semaphore - Set number of leases - Can be tuned for specific needs
  103. Limiting Jobs @contextlib.contextmanager def semaphore(self): semaphore = None if self.dms_code

    and not self.called_directly: semaphore = client.Semaphore(self.dms_code, max_leases=3) if not semaphore.acquire(blocking=False): raise celery.task.current.retry() try: yield finally: if semaphore: semaphore.release()
  104. Limiting Jobs @contextlib.contextmanager def semaphore(self): semaphore = None if self.dms_code

    and not self.called_directly: semaphore = client.Semaphore(self.dms_code, max_leases=3) if not semaphore.acquire(blocking=False): raise celery.task.current.retry() try: yield finally: if semaphore: semaphore.release()
  105. Limiting Jobs @contextlib.contextmanager def semaphore(self): semaphore = None if self.dms_code

    and not self.called_directly: semaphore = client.Semaphore(self.dms_code, max_leases=3) if not semaphore.acquire(blocking=False): raise celery.task.current.retry() try: yield finally: if semaphore: semaphore.release()
  106. Limiting Jobs @contextlib.contextmanager def semaphore(self): semaphore = None if self.dms_code

    and not self.called_directly: semaphore = client.Semaphore(self.dms_code, max_leases=3) if not semaphore.acquire(blocking=False): raise celery.task.current.retry() try: yield finally: if semaphore: semaphore.release()
  107. Thundering Herd - "occurs when a large number of processes

    waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time."
  108. Thundering Herd - "occurs when a large number of processes

    waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries
  109. Thundering Herd - "occurs when a large number of processes

    waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60)
  110. Thundering Herd - "occurs when a large number of processes

    waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60) - Simple, but effective
  111. Thundering Herd - "occurs when a large number of processes

    waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60) - Simple, but effective - ZooKeeper Locks
  112. Thundering Herd - "occurs when a large number of processes

    waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60) - Simple, but effective - ZooKeeper Locks - Adds complexity, but arguably "more correct"
  113. Thundering Herd - "occurs when a large number of processes

    waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60) - Simple, but effective - ZooKeeper Locks - Adds complexity, but arguably "more correct" - Locks are held in a list
  114. Distributed Mutex - "A mutex is a way to ensure

    that no two concurrent processes are running at the same time"
  115. Distributed Mutex - "A mutex is a way to ensure

    that no two concurrent processes are running at the same time" - Only start a task if one currently isn't running
  116. Distributed Mutex - "A mutex is a way to ensure

    that no two concurrent processes are running at the same time" - Only start a task if one currently isn't running - Can be limited by input types
  117. def _get_node(self, args, kwargs):! mutex_keys = getattr(self, 'mutex_keys', ())! lock_node

    = '/mutex/celery/{}'.format(self.name)! items = inspect.getcallargs(self.run, *args, **kwargs)! for value in (items[x] for x in mutex_keys if ! items.get(x)):! lock_node += value! return lock_node! ! @contextlib.contextmanager! def mutex(self, args, kwargs):! client = None! success = False! lock_node = self._get_node(args, kwargs)! if not client.exists(lock_node):! success = True! if success:! client.create(lock_node, makepath=True)! yield True! else:! yield False
  118. def _get_node(self, args, kwargs):! mutex_keys = getattr(self, 'mutex_keys', ())! lock_node

    = '/mutex/celery/{}'.format(self.name)! items = inspect.getcallargs(self.run, *args, **kwargs)! for value in (items[x] for x in mutex_keys if ! items.get(x)):! lock_node += value! return lock_node! ! @contextlib.contextmanager! def mutex(self, args, kwargs):! client = None! success = False! lock_node = self._get_node(args, kwargs)! if not client.exists(lock_node):! success = True! if success:! client.create(lock_node, makepath=True)! yield True! else:! yield False
  119. def _get_node(self, args, kwargs):! mutex_keys = getattr(self, 'mutex_keys', ())! lock_node

    = '/mutex/celery/{}'.format(self.name)! items = inspect.getcallargs(self.run, *args, **kwargs)! for value in (items[x] for x in mutex_keys if ! items.get(x)):! lock_node += value! return lock_node! ! @contextlib.contextmanager! def mutex(self, args, kwargs):! client = None! success = False! lock_node = self._get_node(args, kwargs)! if not client.exists(lock_node):! success = True! if success:! client.create(lock_node, makepath=True)! yield True! else:! yield False
  120. class MutexTask(celery.Task):! abstract = True! ! @contextlib.contextmanager! def mutex(self, args,

    kwargs, delete=False):! pass! ! def apply_async(self, args=None, kwargs=None, **options):! with self.mutex(args, kwargs) as mutex_acquired:! if mutex_acquired:! return super(MutexTask, ! self).apply_async(args, kwargs,! **options)! ! def after_return(self, *args, **kwargs):! lock_node = self._get_node(args, kwargs)! if client.exists(lock_node):! client.delete(lock_node)
  121. class MutexTask(celery.Task):! abstract = True! ! @contextlib.contextmanager! def mutex(self, args,

    kwargs, delete=False):! pass! ! def apply_async(self, args=None, kwargs=None, **options):! with self.mutex(args, kwargs) as mutex_acquired:! if mutex_acquired:! return super(MutexTask, ! self).apply_async(args, kwargs,! **options)! ! def after_return(self, *args, **kwargs):! lock_node = self._get_node(args, kwargs)! if client.exists(lock_node):! client.delete(lock_node)
  122. class MutexTask(celery.Task):! abstract = True! ! @contextlib.contextmanager! def mutex(self, args,

    kwargs, delete=False):! pass! ! def apply_async(self, args=None, kwargs=None, **options):! with self.mutex(args, kwargs) as mutex_acquired:! if mutex_acquired:! return super(MutexTask, ! self).apply_async(args, kwargs,! **options)! ! def after_return(self, *args, **kwargs):! lock_node = self._get_node(args, kwargs)! if client.exists(lock_node):! client.delete(lock_node)
  123. class MutexTask(celery.Task):! abstract = True! ! @contextlib.contextmanager! def mutex(self, args,

    kwargs, delete=False):! pass! ! def apply_async(self, args=None, kwargs=None, **options):! with self.mutex(args, kwargs) as mutex_acquired:! if mutex_acquired:! return super(MutexTask, ! self).apply_async(args, kwargs,! **options)! ! def after_return(self, *args, **kwargs):! lock_node = self._get_node(args, kwargs)! if client.exists(lock_node):! client.delete(lock_node)
  124. @app.task(base=MutexTask) def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name, clean=True) !

    @app.task(base=MutexTask, mutex_keys=('schedule_id')) def build_exports(schedule_id): magic.build_exports(schedule_id)
  125. @app.task(base=MutexTask) def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name, clean=True) !

    @app.task(base=MutexTask, mutex_keys=('schedule_id')) def build_exports(schedule_id): magic.build_exports(schedule_id)
  126. Yes

  127. Fin

  128. Fin - Learn From My Mistakes - Be Diligent -

    I got careless and introduced a deadlock Friday
  129. Fin - Learn From My Mistakes - Be Diligent -

    I got careless and introduced a deadlock Friday - Ask Questions
  130. Fin - Learn From My Mistakes - Be Diligent -

    I got careless and introduced a deadlock Friday - Ask Questions - Now
  131. Fin - Learn From My Mistakes - Be Diligent -

    I got careless and introduced a deadlock Friday - Ask Questions - Now - In the Hallway
  132. Fin - Learn From My Mistakes - Be Diligent -

    I got careless and introduced a deadlock Friday - Ask Questions - Now - In the Hallway - At lunch
  133. Fin - Learn From My Mistakes - Be Diligent -

    I got careless and introduced a deadlock Friday - Ask Questions - Now - In the Hallway - At lunch - Over drinks (My time can be bought with Cream Sodas)