Save 37% off PRO during our Black Friday Sale! »

Distributed Computing Is Hard, Lets Go Shopping by Lewis Franklin

D21717ea76044d31115c573d368e6ff4?s=47 PyCon 2014
April 11, 2014
830

Distributed Computing Is Hard, Lets Go Shopping by Lewis Franklin

D21717ea76044d31115c573d368e6ff4?s=128

PyCon 2014

April 11, 2014
Tweet

Transcript

  1. Distributed Computing Is Hard, Lets Go Shopping Understanding real-world challenges

    with distributed computing, focusing on Celery
  2. Who Am I (And Why Should You Listen)

  3. Who Am I (And Why Should You Listen) - Developing

    in Python for 10 years
  4. Who Am I (And Why Should You Listen) - Developing

    in Python for 10 years - Using Celery for 2 years
  5. Who Am I (And Why Should You Listen) - Developing

    in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects
  6. Who Am I (And Why Should You Listen) - Developing

    in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects - Celery Mutex
  7. Who Am I (And Why Should You Listen) - Developing

    in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects - Celery Mutex - zkcelery
  8. Who Am I (And Why Should You Listen) - Developing

    in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects - Celery Mutex - zkcelery - I have screwed up with Celery.
  9. Who Am I (And Why Should You Listen) - Developing

    in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects - Celery Mutex - zkcelery - I have screwed up with Celery. - A lot.
  10. Why Am I Here

  11. Why Am I Here - Distributed Computing Is Awesome

  12. Why Am I Here - Distributed Computing Is Awesome -

    Not Enough About It At PyCon 2013
  13. Why Am I Here - Distributed Computing Is Awesome -

    Not Enough About It At PyCon 2013 - Wanted to focus on Celery outside of web
  14. Why Am I Here - Distributed Computing Is Awesome -

    Not Enough About It At PyCon 2013 - Wanted to focus on Celery outside of web - I want to help others avoid some pitfalls
  15. Where Are We Going? - This talk is about distributed

    computing
  16. Where Are We Going? - This talk is about distributed

    computing - I focus on Celery, because its what I know
  17. Where Are We Going? - This talk is about distributed

    computing - I focus on Celery, because its what I know - But I know its not the only game in town.
  18. Where Are We Going? - This talk is about distributed

    computing - I focus on Celery, because its what I know - But I know its not the only game in town. - This talk does assume, at times, you know Celery
  19. Where Are We Going? - This talk is about distributed

    computing - I focus on Celery, because its what I know - But I know its not the only game in town. - This talk does assume, at times, you know Celery - If you don't, stick around. You still may learn something
  20. Where Are We Going? - This talk is about distributed

    computing - I focus on Celery, because its what I know - But I know its not the only game in town. - This talk does assume, at times, you know Celery - If you don't, stick around. You still may learn something - I'd love to talk with you about Celery
  21. Celery: An Introduction

  22. —celeryproject.com Celery is an asynchronous task queue/job queue based on

    distributed message passing. It is focused on real-time operation, but supports scheduling as well.
  23. Unpacking the Description

  24. Unpacking the Description - Celery allows you to use Python

    with a message broker. (e.g., RabbitMQ, Redis)
  25. Unpacking the Description - Celery allows you to use Python

    with a message broker. (e.g., RabbitMQ, Redis) - You build tasks that can then be run locally or across a group of computers that can communicate with the message broker.
  26. Unpacking the Description - Celery allows you to use Python

    with a message broker. (e.g., RabbitMQ, Redis) - You build tasks that can then be run locally or across a group of computers that can communicate with the message broker. - Tasks can be reactive or scheduled
  27. Why Are YOU Here? Overcoming Obstacles Problems Encountered and Personal

    Solutions
  28. Fallacies of Distributed Computing 1994, Peter Deutsch

  29. Fallacies of Distributed Computing 1994, Peter Deutsch - The network

    is reliable
  30. Fallacies of Distributed Computing 1994, Peter Deutsch - The network

    is reliable - Latency is zero
  31. Fallacies of Distributed Computing 1994, Peter Deutsch - The network

    is reliable - Latency is zero - Bandwidth is infinite
  32. Fallacies of Distributed Computing 1994, Peter Deutsch - The network

    is reliable - Latency is zero - Bandwidth is infinite - The network is secure
  33. Fallacies of Distributed Computing 1994, Peter Deutsch - The network

    is reliable - Latency is zero - Bandwidth is infinite - The network is secure - Topology doesn't change
  34. Fallacies of Distributed Computing 1994, Peter Deutsch - The network

    is reliable - Latency is zero - Bandwidth is infinite - The network is secure - Topology doesn't change - There is one administrator
  35. Fallacies of Distributed Computing 1994, Peter Deutsch - The network

    is reliable - Latency is zero - Bandwidth is infinite - The network is secure - Topology doesn't change - There is one administrator - Transport cost is zero
  36. KISS 'Sam ! The dog: the children's guide to queuing'

    --@hhoover
  37. Memory Management

  38. Memory Management - Dealing with large (1GB+) files

  39. Memory Management - Dealing with large (1GB+) files - Understand

    the consequences for each call
  40. Memory Management - Dealing with large (1GB+) files - Understand

    the consequences for each call - Utilize generators / iterators when possible
  41. Memory Management - Dealing with large (1GB+) files - Understand

    the consequences for each call - Utilize generators / iterators when possible - etree.iterparse() vs . etree.parse()
  42. Memory Management - Dealing with large (1GB+) files - Understand

    the consequences for each call - Utilize generators / iterators when possible - etree.iterparse() vs . etree.parse() - r. iter_content() vs. r.content
  43. Memory Management <?xml version="1.0"?> <VehicleSales> <VehicleSale> <CRMSaleType>1</CRMSaleType> <BuyRateAPR>0.2081</BuyRateAPR> <APR>22.81</APR> <BackGross>1136.24</BackGross>

    <BodyStyle>HB</BodyStyle> <Branch>540</Branch> <CashDown>2000.00</CashDown> <CashPrice>14991.00</CashPrice> <CostPrice>11087.25</CostPrice> </VehicleSale> </VehicleSales>
  44. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  45. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  46. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  47. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  48. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  49. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  50. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  51. Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

    level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()
  52. Data Locality

  53. Data Locality - It easy to forget where your data

    is
  54. Data Locality - It easy to forget where your data

    is - 1 file, 2 tasks
  55. Data Locality - It easy to forget where your data

    is - 1 file, 2 tasks - Closer is better
  56. Data Locality - It easy to forget where your data

    is - 1 file, 2 tasks - Closer is better - Memory > file system > NAS > carrier pigeon
  57. Data Locality - It easy to forget where your data

    is - 1 file, 2 tasks - Closer is better - Memory > file system > NAS > carrier pigeon - Find the balance between speed and accessibility
  58. Segregation of Duties - Use multiple queues to "Keep 'em

    Separated" - An Idle Queue is the Devil's Playground - Find the balance - Fine grained queues - Resource utilization
  59. Segregation of Duties

  60. Segregation of Duties - Use multiple queues to "Keep 'em

    Separated" - An Idle Queue is the Devil's Playground - Find the balance - Fine grained queues - Resource utilization
  61. Segregation of Duties

  62. Segregation of Duties - Use multiple queues to "Keep 'em

    Separated" - An Idle Queue is the Devil's Playground - Find the balance
  63. Segregation of Duties - Use multiple queues to "Keep 'em

    Separated" - An Idle Queue is the Devil's Playground - Find the balance - Fine grained queues
  64. Segregation of Duties - Use multiple queues to "Keep 'em

    Separated" - An Idle Queue is the Devil's Playground - Find the balance - Fine grained queues - Resource utilization
  65. Segregation of Duties - Use multiple queues to "Keep 'em

    Separated" - An Idle Queue is the Devil's Playground - Find the balance - Fine grained queues - Resource utilization - Celery Autoscaling
  66. Simplifying Similar Tasks

  67. Simplifying Similar Tasks - Utilize Celery Abstract Tasks

  68. Simplifying Similar Tasks - Utilize Celery Abstract Tasks - Serve

    as a base class for new task types
  69. Simplifying Similar Tasks - Utilize Celery Abstract Tasks - Serve

    as a base class for new task types - Can Add Custom Handlers
  70. Simplifying Similar Tasks - Utilize Celery Abstract Tasks - Serve

    as a base class for new task types - Can Add Custom Handlers - Useful for shared "boilerplate" code
  71. Simplifying Similar Tasks - Utilize Celery Abstract Tasks - Serve

    as a base class for new task types - Can Add Custom Handlers - Useful for shared "boilerplate" code - Database connection
  72. Simplifying Similar Tasks - Utilize Celery Abstract Tasks - Serve

    as a base class for new task types - Can Add Custom Handlers - Useful for shared "boilerplate" code - Database connection - Celery Mutex
  73. Custom Abstract Class class DBTask(celery.Task): abstract = True _db =

    None ! @property def db(self): if self._db is None: self._db = Database.connect() return self._db ! def after_failure(self, *args, **kwargs): send_email('The task failed!') ! ! @app.task(base=DebugTask) def get_data(table_name): return get_data.db.table(table_name).all()
  74. Custom Abstract Class class DBTask(celery.Task): abstract = True _db =

    None ! @property def db(self): if self._db is None: self._db = Database.connect() return self._db ! def after_failure(self, *args, **kwargs): send_email('The task failed!') ! ! @app.task(base=DebugTask) def get_data(table_name): return get_data.db.table(table_name).all()
  75. Custom Abstract Class class DBTask(celery.Task): abstract = True _db =

    None ! @property def db(self): if self._db is None: self._db = Database.connect() return self._db ! def after_failure(self, *args, **kwargs): send_email('The task failed!') ! ! @app.task(base=DebugTask) def get_data(table_name): return get_data.db.table(table_name).all()
  76. Custom Abstract Class class DBTask(celery.Task): abstract = True _db =

    None ! @property def db(self): if self._db is None: self._db = Database.connect() return self._db ! def after_failure(self, *args, **kwargs): send_email('The task failed!') ! ! @app.task(base=DebugTask) def get_data(table_name): return get_data.db.table(table_name).all()
  77. Custom Abstract Class class DBTask(celery.Task): abstract = True _db =

    None ! @property def db(self): if self._db is None: self._db = Database.connect() return self._db ! def after_failure(self, *args, **kwargs): send_email('The task failed!') ! ! @app.task(base=DebugTask) def get_data(table_name): return get_data.db.table(table_name).all()
  78. Data, Data, Everywhere '"Sam the dog eats a pound of

    chocolate and poos all over the house" - a children's guide to proactive systems monitoring' --@jessenoller
  79. Keeping Track of Tasks

  80. Keeping Track of Tasks - Install pre-baked monitoring tools

  81. Keeping Track of Tasks - Install pre-baked monitoring tools -

    Celery Flower
  82. Keeping Track of Tasks - Install pre-baked monitoring tools -

    Celery Flower - RabbitMQ Management Plugin
  83. Keeping Track of Tasks - Install pre-baked monitoring tools -

    Celery Flower - RabbitMQ Management Plugin - Understand Celery Events / Snapshots
  84. Keeping Track of Tasks - Install pre-baked monitoring tools -

    Celery Flower - RabbitMQ Management Plugin - Understand Celery Events / Snapshots - Hook Into Monitoring Tools
  85. Keeping Track of Tasks - Install pre-baked monitoring tools -

    Celery Flower - RabbitMQ Management Plugin - Understand Celery Events / Snapshots - Hook Into Monitoring Tools - Nagios
  86. Keeping Track of Tasks - Install pre-baked monitoring tools -

    Celery Flower - RabbitMQ Management Plugin - Understand Celery Events / Snapshots - Hook Into Monitoring Tools - Nagios - Zabbix (rabbitmq-zabbix)
  87. Logging

  88. Logging - Save Your Fingers!

  89. Logging - Save Your Fingers! - Find a system you

    trust
  90. Logging - Save Your Fingers! - Find a system you

    trust - LogStash
  91. Logging - Save Your Fingers! - Find a system you

    trust - LogStash - Heka
  92. Logging - Save Your Fingers! - Find a system you

    trust - LogStash - Heka - SysLog
  93. Logging - Save Your Fingers! - Find a system you

    trust - LogStash - Heka - SysLog - NFS Mount
  94. Error Tracking

  95. Error Tracking - Use a central error logger

  96. Error Tracking - Use a central error logger - I

    use Sentry
  97. Error Tracking - Use a central error logger - I

    use Sentry - Just. Use. Something.
  98. Error Tracking - Use a central error logger - I

    use Sentry - Just. Use. Something. - Leverage LoggerAdapter to capture extra info
  99. Error Tracking - Use a central error logger - I

    use Sentry - Just. Use. Something. - Leverage LoggerAdapter to capture extra info - Hostname
  100. Error Tracking - Use a central error logger - I

    use Sentry - Just. Use. Something. - Leverage LoggerAdapter to capture extra info - Hostname - Worker name
  101. Error Tracking def get_logger(logger_name, **kwargs): logger = logging.getLogger(logger_name) return logging.LoggerAdapter(logger,

    kwargs) ! extra = {'DMS Specific': {'company': company, 'enterprise': enterprise, 'start_uri': start_uri, 'process_path': process_path, } } tags = {'enterprise': enterprise, 'company': company} logger = get_logger(self, extra=extra, tags=tags)
  102. Error Tracking def get_logger(logger_name, **kwargs): logger = logging.getLogger(logger_name) return logging.LoggerAdapter(logger,

    kwargs) ! extra = {'DMS Specific': {'company': company, 'enterprise': enterprise, 'start_uri': start_uri, 'process_path': process_path, } } tags = {'enterprise': enterprise, 'company': company} logger = get_logger(self, extra=extra, tags=tags)
  103. Error Tracking def get_logger(logger_name, **kwargs): logger = logging.getLogger(logger_name) return logging.LoggerAdapter(logger,

    kwargs) ! extra = {'DMS Specific': {'company': company, 'enterprise': enterprise, 'start_uri': start_uri, 'process_path': process_path, } } tags = {'enterprise': enterprise, 'company': company} logger = get_logger(self, extra=extra, tags=tags)
  104. Error Tracking def get_logger(logger_name, **kwargs): logger = logging.getLogger(logger_name) return logging.LoggerAdapter(logger,

    kwargs) ! extra = {'DMS Specific': {'company': company, 'enterprise': enterprise, 'start_uri': start_uri, 'process_path': process_path, } } tags = {'enterprise': enterprise, 'company': company} logger = get_logger(self, extra=extra, tags=tags)
  105. Error Tracking

  106. Error Tracking

  107. Testing, testing, testing, testing 'SthDae om g : a children's

    guide to asynchronous programming.' --@AnthonyBriggs
  108. Testing Strategy

  109. Testing Strategy - Test outside of Celery

  110. Testing Strategy - Test outside of Celery - Test with

    a single worker/job
  111. Testing Strategy - Test outside of Celery - Test with

    a single worker/job - Test with one worker, multiple concurrent jobs
  112. Testing Strategy - Test outside of Celery - Test with

    a single worker/job - Test with one worker, multiple concurrent jobs - Test with multiple servers
  113. Testing Strategy - Test outside of Celery - Test with

    a single worker/job - Test with one worker, multiple concurrent jobs - Test with multiple servers - Ramp up as much as possible
  114. Race Conditions @app.task def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name,

    clean=True)
  115. Race Conditions @app.task def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name,

    clean=True)
  116. Race Conditions @app.task def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name,

    clean=True) def build(worker):! db.callproc('create_report', params)! report_path = cursor.fetchone()! shutil.copy(report_path, 'data.xml')! worker.apply_stylesheet('style.xsl', 'data.xml')
  117. Race Conditions @app.task def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name,

    clean=True) def build(worker):! db.callproc('create_report', params)! report_path = cursor.fetchone()! shutil.copy(report_path, 'data.xml')! worker.apply_stylesheet('style.xsl', 'data.xml')
  118. Serenity Prayer '"Sam the dog is half missing but he

    was whole at the pet store" - a parents guide to explaining eventually consistently to children.' --@jessenoller
  119. Handling "Abusive" Tasks

  120. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

  121. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

    - Sometimes Calling Outside Your Environment
  122. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

    - Sometimes Calling Outside Your Environment - Watch Their Memory Usage
  123. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

    - Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output
  124. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

    - Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output - Segregate to its own queue
  125. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

    - Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output - Segregate to its own queue - Utilize Soft / Hard Timeouts (Celery)
  126. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

    - Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output - Segregate to its own queue - Utilize Soft / Hard Timeouts (Celery) - Soft Timeout lets you "clean up" afterwards
  127. Handling "Abusive" Tasks - Not All Tasks Are Created Equal

    - Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output - Segregate to its own queue - Utilize Soft / Hard Timeouts (Celery) - Soft Timeout lets you "clean up" afterwards - Hard Timeout kills without remorse
  128. Handling "Abusive" Tasks @app.task(soft_time_limit=3600)! def run_job(job_id):! try:! job = AbusiveJob(job_id)!

    job.build()! job.run()! except celery.exceptions.SoftTimeLimitExceeded:! raise celery.task.current.retry()! except celery.exceptions.MaxRetriesExceededError:! send_email('AbusiveJob failed')
  129. Handling "Abusive" Tasks @app.task(soft_time_limit=3600)! def run_job(job_id):! try:! job = AbusiveJob(job_id)!

    job.build()! job.run()! except celery.exceptions.SoftTimeLimitExceeded:! raise celery.task.current.retry()! except celery.exceptions.MaxRetriesExceededError:! send_email('AbusiveJob failed')
  130. Handling "Abusive" Tasks @app.task(soft_time_limit=3600)! def run_job(job_id):! try:! job = AbusiveJob(job_id)!

    job.build()! job.run()! except celery.exceptions.SoftTimeLimitExceeded:! raise celery.task.current.retry()! except celery.exceptions.MaxRetriesExceededError:! send_email('AbusiveJob failed')
  131. Handling "Abusive" Tasks @app.task(soft_time_limit=3600)! def run_job(job_id):! try:! job = AbusiveJob(job_id)!

    job.build()! job.run()! except celery.exceptions.SoftTimeLimitExceeded:! raise celery.task.current.retry()! except celery.exceptions.MaxRetriesExceededError:! send_email('AbusiveJob failed')
  132. Single Points of Failure

  133. Single Points of Failure - Identify Your Single Points of

    Failure
  134. Single Points of Failure - Identify Your Single Points of

    Failure - Database
  135. Single Points of Failure - Identify Your Single Points of

    Failure - Database - Broker
  136. Single Points of Failure - Identify Your Single Points of

    Failure - Database - Broker - Eliminate the Ones You Can
  137. Single Points of Failure - Identify Your Single Points of

    Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster
  138. Single Points of Failure - Identify Your Single Points of

    Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster - Database Slaves
  139. Single Points of Failure - Identify Your Single Points of

    Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster - Database Slaves - Mitigate Those You Can't
  140. Single Points of Failure - Identify Your Single Points of

    Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster - Database Slaves - Mitigate Those You Can't - Add pre-run check
  141. Single Points of Failure - Identify Your Single Points of

    Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster - Database Slaves - Mitigate Those You Can't - Add pre-run check - Utilize retries
  142. Clock Synchronization

  143. Clock Synchronization - Remember clocks may differ

  144. Clock Synchronization - Remember clocks may differ - Use NTP

  145. Clock Synchronization - Remember clocks may differ - Use NTP

    - Still don't assume they are in sync
  146. Clock Synchronization @app.task! def find_reports(time=None):! if not time:! time =

    '{:%H}0000'.format(datetime.datetime.now())! select_stmt = 'SELECT id FROM reports WHERE time = ?'! for report_id in db.execute(select_stmt, (time,)):! run_report.delay(report_id)! ! ! CELERYBEAT_SCHEDULE = {! 'report_finder': {! 'task': 'scheduled_report_finder',! 'schedule': celery.schedules.crontab(minute=0)! }! }
  147. Clock Synchronization @app.task! def find_reports(time=None):! if not time:! time =

    '{:%H}0000'.format(datetime.datetime.now())! select_stmt = 'SELECT id FROM reports WHERE time = ?'! for report_id in db.execute(select_stmt, (time,)):! run_report.delay(report_id)! ! ! CELERYBEAT_SCHEDULE = {! 'report_finder': {! 'task': 'scheduled_report_finder',! 'schedule': celery.schedules.crontab(minute=0)! }! }
  148. Clock Synchronization @app.task! def find_reports(time=None):! if not time:! time =

    '{:%H}0000'.format(datetime.datetime.now())! select_stmt = 'SELECT id FROM reports WHERE time = ?'! for report_id in db.execute(select_stmt, (time,)):! run_report.delay(report_id)! ! ! CELERYBEAT_SCHEDULE = {! 'report_finder': {! 'task': 'scheduled_report_finder',! 'schedule': celery.schedules.crontab(minute=0)! }! }
  149. Clock Synchronization @app.task! def find_reports(time=None):! if not time:! time =

    '{:%H}0000'.format(datetime.datetime.now())! select_stmt = 'SELECT id FROM reports WHERE time = ?'! for report_id in db.execute(select_stmt, (time,)):! run_report.delay(report_id)! ! ! CELERYBEAT_SCHEDULE = {! 'report_finder': {! 'task': 'scheduled_report_finder',! 'schedule': celery.schedules.crontab(minute=0)! }! }
  150. Calling In the Calvary '"Sam the dog sam the dog

    dog the Sam dog" - a parents guide to explaining leader election in distributed systems for children' --@jessenoller
  151. Limiting Jobs

  152. Limiting Jobs - Client calls; you are hammering their server

  153. Limiting Jobs - Client calls; you are hammering their server

    - What do you do?
  154. Limiting Jobs - Client calls; you are hammering their server

    - What do you do? - Distributed semaphore
  155. Limiting Jobs - Client calls; you are hammering their server

    - What do you do? - Distributed semaphore - Set number of leases
  156. Limiting Jobs - Client calls; you are hammering their server

    - What do you do? - Distributed semaphore - Set number of leases - Can be tuned for specific needs
  157. Limiting Jobs @contextlib.contextmanager def semaphore(self): semaphore = None if self.dms_code

    and not self.called_directly: semaphore = client.Semaphore(self.dms_code, max_leases=3) if not semaphore.acquire(blocking=False): raise celery.task.current.retry() try: yield finally: if semaphore: semaphore.release()
  158. Limiting Jobs @contextlib.contextmanager def semaphore(self): semaphore = None if self.dms_code

    and not self.called_directly: semaphore = client.Semaphore(self.dms_code, max_leases=3) if not semaphore.acquire(blocking=False): raise celery.task.current.retry() try: yield finally: if semaphore: semaphore.release()
  159. Limiting Jobs @contextlib.contextmanager def semaphore(self): semaphore = None if self.dms_code

    and not self.called_directly: semaphore = client.Semaphore(self.dms_code, max_leases=3) if not semaphore.acquire(blocking=False): raise celery.task.current.retry() try: yield finally: if semaphore: semaphore.release()
  160. Limiting Jobs @contextlib.contextmanager def semaphore(self): semaphore = None if self.dms_code

    and not self.called_directly: semaphore = client.Semaphore(self.dms_code, max_leases=3) if not semaphore.acquire(blocking=False): raise celery.task.current.retry() try: yield finally: if semaphore: semaphore.release()
  161. Thundering Herd

  162. Thundering Herd - "occurs when a large number of processes

    waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time."
  163. Thundering Herd - "occurs when a large number of processes

    waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries
  164. Thundering Herd - "occurs when a large number of processes

    waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60)
  165. Thundering Herd - "occurs when a large number of processes

    waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60) - Simple, but effective
  166. Thundering Herd - "occurs when a large number of processes

    waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60) - Simple, but effective - ZooKeeper Locks
  167. Thundering Herd - "occurs when a large number of processes

    waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60) - Simple, but effective - ZooKeeper Locks - Adds complexity, but arguably "more correct"
  168. Thundering Herd - "occurs when a large number of processes

    waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60) - Simple, but effective - ZooKeeper Locks - Adds complexity, but arguably "more correct" - Locks are held in a list
  169. Distributed Mutex

  170. Distributed Mutex - "A mutex is a way to ensure

    that no two concurrent processes are running at the same time"
  171. Distributed Mutex - "A mutex is a way to ensure

    that no two concurrent processes are running at the same time" - Only start a task if one currently isn't running
  172. Distributed Mutex - "A mutex is a way to ensure

    that no two concurrent processes are running at the same time" - Only start a task if one currently isn't running - Can be limited by input types
  173. def _get_node(self, args, kwargs):! mutex_keys = getattr(self, 'mutex_keys', ())! lock_node

    = '/mutex/celery/{}'.format(self.name)! items = inspect.getcallargs(self.run, *args, **kwargs)! for value in (items[x] for x in mutex_keys if ! items.get(x)):! lock_node += value! return lock_node! ! @contextlib.contextmanager! def mutex(self, args, kwargs):! client = None! success = False! lock_node = self._get_node(args, kwargs)! if not client.exists(lock_node):! success = True! if success:! client.create(lock_node, makepath=True)! yield True! else:! yield False
  174. def _get_node(self, args, kwargs):! mutex_keys = getattr(self, 'mutex_keys', ())! lock_node

    = '/mutex/celery/{}'.format(self.name)! items = inspect.getcallargs(self.run, *args, **kwargs)! for value in (items[x] for x in mutex_keys if ! items.get(x)):! lock_node += value! return lock_node! ! @contextlib.contextmanager! def mutex(self, args, kwargs):! client = None! success = False! lock_node = self._get_node(args, kwargs)! if not client.exists(lock_node):! success = True! if success:! client.create(lock_node, makepath=True)! yield True! else:! yield False
  175. def _get_node(self, args, kwargs):! mutex_keys = getattr(self, 'mutex_keys', ())! lock_node

    = '/mutex/celery/{}'.format(self.name)! items = inspect.getcallargs(self.run, *args, **kwargs)! for value in (items[x] for x in mutex_keys if ! items.get(x)):! lock_node += value! return lock_node! ! @contextlib.contextmanager! def mutex(self, args, kwargs):! client = None! success = False! lock_node = self._get_node(args, kwargs)! if not client.exists(lock_node):! success = True! if success:! client.create(lock_node, makepath=True)! yield True! else:! yield False
  176. class MutexTask(celery.Task):! abstract = True! ! @contextlib.contextmanager! def mutex(self, args,

    kwargs, delete=False):! pass! ! def apply_async(self, args=None, kwargs=None, **options):! with self.mutex(args, kwargs) as mutex_acquired:! if mutex_acquired:! return super(MutexTask, ! self).apply_async(args, kwargs,! **options)! ! def after_return(self, *args, **kwargs):! lock_node = self._get_node(args, kwargs)! if client.exists(lock_node):! client.delete(lock_node)
  177. class MutexTask(celery.Task):! abstract = True! ! @contextlib.contextmanager! def mutex(self, args,

    kwargs, delete=False):! pass! ! def apply_async(self, args=None, kwargs=None, **options):! with self.mutex(args, kwargs) as mutex_acquired:! if mutex_acquired:! return super(MutexTask, ! self).apply_async(args, kwargs,! **options)! ! def after_return(self, *args, **kwargs):! lock_node = self._get_node(args, kwargs)! if client.exists(lock_node):! client.delete(lock_node)
  178. class MutexTask(celery.Task):! abstract = True! ! @contextlib.contextmanager! def mutex(self, args,

    kwargs, delete=False):! pass! ! def apply_async(self, args=None, kwargs=None, **options):! with self.mutex(args, kwargs) as mutex_acquired:! if mutex_acquired:! return super(MutexTask, ! self).apply_async(args, kwargs,! **options)! ! def after_return(self, *args, **kwargs):! lock_node = self._get_node(args, kwargs)! if client.exists(lock_node):! client.delete(lock_node)
  179. class MutexTask(celery.Task):! abstract = True! ! @contextlib.contextmanager! def mutex(self, args,

    kwargs, delete=False):! pass! ! def apply_async(self, args=None, kwargs=None, **options):! with self.mutex(args, kwargs) as mutex_acquired:! if mutex_acquired:! return super(MutexTask, ! self).apply_async(args, kwargs,! **options)! ! def after_return(self, *args, **kwargs):! lock_node = self._get_node(args, kwargs)! if client.exists(lock_node):! client.delete(lock_node)
  180. @app.task(base=MutexTask) def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name, clean=True) !

    @app.task(base=MutexTask, mutex_keys=('schedule_id')) def build_exports(schedule_id): magic.build_exports(schedule_id)
  181. @app.task(base=MutexTask) def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name, clean=True) !

    @app.task(base=MutexTask, mutex_keys=('schedule_id')) def build_exports(schedule_id): magic.build_exports(schedule_id)
  182. Are We There Yet?

  183. Yes

  184. Fin

  185. Fin - Learn From My Mistakes

  186. Fin - Learn From My Mistakes - Be Diligent

  187. Fin - Learn From My Mistakes - Be Diligent -

    I got careless and introduced a deadlock Friday
  188. Fin - Learn From My Mistakes - Be Diligent -

    I got careless and introduced a deadlock Friday - Ask Questions
  189. Fin - Learn From My Mistakes - Be Diligent -

    I got careless and introduced a deadlock Friday - Ask Questions - Now
  190. Fin - Learn From My Mistakes - Be Diligent -

    I got careless and introduced a deadlock Friday - Ask Questions - Now - In the Hallway
  191. Fin - Learn From My Mistakes - Be Diligent -

    I got careless and introduced a deadlock Friday - Ask Questions - Now - In the Hallway - At lunch
  192. Fin - Learn From My Mistakes - Be Diligent -

    I got careless and introduced a deadlock Friday - Ask Questions - Now - In the Hallway - At lunch - Over drinks (My time can be bought with Cream Sodas)
  193. Fin - Get In Touch! - @brolewis - lewis.franklin@gmail.com -

    brolewis.com - Questions?