Distributed Computing Is Hard, Lets Go Shopping by Lewis Franklin

Transcript

Distributed Computing Is Hard, Lets Go Shopping Understanding real-world challenges

with distributed computing, focusing on Celery

Who Am I (And Why Should You Listen)

Who Am I (And Why Should You Listen) - Developing

in Python for 10 years

Who Am I (And Why Should You Listen) - Developing

in Python for 10 years - Using Celery for 2 years

Who Am I (And Why Should You Listen) - Developing

in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects

Who Am I (And Why Should You Listen) - Developing

in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects - Celery Mutex

Who Am I (And Why Should You Listen) - Developing

in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects - Celery Mutex - zkcelery

Who Am I (And Why Should You Listen) - Developing

in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects - Celery Mutex - zkcelery - I have screwed up with Celery.

Who Am I (And Why Should You Listen) - Developing

in Python for 10 years - Using Celery for 2 years - Contributed Celery helper projects - Celery Mutex - zkcelery - I have screwed up with Celery. - A lot.

Why Am I Here

Why Am I Here - Distributed Computing Is Awesome

Why Am I Here - Distributed Computing Is Awesome -

Not Enough About It At PyCon 2013

Why Am I Here - Distributed Computing Is Awesome -

Not Enough About It At PyCon 2013 - Wanted to focus on Celery outside of web

Why Am I Here - Distributed Computing Is Awesome -

Not Enough About It At PyCon 2013 - Wanted to focus on Celery outside of web - I want to help others avoid some pitfalls

Where Are We Going? - This talk is about distributed

computing

Where Are We Going? - This talk is about distributed

computing - I focus on Celery, because its what I know

Where Are We Going? - This talk is about distributed

computing - I focus on Celery, because its what I know - But I know its not the only game in town.

Where Are We Going? - This talk is about distributed

computing - I focus on Celery, because its what I know - But I know its not the only game in town. - This talk does assume, at times, you know Celery

Where Are We Going? - This talk is about distributed

computing - I focus on Celery, because its what I know - But I know its not the only game in town. - This talk does assume, at times, you know Celery - If you don't, stick around. You still may learn something

Where Are We Going? - This talk is about distributed

computing - I focus on Celery, because its what I know - But I know its not the only game in town. - This talk does assume, at times, you know Celery - If you don't, stick around. You still may learn something - I'd love to talk with you about Celery

Celery: An Introduction

—celeryproject.com Celery is an asynchronous task queue/job queue based on

distributed message passing. It is focused on real-time operation, but supports scheduling as well.

Unpacking the Description

Unpacking the Description - Celery allows you to use Python

with a message broker. (e.g., RabbitMQ, Redis)

Unpacking the Description - Celery allows you to use Python

with a message broker. (e.g., RabbitMQ, Redis) - You build tasks that can then be run locally or across a group of computers that can communicate with the message broker.

Unpacking the Description - Celery allows you to use Python

with a message broker. (e.g., RabbitMQ, Redis) - You build tasks that can then be run locally or across a group of computers that can communicate with the message broker. - Tasks can be reactive or scheduled

Why Are YOU Here? Overcoming Obstacles Problems Encountered and Personal

Solutions

Fallacies of Distributed Computing 1994, Peter Deutsch

Fallacies of Distributed Computing 1994, Peter Deutsch - The network

is reliable

Fallacies of Distributed Computing 1994, Peter Deutsch - The network

is reliable - Latency is zero

Fallacies of Distributed Computing 1994, Peter Deutsch - The network

is reliable - Latency is zero - Bandwidth is infinite

Fallacies of Distributed Computing 1994, Peter Deutsch - The network

is reliable - Latency is zero - Bandwidth is infinite - The network is secure

Fallacies of Distributed Computing 1994, Peter Deutsch - The network

is reliable - Latency is zero - Bandwidth is infinite - The network is secure - Topology doesn't change

Fallacies of Distributed Computing 1994, Peter Deutsch - The network

is reliable - Latency is zero - Bandwidth is infinite - The network is secure - Topology doesn't change - There is one administrator

Fallacies of Distributed Computing 1994, Peter Deutsch - The network

is reliable - Latency is zero - Bandwidth is infinite - The network is secure - Topology doesn't change - There is one administrator - Transport cost is zero

KISS 'Sam ! The dog: the children's guide to queuing'

--@hhoover

Memory Management

Memory Management - Dealing with large (1GB+) files

Memory Management - Dealing with large (1GB+) files - Understand

the consequences for each call

Memory Management - Dealing with large (1GB+) files - Understand

the consequences for each call - Utilize generators / iterators when possible

Memory Management - Dealing with large (1GB+) files - Understand

the consequences for each call - Utilize generators / iterators when possible - etree.iterparse() vs . etree.parse()

Memory Management - Dealing with large (1GB+) files - Understand

the consequences for each call - Utilize generators / iterators when possible - etree.iterparse() vs . etree.parse() - r. iter_content() vs. r.content

Memory Management <?xml version="1.0"?> <VehicleSales> <VehicleSale> <CRMSaleType>1</CRMSaleType> <BuyRateAPR>0.2081</BuyRateAPR> <APR>22.81</APR> <BackGross>1136.24</BackGross>

Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

level = -1! item_data = {}! for event, elem in etree:! name = elem.tag! if event == 'start':! level += 1! if level == 2 and event == 'end' and elem.text:! item_data[name] = elem.text! if level == 1 and event == 'end':! self._write_to_db(data_type, item_data)! item_data = {}! if event == 'end':! level -= 1! elem.clear()

Memory Management events = ('start', 'end')! etree = etree.cElementTree.iterparse(xml_path, events=events)!

Data Locality

Data Locality - It easy to forget where your data

is

Data Locality - It easy to forget where your data

is - 1 file, 2 tasks

Data Locality - It easy to forget where your data

is - 1 file, 2 tasks - Closer is better

Data Locality - It easy to forget where your data

is - 1 file, 2 tasks - Closer is better - Memory > file system > NAS > carrier pigeon

Data Locality - It easy to forget where your data

is - 1 file, 2 tasks - Closer is better - Memory > file system > NAS > carrier pigeon - Find the balance between speed and accessibility

Segregation of Duties - Use multiple queues to "Keep 'em

Separated" - An Idle Queue is the Devil's Playground - Find the balance - Fine grained queues - Resource utilization

Segregation of Duties

Segregation of Duties - Use multiple queues to "Keep 'em

Segregation of Duties

Segregation of Duties - Use multiple queues to "Keep 'em

Separated" - An Idle Queue is the Devil's Playground - Find the balance

Segregation of Duties - Use multiple queues to "Keep 'em

Separated" - An Idle Queue is the Devil's Playground - Find the balance - Fine grained queues

Segregation of Duties - Use multiple queues to "Keep 'em

Separated" - An Idle Queue is the Devil's Playground - Find the balance - Fine grained queues - Resource utilization - Celery Autoscaling

Simplifying Similar Tasks

Simplifying Similar Tasks - Utilize Celery Abstract Tasks

Simplifying Similar Tasks - Utilize Celery Abstract Tasks - Serve

as a base class for new task types

Simplifying Similar Tasks - Utilize Celery Abstract Tasks - Serve

as a base class for new task types - Can Add Custom Handlers

Simplifying Similar Tasks - Utilize Celery Abstract Tasks - Serve

as a base class for new task types - Can Add Custom Handlers - Useful for shared "boilerplate" code

Simplifying Similar Tasks - Utilize Celery Abstract Tasks - Serve

as a base class for new task types - Can Add Custom Handlers - Useful for shared "boilerplate" code - Database connection

Simplifying Similar Tasks - Utilize Celery Abstract Tasks - Serve

as a base class for new task types - Can Add Custom Handlers - Useful for shared "boilerplate" code - Database connection - Celery Mutex

Custom Abstract Class class DBTask(celery.Task): abstract = True _db =

None ! @property def db(self): if self._db is None: self._db = Database.connect() return self._db ! def after_failure(self, *args, **kwargs): send_email('The task failed!') ! ! @app.task(base=DebugTask) def get_data(table_name): return get_data.db.table(table_name).all()

Custom Abstract Class class DBTask(celery.Task): abstract = True _db =

Data, Data, Everywhere '"Sam the dog eats a pound of

chocolate and poos all over the house" - a children's guide to proactive systems monitoring' --@jessenoller

Keeping Track of Tasks

Keeping Track of Tasks - Install pre-baked monitoring tools

Keeping Track of Tasks - Install pre-baked monitoring tools -

Celery Flower

Keeping Track of Tasks - Install pre-baked monitoring tools -

Celery Flower - RabbitMQ Management Plugin

Keeping Track of Tasks - Install pre-baked monitoring tools -

Celery Flower - RabbitMQ Management Plugin - Understand Celery Events / Snapshots

Keeping Track of Tasks - Install pre-baked monitoring tools -

Celery Flower - RabbitMQ Management Plugin - Understand Celery Events / Snapshots - Hook Into Monitoring Tools

Keeping Track of Tasks - Install pre-baked monitoring tools -

Celery Flower - RabbitMQ Management Plugin - Understand Celery Events / Snapshots - Hook Into Monitoring Tools - Nagios

Keeping Track of Tasks - Install pre-baked monitoring tools -

Celery Flower - RabbitMQ Management Plugin - Understand Celery Events / Snapshots - Hook Into Monitoring Tools - Nagios - Zabbix (rabbitmq-zabbix)

Logging

Logging - Save Your Fingers!

Logging - Save Your Fingers! - Find a system you

trust

Logging - Save Your Fingers! - Find a system you

trust - LogStash

Logging - Save Your Fingers! - Find a system you

trust - LogStash - Heka

Logging - Save Your Fingers! - Find a system you

trust - LogStash - Heka - SysLog

Logging - Save Your Fingers! - Find a system you

trust - LogStash - Heka - SysLog - NFS Mount

Error Tracking

Error Tracking - Use a central error logger

Error Tracking - Use a central error logger - I

use Sentry

Error Tracking - Use a central error logger - I

use Sentry - Just. Use. Something.

Error Tracking - Use a central error logger - I

use Sentry - Just. Use. Something. - Leverage LoggerAdapter to capture extra info

Error Tracking - Use a central error logger - I

use Sentry - Just. Use. Something. - Leverage LoggerAdapter to capture extra info - Hostname

Error Tracking - Use a central error logger - I

use Sentry - Just. Use. Something. - Leverage LoggerAdapter to capture extra info - Hostname - Worker name

Error Tracking def get_logger(logger_name, **kwargs): logger = logging.getLogger(logger_name) return logging.LoggerAdapter(logger,

kwargs) ! extra = {'DMS Specific': {'company': company, 'enterprise': enterprise, 'start_uri': start_uri, 'process_path': process_path, } } tags = {'enterprise': enterprise, 'company': company} logger = get_logger(self, extra=extra, tags=tags)

Error Tracking def get_logger(logger_name, **kwargs): logger = logging.getLogger(logger_name) return logging.LoggerAdapter(logger,

Error Tracking

Testing, testing, testing, testing 'SthDae om g : a children's

guide to asynchronous programming.' --@AnthonyBriggs

Testing Strategy

Testing Strategy - Test outside of Celery

Testing Strategy - Test outside of Celery - Test with

a single worker/job

Testing Strategy - Test outside of Celery - Test with

a single worker/job - Test with one worker, multiple concurrent jobs

Testing Strategy - Test outside of Celery - Test with

a single worker/job - Test with one worker, multiple concurrent jobs - Test with multiple servers

Testing Strategy - Test outside of Celery - Test with

a single worker/job - Test with one worker, multiple concurrent jobs - Test with multiple servers - Ramp up as much as possible

Race Conditions @app.task def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name,

clean=True)

Race Conditions @app.task def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name,

clean=True)

Race Conditions @app.task def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name,

clean=True) def build(worker):! db.callproc('create_report', params)! report_path = cursor.fetchone()! shutil.copy(report_path, 'data.xml')! worker.apply_stylesheet('style.xsl', 'data.xml')

Race Conditions @app.task def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name,

clean=True) def build(worker):! db.callproc('create_report', params)! report_path = cursor.fetchone()! shutil.copy(report_path, 'data.xml')! worker.apply_stylesheet('style.xsl', 'data.xml')

Serenity Prayer '"Sam the dog is half missing but he

was whole at the pet store" - a parents guide to explaining eventually consistently to children.' --@jessenoller

Handling "Abusive" Tasks

Handling "Abusive" Tasks - Not All Tasks Are Created Equal

- Sometimes Calling Outside Your Environment

Handling "Abusive" Tasks - Not All Tasks Are Created Equal

- Sometimes Calling Outside Your Environment - Watch Their Memory Usage

Handling "Abusive" Tasks - Not All Tasks Are Created Equal

- Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output

Handling "Abusive" Tasks - Not All Tasks Are Created Equal

- Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output - Segregate to its own queue

Handling "Abusive" Tasks - Not All Tasks Are Created Equal

- Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output - Segregate to its own queue - Utilize Soft / Hard Timeouts (Celery)

Handling "Abusive" Tasks - Not All Tasks Are Created Equal

- Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output - Segregate to its own queue - Utilize Soft / Hard Timeouts (Celery) - Soft Timeout lets you "clean up" afterwards

Handling "Abusive" Tasks - Not All Tasks Are Created Equal

- Sometimes Calling Outside Your Environment - Watch Their Memory Usage - Incrementally read output - Segregate to its own queue - Utilize Soft / Hard Timeouts (Celery) - Soft Timeout lets you "clean up" afterwards - Hard Timeout kills without remorse

Handling "Abusive" Tasks @app.task(soft_time_limit=3600)! def run_job(job_id):! try:! job = AbusiveJob(job_id)!

job.build()! job.run()! except celery.exceptions.SoftTimeLimitExceeded:! raise celery.task.current.retry()! except celery.exceptions.MaxRetriesExceededError:! send_email('AbusiveJob failed')

Handling "Abusive" Tasks @app.task(soft_time_limit=3600)! def run_job(job_id):! try:! job = AbusiveJob(job_id)!

Single Points of Failure

Single Points of Failure - Identify Your Single Points of

Failure

Single Points of Failure - Identify Your Single Points of

Failure - Database

Single Points of Failure - Identify Your Single Points of

Failure - Database - Broker

Single Points of Failure - Identify Your Single Points of

Failure - Database - Broker - Eliminate the Ones You Can

Single Points of Failure - Identify Your Single Points of

Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster

Single Points of Failure - Identify Your Single Points of

Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster - Database Slaves

Single Points of Failure - Identify Your Single Points of

Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster - Database Slaves - Mitigate Those You Can't

Single Points of Failure - Identify Your Single Points of

Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster - Database Slaves - Mitigate Those You Can't - Add pre-run check

Single Points of Failure - Identify Your Single Points of

Failure - Database - Broker - Eliminate the Ones You Can - RabbitMQ Cluster - Database Slaves - Mitigate Those You Can't - Add pre-run check - Utilize retries

Clock Synchronization

Clock Synchronization - Remember clocks may differ

Clock Synchronization - Remember clocks may differ - Use NTP

- Still don't assume they are in sync

Clock Synchronization @app.task! def find_reports(time=None):! if not time:! time =

'{:%H}0000'.format(datetime.datetime.now())! select_stmt = 'SELECT id FROM reports WHERE time = ?'! for report_id in db.execute(select_stmt, (time,)):! run_report.delay(report_id)! ! ! CELERYBEAT_SCHEDULE = {! 'report_finder': {! 'task': 'scheduled_report_finder',! 'schedule': celery.schedules.crontab(minute=0)! }! }

Clock Synchronization @app.task! def find_reports(time=None):! if not time:! time =

Calling In the Calvary '"Sam the dog sam the dog

dog the Sam dog" - a parents guide to explaining leader election in distributed systems for children' --@jessenoller

Limiting Jobs

Limiting Jobs - Client calls; you are hammering their server

- What do you do?

Limiting Jobs - Client calls; you are hammering their server

- What do you do? - Distributed semaphore

Limiting Jobs - Client calls; you are hammering their server

- What do you do? - Distributed semaphore - Set number of leases

Limiting Jobs - Client calls; you are hammering their server

- What do you do? - Distributed semaphore - Set number of leases - Can be tuned for specific needs

Limiting Jobs @contextlib.contextmanager def semaphore(self): semaphore = None if self.dms_code

and not self.called_directly: semaphore = client.Semaphore(self.dms_code, max_leases=3) if not semaphore.acquire(blocking=False): raise celery.task.current.retry() try: yield finally: if semaphore: semaphore.release()

Limiting Jobs @contextlib.contextmanager def semaphore(self): semaphore = None if self.dms_code

Thundering Herd

Thundering Herd - "occurs when a large number of processes

waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time."

Thundering Herd - "occurs when a large number of processes

waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries

Thundering Herd - "occurs when a large number of processes

waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60)

Thundering Herd - "occurs when a large number of processes

waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60) - Simple, but effective

Thundering Herd - "occurs when a large number of processes

waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60) - Simple, but effective - ZooKeeper Locks

Thundering Herd - "occurs when a large number of processes

waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60) - Simple, but effective - ZooKeeper Locks - Adds complexity, but arguably "more correct"

Thundering Herd - "occurs when a large number of processes

waiting for an event are awoken when that event occurs, but only one process is able to proceed at a time." - Add jitter to retries - random.randint(30, 60) - Simple, but effective - ZooKeeper Locks - Adds complexity, but arguably "more correct" - Locks are held in a list

Distributed Mutex

Distributed Mutex - "A mutex is a way to ensure

that no two concurrent processes are running at the same time"

Distributed Mutex - "A mutex is a way to ensure

that no two concurrent processes are running at the same time" - Only start a task if one currently isn't running

Distributed Mutex - "A mutex is a way to ensure

that no two concurrent processes are running at the same time" - Only start a task if one currently isn't running - Can be limited by input types

def _get_node(self, args, kwargs):! mutex_keys = getattr(self, 'mutex_keys', ())! lock_node

= '/mutex/celery/{}'.format(self.name)! items = inspect.getcallargs(self.run, *args, **kwargs)! for value in (items[x] for x in mutex_keys if ! items.get(x)):! lock_node += value! return lock_node! ! @contextlib.contextmanager! def mutex(self, args, kwargs):! client = None! success = False! lock_node = self._get_node(args, kwargs)! if not client.exists(lock_node):! success = True! if success:! client.create(lock_node, makepath=True)! yield True! else:! yield False

def _get_node(self, args, kwargs):! mutex_keys = getattr(self, 'mutex_keys', ())! lock_node

class MutexTask(celery.Task):! abstract = True! ! @contextlib.contextmanager! def mutex(self, args,

kwargs, delete=False):! pass! ! def apply_async(self, args=None, kwargs=None, **options):! with self.mutex(args, kwargs) as mutex_acquired:! if mutex_acquired:! return super(MutexTask, ! self).apply_async(args, kwargs,! **options)! ! def after_return(self, *args, **kwargs):! lock_node = self._get_node(args, kwargs)! if client.exists(lock_node):! client.delete(lock_node)

class MutexTask(celery.Task):! abstract = True! ! @contextlib.contextmanager! def mutex(self, args,

@app.task(base=MutexTask) def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name, clean=True) !

@app.task(base=MutexTask, mutex_keys=('schedule_id')) def build_exports(schedule_id): magic.build_exports(schedule_id)

@app.task(base=MutexTask) def run_producers(): for file_name in os.listdir(START_PATH): magic.import_data.delay(file_name, clean=True) !

@app.task(base=MutexTask, mutex_keys=('schedule_id')) def build_exports(schedule_id): magic.build_exports(schedule_id)

Distributed Computing Is Hard, Lets Go Shopping...

Distributed Computing Is Hard, Lets Go Shopping by Lewis Franklin

More Decks by PyCon 2014

Featured

Transcript