Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Exploring Concurrency in Python & AWS

alcy
September 01, 2016

Exploring Concurrency in Python & AWS

Talk given at AWS Hamburg User Group Meetup

alcy

September 01, 2016
Tweet

Other Decks in Programming

Transcript

  1. BACKGROUND PROBLEM: INTRA-S3 BACKUPS ▸ Daily backup of 250 objects,

    from one bucket to another, each object ~ 600-800 MB. ▸ Initial bulk backup of 6 months: 45000 objects. ~ 25PB
  2. FROM A FOR LOOP TO THREADS BEFORE THE THREADS PASS

    #1 for obj in src_objects: dest_obj.copy_from( CopySource={ 'Bucket': obj.bucket_name, 'Key': obj.key })
  3. FROM A FOR LOOP TO THREADS BEFORE THE THREADS PASS

    #1 for obj in src_objects: dest_obj.copy_from( CopySource={ 'Bucket': obj.bucket_name, 'Key': obj.key }) EXECUTION TIME: 1 hr 45 mins !
  4. CONCURRENCY IN PYTHON CONCURRENCY IN PYTHON ▸ asyncio ▸ Event

    loops, asynchronous IO ▸ concurrent.futures ▸ high level abstractions: ThreadPoolExecutor and ProcessPoolExcutor ▸ threading ▸ low level constructs: build your own solution based on thread, semaphores and locks ▸ multiprocessing ▸ similar to threading, but for processes
  5. CONCURRENCY IN PYTHON CONCURRENT.FUTURES - USES AND LIMITATIONS ▸ ProcessPoolExecutor

    ▸ Multiple Python processes across CPUs ▸ Good for CPU intensive tasks ▸ ThreadPoolExecutor ▸ Threads run inside a single Python interpreter. ▸ Only one thread can run at a time, because of the GIL (Global Interpreter Lock) ▸ Good for I/O. When a thread is blocked on IO, it releases the GIL, which gets acquired by another thread.
  6. CONCURRENCY IN PYTHON PASS #2 CONCURRENT.FUTURES def copy(obj, bucket, key):

    result = obj.copy_from(CopySource={'Bucket': bucket, 'Key': key}) return (result['ResponseMetadata']['HTTPStatusCode']) with futures.ThreadPoolExecutor(max_workers=100) as executor: todo = [] for src_object in src_objects: dest_obj = s3.Object(dest_bucket,dest_key) future = executor.submit(copy, dest_obj, src_object.bucket, src_object.key) todo.append(future) results = [] for future in futures.as_completed(todo): res = future.result() results.append(res) print(len(results))
  7. CONCURRENCY IN PYTHON PASS #2 CONCURRENT.FUTURES def copy(obj, bucket, key):

    result = obj.copy_from(CopySource={'Bucket': bucket, 'Key': key}) return (result['ResponseMetadata']['HTTPStatusCode']) def task(prefix, src_bucket, src_objects, dest_bucket): with futures.ThreadPoolExecutor(max_workers=100) as executor: for src_object in src_objects: dest_obj = s3.Object(dest_bucket,dest_key) future = executor.submit(copy, dest_obj, src_object.bucket, src_object.key) with futures.ThreadPoolExecutor(max_workers=31) as task_executor: while date != datetime(2016, 6, 31): date = date + relativedelta(days=1) prefix = date.strftime("%Y-%m-%d") future = task_executor.submit(task, prefix, src_bucket, src_objects,dest_bucket) tasks.append(future)
  8. CONCURRENCY ON AWS LAMBDA: BEFORE THE MEETUP ▸ Single function

    with multithreaded code. ▸ Timeouts and higher resource consumption. ▸ Moved the code out of lambda to an ec2 instance.
  9. CONCURRENCY ON AWS LAMBDA: AFTER THE MEETUP ▸ Suggestion from

    the meetup: invoke many lambdas. ▸ Sounded costly.
  10. CONCURRENCY ON AWS LAMBDA: AFTER THE MEETUP ▸ Suggestion from

    the meetup: invoke many lambdas. ▸ Sounded costly. ▸ …but isn’t !
  11. CONCURRENCY ON AWS LAMBDA: AFTER THE MEETUP ▸ Suggestion from

    the meetup: invoke many lambdas. ▸ Sounded costly. ▸ …but isn’t ! ▸ (can be if you are not careful)
  12. CONCURRENCY ON AWS LAMBDA AND SNS ▸ Single lambda function

    publishes messages to SNS. The payload for each message contains an s3 object’s attributes. Another lambda function subscribed to the SNS topic executes s3 copy api call for each object.
  13. CONCURRENCY ON AWS LAMBDA: PUBLISH TO SNS …WITH THREADS !

    def publish_sns(key): client.publish( TopicArn='arn:aws:sns:us-east-1:123456789:s3copy', Message=key, Subject=key ) return response def lambda_handler(*args): for obj in src_objects: with futures.ThreadPoolExecutor(max_workers=10) as executor: future = executor.submit(publish_sns,obj.key)
  14. CONCURRENCY ON AWS LAMBDA: CONSUME FROM SNS, S3 COPY def

    lambda_handler(event, context): obj.copy_from( CopySource={ 'Bucket': ‘my-bucket’, ‘Key': event['Records'][0]['Sns'] ['Message']})
  15. CONCURRENCY ON AWS LAMBDA: CONCURRENCY ▸ An invocation of the

    lambda function as the unit of concurrency. ▸ For event sources that are not stream-based: concurrency = events per second * function duration ▸ concurrency = 20 * 30 = 600 ▸ By default, 100 concurrent executions is the safety limit, invocations after that are throttled. Can be increased on request. ▸ Retries on errors
  16. CONCURRENCY ON AWS LAMBDA: CONCURRENCY ▸ An invocation of the

    lambda function as the unit of concurrency. ▸ For event sources that are not stream-based: concurrency = events per second * function duration ▸ concurrency = 20 * 30 = 600 ▸ By default, 100 concurrent executions is the safety limit, invocations after that are throttled. Can be increased on request. ▸ Retries on errors LITTLE’S LAW
  17. CONCURRENCY ON AWS LAMBDA: CONCURRENCY ▸ An invocation of the

    lambda function as the unit of concurrency. ▸ For event sources that are not stream-based: concurrency = events per second * function duration ▸ concurrency = 20 * 30 = 600 ▸ By default, 100 concurrent executions is the safety limit, invocations after that are throttled. Can be increased on request. ▸ Retries on errors IF YOUR DATA DONT FIT LL, CHANGE YOUR DATA ! - NEIL GUNTHER
  18. CONCURRENCY ON AWS LAMBDA: CONCURRENCY CALCULATING EVENTS RATE def publish_sns(client,

    dest_key): global reqs t0 = pc() client.publish( TopicArn='arn:aws:sns:us-east-1:562810932035:s3copy', Message=dest_key, Subject=dest_key ) t1 = pc() - t0 ptime += t1 reqs += 1 return response def metrics(): global reqs if (reqs != 0): reqs_per_sec.append(reqs) times.append(ptime) if reqs == 251: sys.exit() threading.Timer(2, metrics).start() metrics() # throughput = reqs_per_sec[n] - reqs_per_sec[n-1] / 2) # [5, 20, 60, 100] -> ( 60 - 20 ) / 2 = 20 reqs/sec # Confirm with LL with response times using N = XR
  19. CONCURRENCY ON AWS LAMBDA: USEFUL METRICS ▸ Invocations ▸ Alert

    if invocations < number of s3 objects to copy ▸ Throttles ▸ Not a problem for simple jobs without a time constraint ▸ Duration ▸ Errors ▸ Also useful for monitoring
  20. CONCURRENCY ON AWS LAMBDA: RESULT ▸ Execution time = max(Last

    Modified Time) - min(Last Modified Time) 2 minutes 40 seconds