Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Jason Myers - Leveraging Serverless Architecture for Powerful Data Pipelines

Jason Myers - Leveraging Serverless Architecture for Powerful Data Pipelines

Serverless Architectures that allow us to run python functions in the cloud in an event-driven parallel fashion can be used to create extremely dynamic and powerful data pipelines for use in ETL and data science. Join me for an exploration of how to build data pipelines on Amazon Web Services Lambda with python. We'll cover a single introduction to event-driven programming. Then, we'll walk through building an example pipeline while discussing some of the frameworks and tools that can make building your pipeline easier. Finally, we'll discuss how to maintain observability on your pipeline to ensure proper performance and troubleshooting information.

https://us.pycon.org/2017/schedule/presentation/566/

Bde70c0ba031a765ff25c19e6b7d6d23?s=128

PyCon 2017

May 21, 2017
Tweet

Transcript

  1. Serverless Architecture for Powerful Data Pipelines Jason A Myers

  2. Credit: Horst Felske and Fritz Schiemann ex-convex.org

  3. Issues — Scheduled — Complex — Scaling — Recovery

  4. Credit: Anonymous

  5. Python Toolkits and Services — Luigi — Airflow — AWS

    Data Pipeline
  6. SERVERLESS

  7. None
  8. Cloud Functions

  9. None
  10. None
  11. Simple Pipeline overview

  12. Simple Pipeline overview

  13. Simple Pipeline overview

  14. Simple Pipeline overview

  15. Simple Pipeline overview

  16. Serverless Python Tools — Zappa — Apex — Chalice —

    Serverless
  17. Apex — Multiple Environment Support — Function Deployment — Infrastructure

    as code via Terraform
  18. Project Structure !"" functions # $"" listener # $"" main.py

    !"" infrastructure # !"" dev # # !"" main.tf # # !"" outputs.tf # # $"" variables.tf # !"" prod # # !"" main.tf # # !"" outputs.tf # # $"" variables.tf !"" project.json $"" project.prod.json
  19. Project Structure !"" functions # $"" listener # $"" main.py

    !"" infrastructure # !"" dev # # !"" main.tf # # !"" outputs.tf # # $"" variables.tf # !"" prod # # !"" main.tf # # !"" outputs.tf # # $"" variables.tf !"" project.json $"" project.prod.json
  20. Project Structure !"" functions # $"" listener # $"" main.py

    !"" infrastructure # !"" dev # # !"" main.tf # # !"" outputs.tf # # $"" variables.tf # !"" prod # # !"" main.tf # # !"" outputs.tf # # $"" variables.tf !"" project.json $"" project.prod.json
  21. Apex package.json { "name": "listener", "description": "S3 File Listener", "runtime":

    "python3.6", "memory": 128, "timeout": 5, "role": "arn:aws:iam::ACOUNTNUM:role/listen_lambda_function", "environment": {}, "defaultEnvironment": "dev" }
  22. S3 Event Handler import logging import boto3 log = logging.getLogger()

    log.setLevel(logging.DEBUG) def get_bucket_key(event); bucket = event['Records'][0]['s3']['bucket']['name'] key = event['Records'][0]['s3']['object']['key'] return bucket, key def handle(event, context): log.info('{}-{}'.format(event, context)) bucket_name, key_name = get_bucket_key(event)
  23. S3 Event Handler import logging import boto3 log = logging.getLogger()

    log.setLevel(logging.DEBUG) def get_bucket_key(event); bucket = event['Records'][0]['s3']['bucket']['name'] key = event['Records'][0]['s3']['object']['key'] return bucket, key def handle(event, context): log.info('{}-{}'.format(event, context)) bucket_name, key_name = get_bucket_key(event)
  24. S3 Event Handler import logging import boto3 log = logging.getLogger()

    log.setLevel(logging.DEBUG) def get_bucket_key(event); bucket = event['Records'][0]['s3']['bucket']['name'] key = event['Records'][0]['s3']['object']['key'] return bucket, key def handle(event, context): log.info('{}-{}'.format(event, context)) bucket_name, key_name = get_bucket_key(event)
  25. S3 Event Handler (cont.) values = { 'bucket_name': bucket_name, 'key_name':

    key_name, 'timestamp': datetime.utcnow().isoformat() } client = boto3.client('sqs') client.publish( TopicArn=topic_arn, Message=json.dumps(values) )
  26. AWS Logging Permissions data "aws_iam_policy_document" "listener_logging" { statement { sid

    = "AllowRoleToOutputCloudWatchLogs" effect = "Allow" actions = ["logs:*"] resources = ["*"] } } resource "aws_iam_policy" "listener_logs" { name = "listener_logs" description = "Allow listener to log operations" policy = "${data.aws_iam_policy_document.listener_logging.json}" }
  27. AWS IAM Role Assumption data "aws_iam_policy_document" "listener_lambda_assume_role" { statement {

    sid = "AllowRoleToBeUsedbyLambda" effect = "Allow" actions = ["sts:AssumeRole"] principals { type = "Service" identifiers = ["lambda.amazonaws.com"] } } } resource "aws_iam_role" "listener_lambda_function" { name = "listener_lambda_function" assume_role_policy = "${data.aws_iam_policy_document.listener_lambda_assume_role.json}" }
  28. AWS Policy A!achment resource "aws_iam_policy_attachment" "listener_logs_attach" { name = "listener_logs_attach"

    roles = ["${aws_iam_role.listener_lambda_function.name}"] policy_arn = "${aws_iam_policy.listener_logs.arn}" }
  29. Deploy Function and Infrastructure # apex deploy # apex infra

    plan # apex infra apply
  30. What about those other functions...

  31. Lambda Packages Zappa Project/Gun.io

  32. Lambda Packages List bcrypt cffi PyNaCl datrie LXML misaka MySQL-Python

    numpy OpenCV Pillow (PIL) psycopg2 PyCrypto cryptography pyproj python-ldap python- Levenshtein regex
  33. Dependency Handling in Apex "hooks":{ "build": "pip install -r requirements.txt

    -t ." }
  34. Function Considerations — atomic — idempotent

  35. Dis is the Remix — Longer Jobs — Legacy Pipelines

  36. Hybrid Pipeline overview

  37. Hybrid Pipeline overview

  38. Hybrid Pipeline overview

  39. Hybrid Pipeline overview

  40. Hybrid Pipeline overview

  41. Closing thoughts — Workload and Phases are important — 14x

    improvement — 0.73x improvement
  42. Questions?