$30 off During Our Annual Pro Sale. View Details »

Mercedes Coyle - Build Serverless Realtime Data Pipelines with Python and AWS Lambda

Mercedes Coyle - Build Serverless Realtime Data Pipelines with Python and AWS Lambda

At Scripps Networks Living, we operate a network of video players generating around 100 million events per day. In order to process, store, and analyze this data, we operate batch and realtime data pipelines based off of Lambda Architecture principles. After outgrowing our original events system, we rebuilt it from the ground up based on AWS Services and the learnings from our original system.

https://us.pycon.org/2016/schedule/presentation/2237/

PyCon 2016

May 29, 2016
Tweet

More Decks by PyCon 2016

Other Decks in Programming

Transcript

  1. *well, almost.
    ZERO INFRASTRUCTURE*
    Building Realtime* Data Pipelines
    with Python and AWS Lambda
    Mercedes Coyle
    @benzobot
    Cloud Operations Engineer

    View Slide

  2. WHAT I’LL COVER
    ‣ Realtime/Streaming Systems Architecture
    ‣ AWS Serverless Components:
    ‣ Kinesis
    ‣ Lambda
    ‣ API Gateway
    ‣ EMR

    View Slide

  3. USE CASE
    ‣ Online video syndication platform
    ‣ Connects content providers, video publishers, and advertisers
    ‣ 2-3 million video streams per day

    View Slide

  4. INTRO: REALTIME SYSTEMS
    ‣ What does “Realtime” mean?
    ‣ Event based
    ‣ Near realtime - up to several
    seconds between data origin
    and destination

    View Slide

  5. Architecture
    LEGACY DATA SYSTEM

    View Slide

  6. What we learned
    LEGACY DATA SYSTEM
    ‣ Need for faster data analysis
    ‣ Avoid logging to disk as a
    method of data collection
    ‣ Scheduled jobs are not intelligent
    ‣ Mangled data

    View Slide

  7. System Requirements
    GOING SERVERLESS
    ‣ Allow for streaming
    analytics
    ‣ Reduce system complexity
    ‣ Data source and storage
    agnostic
    ‣ Flexibility

    View Slide

  8. Architecture
    SERVERLESS DATA SYSTEM

    View Slide

  9. AWS Services
    GOING SERVERLESS
    ‣ API Gateway
    ‣ Kinesis Streams
    ‣ Lambda
    ‣ S3
    ‣ EMR

    View Slide

  10. API GATEWAY
    ‣ Quick and easy to setup
    ‣ Public HTTP interface or use API keys
    ‣ Can trigger lambda or go directly to Kinesis stream

    View Slide

  11. Queueing Service
    KINESIS STREAMS
    ‣ HTTP PUT single or batched
    records
    ‣ 7 day data retention
    ‣ Multiple subscriber
    ‣ Horizontally scalable

    View Slide

  12. Simple Storage Service
    S3
    ‣ Simple Storage Service
    ‣ Stores file objects, not a traditional file system
    ‣ Categorize file objects by buckets
    ‣ ///filename.bz2

    View Slide

  13. Elastic Map Reduce
    EMR
    ‣ Managed Hadoop cluster
    ‣ Spin up, process, destroy

    View Slide

  14. Features
    AWS LAMBDA
    ‣ Event driven push/pull
    ‣ Scales up/down automatically
    ‣ Supports Python, NodeJS, and
    Java
    ‣ Stateless and Asynchronous

    View Slide

  15. The basics
    LAMBDA: ANATOMY

    View Slide

  16. Event and Context Object
    LAMBDA: ANATOMY
    ‣ Event Data

    View Slide

  17. Event and Context Object
    LAMBDA: ANATOMY
    ‣ The context object is metadata about the running function
    ‣ context.get_remaining_time_in_millis()
    ‣ context.aws_request_id()

    View Slide

  18. Runtime
    LAMBDA: ANATOMY
    ‣ Key design feature is
    statelessness
    ‣ Lambda functions don’t know
    anything about previous events
    ‣ Automatic retry on failure

    View Slide

  19. Runtime
    LAMBDA: ANATOMY

    View Slide

  20. Logging and Monitoring
    LAMBDA: ANATOMY
    ‣ Any print or logging statement is logged to CloudWatch

    View Slide

  21. Logging and Monitoring
    LAMBDA: ANATOMY
    ‣ Metrics dashboard displays high level performance data

    View Slide

  22. LAMBDA: TESTING
    ‣ Can test lambda code as any other python code with your
    preferred testing framework
    ‣ Invoke lambda functions manually from AWS CLI
    ‣ aws lambda invoke --invocation-type DryRun --function-
    name put-events-kinesis --payload '{"test":"data"}' outfile

    View Slide

  23. LAMBDA: PACKAGING
    ‣ Need to create a zip file of function code and any
    dependencies
    ‣ Can pip install -t /project-dir/ and zip contents of
    that directory
    ‣ Or you can install the contents of /lib/
    python2.7/site-packages/

    View Slide

  24. LAMBDA: DEPLOYMENT
    ‣ AWS CLI from Travis CI job
    ‣ Cloud Formation template
    ‣ Upload to S3 and deploy via
    Lambda

    View Slide

  25. Takeaways and Lessons Learned
    SUMMARY
    ‣ Python 2.7 only
    ‣ Faster development cycles and data insights
    ‣ Code more for business goals, less for infrastructure
    ‣ Factor in maintenance and operational costs when pricing
    out

    View Slide

  26. *Not all servers
    SERVERLESS*

    View Slide

  27. RESOURCES
    ‣ https://aws.amazon.com/blogs/compute/microservices-without-the-servers/
    ‣ http://redmonk.com/fryan/2016/04/28/serverless-volume-compute-for-a-new-generation/
    ‣ http://blogs.aws.amazon.com/bigdata/post/Tx2Z24D4T99AN35/Snakes-in-the-Stream-Feeding-and-Eating-
    Amazon-Kinesis-Streams-with-Python
    ‣ https://docs.aws.amazon.com/lambda/latest/dg/intro-core-components.html
    ‣ https://docs.aws.amazon.com/lambda/latest/dg/limits.html
    ‣ https://github.com/spulec/moto

    View Slide

  28. THANK YOU!
    Mercedes Coyle
    @benzobot

    View Slide