Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analyzing data at any scale with AWS Lambda

Analyzing data at any scale with AWS Lambda

AWS re:Invent 2020

AWS Lambda functions provide a powerful compute environment that can be used to process and gain insights from data stored in databases, Amazon Aurora, object storage, and file systems. This session reviews options and techniques to optimize your data analytics platform without managing a server, and it focuses on unstructured (Amazon S3 and Amazon EFS) and structured (Amazon DynamoDB and Amazon Aurora) data, including integrations with Amazon Athena, an interactive query service.

Danilo Poccia

February 08, 2021
Tweet

More Decks by Danilo Poccia

Other Decks in Programming

Transcript

  1. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Analyzing data at any scale
    with AWS Lambda
    Danilo Poccia
    Chief Evangelist (EMEA)
    AWS
    S V S 3 0 1

    View full-size slide

  2. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Event-driven metadata extraction and enrichment
    Monte Carlo method, ensemble forecasting, ensemble learning
    Using Amazon Elastic File System (Amazon EFS) to host
    Lambda function dependencies
    Machine learning inference examples
    Advanced use cases and demos
    Agenda

    View full-size slide

  3. Event-driven metadata extraction + enrichment

    View full-size slide

  4. Physical sciences and engineering
    Computational physics, physical chemistry,
    aerodynamics, fluid dynamics, wireless
    networks
    Finance and business
    Evaluate the risk and uncertainty that
    would affect the outcome of different
    decision options
    Mathematics
    For example, execute 10 / 100 / 1,000
    Lambda functions, each performing one or
    more random sampling
    What’s the area of the “cloud”
    shape?
    How many hits and misses?
    The number of hits divided by the total
    number of runs tends to the same ratio as
    the two areas
    Monte Carlo methods
    R E P E A T E D R A N D O M S A M P L I N G A N D S T A T I S T I C A L A N A L Y S I S

    View full-size slide

  5. This set of forecasts aims to give a
    better indication of the range of
    possible future states
    Useful when there is
    uncertainty/error in the input
    parameters, or a system is highly
    sensitive to initial conditions –
    such as for chaotic dynamical
    systems, like weather forecast
    For example, execute 10 / 100 /
    1,000 Lambda functions, each with
    slightly different parameters in
    input, and then compare the
    results
    If 56 out of 100 weather forecast
    simulations expect rain, then,
    “There is a 56% chance of rain”
    Ensemble forecasting
    I N S T E A D O F M A K I N G A S I N G L E F O R E C A S T O F T H E M O S T L I K E L Y
    S C E N A R I O , A S E T ( O R E N S E M B L E ) O F F O R E C A S T S I S P R O D U C E D
    Input
    Output

    View full-size slide

  6. For example, run 10 / 100 / 1,000
    Lambda functions training
    different (and relatively simple)
    machine learning algorithms
    To run inference, you can combine
    all or a subset of the results
    Ensemble learning
    U S I N G M U L T I P L E L E A R N I N G A L G O R I T H M S T O O B T A I N B E T T E R P R E D I C T I V E P E R F O R M A N C E
    T H A N W H A T C O U L D B E O B T A I N E D F R O M A N Y O F T H E C O N S T I T U E N T L E A R N I N G A L G O R I T H M S A L O N E
    Models
    Dataset
    Training
    0 1 0 0 0
    0
    Predictions
    Ensemble’s Prediction

    View full-size slide

  7. For example, for Python 3.8
    • NumPy 1.19.0
    • SciPy 1.5.1
    AWS Layer for Python: NumPy and SciPy

    View full-size slide

  8. • Extends the reach of Lambda to new uses cases
    • Machine learning training
    § CPU-based, 15 minutes time limit
    § Improving results with ensemble forecasts or ensemble learning
    • Machine learning inference
    § Using CPUs
    § Evaluate your latency requirements (sync vs async invocations)
    § Use Provisioned Concurrency to avoid cold starts
    • You can use an Amazon Elastic Compute Cloud (Amazon EC2)
    instance to install dependencies and copy them to an Amazon EFS
    file system
    § Some tools created by the AWS community are great! For example
    – https://github.com/lambci/cmda
    Using Amazon EFS for hosting dependencies

    View full-size slide

  9. • Using PyTorch
    Machine learning inference (with Amazon EFS)
    {"bird_class": "106.Horned_Puffin"} {"bird_class": "053.Western_Grebe"}
    https://aws.amazon.com/blogs/aws/new-a-shared-file-system-for-your-lambda-functions/

    View full-size slide

  10. https://aws.amazon.com/blogs/compute/building-deep-learning-inference-with-aws-lambda-and-amazon-efs/
    • Using TensorFlow
    Machine learning inference (with Amazon EFS)

    View full-size slide

  11. https://aws.amazon.com/blogs/compute/pay-as-you-go-machine-learning-inference-with-aws-lambda/
    • Using XGBoost
    Machine learning inference (with Amazon EFS)

    View full-size slide

  12. • Add the following dependencies to your requirements.txt
    numpy
    scipy
    scikit-learn
    panda
    matplotlib
    ipython
    jupyter
    jupyterlab
    papermill
    Using data science dependencies in Python

    View full-size slide

  13. • Mount Amazon EFS at instance launch, by default it will be on
    /mnt/efs/fs1/
    • Then, you can use these commands (on Amazon Linux 2):
    § sudo yum update –y # reboot if kernel is updated
    § sudo mkdir –p /mnt/efs/fs1/DataScience/lib
    § sudo chown -R ec2-user:ec2-user /mnt/efs/fs1/DataScience
    § sudo amazon-linux-extras install python3.8
    § pip3.8 install --user wheel
    § pip3.8 install -t /mnt/efs/fs1/DataScience/lib -r requirements.txt
    Using data science dependencies in Python

    View full-size slide

  14. • Create Amazon EFS access point for /DataScience
    § On Amazon Linux, you can use ec2-user’s UID and GID
    • Connect Lambda function to the Amazon Virtual Private Cloud
    (Amazon VPC)
    § To connect to other AWS services, such as Amazon S3 or Amazon DynamoDB, use Amazon
    VPC Endpoints
    § To reach the public internet, use private subnets + NAT Gateway
    • Add file system to function using Amazon EFS access point
    § For example, mount under /mnt/DataScience
    • Configure function environment to use dependencies
    § For example in Python set PYTHONPATH = /mnt/DataScience/lib
    Using data science dependencies in Python

    View full-size slide

  15. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  16. from sklearn.ensemble import RandomForestClassifier
    import json
    def lambda_handler(event, context):
    clf = RandomForestClassifier(random_state=0)
    X = [[ 1, 2, 3], [11, 12, 13]] # 2 samples, 3 features
    y = [0, 1] # classes of each sample
    clf.fit(X, y) # fitting the classifier
    A = [[4, 5, 6], [14, 15, 16], [3, 2, 1], [17, 15, 13]]
    result = {
    'type': 'RandomForestClassifier',
    'predict({})'.format(X): '{}'.format(clf.predict(X)),
    'predict({})'.format(A): '{}'.format(clf.predict(A))
    }
    return {
    'statusCode': 200,
    'body': json.dumps(result)
    }
    Random Forest Classifier
    U S I N G S C I K I T - L E A R N F O R C L A S S I F I C A T I O N
    Fitting the
    classifier
    Getting
    predictions
    Returned by
    Amazon API
    Gateway

    View full-size slide

  17. from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LogisticRegression
    . . .
    import json
    pipe = make_pipeline( # create a pipeline object
    StandardScaler(),
    LogisticRegression(random_state=0)
    )
    X, y = load_iris(return_X_y=True) # load the iris dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # split train / test
    sets
    pipe.fit(X_train, y_train) # fit the whole pipeline
    def lambda_handler(event, context):
    result = {
    'accuracy_score': accuracy_score(pipe.predict(X_test), y_test)
    }
    return {
    'statusCode': 200,
    'body': json.dumps(result)
    }
    Logistic Regression
    U S I N G S C I K I T - L E A R N F O R R E G R E S S I O N U S I N G A P I P E L I N E
    Training
    Accuracy returned by
    Amazon API Gateway
    Pipeline

    View full-size slide

  18. import matplotlib
    import matplotlib.pyplot as plt
    . . .
    plt.figure(figsize=(len(anomaly_algorithms) * 2 + 3, 12.5))
    plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05, hspace=.01)
    . . .
    img_data = io.BytesIO() # You can’t just do plt.show()
    plt.savefig(img_data, format='png')
    img_data.seek(0)
    s3 = boto3.resource('s3')
    bucket = s3.Bucket(OUTPUT_BUCKET)
    bucket.put_object(Body=img_data, ContentType='image/png', Key=OUTPUT_KEY, ACL='public-read')
    image_url = 'https://{}.s3.amazonaws.com/{}'.format(OUTPUT_BUCKET, OUTPUT_KEY)
    def lambda_handler(event, context):
    return {
    'statusCode': 302, # 301 would be permanent and cached
    'headers': {
    'Location': image_url
    }
    }
    Matplotlib Image on S3
    U S I N G M A T P L O T L I B W I T H A M A Z O N S 3 A N D A M A Z O N A P I G A T E W A Y
    Uploading to an
    Amazon S3
    bucket
    HTTP redirect to
    the Amazon S3
    object

    View full-size slide

  19. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View full-size slide

  20. • Use cases
    § A periodic report to execute with different parameters depending on time/date
    § Run a notebook and then automate actions based on the result
    § Run a notebook as part of a workflow
    • For more info
    § https://papermill.readthedocs.io/
    § https://github.com/nteract/papermill
    Running Jupyter notebooks using Papermill
    A T O O L F O R P A R A M E T E R I Z I N G A N D E X E C U T I N G J U P Y T E R N O T E B O O K S

    View full-size slide

  21. Running Jupyter notebooks using Papermill
    E V E N T - D R I V E N J U P Y T E R N O T E B O O K S O N A M A Z O N S 3
    Use Amazon S3
    user-defined
    metadata
    for parameters
    Upload
    Jupyter notebook
    to Amazon S3
    S3 user-defined metadata is limited to 2 KB in size

    View full-size slide

  22. A M A Z O N S 3 U R L S Y N T A X I S S U P P O R T E D B Y P A P E R M I L L T O R E A D A N D W R I T E N O T E B O O K S
    import papermill as pm
    . . .
    sys.path.append("/opt/bin")
    sys.path.append("/opt/python")
    os.environ["IPYTHONDIR"]='/tmp/ipythondir'
    . . .
    input_notebook = 's3://{}/{}'.format(bucket, key)
    output_notebook = 's3://{}/{}'.format(OUTPUT_BUCKET, key)
    . . .
    pm.execute_notebook(
    input_notebook,
    output_notebook,
    parameters = parameters
    )
    Using Papermill with Amazon S3
    Executing Jupyter
    notebook
    Using S3 URLs for
    input and output
    To run Python
    inside Jupyter

    View full-size slide

  23. © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Takeaways

    View full-size slide

  24. • Use event-driven architectures for metadata extraction, enrichment,
    and indexing
    • For more advanced use cases, manage dependencies using an Amazon
    EFS file system to use tools like Scikit-learn and Matplotlib
    • Simplify machine learning inference with frameworks such as PyTorch
    and TensorFlow running in Lambda functions
    • Automate Jupyter notebooks execution using Papermill
    Takeaways

    View full-size slide

  25. Thank you!
    © 2020, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    https://github.com/danilop/analyzing-data-aws-lambda
    @danilop

    View full-size slide