Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analyzing data at any scale with AWS Lambda

Analyzing data at any scale with AWS Lambda

AWS re:Invent 2020

AWS Lambda functions provide a powerful compute environment that can be used to process and gain insights from data stored in databases, Amazon Aurora, object storage, and file systems. This session reviews options and techniques to optimize your data analytics platform without managing a server, and it focuses on unstructured (Amazon S3 and Amazon EFS) and structured (Amazon DynamoDB and Amazon Aurora) data, including integrations with Amazon Athena, an interactive query service.

Danilo Poccia

February 08, 2021

More Decks by Danilo Poccia

Other Decks in Programming


  1. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Analyzing data at any scale with AWS Lambda Danilo Poccia Chief Evangelist (EMEA) AWS S V S 3 0 1
  2. © 2020, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Event-driven metadata extraction and enrichment Monte Carlo method, ensemble forecasting, ensemble learning Using Amazon Elastic File System (Amazon EFS) to host Lambda function dependencies Machine learning inference examples Advanced use cases and demos Agenda
  3. Physical sciences and engineering Computational physics, physical chemistry, aerodynamics, fluid

    dynamics, wireless networks Finance and business Evaluate the risk and uncertainty that would affect the outcome of different decision options Mathematics For example, execute 10 / 100 / 1,000 Lambda functions, each performing one or more random sampling What’s the area of the “cloud” shape? How many hits and misses? The number of hits divided by the total number of runs tends to the same ratio as the two areas Monte Carlo methods R E P E A T E D R A N D O M S A M P L I N G A N D S T A T I S T I C A L A N A L Y S I S
  4. This set of forecasts aims to give a better indication

    of the range of possible future states Useful when there is uncertainty/error in the input parameters, or a system is highly sensitive to initial conditions – such as for chaotic dynamical systems, like weather forecast For example, execute 10 / 100 / 1,000 Lambda functions, each with slightly different parameters in input, and then compare the results If 56 out of 100 weather forecast simulations expect rain, then, “There is a 56% chance of rain” Ensemble forecasting I N S T E A D O F M A K I N G A S I N G L E F O R E C A S T O F T H E M O S T L I K E L Y S C E N A R I O , A S E T ( O R E N S E M B L E ) O F F O R E C A S T S I S P R O D U C E D Input Output
  5. For example, run 10 / 100 / 1,000 Lambda functions

    training different (and relatively simple) machine learning algorithms To run inference, you can combine all or a subset of the results Ensemble learning U S I N G M U L T I P L E L E A R N I N G A L G O R I T H M S T O O B T A I N B E T T E R P R E D I C T I V E P E R F O R M A N C E T H A N W H A T C O U L D B E O B T A I N E D F R O M A N Y O F T H E C O N S T I T U E N T L E A R N I N G A L G O R I T H M S A L O N E Models Dataset Training 0 1 0 0 0 0 Predictions Ensemble’s Prediction
  6. For example, for Python 3.8 • NumPy 1.19.0 • SciPy

    1.5.1 AWS Layer for Python: NumPy and SciPy
  7. • Extends the reach of Lambda to new uses cases

    • Machine learning training § CPU-based, 15 minutes time limit § Improving results with ensemble forecasts or ensemble learning • Machine learning inference § Using CPUs § Evaluate your latency requirements (sync vs async invocations) § Use Provisioned Concurrency to avoid cold starts • You can use an Amazon Elastic Compute Cloud (Amazon EC2) instance to install dependencies and copy them to an Amazon EFS file system § Some tools created by the AWS community are great! For example – https://github.com/lambci/cmda Using Amazon EFS for hosting dependencies
  8. • Using PyTorch Machine learning inference (with Amazon EFS) {"bird_class":

    "106.Horned_Puffin"} {"bird_class": "053.Western_Grebe"} https://aws.amazon.com/blogs/aws/new-a-shared-file-system-for-your-lambda-functions/
  9. • Add the following dependencies to your requirements.txt numpy scipy

    scikit-learn panda matplotlib ipython jupyter jupyterlab papermill Using data science dependencies in Python
  10. • Mount Amazon EFS at instance launch, by default it

    will be on /mnt/efs/fs1/ • Then, you can use these commands (on Amazon Linux 2): § sudo yum update –y # reboot if kernel is updated § sudo mkdir –p /mnt/efs/fs1/DataScience/lib § sudo chown -R ec2-user:ec2-user /mnt/efs/fs1/DataScience § sudo amazon-linux-extras install python3.8 § pip3.8 install --user wheel § pip3.8 install -t /mnt/efs/fs1/DataScience/lib -r requirements.txt Using data science dependencies in Python
  11. • Create Amazon EFS access point for /DataScience § On

    Amazon Linux, you can use ec2-user’s UID and GID • Connect Lambda function to the Amazon Virtual Private Cloud (Amazon VPC) § To connect to other AWS services, such as Amazon S3 or Amazon DynamoDB, use Amazon VPC Endpoints § To reach the public internet, use private subnets + NAT Gateway • Add file system to function using Amazon EFS access point § For example, mount under /mnt/DataScience • Configure function environment to use dependencies § For example in Python set PYTHONPATH = /mnt/DataScience/lib Using data science dependencies in Python
  12. from sklearn.ensemble import RandomForestClassifier import json def lambda_handler(event, context): clf

    = RandomForestClassifier(random_state=0) X = [[ 1, 2, 3], [11, 12, 13]] # 2 samples, 3 features y = [0, 1] # classes of each sample clf.fit(X, y) # fitting the classifier A = [[4, 5, 6], [14, 15, 16], [3, 2, 1], [17, 15, 13]] result = { 'type': 'RandomForestClassifier', 'predict({})'.format(X): '{}'.format(clf.predict(X)), 'predict({})'.format(A): '{}'.format(clf.predict(A)) } return { 'statusCode': 200, 'body': json.dumps(result) } Random Forest Classifier U S I N G S C I K I T - L E A R N F O R C L A S S I F I C A T I O N Fitting the classifier Getting predictions Returned by Amazon API Gateway
  13. from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression . .

    . import json pipe = make_pipeline( # create a pipeline object StandardScaler(), LogisticRegression(random_state=0) ) X, y = load_iris(return_X_y=True) # load the iris dataset X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) # split train / test sets pipe.fit(X_train, y_train) # fit the whole pipeline def lambda_handler(event, context): result = { 'accuracy_score': accuracy_score(pipe.predict(X_test), y_test) } return { 'statusCode': 200, 'body': json.dumps(result) } Logistic Regression U S I N G S C I K I T - L E A R N F O R R E G R E S S I O N U S I N G A P I P E L I N E Training Accuracy returned by Amazon API Gateway Pipeline
  14. import matplotlib import matplotlib.pyplot as plt . . . plt.figure(figsize=(len(anomaly_algorithms)

    * 2 + 3, 12.5)) plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05, hspace=.01) . . . img_data = io.BytesIO() # You can’t just do plt.show() plt.savefig(img_data, format='png') img_data.seek(0) s3 = boto3.resource('s3') bucket = s3.Bucket(OUTPUT_BUCKET) bucket.put_object(Body=img_data, ContentType='image/png', Key=OUTPUT_KEY, ACL='public-read') image_url = 'https://{}.s3.amazonaws.com/{}'.format(OUTPUT_BUCKET, OUTPUT_KEY) def lambda_handler(event, context): return { 'statusCode': 302, # 301 would be permanent and cached 'headers': { 'Location': image_url } } Matplotlib Image on S3 U S I N G M A T P L O T L I B W I T H A M A Z O N S 3 A N D A M A Z O N A P I G A T E W A Y Uploading to an Amazon S3 bucket HTTP redirect to the Amazon S3 object
  15. • Use cases § A periodic report to execute with

    different parameters depending on time/date § Run a notebook and then automate actions based on the result § Run a notebook as part of a workflow • For more info § https://papermill.readthedocs.io/ § https://github.com/nteract/papermill Running Jupyter notebooks using Papermill A T O O L F O R P A R A M E T E R I Z I N G A N D E X E C U T I N G J U P Y T E R N O T E B O O K S
  16. Running Jupyter notebooks using Papermill E V E N T

    - D R I V E N J U P Y T E R N O T E B O O K S O N A M A Z O N S 3 Use Amazon S3 user-defined metadata for parameters Upload Jupyter notebook to Amazon S3 S3 user-defined metadata is limited to 2 KB in size
  17. A M A Z O N S 3 U R

    L S Y N T A X I S S U P P O R T E D B Y P A P E R M I L L T O R E A D A N D W R I T E N O T E B O O K S import papermill as pm . . . sys.path.append("/opt/bin") sys.path.append("/opt/python") os.environ["IPYTHONDIR"]='/tmp/ipythondir' . . . input_notebook = 's3://{}/{}'.format(bucket, key) output_notebook = 's3://{}/{}'.format(OUTPUT_BUCKET, key) . . . pm.execute_notebook( input_notebook, output_notebook, parameters = parameters ) Using Papermill with Amazon S3 Executing Jupyter notebook Using S3 URLs for input and output To run Python inside Jupyter
  18. • Use event-driven architectures for metadata extraction, enrichment, and indexing

    • For more advanced use cases, manage dependencies using an Amazon EFS file system to use tools like Scikit-learn and Matplotlib • Simplify machine learning inference with frameworks such as PyTorch and TensorFlow running in Lambda functions • Automate Jupyter notebooks execution using Papermill Takeaways
  19. Thank you! © 2020, Amazon Web Services, Inc. or its

    affiliates. All rights reserved. https://github.com/danilop/analyzing-data-aws-lambda @danilop