Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Machine Learning From Your Databases

Using Machine Learning From Your Databases

AWS Cambridge User Group, November 10th, 2020

Machine Learning allows you to get better insights from your data. But where is most of the structured data stored? In databases! Amazon Aurora is now natively integrated with Amazon SageMaker, a service providing you with the ability to build, train, and deploy custom machine learning models quickly, and Amazon Comprehend, a natural language processing (NLP) service that uses machine learning to find insights in text. Let’s see how this integration works!

Danilo Poccia

November 10, 2020
Tweet

More Decks by Danilo Poccia

Other Decks in Programming

Transcript

  1. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Danilo Poccia, Chief Evangelist (EMEA) @danilop Using Machine Learning From Your Database
  2. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Where to store data? Unstructured Structured Semi-Structured Object Storage Database LOB (Large OBject) NoSQL Database Relational Database Relational Database with JSON Document Extensions NoSQL Database Transient Memory Data Time Series Database
  3. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases (2017) Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, Xiaofeng Bao Amazon Web Services ABSTRACT Amazon Aurora is a relational database service for OLTP workloads offered as part of Amazon Web Services (AWS). In this paper, we describe the architecture of Aurora and the design considerations leading to that architecture. We believe the central constraint in high throughput data processing has moved from compute and storage to the network. Aurora brings a novel architecture to the relational database to address this constraint, most notably by pushing redo processing to a multi-tenant scale- out storage service, purpose-built for Aurora. We describe how doing so not only reduces network traffic, but also allows for fast crash recovery, failovers to replicas without loss of data, and fault-tolerant, self-healing storage. We then describe how Aurora achieves consensus on durable state across numerous storage nodes using an efficient asynchronous scheme, avoiding expensive and chatty recovery protocols. Finally, having operated Aurora as a production service for over 18 months, we share lessons we have learned from our customers on what modern cloud applications expect from their database tier. Keywords Databases; Distributed Systems; Log Processing; Quorum Models; Replication; Recovery; Performance; OLTP 1. INTRODUCTION IT workloads are increasingly moving to public cloud providers. Significant reasons for this industry-wide transition include the ability to provision capacity on a flexible on-demand basis and to pay for this capacity using an operational expense as opposed to capital expense model. Many IT workloads require a relational OLTP database; providing equivalent or superior capabilities to on-premise databases is critical to support this secular transition. In modern distributed cloud services, resilience and scalability are increasingly achieved by decoupling compute from storage [10][24][36][38][39] and by replicating storage across multiple nodes. Doing so lets us handle operations such as replacing misbehaving or unreachable hosts, adding replicas, failing over from a writer to a replica, scaling the size of a database instance up or down, etc. The I/O bottleneck faced by traditional database systems changes in this environment. Since I/Os can be spread across many nodes and many disks in a multi-tenant fleet, the individual disks and nodes are no longer hot. Instead, the bottleneck moves to the network between the database tier requesting I/Os and the storage tier that performs these I/Os. Beyond the basic bottlenecks of packets per second (PPS) and bandwidth, there is amplification of traffic since a performant database will issue writes out to the storage fleet in parallel. The performance of the outlier storage node, disk or network path can dominate response time. Although most operations in a database can overlap with each other, there are several situations that require synchronous operations. These result in stalls and context switches. One such situation is a disk read due to a miss in the database buffer cache. A reading thread cannot continue until its read completes. A cache miss may also incur the extra penalty of evicting and flushing a dirty cache page to accommodate the new page. Background processing such as checkpointing and dirty page writing can reduce the occurrence of this penalty, but can also cause stalls, context switches and resource contention. Transaction commits are another source of interference; a stall in committing one transaction can inhibit others from progressing. Handling commits with multi-phase synchronization protocols such as 2-phase commit (2PC) [3][4][5] is challenging in a cloud- scale distributed system. These protocols are intolerant of failure and high-scale distributed systems have a continual “background noise” of hard and soft failures. They are also high latency, as high scale systems are distributed across multiple data centers. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, require prior specific permission and/or a fee. Request permissions from [email protected] SIGMOD’17, May 14 – 19, 2017, Chicago, IL, USA. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-4197-4/17/05…$15.00 DOI: http://dx.doi.org/10.1145/3035918.3056101 Control Plane Data Plane Amazon DynamoDB Amazon SWF Logging + Storage SQL Transactions Caching Amazon S3 Figure 1: Move logging and storage off the database engine 1041 https://www.amazon.science/publications/amazon-aurora-design-considerations-for-high-throughput-cloud-native-relational-databases
  4. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Move logging and storage off the database engine `
  5. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Network IO in mirrored MySQL (not Amazon Aurora)
  6. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Network IO in Amazon Aurora `
  7. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Aurora • MySQL 5.6 and 5.7 compatible • PostgreSQL 9.6, 10, 11 compatible • Serverless • MySQL 5.6 and 5.7, and PostgreSQL 10.7 compatible • Built-in synchronous Data API with an • HTTP endpoint • Integration with AWS SDKs
  8. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Aurora • Integrated with Machine Learning services • Amazon Comprehend • Amazon SageMaker • Aurora MySQL 5.7 – version 2.07.0 and higher • Aurora PostgreSQL 10.11 • Aurora PostgreSQL 11.6 and higher • Global Database • Sub-Second Data Access in Any Region • Read Replica Write Forwarding • Cross-Region Disaster Recovery
  9. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Using Machine Learning directly from your databases ` https://aws.amazon.com/blogs/aws/new-for-amazon-aurora-use-machine-learning-directly-from-your-databases/
  10. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Using Machine Learning directly from your databases https://aws.amazon.com/blogs/aws/new-for-amazon-aurora-use-machine-learning-directly-from-your-databases/
  11. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Using Amazon Aurora + Amazon Comprehend
  12. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Sentiment Analysis – Managing Comments CREATE TABLE IF NOT EXISTS comments ( comment_id INT AUTO_INCREMENT PRIMARY KEY, comment_text VARCHAR(255) NOT NULL ); INSERT INTO comments (comment_text) VALUES ("This is very useful, thank you for writing it!"); INSERT INTO comments (comment_text) VALUES ("Awesome, I was waiting for this feature."); INSERT INTO comments (comment_text) VALUES ("An interesting write up, please add more details."); INSERT INTO comments (comment_text) VALUES ("I don’t like how this was implemented.");
  13. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Checking Sentiment & Confidence of Comments SELECT comment_text, aws_comprehend_detect_sentiment(comment_text, 'en') AS sentiment, aws_comprehend_detect_sentiment_confidence(comment_text, 'en') AS confidence FROM comments; +---------------------------------------------------+-----------+--------------------+ | comment_text | sentiment | confidence | +---------------------------------------------------+-----------+--------------------+ | This is very useful, thank you for writing it! | POSITIVE | 0.9996347427368164 | | Awesome, I was waiting for this feature. | POSITIVE | 0.9977971315383911 | | An interesting write up, please add more details. | NEUTRAL | 0.5156506896018982 | | I don’t like how this was implemented. | NEGATIVE | 0.9982835054397583 | +---------------------------------------------------+-----------+--------------------+
  14. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Estimating Overall Sentiment from Comments SELECT AVG(CASE aws_comprehend_detect_sentiment(comment_text, 'en') WHEN 'POSITIVE' THEN 1.0 WHEN 'NEGATIVE' THEN -1.0 ELSE 0.0 END) AS avg_sentiment, COUNT(*) AS total FROM comments WHERE aws_comprehend_detect_sentiment_confidence(comment_text, 'en') >= 0.80; +---------------+-------+ | avg_sentiment | total | +---------------+-------+ | 0.33333 | 3 | +---------------+-------+
  15. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Using Amazon Aurora + Amazon SageMaker
  16. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Customer Churn Prediction with Amazon SageMaker Autopilot https://github.com/awslabs/amazon-sagemaker-examples/blob/master/autopilot/autopilot_customer_churn.ipynb
  17. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Managing Customer Churn CREATE FUNCTION will_churn ( state varchar(2048), acc_length bigint(20), area_code bigint(20), phone varchar(2048), int_plan varchar(2048), vmail_plan varchar(2048), vmail_msg bigint(20), day_mins double, day_calls bigint(20), day_charge double, eve_mins double, eve_calls bigint(20), eve_charge double, night_mins double, night_calls bigint(20), night_charge double, int_mins double, int_calls bigint(20), int_charge double, cust_service_calls bigint(20)) RETURNS varchar(2048) CHARSET utf8mb4 alias aws_sagemaker_invoke_endpoint endpoint name 'tuning-job-123’;
  18. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Estimating (and Caching) Customer Churn CREATE TABLE IF NOT EXISTS customers_churn AS SELECT *, will_churn(state, acc_length, area_code, phone, int_plan, vmail_plan, vmail_msg, day_mins, day_calls, day_charge, eve_mins, eve_calls, eve_charge, night_mins, night_calls, night_charge, int_mins, int_calls, int_charge, cust_service_calls) will_churn FROM customers; SELECT * FROM customers_churn LIMIT 7; SELECT will_churn, COUNT(*) FROM customers_churn GROUP BY will_churn;
  19. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Customer Churn – Some Results SELECT * FROM customers_churn LIMIT 7; +-------+------------+-----------+----------+----------+------------+-----------+----------+-----------+------------+----------+-----------+------------+------------+-------------+--------------+----------+-----------+------------+--------------------+------------+ | state | acc_length | area_code | phone | int_plan | vmail_plan | vmail_msg | day_mins | day_calls | day_charge | eve_mins | eve_calls | eve_charge | night_mins | night_calls | night_charge | int_mins | int_calls | int_charge | cust_service_calls | will_churn | +-------+------------+-----------+----------+----------+------------+-----------+----------+-----------+------------+----------+-----------+------------+------------+-------------+--------------+----------+-----------+------------+--------------------+------------+ | KS | 128 | 415 | 382-4657 | no | yes | 25 | 265.1 | 110 | 45.07 | 197.4 | 99 | 16.78 | 244.7 | 91 | 11.01 | 10 | 3 | 2.7 | 1 | False. | | OH | 107 | 415 | 371-7191 | no | yes | 26 | 161.6 | 123 | 27.47 | 195.5 | 103 | 16.62 | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.7 | 1 | False. | | NJ | 137 | 415 | 358-1921 | no | no | 0 | 243.4 | 114 | 41.38 | 121.2 | 110 | 10.3 | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | False. | | OH | 84 | 408 | 375-9999 | yes | no | 0 | 299.4 | 71 | 50.9 | 61.9 | 88 | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | False. | | OK | 75 | 415 | 330-6626 | yes | no | 0 | 166.7 | 113 | 28.34 | 148.3 | 122 | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | False. | | AL | 118 | 510 | 391-8027 | yes | no | 0 | 223.4 | 98 | 37.98 | 220.6 | 101 | 18.75 | 203.9 | 118 | 9.18 | 6.3 | 6 | 1.7 | 0 | False. | | MA | 121 | 510 | 355-9993 | no | yes | 24 | 218.2 | 88 | 37.09 | 348.5 | 108 | 29.62 | 212.6 | 118 | 9.57 | 7.5 | 7 | 2.03 | 3 | False. | +-------+------------+-----------+----------+----------+------------+-----------+----------+-----------+------------+----------+-----------+------------+------------+-------------+--------------+----------+-----------+------------+--------------------+------------+
  20. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Customer Churn – Overall Stats SELECT will_churn, COUNT(*) FROM customers_churn GROUP BY will_churn; +------------+----------+ | will_churn | COUNT(*) | +------------+----------+ | False. | 2898 | | True. | 435 | +------------+----------+
  21. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Customer Churn – Diving Deeper SELECT will_churn, AVG(acc_length), AVG(cust_service_calls), COUNT(*) FROM customers_churn GROUP BY will_churn;
  22. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Amazon Aurora Labs for MySQL https://awsauroralabsmy.com/ml/overview/
  23. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Takeaways Relational databases contain highly structured data Using machine learning from your databases unleashes new use cases for new users You don’t have to change traditional SQL-based apps to leverage sentiment analysis and machine learning inference Machine learning adoption is easier and broader in scope
  24. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Thank you! @danilop Please give me your feedback J