Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Machine Learning From Your Databases

Using Machine Learning From Your Databases

AWS User Group Netherlands, June 30th, 2020

Machine Learning allows you to get better insights from your data. But where is most of the structured data stored? In databases! Amazon Aurora is now natively integrated with Amazon SageMaker, a service providing you with the ability to build, train, and deploy custom machine learning models quickly, and Amazon Comprehend, a natural language processing (NLP) service that uses machine learning to find insights in text. Let’s see how this integration works!

Danilo Poccia

June 30, 2020
Tweet

More Decks by Danilo Poccia

Other Decks in Programming

Transcript

  1. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Danilo Poccia, Chief Evangelist (EMEA)
    @danilop
    Using Machine Learning
    From Your Database

    View Slide

  2. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Amazon Aurora: Design Considerations for High
    Throughput Cloud-Native Relational Databases (2017)
    Amazon Aurora: Design Considerations for High
    Throughput Cloud-Native Relational Databases
    Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta,
    Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, Xiaofeng Bao
    Amazon Web Services
    ABSTRACT
    Amazon Aurora is a relational database service for OLTP
    workloads offered as part of Amazon Web Services (AWS). In
    this paper, we describe the architecture of Aurora and the design
    considerations leading to that architecture. We believe the central
    constraint in high throughput data processing has moved from
    compute and storage to the network. Aurora brings a novel
    architecture to the relational database to address this constraint,
    most notably by pushing redo processing to a multi-tenant scale-
    out storage service, purpose-built for Aurora. We describe how
    doing so not only reduces network traffic, but also allows for fast
    crash recovery, failovers to replicas without loss of data, and
    fault-tolerant, self-healing storage. We then describe how Aurora
    achieves consensus on durable state across numerous storage
    nodes using an efficient asynchronous scheme, avoiding
    expensive and chatty recovery protocols. Finally, having operated
    Aurora as a production service for over 18 months, we share
    lessons we have learned from our customers on what modern
    cloud applications expect from their database tier.
    Keywords
    Databases; Distributed Systems; Log Processing; Quorum
    Models; Replication; Recovery; Performance; OLTP
    1. INTRODUCTION
    IT workloads are increasingly moving to public cloud providers.
    Significant reasons for this industry-wide transition include the
    ability to provision capacity on a flexible on-demand basis and to
    pay for this capacity using an operational expense as opposed to
    capital expense model. Many IT workloads require a relational
    OLTP database; providing equivalent or superior capabilities to
    on-premise databases is critical to support this secular transition.
    In modern distributed cloud services, resilience and scalability are
    increasingly achieved by decoupling compute from storage
    [10][24][36][38][39] and by replicating storage across multiple
    nodes. Doing so lets us handle operations such as replacing
    misbehaving or unreachable hosts, adding replicas, failing over
    from a writer to a replica, scaling the size of a database instance
    up or down, etc.
    The I/O bottleneck faced by traditional database systems changes
    in this environment. Since I/Os can be spread across many nodes
    and many disks in a multi-tenant fleet, the individual disks and
    nodes are no longer hot. Instead, the bottleneck moves to the
    network between the database tier requesting I/Os and the storage
    tier that performs these I/Os. Beyond the basic bottlenecks of
    packets per second (PPS) and bandwidth, there is amplification of
    traffic since a performant database will issue writes out to the
    storage fleet in parallel. The performance of the outlier storage
    node, disk or network path can dominate response time.
    Although most operations in a database can overlap with each
    other, there are several situations that require synchronous
    operations. These result in stalls and context switches. One such
    situation is a disk read due to a miss in the database buffer cache.
    A reading thread cannot continue until its read completes. A cache
    miss may also incur the extra penalty of evicting and flushing a
    dirty cache page to accommodate the new page. Background
    processing such as checkpointing and dirty page writing can
    reduce the occurrence of this penalty, but can also cause stalls,
    context switches and resource contention.
    Transaction commits are another source of interference; a stall in
    committing one transaction can inhibit others from progressing.
    Handling commits with multi-phase synchronization protocols
    such as 2-phase commit (2PC) [3][4][5] is challenging in a cloud-
    scale distributed system. These protocols are intolerant of failure
    and high-scale distributed systems have a continual “background
    noise” of hard and soft failures. They are also high latency, as
    high scale systems are distributed across multiple data centers.
    Permission to make digital or hard copies of all or part of this work for personal or
    classroom use is granted without fee provided that copies are not made or
    distributed for profit or commercial advantage and that copies bear this notice and
    the full citation on the first page. Copyrights for components of this work owned
    by others than the author(s) must be honored. Abstracting with credit is permitted.
    To copy otherwise, or republish, to post on servers or to redistribute to lists, require
    prior specific permission and/or a fee. Request permissions from
    [email protected]
    SIGMOD’17, May 14 – 19, 2017, Chicago, IL, USA.
    Copyright is held by the owner/author(s). Publication rights licensed to ACM.
    ACM 978-1-4503-4197-4/17/05…$15.00
    DOI: http://dx.doi.org/10.1145/3035918.3056101
    Control Plane
    Data Plane
    Amazon
    DynamoDB
    Amazon SWF
    Logging + Storage
    SQL
    Transactions
    Caching
    Amazon S3
    Figure 1: Move logging and storage off the database engine
    1041
    https://www.allthingsdistributed.com/files/p1041-verbitski.pdf

    View Slide

  3. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Move logging and storage off the database engine
    `

    View Slide

  4. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Network IO in mirrored MySQL (not Amazon Aurora)

    View Slide

  5. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Network IO in Amazon Aurora
    `

    View Slide

  6. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Amazon Aurora
    • MySQL 5.6 and 5.7 compatible
    • PostgreSQL 9.6, 10, 11 compatible
    • Serverless
    • MySQL 5.6 and 5.7, and PostgreSQL 10.7 compatible
    • Built-in synchronous Data API with an
    • HTTP endpoint
    • Integration with AWS SDKs

    View Slide

  7. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Amazon Aurora
    • Integrated with Machine Learning services
    • Amazon Comprehend
    • Amazon SageMaker
    • Aurora MySQL 5.7 – version 2.07.0 and higher
    • Aurora PostgreSQL 10.11 – version 2.4
    • Aurora PostgreSQL 11.6 and higher – version 3.1 and higher
    • Global Database
    • Sub-Second Data Access in Any Region
    • Read Replica Write Forwarding
    • Cross-Region Disaster Recovery

    View Slide

  8. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Using Machine Learning directly from your databases
    `
    https://aws.amazon.com/blogs/aws/new-for-amazon-aurora-use-machine-learning-directly-from-your-databases/

    View Slide

  9. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Using Machine Learning directly from your databases
    https://aws.amazon.com/blogs/aws/new-for-amazon-aurora-use-machine-learning-directly-from-your-databases/

    View Slide

  10. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Using Amazon Aurora
    + Amazon Comprehend

    View Slide

  11. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Sentiment Analysis – Managing Comments
    CREATE TABLE IF NOT EXISTS comments (
    comment_id INT AUTO_INCREMENT PRIMARY KEY,
    comment_text VARCHAR(255) NOT NULL
    );
    INSERT INTO comments (comment_text)
    VALUES ("This is very useful, thank you for writing it!");
    INSERT INTO comments (comment_text)
    VALUES ("Awesome, I was waiting for this feature.");
    INSERT INTO comments (comment_text)
    VALUES ("An interesting write up, please add more details.");
    INSERT INTO comments (comment_text)
    VALUES ("I don’t like how this was implemented.");

    View Slide

  12. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Checking Sentiment & Confidence of Comments
    SELECT comment_text,
    aws_comprehend_detect_sentiment(comment_text, 'en') AS sentiment,
    aws_comprehend_detect_sentiment_confidence(comment_text, 'en') AS confidence
    FROM comments;
    +---------------------------------------------------+-----------+--------------------+
    | comment_text | sentiment | confidence |
    +---------------------------------------------------+-----------+--------------------+
    | This is very useful, thank you for writing it! | POSITIVE | 0.9996347427368164 |
    | Awesome, I was waiting for this feature. | POSITIVE | 0.9977971315383911 |
    | An interesting write up, please add more details. | NEUTRAL | 0.5156506896018982 |
    | I don’t like how this was implemented. | NEGATIVE | 0.9982835054397583 |
    +---------------------------------------------------+-----------+--------------------+

    View Slide

  13. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Estimating Overall Sentiment from Comments
    SELECT AVG(CASE aws_comprehend_detect_sentiment(comment_text, 'en')
    WHEN 'POSITIVE' THEN 1.0
    WHEN 'NEGATIVE' THEN -1.0
    ELSE 0.0 END) AS avg_sentiment,
    COUNT(*) AS total
    FROM comments
    WHERE aws_comprehend_detect_sentiment_confidence(comment_text, 'en') >= 0.80;
    +---------------+-------+
    | avg_sentiment | total |
    +---------------+-------+
    | 0.33333 | 3 |
    +---------------+-------+

    View Slide

  14. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Using Amazon Aurora
    + Amazon SageMaker

    View Slide

  15. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Customer Churn Prediction with Amazon SageMaker Autopilot
    https://github.com/awslabs/amazon-sagemaker-examples/blob/master/autopilot/autopilot_customer_churn.ipynb

    View Slide

  16. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Managing Customer Churn
    CREATE FUNCTION will_churn (
    state varchar(2048),
    acc_length bigint(20),
    area_code bigint(20), phone varchar(2048),
    int_plan varchar(2048),
    vmail_plan varchar(2048), vmail_msg bigint(20),
    day_mins double, day_calls bigint(20), day_charge double,
    eve_mins double, eve_calls bigint(20), eve_charge double,
    night_mins double, night_calls bigint(20), night_charge double,
    int_mins double, int_calls bigint(20), int_charge double,
    cust_service_calls bigint(20))
    RETURNS varchar(2048) CHARSET utf8mb4
    alias aws_sagemaker_invoke_endpoint
    endpoint name 'tuning-job-123’;

    View Slide

  17. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Estimating (and Caching) Customer Churn
    CREATE TABLE IF NOT EXISTS customers_churn AS
    SELECT *, will_churn(state, acc_length, area_code, phone, int_plan,
    vmail_plan, vmail_msg, day_mins, day_calls, day_charge,
    eve_mins, eve_calls, eve_charge, night_mins, night_calls, night_charge,
    int_mins, int_calls, int_charge, cust_service_calls) will_churn
    FROM customers;
    SELECT * FROM customers_churn LIMIT 7;
    SELECT will_churn, COUNT(*) FROM customers_churn GROUP BY will_churn;

    View Slide

  18. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Customer Churn – Some Results
    SELECT * FROM customers_churn LIMIT 7;
    +-------+------------+-----------+----------+----------+------------+-----------+----------+-----------+------------+----------+-----------+------------+------------+-------------+--------------+----------+-----------+------------+--------------------+------------+
    | state | acc_length | area_code | phone | int_plan | vmail_plan | vmail_msg | day_mins | day_calls | day_charge | eve_mins | eve_calls | eve_charge | night_mins | night_calls | night_charge | int_mins | int_calls | int_charge | cust_service_calls | will_churn |
    +-------+------------+-----------+----------+----------+------------+-----------+----------+-----------+------------+----------+-----------+------------+------------+-------------+--------------+----------+-----------+------------+--------------------+------------+
    | KS | 128 | 415 | 382-4657 | no | yes | 25 | 265.1 | 110 | 45.07 | 197.4 | 99 | 16.78 | 244.7 | 91 | 11.01 | 10 | 3 | 2.7 | 1 | False. |
    | OH | 107 | 415 | 371-7191 | no | yes | 26 | 161.6 | 123 | 27.47 | 195.5 | 103 | 16.62 | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.7 | 1 | False. |
    | NJ | 137 | 415 | 358-1921 | no | no | 0 | 243.4 | 114 | 41.38 | 121.2 | 110 | 10.3 | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | False. |
    | OH | 84 | 408 | 375-9999 | yes | no | 0 | 299.4 | 71 | 50.9 | 61.9 | 88 | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | False. |
    | OK | 75 | 415 | 330-6626 | yes | no | 0 | 166.7 | 113 | 28.34 | 148.3 | 122 | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | False. |
    | AL | 118 | 510 | 391-8027 | yes | no | 0 | 223.4 | 98 | 37.98 | 220.6 | 101 | 18.75 | 203.9 | 118 | 9.18 | 6.3 | 6 | 1.7 | 0 | False. |
    | MA | 121 | 510 | 355-9993 | no | yes | 24 | 218.2 | 88 | 37.09 | 348.5 | 108 | 29.62 | 212.6 | 118 | 9.57 | 7.5 | 7 | 2.03 | 3 | False. |
    +-------+------------+-----------+----------+----------+------------+-----------+----------+-----------+------------+----------+-----------+------------+------------+-------------+--------------+----------+-----------+------------+--------------------+------------+

    View Slide

  19. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Customer Churn – Overall Stats
    SELECT will_churn, COUNT(*)
    FROM customers_churn
    GROUP BY will_churn;
    +------------+----------+
    | will_churn | COUNT(*) |
    +------------+----------+
    | False. | 2898 |
    | True. | 435 |
    +------------+----------+

    View Slide

  20. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Customer Churn – Diving Deeper
    SELECT will_churn,
    AVG(acc_length),
    AVG(cust_service_calls),
    COUNT(*)
    FROM customers_churn
    GROUP BY will_churn;

    View Slide

  21. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Amazon Aurora Labs for MySQL
    https://awsauroralabsmy.com/ml/overview/

    View Slide

  22. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Takeaways
    Relational databases contain highly structured data
    Using machine learning from your databases unleashes
    new use cases for new users
    You don’t have to change traditional SQL-based apps to leverage
    sentiment analysis and machine learning inference
    Machine learning adoption is easier and broader in scope

    View Slide

  23. © 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Thank you!
    @danilop Please give me your feedback J

    View Slide