AWS data services for machine learning - AWS Innovate Online

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Agenda The evolution of data challenges What’s a data lake? Data ingestion & real-time data Query engines & ETL Demo

Slide 3

Slide 3 text

Solution My reports make my database server very slow Before 2009 The DBA years Overnight DB dump Read-only replica My data doesn’t fit in one machine And it’s not only transactional 2009-2011 The Hadoop epiphany Hadoop Map/Reduce all the things My data is very fast Map/Reduce is hard to use 2012-2014 The Message Broker and NoSQL Age Kafka/RabbitMQ Cassandra/HBase /Storm Basic ETL Hive Duplicating batch/stream is inefficient I need to cleanse my source data Hadoop ecosystem is hard to manage My data scientists don’t like Java I am not sure which data we are already processing 2015-2017 The Spark kingdom and the spreadsheet wars Kafka/Spark Complex ETL Create new departments for data governance Spreadsheet all the things Streaming is hard My schemas have evolved I cannot query old and new data together My cluster is running old versions; upgrading is hard I want to use ML 2017-2018 The myth of DataOps Kafka/Flink (Java or Scala required) Complex ETL with a pinch of ML Apache Atlas Commercial distributions

Slide 4

Slide 4 text

Data variety and data volumes are increasing rapidly Multiple consumers and applications Ingest Discover Catalog Understand Curate Find insights Amazon Kinesis Data Streams Amazon Kinesis Data Firehose On-premises databases

Slide 5

Slide 5 text

Some problems during all periods More time spent maintaining the cluster than adding functionality Security and monitoring are hard Cluster is sitting idle most of the time No time left to experiment Frustration because data preparation, cleansing, and basic transformations take 80% of our time

Slide 6

Slide 6 text

The downfall of the data engineer Watching paint dry is exciting in comparison to writing and maintaining Extract Transform and Load (ETL) logic. Most ETL jobs take a long time to execute and errors or issues tend to happen at runtime or are post-runtime assertions. Since the development time to execution time ratio is typically low, being productive means juggling with multiple pipelines at once and inherently doing a lot of context switching. By the time one of your 5 running “big data jobs” has finished, you have to get back in the mind space you were in many hours ago and craft your next iteration. Depending on how caffeinated you are, how long it’s been since the last iteration, and how systematic you are, you may fail at restoring the full context in your short- term memory. This leads to systemic, stupid errors that waste hours. “ ” Maxime Beauchemin Data engineer @ Lyft Also, creator of Apache Airflow and Apache Superset. Ex-Facebook, Ex-Yahoo!, Ex-Airbnb medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b

Slide 7

Slide 7 text

Purpose-built engines Right tool for the job

Slide 8

Slide 8 text

Purpose-built analytics tools Collect Store Analyze Amazon Kinesis Data Firehose AWS Direct Connect AWS Snowball Amazon Kinesis Data Analytics Amazon Kinesis Data Streams Amazon S3 Amazon S3 Glacier Amazon CloudSearch Amazon RDS, Amazon Aurora Amazon DynamoDB Amazon Elasticsearch Service Amazon EMR Amazon Redshift Amazon QuickSight AWS Database Migration Service AWS Glue Amazon Athena Amazon SageMaker

Slide 9

Slide 9 text

Slide 10

Slide 10 text

What’s a data lake? Collect, store, process, consume, and analyze organizational data Structured, semi-structured, and unstructured data Decoupled compute and (low-cost) storage Fast automated ingestion Schema on-read Allows self-service and easy plug and play Complementary to data warehouses

Slide 11

Slide 11 text

A possible open-source solution Hadoop Cluster (static/multi tenant) Apache NiFi for ingestion workflows Sqoop to ingest data from RDBMS HDFS to store the data (tied to the Hadoop cluster) Hive/HCatalog for data catalog Apache Atlas for a more human data catalog and governance Apache Spark for complex ETL—with Apache Livy for REST Hive for batch workloads with SQL Presto for interactive queries with SQL Kafka for streaming ingest Apache Spark/Apache Flink for streaming analytics Apache Hbase (or maybe Cassandra) to store streaming data Apache Phoenix to run SQL queries on top of Hbase Prometheus (or fluentd/collectd/Ganglia/Nagios…) for logs and monitoring Airflow/Oozie to schedule workflows Superset for business dashboards Jupyter/JupyterHub/Zeppelin for data science Security (Apache Sentry for Roles, Ranger for configuration, Knox as a firewall) YARN to coordinate resources Ambari for cluster administration Terraform/Chef/Puppet for provisioning

Slide 12

Slide 12 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Or a cloud-native solution on AWS Amazon DynamoDB Amazon Elasticsearch Service AWS AppSync Amazon API Gateway Amazon Cognito AWS KMS AWS CloudTrail AWS IAM Amazon CloudWatch AWS Snowball AWS Storage Gateway Amazon Kinesis Data Firehose AWS Direct Connect AWS Database Migration Service Amazon Athena Amazon EMR AWS Glue Amazon Redshift Amazon DynamoDB Amazon QuickSight Amazon Kinesis Amazon Elasticsearch Service Amazon Neptune Amazon RDS AWS Glue

Slide 13

Slide 13 text

Data lakes & analytics on AWS

Slide 14

Slide 14 text

CHALLENGE Need to create constant feedback loop for designers Gain up-to-the-minute understanding of gamer satisfaction to guarantee gamers are engaged, thus resulting in the most popular game played in the world Fortnite | 125+ million players

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Amazon Kinesis: Real-time analytics Easily collect, process, and analyze video and data streams in real time Capture, process, and store video streams for analytics Load data streams into AWS data stores Analyze data streams with SQL Build custom applications that analyze data streams Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics

Slide 17

Slide 17 text

Amazon S3: Buffered files Kinesis Agent Record producers Amazon Redshift: Table loads Amazon Elasticsearch Service: Domain loads Amazon S3: Source record backup Transformed records Put Records Kinesis Data Firehose: Delivery stream AWS Lambda: Transformations & enrichment Raw Transformed

Slide 18

Slide 18 text

Slide 19

Slide 19 text

The overhead of data preparation Building training sets Cleaning and organizing data Collecting datasets Mining data for patterns Refining algorithms Other 80%

Slide 20

Slide 20 text

AWS Glue: Cleanse, prep, and catalog AWS Glue Data Catalog: A single view across your data lake Automatically discovers data and stores schema Makes data searchable and available for ETL with table definitions and custom metadata AWS Glue ETL jobs: Clean, transform, and store processed data Serverless Apache Spark environment AWS Glue ETL libraries or bring your own code Write jobs in Python or Scala Amazon S3 (Raw data) Amazon S3 (Staging data) Amazon S3 (Processed data) AWS Glue Data Catalog Crawlers Crawlers Crawlers

Slide 21

Slide 21 text

Amazon Athena Query S3 using standard SQL (Presto as distributed engine) Serverless: No infrastructure to set up or manage Multiple data format support: Define schema on demand $ Query instantly Pay per query Open Easy

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EMR: Big data processing Low cost Flexible billing with per- second billing, EC2 spot and reserved instances, and auto-scaling to reduce costs 50–80% $ Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, cluster tuning Latest versions Updated with the latest open-source frameworks within 30 days of release Use S3 storage Process data directly in the S3 data lake securely with high performance using the EMRFS connector Data Lake 1001100001001010111 0010101011100101010 0000111100101100101 010001100001

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Amazon Redshift: Data warehousing Fast at scale Columnar storage technology to improve I/O efficiency and scale query performance Secure Audit everything; encrypt data end to end; extensive certification and compliance Open file formats Analyze optimized data formats on the latest SSD, and all open data formats in Amazon S3 Inexpensive As low as $1,000 per terabyte per year, 1/10th the cost of traditional data warehouse solutions; start at $0.25 per hour $

Slide 26

Slide 26 text

Amazon Redshift Spectrum Extend the data warehouse to exabytes of data in S3 data lake S3 data lake Amazon Redshift data Amazon Redshift Spectrum query engine • Exabyte Amazon Redshift SQL queries against S3 • Join data across Amazon Redshift and S3 • Scale compute and storage separately • Stable query performance and unlimited concurrency • CSV, ORC, Avro & Parquet data formats • Pay only for the amount of data scanned

Slide 27

Slide 27 text

Let’s play a game Werner Vogels Amazon’s CTO, AWS Summit San Francisco, 2017 youtu.be/RpPf38L0HHU?t=3963

Slide 28

Slide 28 text

Let’s play a game Werner Vogels Amazon’s CTO, AWS Summit San Francisco, 2017 youtu.be/RpPf38L0HHU?t=3963

Slide 29

Slide 29 text

Let’s play a game Werner Vogels Amazon’s CTO, AWS Summit San Francisco, 2017 youtu.be/RpPf38L0HHU?t=3963

Slide 30

Slide 30 text

Amazon QuickSight easy Empower everyone Seamless connectivity Fast analysis Serverless Now with ML superpowers!

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text