Slide 1

Slide 1 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. K Y I V 0 6 . 1 1 . 1 9 Building a Modern Data Platform in the Cloud Alex Casalboni Sr. Technical Evangelist Amazon Web Services @alex_casalboni

Slide 2

Slide 2 text

About me • Software Engineer & Web Developer • Worked in a startup for 4.5 years • ServerlessDays Organizer • AWS Customer since 2013

Slide 3

Slide 3 text

S U M M I T bit.ly/AWSDataLakeDemo

Slide 4

Slide 4 text

Organizations that successfully generate business value from their data, will outperform their peers. An Aberdeen survey saw organizations who implemented a Data Lake outperforming similar companies by 9% in organic revenue growth.* 24% 15% Leaders Followers Organic revenue growth *Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence To Become a Leader, Data is Your Differentiator

Slide 5

Slide 5 text

Data variety and data volumes are increasing rapidly Multiple Consumers and Applications Ingest Discover Catalog Understand Curate Find insights

Slide 6

Slide 6 text

Purpose-built engines Right tool for the job

Slide 7

Slide 7 text

Collect Store Analyze Amazon Kinesis Firehose AWS Direct Connect Amazon Snowball Amazon Kinesis Analytics Amazon Kinesis Streams Amazon S3 Amazon Glacier Amazon CloudSearch Amazon RDS, Amazon Aurora Amazon Dynamo DB Amazon Elasticsearch Amazon EMR Amazon Redshift Amazon QuickSight AWS Database Migration Service AWS Glue Amazon Athena Amazon SageMaker

Slide 8

Slide 8 text

Traditionally, Analytics Used to Look Like This OLTP ERP CRM LOB Data Warehouse Business Intelligence • Relational data • TBs–PBs scale • Schema defined prior to data load • Operational reporting and ad hoc • Large initial CAPEX + $10K–$50K/TB/Year

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

“A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale”

Slide 11

Slide 11 text

Collect analyze semi-structured unstructured Decoupled ingestion on-read warehouses

Slide 12

Slide 12 text

exabyte scale once many tools Open formats

Slide 13

Slide 13 text

S3 Elasticsearch Glue DynamoDB Catalog & search Cognito API Gateway API/UI Athena QuickSight Redshift Spectrum Analytics & processing Lambda Kinesis Streams Kinesis Firehose Direct Connect Ingest AWS IoT KMS CloudTrail IAM Macie Security & auditing

Slide 14

Slide 14 text

CHALLENGE Need to create constant feedback loop for designers Gain up-to-the-minute understanding of gamer satisfaction to guarantee gamers are engaged, thus resulting in the most popular game played in the world Fortnite | 125+ million players

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

time Capture, process, and store video streams for analytics Load data streams into AWS data stores Analyze data streams with SQL Build custom applications that analyze data streams Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics

Slide 17

Slide 17 text

Amazon S3: Buffered files Kinesis Agent Record producers Amazon Redshift: Table loads Amazon Elasticsearch Service: Domain loads Amazon S3: Source record backup Transformed records Put Records Kinesis Firehose: Delivery stream

Slide 18

Slide 18 text

Amazon S3: Buffered files Kinesis Agent Record producers Amazon Redshift: Table loads Amazon Elasticsearch Service: Domain loads Amazon S3: Source record backup Transformed records Put Records Kinesis Firehose: Delivery stream AWS Lambda: Transformations & enrichment Raw Transformed

Slide 19

Slide 19 text

Open-source standards (Apache) Parquet, ORC, etc. Optimize Performance Optimize Costs Analytical queries

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 22

Slide 22 text

Storing is Not Enough, Data Needs to Be Discoverable Dark data are the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). CRM ERP Data warehouse Mainframe data Web Social Log files Machine data Semi- structured Unstructured “ ” Gartner IT Glossary, 2018 https://www.gartner.com/it-glossary/dark-data

Slide 23

Slide 23 text

Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other 80%

Slide 24

Slide 24 text

& Data Catalog ETL Job authoring Discover data and extract schema Auto-generates customizable ETL code in Python and Spark Data & schema automatic discovery Generates customizable code for ETL Schedule and run ETL jobs periodically Serverless model

Slide 25

Slide 25 text

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Crawlers automatically build your data catalog and keep it in sync Automatically discover new data & extract schema definitions Detect schema changes and version tables Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expression Run ad hoc or on a schedule; serverless – only pay when crawler runs AWS Glue Crawlers Crawlers Automatically catalog your data

Slide 26

Slide 26 text

AWS Lake Formation (join the preview) Build, secure, and manage a data lake in days Build a data lake in days, not months Build and deploy a fully managed data lake with a few clicks Enforce security policies across multiple services Centrally define security, governance, and auditing policies in one place and enforce those policies for all users and all applications Combine different analytics approaches Empower analyst and data scientist productivity, giving them self- service discovery and safe access to all data from a single catalog

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

User-Defined Functions • Bring your own functions & code • Execute without provisioning servers Processing and Querying In Place Fully Managed Process & Query AWS Glue Amazon Athena Amazon Redshift Amazon SageMaker AWS Lambda

Slide 29

Slide 29 text

Query S3 using standard SQL (Presto as distributed engine) Serverless - No infrastructure to set up or manage Multiple data format support – Define Schema on Demand $ Query Instantly Pay per query Open Easy

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

Data scanned: 169.53GB (of 2.2TB) Query duration: 44.66 seconds Cost: $0.85 ($5/TB or $0.005/GB) SELECT gram, year, sum(count) FROM ngram WHERE gram = 'just say no' GROUP BY gram, year ORDER BY year ASC; registry.opendata.aws/google-ngrams

Slide 32

Slide 32 text

year 2018 month 11 day 25

Slide 33

Slide 33 text

Amazon QuickSight easy Empower everyone Seamless connectivity Fast analysis Serverless

Slide 34

Slide 34 text

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Slide 35

Slide 35 text

S U M M I T bit.ly/AWSDataLakeDemo

Slide 36

Slide 36 text

JSON Payload Example for each event { "r": 255, "g": 0, "b": 0, "c": "Red", "device": { "id": "4992157", "browser": "Chrome", "browserVersion": "72.0.3626.109", "os": "Mac OS", "isMobile": false, "isMobileIOS": false, "isMobileAndroid": false }, "dt": { "year": 2019, "month": 4, "day": 17, "hour": 16, "minutes": 30, "seconds": 47, "millis": 725 }, "id": 1551116627725, "region": "Europe", "awsExperience": "1-3 Years", "awsServiceArea": "Management Tools" }

Slide 37

Slide 37 text

Demo Architecture Amazon CloudFront Amazon Cognito Amazon S3 Web App Users Amazon Kinesis Data Firehose Amazon Athena AWS Glue Amazon QuickSight Client Mobile client AWS SDK S3 Bucket AWS Cloud Region

Slide 38

Slide 38 text

Thank you! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Alex Casalboni @alex_casalboni