Building a Modern Data Platform in the Cloud [AWS Dev Day @ Kyiv]

© 2019, Amazon Web Services, Inc. or its affiliates. All
rights reserved. K Y I V 0 6 . 1 1 . 1 9 Building a Modern Data Platform in the Cloud Alex Casalboni Sr. Technical Evangelist Amazon Web Services @alex_casalboni

About me • Software Engineer & Web Developer • Worked
in a startup for 4.5 years • ServerlessDays Organizer • AWS Customer since 2013

S U M M I T bit.ly/AWSDataLakeDemo

Organizations that successfully generate business value from their data, will
outperform their peers. An Aberdeen survey saw organizations who implemented a Data Lake outperforming similar companies by 9% in organic revenue growth.* 24% 15% Leaders Followers Organic revenue growth *Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence To Become a Leader, Data is Your Differentiator

Data variety and data volumes are increasing rapidly Multiple Consumers
and Applications Ingest Discover Catalog Understand Curate Find insights

Purpose-built engines Right tool for the job

Collect Store Analyze Amazon Kinesis Firehose AWS Direct Connect Amazon
Snowball Amazon Kinesis Analytics Amazon Kinesis Streams Amazon S3 Amazon Glacier Amazon CloudSearch Amazon RDS, Amazon Aurora Amazon Dynamo DB Amazon Elasticsearch Amazon EMR Amazon Redshift Amazon QuickSight AWS Database Migration Service AWS Glue Amazon Athena Amazon SageMaker

Traditionally, Analytics Used to Look Like This OLTP ERP CRM
LOB Data Warehouse Business Intelligence • Relational data • TBs–PBs scale • Schema defined prior to data load • Operational reporting and ad hoc • Large initial CAPEX + $10K–$50K/TB/Year

“A data lake is a centralized repository that allows you
to store all your structured and unstructured data at any scale”

Collect analyze semi-structured unstructured Decoupled ingestion on-read warehouses

exabyte scale once many tools Open formats

S3 Elasticsearch Glue DynamoDB Catalog & search Cognito API Gateway
API/UI Athena QuickSight Redshift Spectrum Analytics & processing Lambda Kinesis Streams Kinesis Firehose Direct Connect Ingest AWS IoT KMS CloudTrail IAM Macie Security & auditing

CHALLENGE Need to create constant feedback loop for designers Gain
up-to-the-minute understanding of gamer satisfaction to guarantee gamers are engaged, thus resulting in the most popular game played in the world Fortnite | 125+ million players

time Capture, process, and store video streams for analytics Load
data streams into AWS data stores Analyze data streams with SQL Build custom applications that analyze data streams Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics

Amazon S3: Buffered files Kinesis Agent Record producers Amazon Redshift:
Table loads Amazon Elasticsearch Service: Domain loads Amazon S3: Source record backup Transformed records Put Records Kinesis Firehose: Delivery stream

Amazon S3: Buffered files Kinesis Agent Record producers Amazon Redshift:
Table loads Amazon Elasticsearch Service: Domain loads Amazon S3: Source record backup Transformed records Put Records Kinesis Firehose: Delivery stream AWS Lambda: Transformations & enrichment Raw Transformed

Open-source standards (Apache) Parquet, ORC, etc. Optimize Performance Optimize Costs
Analytical queries

rights reserved.

Storing is Not Enough, Data Needs to Be Discoverable Dark
data are the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). CRM ERP Data warehouse Mainframe data Web Social Log files Machine data Semi- structured Unstructured “ ” Gartner IT Glossary, 2018 https://www.gartner.com/it-glossary/dark-data

Building training sets Cleaning and organizing data Collecting data sets
Mining data for patterns Refining algorithms Other 80%

& Data Catalog ETL Job authoring Discover data and extract
schema Auto-generates customizable ETL code in Python and Spark Data & schema automatic discovery Generates customizable code for ETL Schedule and run ETL jobs periodically Serverless model

© 2017, Amazon Web Services, Inc. or its Affiliates. All
rights reserved. Crawlers automatically build your data catalog and keep it in sync Automatically discover new data & extract schema definitions Detect schema changes and version tables Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expression Run ad hoc or on a schedule; serverless – only pay when crawler runs AWS Glue Crawlers Crawlers Automatically catalog your data

AWS Lake Formation (join the preview) Build, secure, and manage
a data lake in days Build a data lake in days, not months Build and deploy a fully managed data lake with a few clicks Enforce security policies across multiple services Centrally define security, governance, and auditing policies in one place and enforce those policies for all users and all applications Combine different analytics approaches Empower analyst and data scientist productivity, giving them self- service discovery and safe access to all data from a single catalog

User-Defined Functions • Bring your own functions & code •
Execute without provisioning servers Processing and Querying In Place Fully Managed Process & Query AWS Glue Amazon Athena Amazon Redshift Amazon SageMaker AWS Lambda

Query S3 using standard SQL (Presto as distributed engine) Serverless
- No infrastructure to set up or manage Multiple data format support – Define Schema on Demand $ Query Instantly Pay per query Open Easy

Data scanned: 169.53GB (of 2.2TB) Query duration: 44.66 seconds Cost:
$0.85 ($5/TB or $0.005/GB) SELECT gram, year, sum(count) FROM ngram WHERE gram = 'just say no' GROUP BY gram, year ORDER BY year ASC; registry.opendata.aws/google-ngrams

year 2018 month 11 day 25

Amazon QuickSight easy Empower everyone Seamless connectivity Fast analysis Serverless

rights reserved.

S U M M I T bit.ly/AWSDataLakeDemo

JSON Payload Example for each event { "r": 255, "g":
0, "b": 0, "c": "Red", "device": { "id": "4992157", "browser": "Chrome", "browserVersion": "72.0.3626.109", "os": "Mac OS", "isMobile": false, "isMobileIOS": false, "isMobileAndroid": false }, "dt": { "year": 2019, "month": 4, "day": 17, "hour": 16, "minutes": 30, "seconds": 47, "millis": 725 }, "id": 1551116627725, "region": "Europe", "awsExperience": "1-3 Years", "awsServiceArea": "Management Tools" }

Demo Architecture Amazon CloudFront Amazon Cognito Amazon S3 Web App
Users Amazon Kinesis Data Firehose Amazon Athena AWS Glue Amazon QuickSight Client Mobile client AWS SDK S3 Bucket AWS Cloud Region

Building a Modern Data Platform in the Cloud [A...

Building a Modern Data Platform in the Cloud [AWS Dev Day @ Kyiv]

Alex Casalboni

More Decks by Alex Casalboni

Other Decks in Programming

Featured

Transcript

© 2019, Amazon Web Services, Inc. or its affiliates. All

About me • Software Engineer & Web Developer • Worked

S U M M I T bit.ly/AWSDataLakeDemo

Organizations that successfully generate business value from their data, will

Data variety and data volumes are increasing rapidly Multiple Consumers

Purpose-built engines Right tool for the job

Collect Store Analyze Amazon Kinesis Firehose AWS Direct Connect Amazon

Traditionally, Analytics Used to Look Like This OLTP ERP CRM

“A data lake is a centralized repository that allows you

Collect analyze semi-structured unstructured Decoupled ingestion on-read warehouses

exabyte scale once many tools Open formats

S3 Elasticsearch Glue DynamoDB Catalog & search Cognito API Gateway

CHALLENGE Need to create constant feedback loop for designers Gain

time Capture, process, and store video streams for analytics Load

Amazon S3: Buffered files Kinesis Agent Record producers Amazon Redshift:

Amazon S3: Buffered files Kinesis Agent Record producers Amazon Redshift:

Open-source standards (Apache) Parquet, ORC, etc. Optimize Performance Optimize Costs

© 2019, Amazon Web Services, Inc. or its affiliates. All

Storing is Not Enough, Data Needs to Be Discoverable Dark

Building training sets Cleaning and organizing data Collecting data sets

& Data Catalog ETL Job authoring Discover data and extract

© 2017, Amazon Web Services, Inc. or its Affiliates. All

AWS Lake Formation (join the preview) Build, secure, and manage

User-Defined Functions • Bring your own functions & code •

Query S3 using standard SQL (Presto as distributed engine) Serverless

Data scanned: 169.53GB (of 2.2TB) Query duration: 44.66 seconds Cost:

year 2018 month 11 day 25

Amazon QuickSight easy Empower everyone Seamless connectivity Fast analysis Serverless

© 2019, Amazon Web Services, Inc. or its affiliates. All

S U M M I T bit.ly/AWSDataLakeDemo

JSON Payload Example for each event { "r": 255, "g":

Demo Architecture Amazon CloudFront Amazon Cognito Amazon S3 Web App

Thank you! © 2019, Amazon Web Services, Inc. or its