rights reserved. K Y I V 0 6 . 1 1 . 1 9 Building a Modern Data Platform in the Cloud Alex Casalboni Sr. Technical Evangelist Amazon Web Services @alex_casalboni
outperform their peers. An Aberdeen survey saw organizations who implemented a Data Lake outperforming similar companies by 9% in organic revenue growth.* 24% 15% Leaders Followers Organic revenue growth *Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence To Become a Leader, Data is Your Differentiator
LOB Data Warehouse Business Intelligence • Relational data • TBs–PBs scale • Schema defined prior to data load • Operational reporting and ad hoc • Large initial CAPEX + $10K–$50K/TB/Year
up-to-the-minute understanding of gamer satisfaction to guarantee gamers are engaged, thus resulting in the most popular game played in the world Fortnite | 125+ million players
data streams into AWS data stores Analyze data streams with SQL Build custom applications that analyze data streams Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
Table loads Amazon Elasticsearch Service: Domain loads Amazon S3: Source record backup Transformed records Put Records Kinesis Firehose: Delivery stream
data are the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). CRM ERP Data warehouse Mainframe data Web Social Log files Machine data Semi- structured Unstructured “ ” Gartner IT Glossary, 2018 https://www.gartner.com/it-glossary/dark-data
schema Auto-generates customizable ETL code in Python and Spark Data & schema automatic discovery Generates customizable code for ETL Schedule and run ETL jobs periodically Serverless model
rights reserved. Crawlers automatically build your data catalog and keep it in sync Automatically discover new data & extract schema definitions Detect schema changes and version tables Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expression Run ad hoc or on a schedule; serverless – only pay when crawler runs AWS Glue Crawlers Crawlers Automatically catalog your data
a data lake in days Build a data lake in days, not months Build and deploy a fully managed data lake with a few clicks Enforce security policies across multiple services Centrally define security, governance, and auditing policies in one place and enforce those policies for all users and all applications Combine different analytics approaches Empower analyst and data scientist productivity, giving them self- service discovery and safe access to all data from a single catalog
Execute without provisioning servers Processing and Querying In Place Fully Managed Process & Query AWS Glue Amazon Athena Amazon Redshift Amazon SageMaker AWS Lambda
$0.85 ($5/TB or $0.005/GB) SELECT gram, year, sum(count) FROM ngram WHERE gram = 'just say no' GROUP BY gram, year ORDER BY year ASC; registry.opendata.aws/google-ngrams