Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Modern Data Platform in the Cloud [AWS Dev Day @ Kyiv]

Building a Modern Data Platform in the Cloud [AWS Dev Day @ Kyiv]

Alex Casalboni

June 11, 2019
Tweet

More Decks by Alex Casalboni

Other Decks in Programming

Transcript

  1. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. K Y I V 0 6 . 1 1 . 1 9 Building a Modern Data Platform in the Cloud Alex Casalboni Sr. Technical Evangelist Amazon Web Services @alex_casalboni
  2. About me • Software Engineer & Web Developer • Worked

    in a startup for 4.5 years • ServerlessDays Organizer • AWS Customer since 2013
  3. Organizations that successfully generate business value from their data, will

    outperform their peers. An Aberdeen survey saw organizations who implemented a Data Lake outperforming similar companies by 9% in organic revenue growth.* 24% 15% Leaders Followers Organic revenue growth *Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence To Become a Leader, Data is Your Differentiator
  4. Data variety and data volumes are increasing rapidly Multiple Consumers

    and Applications Ingest Discover Catalog Understand Curate Find insights
  5. Collect Store Analyze Amazon Kinesis Firehose AWS Direct Connect Amazon

    Snowball Amazon Kinesis Analytics Amazon Kinesis Streams Amazon S3 Amazon Glacier Amazon CloudSearch Amazon RDS, Amazon Aurora Amazon Dynamo DB Amazon Elasticsearch Amazon EMR Amazon Redshift Amazon QuickSight AWS Database Migration Service AWS Glue Amazon Athena Amazon SageMaker
  6. Traditionally, Analytics Used to Look Like This OLTP ERP CRM

    LOB Data Warehouse Business Intelligence • Relational data • TBs–PBs scale • Schema defined prior to data load • Operational reporting and ad hoc • Large initial CAPEX + $10K–$50K/TB/Year
  7. “A data lake is a centralized repository that allows you

    to store all your structured and unstructured data at any scale”
  8. S3 Elasticsearch Glue DynamoDB Catalog & search Cognito API Gateway

    API/UI Athena QuickSight Redshift Spectrum Analytics & processing Lambda Kinesis Streams Kinesis Firehose Direct Connect Ingest AWS IoT KMS CloudTrail IAM Macie Security & auditing
  9. CHALLENGE Need to create constant feedback loop for designers Gain

    up-to-the-minute understanding of gamer satisfaction to guarantee gamers are engaged, thus resulting in the most popular game played in the world Fortnite | 125+ million players
  10. time Capture, process, and store video streams for analytics Load

    data streams into AWS data stores Analyze data streams with SQL Build custom applications that analyze data streams Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
  11. Amazon S3: Buffered files Kinesis Agent Record producers Amazon Redshift:

    Table loads Amazon Elasticsearch Service: Domain loads Amazon S3: Source record backup Transformed records Put Records Kinesis Firehose: Delivery stream
  12. Amazon S3: Buffered files Kinesis Agent Record producers Amazon Redshift:

    Table loads Amazon Elasticsearch Service: Domain loads Amazon S3: Source record backup Transformed records Put Records Kinesis Firehose: Delivery stream AWS Lambda: Transformations & enrichment Raw Transformed
  13. Storing is Not Enough, Data Needs to Be Discoverable Dark

    data are the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). CRM ERP Data warehouse Mainframe data Web Social Log files Machine data Semi- structured Unstructured “ ” Gartner IT Glossary, 2018 https://www.gartner.com/it-glossary/dark-data
  14. Building training sets Cleaning and organizing data Collecting data sets

    Mining data for patterns Refining algorithms Other 80%
  15. & Data Catalog ETL Job authoring Discover data and extract

    schema Auto-generates customizable ETL code in Python and Spark Data & schema automatic discovery Generates customizable code for ETL Schedule and run ETL jobs periodically Serverless model
  16. © 2017, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Crawlers automatically build your data catalog and keep it in sync Automatically discover new data & extract schema definitions Detect schema changes and version tables Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expression Run ad hoc or on a schedule; serverless – only pay when crawler runs AWS Glue Crawlers Crawlers Automatically catalog your data
  17. AWS Lake Formation (join the preview) Build, secure, and manage

    a data lake in days Build a data lake in days, not months Build and deploy a fully managed data lake with a few clicks Enforce security policies across multiple services Centrally define security, governance, and auditing policies in one place and enforce those policies for all users and all applications Combine different analytics approaches Empower analyst and data scientist productivity, giving them self- service discovery and safe access to all data from a single catalog
  18. User-Defined Functions • Bring your own functions & code •

    Execute without provisioning servers Processing and Querying In Place Fully Managed Process & Query AWS Glue Amazon Athena Amazon Redshift Amazon SageMaker AWS Lambda
  19. Query S3 using standard SQL (Presto as distributed engine) Serverless

    - No infrastructure to set up or manage Multiple data format support – Define Schema on Demand $ Query Instantly Pay per query Open Easy
  20. Data scanned: 169.53GB (of 2.2TB) Query duration: 44.66 seconds Cost:

    $0.85 ($5/TB or $0.005/GB) SELECT gram, year, sum(count) FROM ngram WHERE gram = 'just say no' GROUP BY gram, year ORDER BY year ASC; registry.opendata.aws/google-ngrams
  21. JSON Payload Example for each event { "r": 255, "g":

    0, "b": 0, "c": "Red", "device": { "id": "4992157", "browser": "Chrome", "browserVersion": "72.0.3626.109", "os": "Mac OS", "isMobile": false, "isMobileIOS": false, "isMobileAndroid": false }, "dt": { "year": 2019, "month": 4, "day": 17, "hour": 16, "minutes": 30, "seconds": 47, "millis": 725 }, "id": 1551116627725, "region": "Europe", "awsExperience": "1-3 Years", "awsServiceArea": "Management Tools" }
  22. Demo Architecture Amazon CloudFront Amazon Cognito Amazon S3 Web App

    Users Amazon Kinesis Data Firehose Amazon Athena AWS Glue Amazon QuickSight Client Mobile client AWS SDK S3 Bucket AWS Cloud Region
  23. Thank you! © 2019, Amazon Web Services, Inc. or its

    affiliates. All rights reserved. Alex Casalboni @alex_casalboni