Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Modern Data Platform in the Cloud [AWS Dev Day @ Kyiv]

Building a Modern Data Platform in the Cloud [AWS Dev Day @ Kyiv]

Alex Casalboni

June 11, 2019
Tweet

More Decks by Alex Casalboni

Other Decks in Programming

Transcript

  1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    K Y I V
    0 6 . 1 1 . 1 9
    Building a Modern Data
    Platform in the Cloud
    Alex Casalboni
    Sr. Technical Evangelist
    Amazon Web Services
    @alex_casalboni

    View Slide

  2. About me
    • Software Engineer & Web Developer
    • Worked in a startup for 4.5 years
    • ServerlessDays Organizer
    • AWS Customer since 2013

    View Slide

  3. S U M M I T
    bit.ly/AWSDataLakeDemo

    View Slide

  4. Organizations that successfully
    generate business value from their
    data, will outperform their peers. An
    Aberdeen survey saw organizations
    who implemented a Data Lake
    outperforming similar companies by
    9% in organic revenue growth.*
    24%
    15%
    Leaders Followers
    Organic revenue growth
    *Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
    To Become a Leader, Data is Your Differentiator

    View Slide

  5. Data variety and data volumes are increasing rapidly
    Multiple Consumers and Applications
    Ingest
    Discover
    Catalog
    Understand
    Curate
    Find insights

    View Slide

  6. Purpose-built
    engines
    Right tool for the job

    View Slide

  7. Collect Store Analyze
    Amazon Kinesis
    Firehose
    AWS Direct
    Connect
    Amazon
    Snowball
    Amazon Kinesis
    Analytics
    Amazon Kinesis
    Streams
    Amazon S3 Amazon Glacier
    Amazon
    CloudSearch
    Amazon RDS,
    Amazon Aurora
    Amazon
    Dynamo DB
    Amazon
    Elasticsearch
    Amazon EMR
    Amazon
    Redshift
    Amazon
    QuickSight
    AWS Database
    Migration Service
    AWS Glue
    Amazon
    Athena
    Amazon
    SageMaker

    View Slide

  8. Traditionally, Analytics Used to Look Like This
    OLTP ERP CRM LOB
    Data Warehouse
    Business Intelligence • Relational data
    • TBs–PBs scale
    • Schema defined prior to data load
    • Operational reporting and ad hoc
    • Large initial CAPEX + $10K–$50K/TB/Year

    View Slide

  9. View Slide

  10. “A data lake is a centralized repository that
    allows you to store all your structured and
    unstructured data at any scale”

    View Slide

  11. Collect analyze
    semi-structured unstructured
    Decoupled
    ingestion
    on-read
    warehouses

    View Slide

  12. exabyte scale
    once
    many tools
    Open formats

    View Slide

  13. S3
    Elasticsearch
    Glue
    DynamoDB
    Catalog & search
    Cognito
    API
    Gateway
    API/UI
    Athena QuickSight
    Redshift
    Spectrum
    Analytics & processing
    Lambda
    Kinesis
    Streams
    Kinesis
    Firehose
    Direct
    Connect
    Ingest
    AWS
    IoT
    KMS CloudTrail
    IAM Macie
    Security & auditing

    View Slide

  14. CHALLENGE
    Need to create constant feedback loop
    for designers
    Gain up-to-the-minute understanding
    of gamer satisfaction to guarantee
    gamers are engaged, thus resulting in
    the most popular game played in the
    world
    Fortnite | 125+ million players

    View Slide

  15. View Slide

  16. time
    Capture, process, and
    store video streams for
    analytics
    Load data streams into
    AWS data stores
    Analyze data streams with
    SQL
    Build custom applications
    that analyze data streams
    Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics

    View Slide

  17. Amazon S3:
    Buffered files
    Kinesis
    Agent
    Record
    producers Amazon Redshift:
    Table loads
    Amazon Elasticsearch Service:
    Domain loads
    Amazon S3:
    Source record backup
    Transformed records
    Put Records
    Kinesis Firehose:
    Delivery stream

    View Slide

  18. Amazon S3:
    Buffered files
    Kinesis
    Agent
    Record
    producers Amazon Redshift:
    Table loads
    Amazon Elasticsearch Service:
    Domain loads
    Amazon S3:
    Source record backup
    Transformed records
    Put Records
    Kinesis Firehose:
    Delivery stream
    AWS Lambda:
    Transformations &
    enrichment
    Raw Transformed

    View Slide

  19. Open-source standards (Apache)
    Parquet, ORC, etc.
    Optimize Performance
    Optimize Costs
    Analytical queries

    View Slide

  20. View Slide

  21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  22. Storing is Not Enough, Data Needs to Be Discoverable
    Dark data are the information
    assets organizations collect,
    process, and store during
    regular business activities,
    but generally fail to use for other
    purposes (for example, analytics,
    business relationships and
    direct monetizing).
    CRM ERP Data warehouse Mainframe
    data
    Web Social Log
    files
    Machine
    data
    Semi-
    structured
    Unstructured


    Gartner IT Glossary, 2018
    https://www.gartner.com/it-glossary/dark-data

    View Slide

  23. Building training sets
    Cleaning and organizing data
    Collecting data sets
    Mining data for patterns
    Refining algorithms
    Other
    80%

    View Slide

  24. &
    Data Catalog
    ETL Job
    authoring
    Discover data and
    extract schema
    Auto-generates
    customizable ETL code
    in Python and Spark
    Data & schema automatic discovery
    Generates customizable code for ETL
    Schedule and run ETL jobs periodically
    Serverless model

    View Slide

  25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
    Crawlers automatically build your
    data catalog and keep it in sync
    Automatically discover new data & extract
    schema definitions
    Detect schema changes and version tables
    Detect Hive style partitions on Amazon S3
    Built-in classifiers for popular types; custom
    classifiers using Grok expression
    Run ad hoc or on a schedule; serverless – only
    pay when crawler runs
    AWS Glue Crawlers
    Crawlers
    Automatically catalog your data

    View Slide

  26. AWS Lake Formation (join the preview)
    Build, secure, and manage a data lake in days
    Build a data lake in days,
    not months
    Build and deploy a fully
    managed data lake with a few
    clicks
    Enforce security policies
    across multiple services
    Centrally define security,
    governance, and auditing policies in
    one place and enforce those policies
    for all users and all applications
    Combine different
    analytics approaches
    Empower analyst and data scientist
    productivity, giving them self-
    service discovery and safe access to
    all data from a single catalog

    View Slide

  27. View Slide

  28. User-Defined Functions
    • Bring your own functions & code
    • Execute without provisioning servers
    Processing and Querying In Place
    Fully Managed Process & Query
    AWS
    Glue
    Amazon
    Athena
    Amazon
    Redshift
    Amazon
    SageMaker
    AWS
    Lambda

    View Slide

  29. Query S3 using standard SQL (Presto as distributed engine)
    Serverless - No infrastructure to set up or manage
    Multiple data format support – Define Schema on Demand
    $
    Query Instantly Pay per query Open Easy

    View Slide

  30. View Slide

  31. Data scanned: 169.53GB (of 2.2TB)
    Query duration: 44.66 seconds
    Cost: $0.85
    ($5/TB or $0.005/GB)
    SELECT gram, year, sum(count)
    FROM ngram
    WHERE gram = 'just say no'
    GROUP BY gram, year
    ORDER BY year ASC;
    registry.opendata.aws/google-ngrams

    View Slide

  32. year 2018 month 11 day 25

    View Slide

  33. Amazon QuickSight
    easy
    Empower
    everyone
    Seamless
    connectivity
    Fast analysis Serverless

    View Slide

  34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

    View Slide

  35. S U M M I T
    bit.ly/AWSDataLakeDemo

    View Slide

  36. JSON Payload Example for each event
    {
    "r": 255,
    "g": 0,
    "b": 0,
    "c": "Red",
    "device": {
    "id": "4992157",
    "browser": "Chrome",
    "browserVersion": "72.0.3626.109",
    "os": "Mac OS",
    "isMobile": false,
    "isMobileIOS": false,
    "isMobileAndroid": false
    },
    "dt": {
    "year": 2019,
    "month": 4,
    "day": 17,
    "hour": 16,
    "minutes": 30,
    "seconds": 47,
    "millis": 725
    },
    "id": 1551116627725,
    "region": "Europe",
    "awsExperience": "1-3 Years",
    "awsServiceArea": "Management Tools"
    }

    View Slide

  37. Demo Architecture
    Amazon CloudFront
    Amazon Cognito
    Amazon S3
    Web App
    Users Amazon Kinesis
    Data Firehose
    Amazon Athena
    AWS Glue Amazon
    QuickSight
    Client
    Mobile
    client
    AWS SDK
    S3 Bucket
    AWS Cloud
    Region

    View Slide

  38. Thank you!
    © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Alex Casalboni
    @alex_casalboni

    View Slide