Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AWS data services for machine learning - AWS Innovate Online

AWS data services for machine learning - AWS Innovate Online

Track: "I'm in DevOps working with data scientists & researchers"

Alex Casalboni

October 15, 2019
Tweet

More Decks by Alex Casalboni

Other Decks in Programming

Transcript

  1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    AWS data services for machine learning
    Alex Casalboni
    Technical Evangelist
    Amazon Web Services
    @alex_casalboni

    View Slide

  2. Agenda
    The evolution of data challenges
    What’s a data lake?
    Data ingestion & real-time data
    Query engines & ETL
    Demo

    View Slide

  3. Solution
    My reports make
    my database
    server very slow
    Before 2009
    The DBA years
    Overnight DB dump
    Read-only replica
    My data doesn’t fit in
    one machine
    And it’s not only
    transactional
    2009-2011
    The Hadoop epiphany
    Hadoop
    Map/Reduce all the
    things
    My data is very
    fast
    Map/Reduce is
    hard to use
    2012-2014
    The Message Broker
    and NoSQL Age
    Kafka/RabbitMQ
    Cassandra/HBase
    /Storm
    Basic ETL
    Hive
    Duplicating batch/stream is inefficient
    I need to cleanse my source data
    Hadoop ecosystem is hard to manage
    My data scientists don’t like Java
    I am not sure which data we are
    already processing
    2015-2017
    The Spark kingdom and
    the spreadsheet wars
    Kafka/Spark
    Complex ETL
    Create new departments for data
    governance
    Spreadsheet all the things
    Streaming is hard
    My schemas have evolved
    I cannot query old and new
    data together
    My cluster is running old
    versions; upgrading is hard
    I want to use ML
    2017-2018
    The myth of DataOps
    Kafka/Flink (Java or Scala
    required)
    Complex ETL with a pinch of
    ML
    Apache Atlas
    Commercial distributions

    View Slide

  4. Data variety and data volumes are increasing rapidly
    Multiple consumers and applications
    Ingest
    Discover
    Catalog
    Understand
    Curate
    Find insights
    Amazon Kinesis
    Data Streams
    Amazon Kinesis
    Data Firehose
    On-premises
    databases

    View Slide

  5. Some problems during all periods
    More time spent maintaining the cluster than adding functionality
    Security and monitoring are hard
    Cluster is sitting idle most of the time
    No time left to experiment
    Frustration because data preparation, cleansing, and basic transformations
    take 80% of our time

    View Slide

  6. The downfall of the data engineer
    Watching paint dry is exciting in comparison to writing and maintaining Extract
    Transform and Load (ETL) logic. Most ETL jobs take a long time to execute and errors
    or issues tend to happen at runtime or are post-runtime assertions. Since the
    development time to execution time ratio is typically low, being productive means
    juggling with multiple pipelines at once and inherently doing a lot of context
    switching. By the time one of your 5 running “big data jobs” has finished, you have to
    get back in the mind space you were in many hours ago and craft your next iteration.
    Depending on how caffeinated you are, how long it’s been since the last iteration,
    and how systematic you are, you may fail at restoring the full context in your short-
    term memory. This leads to systemic, stupid errors that waste hours.


    Maxime Beauchemin
    Data engineer @ Lyft
    Also, creator of Apache Airflow and Apache Superset. Ex-Facebook, Ex-Yahoo!, Ex-Airbnb
    medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b

    View Slide

  7. Purpose-built
    engines
    Right tool for the job

    View Slide

  8. Purpose-built analytics tools
    Collect Store Analyze
    Amazon Kinesis Data
    Firehose
    AWS Direct
    Connect
    AWS
    Snowball
    Amazon Kinesis Data
    Analytics
    Amazon Kinesis
    Data Streams
    Amazon S3 Amazon S3 Glacier
    Amazon
    CloudSearch
    Amazon RDS,
    Amazon Aurora
    Amazon
    DynamoDB
    Amazon
    Elasticsearch Service
    Amazon EMR
    Amazon
    Redshift
    Amazon
    QuickSight
    AWS Database
    Migration Service
    AWS Glue
    Amazon
    Athena
    Amazon
    SageMaker

    View Slide

  9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    What’s a data lake?

    View Slide

  10. What’s a data lake?
    Collect, store, process, consume, and analyze organizational data
    Structured, semi-structured, and unstructured data
    Decoupled compute and (low-cost) storage
    Fast automated ingestion
    Schema on-read
    Allows self-service and easy plug and play
    Complementary to data warehouses

    View Slide

  11. A possible open-source solution
    Hadoop Cluster (static/multi tenant)
    Apache NiFi for ingestion workflows
    Sqoop to ingest data from RDBMS
    HDFS to store the data (tied to the Hadoop cluster)
    Hive/HCatalog for data catalog
    Apache Atlas for a more human data catalog and governance
    Apache Spark for complex ETL—with Apache Livy for REST
    Hive for batch workloads with SQL
    Presto for interactive queries with SQL
    Kafka for streaming ingest
    Apache Spark/Apache Flink for streaming analytics
    Apache Hbase (or maybe Cassandra) to store streaming data
    Apache Phoenix to run SQL queries on top of Hbase
    Prometheus (or fluentd/collectd/Ganglia/Nagios…) for logs and monitoring
    Airflow/Oozie to schedule workflows
    Superset for business dashboards
    Jupyter/JupyterHub/Zeppelin for data science
    Security (Apache Sentry for Roles, Ranger for configuration, Knox as a firewall)
    YARN to coordinate resources
    Ambari for cluster administration
    Terraform/Chef/Puppet for provisioning

    View Slide

  12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Or a cloud-native solution on AWS
    Amazon
    DynamoDB
    Amazon Elasticsearch
    Service
    AWS
    AppSync
    Amazon
    API Gateway
    Amazon
    Cognito
    AWS
    KMS
    AWS
    CloudTrail
    AWS
    IAM
    Amazon
    CloudWatch
    AWS
    Snowball
    AWS Storage
    Gateway
    Amazon
    Kinesis Data
    Firehose
    AWS Direct
    Connect
    AWS Database
    Migration
    Service
    Amazon
    Athena
    Amazon
    EMR
    AWS
    Glue
    Amazon
    Redshift
    Amazon
    DynamoDB
    Amazon
    QuickSight
    Amazon
    Kinesis
    Amazon
    Elasticsearch
    Service
    Amazon
    Neptune
    Amazon
    RDS
    AWS
    Glue

    View Slide

  13. Data lakes & analytics on AWS

    View Slide

  14. CHALLENGE
    Need to create constant feedback
    loop for designers
    Gain up-to-the-minute
    understanding of gamer
    satisfaction to guarantee gamers
    are engaged, thus resulting in the
    most popular game played in the
    world
    Fortnite | 125+ million players

    View Slide

  15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Data ingestion and real-time data

    View Slide

  16. Amazon Kinesis: Real-time analytics
    Easily collect, process, and analyze video and data streams in real time
    Capture, process, and store
    video streams for analytics
    Load data streams into
    AWS data stores
    Analyze data streams with
    SQL
    Build custom applications
    that analyze data streams
    Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics

    View Slide

  17. Amazon S3:
    Buffered files
    Kinesis
    Agent
    Record
    producers Amazon Redshift:
    Table loads
    Amazon Elasticsearch Service:
    Domain loads
    Amazon S3:
    Source record backup
    Transformed records
    Put Records
    Kinesis Data Firehose:
    Delivery stream
    AWS Lambda:
    Transformations &
    enrichment
    Raw Transformed

    View Slide

  18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Query engines & ETL

    View Slide

  19. The overhead of data preparation
    Building training sets
    Cleaning and organizing data
    Collecting datasets
    Mining data for patterns
    Refining algorithms
    Other
    80%

    View Slide

  20. AWS Glue: Cleanse, prep, and catalog
    AWS Glue Data Catalog: A single view
    across your data lake
    Automatically discovers data and stores schema
    Makes data searchable and available for ETL
    with table definitions and custom metadata
    AWS Glue ETL jobs: Clean, transform,
    and store processed data
    Serverless Apache Spark environment
    AWS Glue ETL libraries or bring your own code
    Write jobs in Python or Scala
    Amazon S3
    (Raw data)
    Amazon S3
    (Staging
    data)
    Amazon S3
    (Processed data)
    AWS Glue Data Catalog
    Crawlers Crawlers Crawlers

    View Slide

  21. Amazon Athena
    Query S3 using standard SQL (Presto as distributed engine)
    Serverless: No infrastructure to set up or manage
    Multiple data format support: Define schema on demand
    $
    Query instantly Pay per query Open Easy

    View Slide

  22. View Slide

  23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Amazon EMR: Big data processing
    Low cost
    Flexible billing with per-
    second billing, EC2 spot
    and reserved instances,
    and auto-scaling to
    reduce costs 50–80%
    $
    Easy
    Launch fully managed
    Hadoop & Spark in
    minutes; no cluster setup,
    node provisioning,
    cluster tuning
    Latest versions
    Updated with the latest
    open-source frameworks
    within 30 days of release
    Use S3 storage
    Process data directly in
    the S3 data lake securely
    with high performance
    using the EMRFS
    connector
    Data Lake
    1001100001001010111
    0010101011100101010
    0000111100101100101
    010001100001

    View Slide

  24. View Slide

  25. Amazon Redshift: Data warehousing
    Fast at scale
    Columnar storage
    technology to improve
    I/O efficiency and scale
    query performance
    Secure
    Audit everything; encrypt
    data end to end;
    extensive certification
    and compliance
    Open file formats
    Analyze optimized data
    formats on the latest
    SSD, and all open data
    formats in Amazon S3
    Inexpensive
    As low as $1,000 per
    terabyte per year, 1/10th
    the cost of traditional
    data warehouse
    solutions; start at $0.25
    per hour
    $

    View Slide

  26. Amazon Redshift Spectrum
    Extend the data warehouse to exabytes of data in S3 data lake
    S3 data lake
    Amazon
    Redshift data
    Amazon Redshift Spectrum
    query engine • Exabyte Amazon Redshift SQL queries against S3
    • Join data across Amazon Redshift and S3
    • Scale compute and storage separately
    • Stable query performance and unlimited concurrency
    • CSV, ORC, Avro & Parquet data formats
    • Pay only for the amount of data scanned

    View Slide

  27. Let’s play a game
    Werner Vogels
    Amazon’s CTO, AWS Summit San Francisco, 2017
    youtu.be/RpPf38L0HHU?t=3963

    View Slide

  28. Let’s play a game
    Werner Vogels
    Amazon’s CTO, AWS Summit San Francisco, 2017
    youtu.be/RpPf38L0HHU?t=3963

    View Slide

  29. Let’s play a game
    Werner Vogels
    Amazon’s CTO, AWS Summit San Francisco, 2017
    youtu.be/RpPf38L0HHU?t=3963

    View Slide

  30. Amazon QuickSight
    easy
    Empower
    everyone
    Seamless
    connectivity
    Fast analysis Serverless
    Now with ML superpowers!

    View Slide

  31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Demo

    View Slide

  32. Thank you!
    © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
    Alex Casalboni
    Technical Evangelist
    Amazon Web Services
    @alex_casalboni

    View Slide