Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Modern Data platform in the Cloud

Building a Modern Data platform in the Cloud

Modern data is massive, quickly evolving, unstructured, and increasingly hard to catalog and understand from multiple consumers and applications. This session will guide you though the best practices for designing a robust data architecture, highlightning the benefits and typical challenges of data lakes and data warehouses. We will build a scalable solution based on managed services such as Amazon Athena, AWS Glue, and AWS Lake Formation.

More Decks by Sébastien Stormacq - AWS Developer Advocate

Other Decks in Science

Transcript

  1. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Building a modern data platform on AWS Sébastien Stormacq AWS Tech Evangelist @sebsto
  2. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. A brief opinionated history of data analytics Problem Solution My reports make my database server very slow Before 2009 The DBA years Overnight DB dump Read-only replica My data doesn’t fit in one machine And it’s not only transactional 2009-2011 The Hadoop epiphany Hadoop Map/Reduce all the things My data is very fast Map/Reduce is hard to use 2012-2014 The Message Broker and NoSQL Age Kafka/RabbitMQ Cassandra/HBASE /STORM Basic ETL Hive Duplicating batch/stream is inefficient I need to cleanse my source data Hadoop ecosystem is hard to manage My data scientists don’t like JAVA I am not sure which data we are already processing 2015-2017 The Spark kingdom and the spreadsheet wars Kafka/Spark Complex ETL Create new departments for data governance Spreadsheet all the things Streaming is hard My schemas have evolved I cannot query old and new data together My cluster is running old versions. Upgrading is hard I want to use ML 2017-2018 The myth of DataOps Kafka/Flink (JAVA or Scala required) Complex ETL with a pinch of ML Apache Atlas Commercial distributions
  3. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Some problems during all periods • My team spends more time maintaining the cluster than adding functionality • Security and monitoring are hard • Most of my time my cluster is sitting idle; Then it’s a bottleneck • I don’t have the time to experiment • Data preparation, cleansing, and basic transformations take a disproportionally high amount of my time. And it’s so frustrating
  4. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Some simple things that scare me (and eat my productivity) • Text encodings • Empty strings. Literal ”NULL” strings • Uppercase and Lowercase • Date and time formats: which date would you say this is 1/4/19? And this? 1553589297 • CSV, especially if uploaded by end users • JSON files with a single array and 200.000 records inside • The same JSON file when row 176.543 has a column never seen before • The same JSON file when all the numbers are strings • XML
  5. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. The downfall of the data engineer Watching paint dry is exciting in comparison to writing and maintaining Extract Transform and Load (ETL) logic. Most ETL jobs take a long time to execute and errors or issues tend to happen at runtime or are post-runtime assertions. Since the development time to execution time ratio is typically low, being productive means juggling with multiple pipelines at once and inherently doing a lot of context switching. By the time one of your 5 running “big data jobs” has finished, you have to get back in the mind space you were in many hours ago and craft your next iteration. Depending on how caffeinated you are, how long it’s been since the last iteration, and how systematic you are, you may fail at restoring the full context in your short term memory. This leads to systemic, stupid errors that waste hours. “ ” Maxime Beauchemin, Data engineer extraordinaire at Lyft, creator of Apache Airflow and Apache Superset. Ex-Facebook, Ex-Yahoo!, Ex-Airbnb https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b
  6. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. Modern data analytics 101
  7. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. A good data lake allows self-service and can easily plug-in new analytical engines.
  8. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. It’s all about democratizing access to your data across the organization.
  9. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. The concept of a Data Lake • All data in one place, a single source of truth • Handles structured/semi-structured/unstructured/raw data • Supports fast ingestion and consumption • Schema on read • Designed for low-cost storage • Decouples storage and compute • Supports protection and security rules
  10. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. A Possible Open Source solution • Hadoop Cluster (static/multi tenant) • Apache NiFi for ingestion workflows • Sqoop to ingest data from RDBMS • HDFS to store the data (tied to the Hadoop cluster) • Hive/HCatalog for data Catalog • Apache Atlas for a more human data catalog and governance • Apache Spark for complex ETL –with Apache Livy for REST • Hive for batch workloads with SQL • Presto for interactive queries with SQL • Kafka for streaming ingest • Apache Spark/Apache Flink for streaming analytics • Apache Hbase (or maybe Cassandra) to store streaming data • Apache Phoenix to run SQL queries on top of Hbase • Prometheus (or fluentd/collectd/ganglia/Nagios…) for logs and monitoring. Maybe with Elastic Search/Kibana • Airflow/Oozie to schedule workflows • Superset for business dashboards • Jupyter/JupyterHub/Zeppelin for data science • Security (Apache Sentry for Roles, Ranger for configuration, Knox as a firewall) • YARN to coordinate resources • Ambari for cluster administration • Terraform/chef/puppet for provisioning
  11. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Or a cloud native Solution on AWS Amazon DynamoDB Amazon Elasticsearch Service AWS AppSync Amazon API Gateway Amazon Cognito AWS KMS AWS CloudTrail AWS IAM Amazon CloudWatch AWS Snowball AWS Storage Gateway Amazon Kinesis Data Firehose AWS Direct Connect AWS Database Migration Service Amazon Athena Amazon EMR AWS Glue Amazon Redshift Amazon DynamoDB Amazon QuickSight Amazon Kinesis Amazon Elasticsearch Service Amazon Neptune Amazon RDS AWS Glue
  12. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. More data lakes & analytics on AWS than anywhere else
  13. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Data Lakes, Analytics, and ML Portfolio from AWS Broadest, deepest set of analytic services Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch service Amazon Kinesis Amazon QuickSight Analytics Machine Learning AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS Storage Gateway AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  14. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Data Movement From On-premises Datacenters AWS Snowball, Snowball Edge and Snowmobile Petabyte and Exabyte- scale data transport solution that uses secure appliances to transfer large amounts of data into and out of the AWS cloud AWS Direct Connect Establish a dedicated network connection from your premises to AWS; reduces your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections AWS Storage Gateway Lets your on-premises applications to use AWS for storage; includes a highly-optimized data transfer mechanism, bandwidth management, along with local cache AWS Database Migration Service Migrate database from the most widely-used commercial and open- source offerings to AWS quickly and securely with minimal downtime to applications
  15. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Data Movement From Real-time Sources Amazon Kinesis Video Streams Securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing Amazon Kinesis Data Firehose Capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools. Amazon Kinesis Data Streams Build custom, real-time applications that process data streams using popular stream processing frameworks AWS IoT Core Supports billions of devices and trillions of messages, and can process and route those messages to AWS endpoints and to other devices reliably and securely Managed Streaming For Kafka Fully managed open- source platform for building real-time streaming data pipelines and applications.
  16. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon S3—Object Storage Security and Compliance Three different forms of encryption; encrypts data in transit when replicating across regions; log and monitor with CloudTrail, use ML to discover and protect sensitive data with Macie Flexible Management Classify, report, and visualize data usage trends; objects can be tagged to see storage consumption, cost, and security; build lifecycle policies to automate tiering, and retention Durability, Availability & Scalability Built for eleven nine’s of durability; data distributed across 3 physical facilities in an AWS region; automatically replicated to any other AWS region Query in Place Run analytics & ML on data lake without data movement; S3 Select can retrieve subset of data, improving analytics performance by 400%
  17. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon Glacier—Backup and Archive Durability, Availability & Scalability Built for eleven nine’s of durability; data distributed across 3 physical facilities in an AWS region; automatically replicated to any other AWS region Secure Log and monitor with CloudTrail, Vault Lock enables WORM storage capabilities, helping satisfy compliance requirements Retrieves data in minutes Three retrieval options to fit your use case; expedited retrievals with Glacier Select can return data in minutes Inexpensive Lowest cost AWS object storage class, allowing you to archive large amounts of data at a very low cost $
  18. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Data Preparation Accounts for ~80% of the Work Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
  19. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use AWS Glue to cleanse, prep, and catalog AWS Glue Data Catalog - a single view across your data lake Automatically discovers data and stores schema Makes data searchable, and available for ETL Contains table definitions and custom metadata Use AWS Glue ETL jobs to cleanse, transform, and store processed data Serverless Apache Spark environment Use Glue ETL libraries or bring your own code Write code in Python or Scala Call any AWS API using the AWS boto3 SDK Amazon S3 (Raw data) Amazon S3 (Staging data) Amazon S3 (Processed data) AWS Glue Data Catalog Crawlers Crawlers Crawlers
  20. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Data Lakes, Analytics, and ML Portfolio from AWS Broadest, deepest set of analytic services Amazon SageMaker AWS Deep Learning AMIs Amazon Rekognition Amazon Lex AWS DeepLens Amazon Comprehend Amazon Translate Amazon Transcribe Amazon Polly Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch service Amazon Kinesis Amazon QuickSight Analytics Machine Learning AWS Direct Connect AWS Snowball AWS Snowmobile AWS Database Migration Service AWS Storage Gateway AWS IoT Core Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Kinesis Video Streams Real-time Data Movement On-premises Data Movement Data Lake on AWS Storage | Archival Storage | Data Catalog
  21. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon EMR—Big Data Processing Low cost Flexible billing with per- second billing, EC2 spot, reserved instances and auto-scaling to reduce costs 50–80% $ Easy Launch fully managed Hadoop & Spark in minutes; no cluster setup, node provisioning, cluster tuning Latest versions Updated with the latest open source frameworks within 30 days of release Use S3 storage Process data directly in the S3 data lake securely with high performance using the EMRFS connector Data Lake 1001100001001010111 0010101011100101010 0000111100101100101 010001100001
  22. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon EMR— More than just managed Hadoop
  23. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon Redshift—Data Warehousing Fast at scale Columnar storage technology to improve I/O efficiency and scale query performance Secure Audit everything; encrypt data end-to-end; extensive certification and compliance Open file formats Analyze optimized data formats on the latest SSD, and all open data formats in Amazon S3 Inexpensive As low as $1,000 per terabyte per year, 1/10th the cost of traditional data warehouse solutions; start at $0.25 per hour $
  24. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon Redshift Spectrum Extend the data warehouse to exabytes of data in S3 data lake S3 data lake Redshift data Redshift Spectrum query engine • Exabyte Redshift SQL queries against S3 • Join data across Redshift and S3 • Scale compute and storage separately • Stable query performance and unlimited concurrency • CSV, ORC, Avro, & Parquet data formats • Pay only for the amount of data scanned
  25. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Let’s play a game Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017 https://youtu.be/RpPf38L0HHU?t=3963 © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Let’s play a game Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017 https://youtu.be/RpPf38L0HHU?t=3963
  26. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Numbers are fun Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017 https://youtu.be/RpPf38L0HHU?t=3963
  27. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Numbers are fun Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017 https://youtu.be/RpPf38L0HHU?t=3963
  28. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon Athena—Interactive Analysis Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Ability to run SQL queries on data archived in Amazon Glacier (coming soon) Query Instantly Zero setup cost; just point to S3 and start querying SQL Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with QuickSight Pay per query Pay only for queries run; save 30–90% on per-query costs through compression $
  29. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Amazon QuickSight easy Empower everyone Seamless connectivity Fast analysis Serverless Now with ML superpowers!
  30. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. AWS Provides Highest Levels of Security Secure Compliance AWS Artifact Amazon Inspector Amazon Cloud HSM Amazon Cognito AWS CloudTrail Security Amazon GuardDuty AWS Shield AWS WAF Amazon Macie VPC Encryption AWS Certification Manager AWS Key Management Service Encryption at rest Encryption in transit Bring your own keys, HSM support Identity AWS IAM AWS SSO Amazon Cloud Directory AWS Directory Service AWS Organizations Customer need to have multiple levels of security, identity and access management, encryption, and compliance to secure their data lake
  31. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. Compliance: Virtually Every Regulatory Agency CSA Cloud Security Alliance Controls ISO 9001 Global Quality Standard ISO 27001 Security Management Controls ISO 27017 Cloud Specific Controls ISO 27018 Personal Data Protection PCI DSS Level 1 Payment Card Standards SOC 1 Audit Controls Report SOC 2 Security, Availability, & Confidentiality Report SOC 3 General Controls Report Global United States CJIS Criminal Justice Information Services DoD SRG DoD Data Processing FedRAMP Government Data Standards FERPA Educational Privacy Act FIPS Government Security Standards FISMA Federal Information Security Management GxP Quality Guidelines and Regulations ISO FFIEC Financial Institutions Regulation HIPPA Protected Health Information ITAR International Arms Regulations MPAA Protected Media Content NIST National Institute of Standards and Technology SEC Rule 17a-4(f) Financial Data Standards VPAT/Section 508 Accountability Standards Asia Pacific FISC [Japan] Financial Industry Information Systems IRAP [Australia] Australian Security Standards K-ISMS [Korea] Korean Information Security MTCS Tier 3 [Singapore] Multi-Tier Cloud Security Standard My Number Act [Japan] Personal Information Protection Europe C5 [Germany] Operational Security Attestation Cyber Essentials Plus [UK] Cyber Threat Protection G-Cloud [UK] UK Government Standards IT-Grundschutz [Germany] Baseline Protection Methodology X P G
  32. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS databases and analytics Broad and deep portfolio, built for builders AWS Marketplace Amazon Redshift Data warehousing Amazon EMR Hadoop + Spark Athena Interactive analytics Kinesis Analytics Real-time Amazon Elasticsearch service Operational Analytics RDS MySQL, PostgreSQL, MariaDB, Oracle, SQL Server Aurora MySQL, PostgreSQL Amazon QuickSight Amazon SageMaker DynamoDB Key value, Document ElastiCache Redis, Memcached Neptune Graph Timestream Time Series QLDB Ledger Database S3/Amazon Glacier AWS Glue ETL & Data Catalog Lake Formation Data Lakes Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Data Pipeline | Direct Connect Data Movement Analytics Databases Business Intelligence & Machine Learning Data Lake Managed Blockchain Blockchain Templates Blockchain Amazon Comprehend Amazon Rekognition Amazon Lex Amazon Transcribe AWS DeepLens 250+ solutions 730+ Database solutions 600+ Analytics solutions 25+ Blockchain solutions 20+ Data lake solutions 30+ solutions RDS on VMWare
  33. CHALLENGE Need to create constant feedback loop for designers Gain

    up-to-the-minute understanding of gamer satisfaction to guarantee gamers are engaged, thus resulting in the most popular game played in the world Fortnite | 125+ million players
  34. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Epic Games uses Data Lakes and analytics Entire analytics platform running on AWS S3 leveraged as a Data Lake All telemetry data is collected with Kinesis Real-time analytics done through Spark on EMR, DynamoDB to create scoreboards and real-time queries Use Amazon EMR for large batch data processing Game designers use data to inform their decisions Game clients Game servers Launcher Game services N E A R R E A L T I M E P I P E L I N E N E A R R E A L T I M E P I P E L I N E Grafana Scoreboards API Limited Raw Data (real time ad-hoc SQL) User ETL (metric definition) Spark on EMR DynamoDB NEAR REALTIME PIPELINES BATCH PIPELINES ETL using EMR Tableau/BI Ad-hoc SQL S3 (Data Lake) Kinesis APIs Databases S3 Other sources
  35. S U M M I T © 2019, Amazon Web

    Services, Inc. or its affiliates. All rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  36. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Demo Overview https://aws.amazon.com/blogs/big-data/harmonize-query-and-visualize-data- from-various-providers-using-aws-glue-amazon-athena-and-amazon-quicksight/
  37. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sébastien Stormacq AWS Tech Evangelist @sebsto
  38. © 2019, Amazon Web Services, Inc. or its affiliates. All

    rights reserved. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Select AWS Glue customers