Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AWS Analytics Immersion Day - Build BI System from Scratch

AWS Analytics Immersion Day - Build BI System from Scratch

Sungmin Kim

April 22, 2022
Tweet

More Decks by Sungmin Kim

Other Decks in Programming

Transcript

  1. Data The world’s most valuable resource is no longer oil,

    but data.* *Copyright: The Economist, 2017, David Parkins “ ”
  2. Answers & Insights Collect Consume Store Process/ Analyze Data 1

    4 0 9 5 *Copyright: Fractionating column, 2009 science-resources.co.uk
  3. Data Temperature Spectrum Structure Hot data Warm data Cold data

    Low High High Request rate Low High Cost / GB Low High Latency Low High Data Volume Low In-Memory SQL NoSQL Search Object Storage Archive Storage Graph
  4. Simplify Big Data Processing Collect Consume Store Process/ Analyze Data

    1 4 0 9 5 Answers & Insights Time to answer(Latency) Throughput Cost
  5. Relational Databases Flat Files And Many Others! Retail Data Ops

    Data Marketing Data Amazon QuickSight Fast BI Service with Pay-per-Session Pricing and ML Insights for everyone
  6. AWS Well-Architecture Framework Performance efficiency Cost optimization Security Reliability Operational

    excellence Set of questions you can use to evaluate how well an architecture is aligned to AWS best practices
  7. Is it Well-Architected? QuickSight Amazon RDS Operation excellence ü Scale-up:

    CPU, Memory, Disk ü Connection Pooling ü Schema changes: alter table ü Schema validation Reliability ü High availability: Master-Slave fail-over ü Data loss Performance efficiency ü Slow SQL queries caused by poor database indexing ü Number of writes >>> Number of reads Cost optimization ü Management of RDMS ü Compute(Instance), Storage cost Security ü Encryption data at rest and in transition
  8. Key considerations for storing data • Storage Volume • Data

    Model • Data Scheme • Access Patterns ü SQL Supported ü Put/Get(key, value) ü Range Query ü Join Query • Serverless or Managed Service • Current skill set
  9. Business Intelligence System QuickSight Redshift S3 Amazon RDS DocumentDB (with

    MongoDB compatability) DynamoDB Object Storage Data Warehouse NoSQL Elasticsearch Service
  10. Business Intelligence System QuickSight Redshift S3 Elasticsearch Service DocumentDB (with

    MongoDB compatability) DynamoDB Object Storage Data Warehouse NoSQL DynamoDB
  11. Business Intelligence System QuickSight DocumentDB (with MongoDB compatibility) Redshift S3

    Elasticsearch Service DocumentDB (with MongoDB compatability) DynamoDB Object Storage Data Warehouse NoSQL
  12. Business Intelligence System QuickSight Elasticsearch Service Redshift S3 Elasticsearch Service

    DocumentDB (with MongoDB compatability) DynamoDB Object Storage Data Warehouse NoSQL
  13. Business Intelligence System QuickSight Redshift Redshift S3 Elasticsearch Service DocumentDB

    (with MongoDB compatability) DynamoDB Object Storage Data Warehouse NoSQL
  14. Business Intelligence System QuickSight S3 Redshift S3 Elasticsearch Service DocumentDB

    (with MongoDB compatability) DynamoDB Object Storage Data Warehouse NoSQL
  15. Storage Engine Comparison S3 Redshift Elasticsearch DocumentDB DynamoDB Storage volume

    Unlimited Limited (~ tens of TB) Limited (~ tens of TB) Limited (~ tens of TB) Unlimited Data model Object Relational Document Document Key-value Data scheme Schema-free Fixed schema Schema-free (JSON) Schema-free (JSON) (Key, value) SQL supported Query engine dependent Yes Yes No No Put/Get (key, value) Yes Yes Yes Yes Yes Range query Query engine dependent Yes Yes Yes Difficult Join query Query engine dependent Yes No Difficult Very Difficult Serverless Yes No No Yes Yes v https://db-engines.com/en/system/Amazon+DocumentDB%3BAmazon+DynamoDB%3BAmazon+Redshift%3BElasticsearch
  16. Use Amazon S3 as Your Persistent Data Store • Natively

    supported by big data frameworks (Spark, Hive, Presto, etc.) • Decouple storage and compute • No need to run compute clusters for storage (unlike HDFS) • Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances • Multiple & heterogeneous analysis clusters and services can use the same data • Designed for 99.999999999% durability • No need to pay for data replication within a region • Secure: SSL, client/server-side encryption at rest • Low cost
  17. Comparison of SQL Processing engines Data Structure Semi Semi Semi

    Full Languages API/SQL SQL SQL SQL Data Store S3 (Glue), S3/HDFS (Spark) S3/HDFS S3 Local Use case Transformation SQL Queries for S3/HDFS Serverless SQL Queries for S3 Fully Featured SQL Database Performance AWS Glue Amazon Athena Amazon Redshift
  18. Data Processing: ETL vs ELT Source1 Source2 Target Source1 Source2

    Staging tables Final tables Target (MPP database) Extract Transform Load E → T → L Extract & Load Transform E → L → T Source3 Source3
  19. How about Real-time Analytics? Kinesis Data Firehose S3 Athena QuickSight

    Real-time Analysis Layer ü Streaming data processing ü Latency: milliseconds ~ seconds Messages CDC Event Streams
  20. Real-time Analytics Kinesis Data Firehose S3 Athena QuickSight Real-time Analysis

    Layer SQS Kinesis Data Streams MSK ü Streaming data processing ü Latency: milliseconds ~ seconds
  21. • Decouple producers & consumers • Persistent buffer • Collect

    multiple streams • Preserve client ordering • Parallel consumption • Streaming MapReduce 4 4 3 3 2 2 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 Producer 1 shard 1 / partition 1 shard 2 / partition 2 Consumer 1 Count of red = 4 Count of violet = 4 Consumer 2 Count of blue = 4 Count of green = 4 Producer 2 Producer 3 Producer n Key = Red Key = Green Key = Blue Key = Violet Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka … Why Stream Storage?
  22. • Decouple producers & consumers • Persistent buffer • Collect

    multiple streams • No client ordering (standard) • FIFO queue preserves client ordering • No streaming MapReduce • No parallel consumption • Amazon SNS can publish to multiple SNS subscribers (queues or Lambda functions) Consumers 4 3 2 1 1 2 3 4 4 3 2 1 1 2 3 4 2 1 3 4 1 3 3 4 2 Standard FIFO Producers Amazon SQS Queue Publisher Amazon SNS Topic AWS Lambda function Amazon SQS queue Queue Subscriber What about Amazon SQS?
  23. Real-time Analytics Kinesis Data Firehose S3 Athena QuickSight Kinesis Data

    Streams Real-time Analysis Layer ü Streaming data processing ü Latency: milliseconds ~ seconds
  24. Real-time Analytics Kinesis Data Firehose S3 Athena QuickSight Kinesis Data

    Streams Kinesis Data Analytics on EMR Elasticsearch Service Real-time Analysis Layer on EMR
  25. RDD Operator RDD RDD … Operator Operator Data Sink Operator

    Operator Data Source Amazon EMR Applications Framework Process Layer Data Layer Infrastructure S3 EMRFS Amazon S3 Instances Spot Instances
  26. Real-time Analytics with Spark or Flink on EMR, RDS Kinesis

    Data Firehose S3 Athena QuickSight Kinesis Data Streams QuickSight EMR Amazon RDS
  27. Real-time Analytics with Spark or Flink on EMR, DynamoDB Kinesis

    Data Firehose S3 Athena QuickSight Kinesis Data Streams EMR DynamoDB real-time dashboard
  28. Real-time Analytics with Spark or Flink on EMR, ElastiCache Kinesis

    Data Firehose S3 Athena QuickSight Kinesis Data Streams EMR real-time dashboard ElastiCache
  29. Real-time Analytics with Kinesis Data Analytics, RDS Kinesis Data Firehose

    S3 Athena QuickSight Kinesis Data Streams Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams
  30. Real-time Analytics with Kinesis Data Analytics, DynamoDB Kinesis Data Firehose

    S3 Athena QuickSight Kinesis Data Streams Kinesis Data Analytics Lambda function Kinesis Data Streams real-time dashboard DynamoDB
  31. Real-time Analytics with Kinesis Data Analytics, ElastiCache Kinesis Data Firehose

    S3 Athena QuickSight Kinesis Data Streams Kinesis Data Analytics Lambda function Kinesis Data Streams real-time dashboard ElastiCache
  32. Real-time Analytics with Elasticsearch Service Kinesis Data Firehose S3 Athena

    QuickSight Kinesis Data Streams Lambda function Elasticsearch Service Kibana
  33. 3 ways to build Real-time Analytics Kinesis Data Streams Lambda

    function Elasticsearch Service Kibana EMR real-time dashboard ElastiCache Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams DynamoDB 1 2 3
  34. Business Intelligence System Kinesis Data Firehose S3 Athena QuickSight Kinesis

    Data Streams Lambda function Elasticsearch Service Kibana
  35. Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Lambda

    function Elasticsearch Service Kibana Batch Layer Serving Layer Speed Layer Business Intelligence System Lambda Architecture
  36. Beyond Data Analytics: Expand Business Intelligence • Network(Graph) Analysis •

    Social networking • Recommendation engines • Fraud detection • Knowledge Graphs • Recommendation • Personalized recommendations • Personalized search • Personalized notifications • Machine Learning • Forecast(Predictive Analytics) • Classification • Sentiment Analysis • and more …
  37. Beyond Data Analytics – Network Analysis(with Neptune) Kinesis Data Firehose

    S3 Athena QuickSight Kinesis Data Streams Lambda function Elasticsearch Service Kibana Neptune Lambda function API Gateway Lambda function
  38. Deliver high-quality recommendations Deliver personalization in days, not months Real-time

    Works with any product or content Amazon Personalize Real-time personalization and recommendation, based on the same technology used at Amazon.com
  39. Beyond Data Analytics: Recommendation(with Personalize) Kinesis Data Firehose S3 Athena

    QuickSight Kinesis Data Streams Lambda function Elasticsearch Service Kibana S3 ElastiCache API Gateway Event (time-based) Lambda function Personalize Lambda function EMR
  40. Fully managed hosting with auto scaling One-click deployment Notebook instances

    Built-in, high- performance algorithms Automatic model tuning One-click training Build Train Deploy Amazon SageMaker Build, Train, and Deploy Machine Learning Models Quickly & Easily, at Scale
  41. Beyond Data Analytics: Machine Learning(with SageMaker) Kinesis Data Firehose S3

    Athena QuickSight Kinesis Data Streams Lambda function Elasticsearch Service Kibana Train Notebook Model Models bucket SageMaker Endpoint SageMaker
  42. Collect Consume Store Process/ Analyze Data 1 4 0 9

    5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon Elasticsearch Service Amazon Machine Learning AWS Lambda
  43. Lessons Learned: Architectural Principles • Build decoupled systems - Data

    → Store → Process → Store → Analyze → Answers • Use the right tool for the job - Data structure, latency, throughput, access patterns • Leverage managed and serverless services - Scalable/elastic, available, reliable, secure, no/low admin • Use log-centric design patterns - Immutable logs (data lake), materialized views • Be cost-conscious - Big data ≠ Big cost • Working backwards - Design from consume to collect
  44. Learn more - Big Data Analytics Architectural Patterns and Best

    Practices (ANT201-R1) - AWS re:Invent 2018 https://www.slideshare.net/AmazonWebServices/big-data-analytics-architectural-patterns-and-best- practices-ant201r1-aws-reinvent-2018 - Everything You Need to Know About Big Data: From Architectural Principles to Best Practices https://www.slideshare.net/AmazonWebServices/everything-you-need-to-know-about-big-data- from-architectural-principles-to-best-practices - Big Data Analytics Options on AWS https://d1.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf - AWS Big Data Blog https://aws.amazon.com/ko/blogs/big-data - AWS Well-Architected Labs https://wellarchitectedlabs.com