Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analytics 101 - Build BI System from Scratch

Analytics 101 - Build BI System from Scratch

Agenda

- Key considerations when designing data analytics system from scratch
- Choosing the right AWS Analytics Services for your needs
- Scaling a data analysis system from batch data analysis to real-time data analysis
- Introduction to Lambda architecture and Kappa architecture
- Reference Architectures

Sungmin Kim

April 12, 2023
Tweet

More Decks by Sungmin Kim

Other Decks in Technology

Transcript

  1. Itinerary to Analytics on QuickSight Kinesis Data Firehose S3 Athena

    Kinesis Data Firehose Kinesis Data Streams Lambda function OpenSearch Service Kibana S3 Athena QuickSight
  2. Data The world’s most valuable resource is no longer oil,

    but data.* *Copyright: The Economist, 2017, David Parkins “ ”
  3. Answers & Insights Collect Consume Store Process/ Analyze Data 1

    4 0 9 5 *Copyright: Fractionating column, 2009 science-resources.co.uk
  4. Data Temperature Spectrum Structure Hot data Warm data Cold data

    Low High High Request rate Low High Cost / GB Low High Latency Low High Data Volume Low In-Memory SQL NoSQL Search Object Storage Archive Storage Graph
  5. Simplify Big Data Processing Collect Consume Store Process/ Analyze Data

    1 4 0 9 5 Answers & Insights Time to answer(Latency) Throughput Cost
  6. Relational Databases Flat Files And Many Others! Retail Data Ops

    Data Marketing Data Amazon QuickSight Fast BI Service with Pay-per-Session Pricing and ML Insights for everyone
  7. AWS Well-Architecture Framework Performance efficiency Cost optimization Security Reliability Operational

    excellence Set of questions you can use to evaluate how well an architecture is aligned to AWS best practices
  8. Is it Well-Architected? QuickSight Amazon RDS Operation excellence ü Scale-up:

    CPU, Memory, Disk ü Connection Pooling ü Schema changes: alter table ü Schema validation Reliability ü High availability: Primary-Replica Fail-over ü Data loss Performance efficiency ü Slow SQL queries caused by poor database indexing ü Number of writes >>> Number of reads Cost optimization ü Management of RDMS ü Compute(Instance), Storage cost Security ü Encryption data at rest and in transition
  9. Key considerations for storing data • Storage Volume • Data

    Model • Data Scheme • Access Patterns ü SQL Supported ü Put/Get(key, value) ü Range Query ü Join Query • Serverless or Managed Service • Current skill set
  10. Business Intelligence System QuickSight Redshift S3 Amazon RDS DocumentDB (with

    MongoDB compatibility) DynamoDB Object Storage Data Warehouse NoSQL OpenSearch Service
  11. Structural Differences RDBMS (Replica) RDBMS (Primary) Query Engine (1) Storage

    Query Engine (2) Query Engine (3) Storage Interface Scale-Out Scale-Out Primary-Replica Cluster Data Node (1) Data Node (2) Data Node (3) Leader RDBMS (Primary) Scale-Up Primary RDBMS (Replica) Scale-Out Replica
  12. Storage Engine Comparison S3 Redshift OpenSearch Service DocumentDB DynamoDB Storage

    volume Unlimited Limited (~ tens of TB) Limited (~ tens of TB) Limited (~ tens of TB) Unlimited Data model Object Relational Document Document Key-value Data scheme Schema-free Fixed schema Schema-free (JSON) Schema-free (JSON) (Key, value) SQL supported Query engine dependent Yes Yes No No Put/Get (key, value) Yes Yes Yes Yes Yes Range query Query engine dependent Yes Yes Yes Difficult Join query Query engine dependent Yes No Difficult Very Difficult Serverless Yes Yes/No(1) Yes/No(2) Yes Yes v https://db-engines.com/en/system/Amazon+DocumentDB%3BAmazon+DynamoDB%3BAmazon+Redshift%3BElasticsearch (1) Redshift vs. Redshift Serverless (2) OpenSearch Service vs. OpenSearch Serverless
  13. Use Amazon S3 as Your Persistent Data Store • Natively

    supported by big data frameworks (Spark, Hive, Presto, etc.) • Decouple storage and compute • No need to run compute clusters for storage (unlike HDFS) • Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances • Multiple & heterogeneous analysis clusters and services can use the same data • Designed for 99.999999999% durability • No need to pay for data replication within a region • Secure: SSL, client/server-side encryption at rest • Low cost
  14. Amazon Kinesis Data Firehose Prepare and reliably load real-time data

    streams into data lakes, warehouse, and analytics tools
  15. Comparison of SQL Processing engines Data Structure Semi Semi Semi

    Full Languages API/SQL SQL SQL SQL Data Store S3 (Glue), S3/HDFS (Spark) S3/HDFS S3 Local Use case Transformation SQL Queries for S3/HDFS Serverless SQL Queries for S3 Fully Featured SQL Database Performance AWS Glue Amazon Athena Amazon Redshift
  16. Data Processing: ETL vs ELT Source1 Source2 Target Source1 Source2

    Staging tables Final tables Target (MPP database) Extract Transform Load E → T → L Extract & Load Transform E → L → T Source3 Source3
  17. How about Real-time Analytics? Kinesis Data Firehose S3 Athena QuickSight

    Real-time Analysis Layer ü Streaming data processing ü Latency: milliseconds ~ seconds Messages CDC Event Streams
  18. Real-time Analytics Kinesis Data Firehose S3 Athena QuickSight Real-time Analysis

    Layer SQS Kinesis Data Streams MSK ü Streaming data processing ü Latency: milliseconds ~ seconds
  19. • Decouple producers & consumers • Persistent buffer • Collect

    multiple streams • Preserve client ordering • Parallel consumption • Streaming MapReduce 4 4 3 3 2 2 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 Producer 1 shard 1 / partition 1 shard 2 / partition 2 Consumer 1 Count of red = 4 Count of violet = 4 Consumer 2 Count of blue = 4 Count of green = 4 Producer 2 Producer 3 Producer n Key = Red Key = Green Key = Blue Key = Violet Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka … Why Stream Storage?
  20. • Decouple producers & consumers • Persistent buffer • Collect

    multiple streams • No client ordering (standard) • FIFO queue preserves client ordering • No streaming MapReduce • No parallel consumption • Amazon SNS can publish to multiple SNS subscribers (queues or Lambda functions) Consumers 4 3 2 1 1 2 3 4 4 3 2 1 1 2 3 4 2 1 3 4 1 3 3 4 2 Standard FIFO Producers Amazon SQS Queue Publisher Amazon SNS Topic AWS Lambda function Amazon SQS queue Queue Subscriber What about Amazon SQS?
  21. Real-time Analytics Kinesis Data Firehose S3 Athena QuickSight Kinesis Data

    Streams Real-time Analysis Layer ü Streaming data processing ü Latency: milliseconds ~ seconds
  22. Real-time Analytics Kinesis Data Firehose S3 Athena QuickSight Kinesis Data

    Streams Kinesis Data Analytics on EMR OpenSearch Service Real-time Analysis Layer on EMR
  23. Real-time Analytics with Spark or Flink on EMR, RDS Kinesis

    Data Firehose S3 Athena QuickSight Kinesis Data Streams QuickSight EMR Amazon RDS
  24. Real-time Analytics with Spark or Flink on EMR, DynamoDB Kinesis

    Data Firehose S3 Athena QuickSight Kinesis Data Streams EMR DynamoDB real-time dashboard
  25. Real-time Analytics with Spark or Flink on EMR, ElastiCache Kinesis

    Data Firehose S3 Athena QuickSight Kinesis Data Streams EMR real-time dashboard ElastiCache
  26. Amazon Kinesis Data Analytics The easiest way to transform and

    analyze streaming data in real-time using Apache Flink
  27. Real-time Analytics with Kinesis Data Analytics, RDS Kinesis Data Firehose

    S3 Athena QuickSight Kinesis Data Streams Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams
  28. Real-time Analytics with Kinesis Data Analytics, DynamoDB Kinesis Data Firehose

    S3 Athena QuickSight Kinesis Data Streams Kinesis Data Analytics Lambda function Kinesis Data Streams real-time dashboard DynamoDB
  29. Real-time Analytics with Kinesis Data Analytics, ElastiCache Kinesis Data Firehose

    S3 Athena QuickSight Kinesis Data Streams Kinesis Data Analytics Lambda function Kinesis Data Streams real-time dashboard ElastiCache
  30. Real-time Analytics with OpenSearch Service Kinesis Data Firehose S3 Athena

    QuickSight Kinesis Data Streams Lambda function OpenSearch Service Kibana
  31. Kinesis Data Firehose Real-time Analytics with OpenSearch Service Kinesis Data

    Firehose S3 Athena QuickSight Kinesis Data Streams OpenSearch Service Kibana
  32. 3 ways to build Real-time Analytics Kinesis Data Streams Lambda

    function OpenSearch Service Kibana EMR real-time dashboard ElastiCache Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams DynamoDB 1 2 3 Kinesis Data Firehose
  33. Business Intelligence System Kinesis Data Firehose S3 Athena QuickSight Kinesis

    Data Streams Lambda function OpenSearch Service Kibana
  34. Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Lambda

    function OpenSearch Service Kibana Batch Layer Serving Layer Speed Layer Business Intelligence System Lambda Architecture
  35. Kinesis Data Streams Business Intelligence System Kappa Architecture Amazon Redshift

    / Redshift Serverless Permanent Tables Real-time Materialized View Streaming Table … … QuickSight MSK
  36. Data Source Buffering Limits: 1~15min OR 1~128MiB v Amazon Athena

    Pricing = (a) run SQL-based queries + (b) the number of bytes scanned ü Data Compression reduces the number of bytes scanned ü Columnar Data Format (e.g., Parquet, ORC) allow you to scan a certain set of columns. ü Data Partitioning filters records to be scanned s3://bucket/csv/year=?/month=?/day=?/hour=?/ 1.csv, 10MiB 2.csv, 9.5MiB … 100.csv, 11MiB many small files s3://bucket/parquet/year=?/month=?/day=?/hour=?/ 1.parquet, 100MiB 2. parquet, 90.5MiB … 5.parquet, 110MiB a few of large files [CTAS Query] CREATE TABLE new_table WITH ( external_location='{location}', format = 'PARQUET', parquet_compression = 'SNAPPY') AS SELECT * FROM old_table WHERE year={year} AND month={month} AND day={day} AND hour={hour} WITH DATA trigger Amazon Athena Performance Tips run CTAS query every hour to merge many small files into a few of large files S3 (tier-0) S3 (tier-1) Athena time-based event (e.g., 1hr) Kinesis Data Streams Kinesis Data Firehose Dashboard
  37. Web Log Analytics with Amazon Kinesis Data Streams Proxy using

    Amazon API Gateway https://github.com/aws-samples/web-analytics-on-aws
  38. Itinerary to Analytics on Kinesis Data Firehose S3 Athena QuickSight

    Kinesis Data Streams Lambda function OpenSearch Service Kibana Kinesis Data Firehose S3 Athena QuickSight
  39. Collect Consume Store Process/ Analyze Data 1 4 0 9

    5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon OpenSearch Service Amazon Machine Learning AWS Lambda
  40. Lessons Learned: Architectural Principles • Build decoupled systems - Data

    → Store → Process → Store → Analyze → Answers • Use the right tool for the job - Data structure, latency, throughput, access patterns • Leverage managed and serverless services - Scalable/elastic, available, reliable, secure, no/low admin • Use log-centric design patterns - Immutable logs (data lake), materialized views • Be cost-conscious - Big data ≠ Big cost • Working backwards - Design from consume to collect
  41. Call to Action (1) Building Serverless Business Intelligent System from

    Scratch • https://serverless-bi-system-from-scratch.workshop.aws/ (2) Dive into Amazon OpenSearch Service • https://catalog.us-east-1.prod.workshops.aws/workshops/f0213896-4dd9-494a-89c5- f7886b45ed4a/en-US (3) Getting started with Amazon OpenSearch Serverless • https://catalog.us-east-1.prod.workshops.aws/workshops/f8d2c175-634d-4c5d-94cb- d83bbc656c6a/en-US (4) Amazon Kinesis Data Firehose Immersion Day • https://catalog.us-east-1.prod.workshops.aws/workshops/32e6bc9a-5c03-416d-be7c- 4d29f40e55c4/en-US (5) Real Time Streaming with Amazon Kinesis • https://catalog.workshops.aws/real-time-streaming-with-kinesis/en-US (6) MSK Immersion Day • https://catalog.us-east-1.prod.workshops.aws/workshops/99d483f9-6383-4223-a855- f97e37c607c1/en-US