Analytics 101 - Build BI System from Scratch

Slide 1

Slide 1 text

Sungmin Kim Sr. Solutions Architect, AWS Analytics 101 Build BI System from Scratch on AWS

Slide 2

Slide 2 text

Itinerary to Analytics on QuickSight Kinesis Data Firehose S3 Athena Kinesis Data Firehose Kinesis Data Streams Lambda function OpenSearch Service Kibana S3 Athena QuickSight

Slide 3

Slide 3 text

Data The world’s most valuable resource is no longer oil, but data.* *Copyright: The Economist, 2017, David Parkins “ ”

Slide 4

Slide 4 text

Answers & Insights Collect Consume Store Process/ Analyze Data 1 4 0 9 5 *Copyright: Fractionating column, 2009 science-resources.co.uk

Slide 5

Slide 5 text

3+1 Vs of Big Data

Slide 6

Slide 6 text

Structured, Unstructured, and Semi-Structured

Slide 7

Slide 7 text

Data Temperature Spectrum Structure Hot data Warm data Cold data Low High High Request rate Low High Cost / GB Low High Latency Low High Data Volume Low In-Memory SQL NoSQL Search Object Storage Archive Storage Graph

Slide 8

Slide 8 text

Simplify Big Data Processing Collect Consume Store Process/ Analyze Data 1 4 0 9 5 Answers & Insights Time to answer(Latency) Throughput Cost

Slide 9

Slide 9 text

Data Analytics System

Slide 10

Slide 10 text

Let’s build Business Intelligence System

Slide 11

Slide 11 text

Business Intelligence System CRM IoT WEB Messages CDC* Event Streams * CDC: Change Data Capture

Slide 12

Slide 12 text

Business Intelligence System QuickSight Amazon RDS

Slide 13

Slide 13 text

Relational Databases Flat Files And Many Others! Retail Data Ops Data Marketing Data Amazon QuickSight Fast BI Service with Pay-per-Session Pricing and ML Insights for everyone

Slide 14

Slide 14 text

Is it Well-Architected? QuickSight Amazon RDS

Slide 15

Slide 15 text

Is it Well-Architected? QuickSight Amazon RDS

Slide 16

Slide 16 text

AWS Well-Architecture Framework Performance efficiency Cost optimization Security Reliability Operational excellence Set of questions you can use to evaluate how well an architecture is aligned to AWS best practices

Slide 17

Slide 17 text

Is it Well-Architected? QuickSight Amazon RDS Operation excellence ü Scale-up: CPU, Memory, Disk ü Connection Pooling ü Schema changes: alter table ü Schema validation Reliability ü High availability: Primary-Replica Fail-over ü Data loss Performance efficiency ü Slow SQL queries caused by poor database indexing ü Number of writes >>> Number of reads Cost optimization ü Management of RDMS ü Compute(Instance), Storage cost Security ü Encryption data at rest and in transition

Slide 18

Slide 18 text

Key considerations for storing data • Storage Volume • Data Model • Data Scheme • Access Patterns ü SQL Supported ü Put/Get(key, value) ü Range Query ü Join Query • Serverless or Managed Service • Current skill set

Slide 19

Slide 19 text

Business Intelligence System QuickSight Redshift S3 Amazon RDS DocumentDB (with MongoDB compatibility) DynamoDB Object Storage Data Warehouse NoSQL OpenSearch Service

Slide 20

Slide 20 text

Structural Differences RDBMS (Replica) RDBMS (Primary) Query Engine (1) Storage Query Engine (2) Query Engine (3) Storage Interface Scale-Out Scale-Out Primary-Replica Cluster Data Node (1) Data Node (2) Data Node (3) Leader RDBMS (Primary) Scale-Up Primary RDBMS (Replica) Scale-Out Replica

Slide 21

Slide 21 text

Storage Engine Comparison S3 Redshift OpenSearch Service DocumentDB DynamoDB Storage volume Unlimited Limited (~ tens of TB) Limited (~ tens of TB) Limited (~ tens of TB) Unlimited Data model Object Relational Document Document Key-value Data scheme Schema-free Fixed schema Schema-free (JSON) Schema-free (JSON) (Key, value) SQL supported Query engine dependent Yes Yes No No Put/Get (key, value) Yes Yes Yes Yes Yes Range query Query engine dependent Yes Yes Yes Difficult Join query Query engine dependent Yes No Difficult Very Difficult Serverless Yes Yes/No(1) Yes/No(2) Yes Yes v https://db-engines.com/en/system/Amazon+DocumentDB%3BAmazon+DynamoDB%3BAmazon+Redshift%3BElasticsearch (1) Redshift vs. Redshift Serverless (2) OpenSearch Service vs. OpenSearch Serverless

Slide 22

Slide 22 text

Use Amazon S3 as Your Persistent Data Store • Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • Decouple storage and compute • No need to run compute clusters for storage (unlike HDFS) • Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances • Multiple & heterogeneous analysis clusters and services can use the same data • Designed for 99.999999999% durability • No need to pay for data replication within a region • Secure: SSL, client/server-side encryption at rest • Low cost

Slide 23

Slide 23 text

Business Intelligence System S3 QuickSight Ingestion Query engine

Slide 24

Slide 24 text

Business Intelligence System Kinesis Data Firehose S3 QuickSight

Slide 25

Slide 25 text

Amazon Kinesis Data Firehose Prepare and reliably load real-time data streams into data lakes, warehouse, and analytics tools

Slide 26

Slide 26 text

Business Intelligence System Kinesis Data Firehose S3 QuickSight Glue Athena EMR Batch Interactive

Slide 27

Slide 27 text

Comparison of SQL Processing engines Data Structure Semi Semi Semi Full Languages API/SQL SQL SQL SQL Data Store S3 (Glue), S3/HDFS (Spark) S3/HDFS S3 Local Use case Transformation SQL Queries for S3/HDFS Serverless SQL Queries for S3 Fully Featured SQL Database Performance AWS Glue Amazon Athena Amazon Redshift

Slide 28

Slide 28 text

Business Intelligence System Kinesis Data Firehose S3 Athena QuickSight

Slide 29

Slide 29 text

Data Processing: ETL vs ELT Source1 Source2 Target Source1 Source2 Staging tables Final tables Target (MPP database) Extract Transform Load E → T → L Extract & Load Transform E → L → T Source3 Source3

Slide 30

Slide 30 text

Real-Time Analytics System

Slide 31

Slide 31 text

How about Real-time Analytics? Kinesis Data Firehose S3 Athena QuickSight Real-time Analysis Layer ü Streaming data processing ü Latency: milliseconds ~ seconds Messages CDC Event Streams

Slide 32

Slide 32 text

Real-time Analytics Kinesis Data Firehose S3 Athena QuickSight Real-time Analysis Layer SQS Kinesis Data Streams MSK ü Streaming data processing ü Latency: milliseconds ~ seconds

Slide 33

Slide 33 text

• Decouple producers & consumers • Persistent buffer • Collect multiple streams • Preserve client ordering • Parallel consumption • Streaming MapReduce 4 4 3 3 2 2 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 Producer 1 shard 1 / partition 1 shard 2 / partition 2 Consumer 1 Count of red = 4 Count of violet = 4 Consumer 2 Count of blue = 4 Count of green = 4 Producer 2 Producer 3 Producer n Key = Red Key = Green Key = Blue Key = Violet Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka … Why Stream Storage?

Slide 34

Slide 34 text

• Decouple producers & consumers • Persistent buffer • Collect multiple streams • No client ordering (standard) • FIFO queue preserves client ordering • No streaming MapReduce • No parallel consumption • Amazon SNS can publish to multiple SNS subscribers (queues or Lambda functions) Consumers 4 3 2 1 1 2 3 4 4 3 2 1 1 2 3 4 2 1 3 4 1 3 3 4 2 Standard FIFO Producers Amazon SQS Queue Publisher Amazon SNS Topic AWS Lambda function Amazon SQS queue Queue Subscriber What about Amazon SQS?

Slide 35

Slide 35 text

Real-time Analytics Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Real-time Analysis Layer ü Streaming data processing ü Latency: milliseconds ~ seconds

Slide 36

Slide 36 text

Real-time Analytics Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Kinesis Data Analytics on EMR OpenSearch Service Real-time Analysis Layer on EMR

Slide 37

Slide 37 text

Open-source Big Data Analytics Tools

Slide 38

Slide 38 text

Amazon EMR Applications Framework Process Layer Data Layer Infrastructure S3 EMRFS Amazon S3 Instances Spot Instances

Slide 39

Slide 39 text

Kinesis Data Analytics Glue EMR Serverless Serverless Fully Managed

Slide 40

Slide 40 text

Real-time Analytics with Spark or Flink on EMR, RDS Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams QuickSight EMR Amazon RDS

Slide 41

Slide 41 text

Real-time Analytics with Spark or Flink on EMR, DynamoDB Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams EMR DynamoDB real-time dashboard

Slide 42

Slide 42 text

Real-time Analytics with Spark or Flink on EMR, ElastiCache Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams EMR real-time dashboard ElastiCache

Slide 43

Slide 43 text

Amazon Kinesis Data Analytics The easiest way to transform and analyze streaming data in real-time using Apache Flink

Slide 44

Slide 44 text

Real-time Analytics with Kinesis Data Analytics, RDS Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams

Slide 45

Slide 45 text

Real-time Analytics with Kinesis Data Analytics, DynamoDB Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Kinesis Data Analytics Lambda function Kinesis Data Streams real-time dashboard DynamoDB

Slide 46

Slide 46 text

Real-time Analytics with Kinesis Data Analytics, ElastiCache Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Kinesis Data Analytics Lambda function Kinesis Data Streams real-time dashboard ElastiCache

Slide 47

Slide 47 text

Amazon OpenSearch Service Securely unlock real-time search, monitoring, and analysis of business and operational data

Slide 48

Slide 48 text

Real-time Analytics with OpenSearch Service Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Lambda function OpenSearch Service Kibana

Slide 49

Slide 49 text

Kinesis Data Firehose Real-time Analytics with OpenSearch Service Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams OpenSearch Service Kibana

Slide 50

Slide 50 text

3 ways to build Real-time Analytics Kinesis Data Streams Lambda function OpenSearch Service Kibana EMR real-time dashboard ElastiCache Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams DynamoDB 1 2 3 Kinesis Data Firehose

Slide 51

Slide 51 text

Business Intelligence System Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Lambda function OpenSearch Service Kibana

Slide 52

Slide 52 text

Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Lambda function OpenSearch Service Kibana Batch Layer Serving Layer Speed Layer Business Intelligence System Lambda Architecture

Slide 53

Slide 53 text

Lambda vs Kappa Architecture Lambda Kappa

Slide 54

Slide 54 text

Kinesis Data Streams Business Intelligence System Kappa Architecture Amazon Redshift / Redshift Serverless Permanent Tables Real-time Materialized View Streaming Table … … QuickSight MSK

Slide 55

Slide 55 text

Data Source Buffering Limits: 1~15min OR 1~128MiB v Amazon Athena Pricing = (a) run SQL-based queries + (b) the number of bytes scanned ü Data Compression reduces the number of bytes scanned ü Columnar Data Format (e.g., Parquet, ORC) allow you to scan a certain set of columns. ü Data Partitioning filters records to be scanned s3://bucket/csv/year=?/month=?/day=?/hour=?/ 1.csv, 10MiB 2.csv, 9.5MiB … 100.csv, 11MiB many small files s3://bucket/parquet/year=?/month=?/day=?/hour=?/ 1.parquet, 100MiB 2. parquet, 90.5MiB … 5.parquet, 110MiB a few of large files [CTAS Query] CREATE TABLE new_table WITH ( external_location='{location}', format = 'PARQUET', parquet_compression = 'SNAPPY') AS SELECT * FROM old_table WHERE year={year} AND month={month} AND day={day} AND hour={hour} WITH DATA trigger Amazon Athena Performance Tips run CTAS query every hour to merge many small files into a few of large files S3 (tier-0) S3 (tier-1) Athena time-based event (e.g., 1hr) Kinesis Data Streams Kinesis Data Firehose Dashboard

Slide 56

Slide 56 text

Reference Architectures

Slide 57

Slide 57 text

Streaming Data Pipeline with Amazon DMS https://github.com/aws-samples/aws-dms-cdc-data-pipeline

Slide 58

Slide 58 text

Web Log Analytics with Amazon Kinesis Data Streams Proxy using Amazon API Gateway https://github.com/aws-samples/web-analytics-on-aws

Slide 59

Slide 59 text

SaaS Metering system on AWS https://github.com/aws-samples/saas-metering-system-on-aws

Slide 60

Slide 60 text

Summary

Slide 61

Slide 61 text

Itinerary to Analytics on Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Lambda function OpenSearch Service Kibana Kinesis Data Firehose S3 Athena QuickSight

Slide 62

Slide 62 text

Collect Consume Store Process/ Analyze Data 1 4 0 9 5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon OpenSearch Service Amazon Machine Learning AWS Lambda

Slide 63

Slide 63 text

Lessons Learned: Architectural Principles • Build decoupled systems - Data → Store → Process → Store → Analyze → Answers • Use the right tool for the job - Data structure, latency, throughput, access patterns • Leverage managed and serverless services - Scalable/elastic, available, reliable, secure, no/low admin • Use log-centric design patterns - Immutable logs (data lake), materialized views • Be cost-conscious - Big data ≠ Big cost • Working backwards - Design from consume to collect

Slide 64

Slide 64 text

Call to Action (1) Building Serverless Business Intelligent System from Scratch • https://serverless-bi-system-from-scratch.workshop.aws/ (2) Dive into Amazon OpenSearch Service • https://catalog.us-east-1.prod.workshops.aws/workshops/f0213896-4dd9-494a-89c5- f7886b45ed4a/en-US (3) Getting started with Amazon OpenSearch Serverless • https://catalog.us-east-1.prod.workshops.aws/workshops/f8d2c175-634d-4c5d-94cb- d83bbc656c6a/en-US (4) Amazon Kinesis Data Firehose Immersion Day • https://catalog.us-east-1.prod.workshops.aws/workshops/32e6bc9a-5c03-416d-be7c- 4d29f40e55c4/en-US (5) Real Time Streaming with Amazon Kinesis • https://catalog.workshops.aws/real-time-streaming-with-kinesis/en-US (6) MSK Immersion Day • https://catalog.us-east-1.prod.workshops.aws/workshops/99d483f9-6383-4223-a855- f97e37c607c1/en-US