Slide 1

Slide 1 text

Sungmin Kim Solutions Architect, AWS Analytics Immersion Day Build BI System from Scratch

Slide 2

Slide 2 text

Itinerary to Analytics on

Slide 3

Slide 3 text

What is Architecting? Art of Trade-off

Slide 4

Slide 4 text

Data The world’s most valuable resource is no longer oil, but data.* *Copyright: The Economist, 2017, David Parkins “ ”

Slide 5

Slide 5 text

Answers & Insights Collect Consume Store Process/ Analyze Data 1 4 0 9 5 *Copyright: Fractionating column, 2009 science-resources.co.uk

Slide 6

Slide 6 text

3+1 Vs of Big Data

Slide 7

Slide 7 text

Structured, Unstructured, and Semi-Structured

Slide 8

Slide 8 text

Data Temperature Spectrum Structure Hot data Warm data Cold data Low High High Request rate Low High Cost / GB Low High Latency Low High Data Volume Low In-Memory SQL NoSQL Search Object Storage Archive Storage Graph

Slide 9

Slide 9 text

Simplify Big Data Processing Collect Consume Store Process/ Analyze Data 1 4 0 9 5 Answers & Insights Time to answer(Latency) Throughput Cost

Slide 10

Slide 10 text

Let’s build Business Intelligence System

Slide 11

Slide 11 text

Business Intelligence System CRM IoT WEB Messages CDC* Event Streams * CDC: Change Data Capture

Slide 12

Slide 12 text

Business Intelligence System QuickSight Amazon RDS

Slide 13

Slide 13 text

Relational Databases Flat Files And Many Others! Retail Data Ops Data Marketing Data Amazon QuickSight Fast BI Service with Pay-per-Session Pricing and ML Insights for everyone

Slide 14

Slide 14 text

Is it Well-Architected? QuickSight Amazon RDS

Slide 15

Slide 15 text

Is it Well-Architected? QuickSight Amazon RDS

Slide 16

Slide 16 text

AWS Well-Architecture Framework Performance efficiency Cost optimization Security Reliability Operational excellence Set of questions you can use to evaluate how well an architecture is aligned to AWS best practices

Slide 17

Slide 17 text

Is it Well-Architected? QuickSight Amazon RDS Operation excellence ü Scale-up: CPU, Memory, Disk ü Connection Pooling ü Schema changes: alter table ü Schema validation Reliability ü High availability: Master-Slave fail-over ü Data loss Performance efficiency ü Slow SQL queries caused by poor database indexing ü Number of writes >>> Number of reads Cost optimization ü Management of RDMS ü Compute(Instance), Storage cost Security ü Encryption data at rest and in transition

Slide 18

Slide 18 text

Key considerations for storing data • Storage Volume • Data Model • Data Scheme • Access Patterns ü SQL Supported ü Put/Get(key, value) ü Range Query ü Join Query • Serverless or Managed Service • Current skill set

Slide 19

Slide 19 text

Business Intelligence System QuickSight Redshift S3 Amazon RDS DocumentDB (with MongoDB compatability) DynamoDB Object Storage Data Warehouse NoSQL Elasticsearch Service

Slide 20

Slide 20 text

Business Intelligence System QuickSight Redshift S3 Elasticsearch Service DocumentDB (with MongoDB compatability) DynamoDB Object Storage Data Warehouse NoSQL DynamoDB

Slide 21

Slide 21 text

Business Intelligence System QuickSight DocumentDB (with MongoDB compatibility) Redshift S3 Elasticsearch Service DocumentDB (with MongoDB compatability) DynamoDB Object Storage Data Warehouse NoSQL

Slide 22

Slide 22 text

Business Intelligence System QuickSight Elasticsearch Service Redshift S3 Elasticsearch Service DocumentDB (with MongoDB compatability) DynamoDB Object Storage Data Warehouse NoSQL

Slide 23

Slide 23 text

Business Intelligence System QuickSight Redshift Redshift S3 Elasticsearch Service DocumentDB (with MongoDB compatability) DynamoDB Object Storage Data Warehouse NoSQL

Slide 24

Slide 24 text

Business Intelligence System QuickSight S3 Redshift S3 Elasticsearch Service DocumentDB (with MongoDB compatability) DynamoDB Object Storage Data Warehouse NoSQL

Slide 25

Slide 25 text

Storage Engine Comparison S3 Redshift Elasticsearch DocumentDB DynamoDB Storage volume Unlimited Limited (~ tens of TB) Limited (~ tens of TB) Limited (~ tens of TB) Unlimited Data model Object Relational Document Document Key-value Data scheme Schema-free Fixed schema Schema-free (JSON) Schema-free (JSON) (Key, value) SQL supported Query engine dependent Yes Yes No No Put/Get (key, value) Yes Yes Yes Yes Yes Range query Query engine dependent Yes Yes Yes Difficult Join query Query engine dependent Yes No Difficult Very Difficult Serverless Yes No No Yes Yes v https://db-engines.com/en/system/Amazon+DocumentDB%3BAmazon+DynamoDB%3BAmazon+Redshift%3BElasticsearch

Slide 26

Slide 26 text

Use Amazon S3 as Your Persistent Data Store • Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • Decouple storage and compute • No need to run compute clusters for storage (unlike HDFS) • Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances • Multiple & heterogeneous analysis clusters and services can use the same data • Designed for 99.999999999% durability • No need to pay for data replication within a region • Secure: SSL, client/server-side encryption at rest • Low cost

Slide 27

Slide 27 text

Business Intelligence System S3 QuickSight Ingestion Query engine

Slide 28

Slide 28 text

Business Intelligence System Kinesis Data Firehose S3 QuickSight

Slide 29

Slide 29 text

Amazon Kinesis Data Firehose Prepare and load real-time data streams into data stores and analytics tools

Slide 30

Slide 30 text

Business Intelligence System Kinesis Data Firehose S3 QuickSight Glue Athena EMR Batch Interactive

Slide 31

Slide 31 text

Comparison of SQL Processing engines Data Structure Semi Semi Semi Full Languages API/SQL SQL SQL SQL Data Store S3 (Glue), S3/HDFS (Spark) S3/HDFS S3 Local Use case Transformation SQL Queries for S3/HDFS Serverless SQL Queries for S3 Fully Featured SQL Database Performance AWS Glue Amazon Athena Amazon Redshift

Slide 32

Slide 32 text

Business Intelligence System Kinesis Data Firehose S3 Athena QuickSight

Slide 33

Slide 33 text

Data Processing: ETL vs ELT Source1 Source2 Target Source1 Source2 Staging tables Final tables Target (MPP database) Extract Transform Load E → T → L Extract & Load Transform E → L → T Source3 Source3

Slide 34

Slide 34 text

How about Real-time Analytics? Kinesis Data Firehose S3 Athena QuickSight Real-time Analysis Layer ü Streaming data processing ü Latency: milliseconds ~ seconds Messages CDC Event Streams

Slide 35

Slide 35 text

Real-time Analytics Kinesis Data Firehose S3 Athena QuickSight Real-time Analysis Layer SQS Kinesis Data Streams MSK ü Streaming data processing ü Latency: milliseconds ~ seconds

Slide 36

Slide 36 text

• Decouple producers & consumers • Persistent buffer • Collect multiple streams • Preserve client ordering • Parallel consumption • Streaming MapReduce 4 4 3 3 2 2 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 Producer 1 shard 1 / partition 1 shard 2 / partition 2 Consumer 1 Count of red = 4 Count of violet = 4 Consumer 2 Count of blue = 4 Count of green = 4 Producer 2 Producer 3 Producer n Key = Red Key = Green Key = Blue Key = Violet Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka … Why Stream Storage?

Slide 37

Slide 37 text

• Decouple producers & consumers • Persistent buffer • Collect multiple streams • No client ordering (standard) • FIFO queue preserves client ordering • No streaming MapReduce • No parallel consumption • Amazon SNS can publish to multiple SNS subscribers (queues or Lambda functions) Consumers 4 3 2 1 1 2 3 4 4 3 2 1 1 2 3 4 2 1 3 4 1 3 3 4 2 Standard FIFO Producers Amazon SQS Queue Publisher Amazon SNS Topic AWS Lambda function Amazon SQS queue Queue Subscriber What about Amazon SQS?

Slide 38

Slide 38 text

Real-time Analytics Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Real-time Analysis Layer ü Streaming data processing ü Latency: milliseconds ~ seconds

Slide 39

Slide 39 text

Real-time Analytics Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Kinesis Data Analytics on EMR Elasticsearch Service Real-time Analysis Layer on EMR

Slide 40

Slide 40 text

RDD Operator RDD RDD … Operator Operator Data Sink Operator Operator Data Source Amazon EMR Applications Framework Process Layer Data Layer Infrastructure S3 EMRFS Amazon S3 Instances Spot Instances

Slide 41

Slide 41 text

Real-time Analytics with Spark or Flink on EMR, RDS Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams QuickSight EMR Amazon RDS

Slide 42

Slide 42 text

Real-time Analytics with Spark or Flink on EMR, DynamoDB Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams EMR DynamoDB real-time dashboard

Slide 43

Slide 43 text

Real-time Analytics with Spark or Flink on EMR, ElastiCache Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams EMR real-time dashboard ElastiCache

Slide 44

Slide 44 text

Amazon Kinesis Data Analytics Easily process and analyze streaming data with standard SQL

Slide 45

Slide 45 text

Real-time Analytics with Kinesis Data Analytics, RDS Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams

Slide 46

Slide 46 text

Real-time Analytics with Kinesis Data Analytics, DynamoDB Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Kinesis Data Analytics Lambda function Kinesis Data Streams real-time dashboard DynamoDB

Slide 47

Slide 47 text

Real-time Analytics with Kinesis Data Analytics, ElastiCache Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Kinesis Data Analytics Lambda function Kinesis Data Streams real-time dashboard ElastiCache

Slide 48

Slide 48 text

Amazon Elasticsearch Service Fully managed, scalable, and secure Elasticsearch service

Slide 49

Slide 49 text

Real-time Analytics with Elasticsearch Service Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Lambda function Elasticsearch Service Kibana

Slide 50

Slide 50 text

3 ways to build Real-time Analytics Kinesis Data Streams Lambda function Elasticsearch Service Kibana EMR real-time dashboard ElastiCache Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams DynamoDB 1 2 3

Slide 51

Slide 51 text

Business Intelligence System Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Lambda function Elasticsearch Service Kibana

Slide 52

Slide 52 text

Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Lambda function Elasticsearch Service Kibana Batch Layer Serving Layer Speed Layer Business Intelligence System Lambda Architecture

Slide 53

Slide 53 text

Lambda vs Kappa Architecture Lambda Kappa

Slide 54

Slide 54 text

Beyond Data Analytics: Expand Business Intelligence • Network(Graph) Analysis • Social networking • Recommendation engines • Fraud detection • Knowledge Graphs • Recommendation • Personalized recommendations • Personalized search • Personalized notifications • Machine Learning • Forecast(Predictive Analytics) • Classification • Sentiment Analysis • and more …

Slide 55

Slide 55 text

Amazon Neptune Fully managed graph database Social Networking Recommendation Engines Fraud Detection Knowledge Graphs

Slide 56

Slide 56 text

Beyond Data Analytics – Network Analysis(with Neptune) Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Lambda function Elasticsearch Service Kibana Neptune Lambda function API Gateway Lambda function

Slide 57

Slide 57 text

Deliver high-quality recommendations Deliver personalization in days, not months Real-time Works with any product or content Amazon Personalize Real-time personalization and recommendation, based on the same technology used at Amazon.com

Slide 58

Slide 58 text

Beyond Data Analytics: Recommendation(with Personalize) Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Lambda function Elasticsearch Service Kibana S3 ElastiCache API Gateway Event (time-based) Lambda function Personalize Lambda function EMR

Slide 59

Slide 59 text

Fully managed hosting with auto scaling One-click deployment Notebook instances Built-in, high- performance algorithms Automatic model tuning One-click training Build Train Deploy Amazon SageMaker Build, Train, and Deploy Machine Learning Models Quickly & Easily, at Scale

Slide 60

Slide 60 text

Beyond Data Analytics: Machine Learning(with SageMaker) Kinesis Data Firehose S3 Athena QuickSight Kinesis Data Streams Lambda function Elasticsearch Service Kibana Train Notebook Model Models bucket SageMaker Endpoint SageMaker

Slide 61

Slide 61 text

Business Intelligence System

Slide 62

Slide 62 text

Journey to Analytics on

Slide 63

Slide 63 text

Collect Consume Store Process/ Analyze Data 1 4 0 9 5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon Elasticsearch Service Amazon Machine Learning AWS Lambda

Slide 64

Slide 64 text

Lessons Learned: Architectural Principles • Build decoupled systems - Data → Store → Process → Store → Analyze → Answers • Use the right tool for the job - Data structure, latency, throughput, access patterns • Leverage managed and serverless services - Scalable/elastic, available, reliable, secure, no/low admin • Use log-centric design patterns - Immutable logs (data lake), materialized views • Be cost-conscious - Big data ≠ Big cost • Working backwards - Design from consume to collect

Slide 65

Slide 65 text

Learn more - Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AWS re:Invent 2018 https://www.slideshare.net/AmazonWebServices/big-data-analytics-architectural-patterns-and-best- practices-ant201r1-aws-reinvent-2018 - Everything You Need to Know About Big Data: From Architectural Principles to Best Practices https://www.slideshare.net/AmazonWebServices/everything-you-need-to-know-about-big-data- from-architectural-principles-to-best-practices - Big Data Analytics Options on AWS https://d1.awsstatic.com/whitepapers/Big_Data_Analytics_Options_on_AWS.pdf - AWS Big Data Blog https://aws.amazon.com/ko/blogs/big-data - AWS Well-Architected Labs https://wellarchitectedlabs.com

Slide 66

Slide 66 text

Q & A