Real-time Analytics on AWS

Slide 1

Slide 1 text

Real-time Analytics on Sungmin, Kim Solutions Architect, AWS

Slide 2

Slide 2 text

Agenda • Why Real-time Data streaming and Analytics? • How to Build? • Where to Store streaming data? • How to Ingest streaming data? • How to Process streaming data? • Delivery Streaming Data • Dive into Stream Process Framework • Transform, Aggregate, Join Streaming Data • Case Studies • Key Takeaways

Slide 3

Slide 3 text

Why Real-time Data streaming and Analytics?

Slide 4

Slide 4 text

Data The world’s most valuable resource is no longer oil, but data.* *Copyright: David Parkins , The Economist, 2017 “ ”

Slide 5

Slide 5 text

Data Loses Value Over Time * Source: Mike Gualtieri, Forrester, Perishable insights Real time Seconds Minutes Hours Days Months Value of data to decision-making Preventive/predictive Actionable Reactive Historical Time-critical decisions Traditional “batch” business intelligence

Slide 6

Slide 6 text

To create Value, derive insights in Real-time * image source: https://androidby.com/wp-content/uploads/2020/04/Need-for-Speed-No-Limits-4.4.6-.APK-MOD-Unlimited-money.png

Slide 7

Slide 7 text

Batch vs Real-time Batch Difference Real-time Arbitrarily, or Periodically Continuity Constant Store → Process (Hadoop MapReduce, Hive, Pig, Spark) Method of analysis Process → Store (Spark Streaming, Flink, Apache Storm) Small - Huge (KB~TB) Data size per a unit Small (B~KB) Low - High (minutes to hours) Query Latency Low (milliseconds to minutes) Low - High (hourly/daily/monthly) Request Rate Very High - High (in seconds, minutes) High - Very high Durability Low - High ¢~$ (Amazon S3, Glacier) Cost/GB $$~$ (Redis, Memcached)

Slide 8

Slide 8 text

From Batch to Real-time: Lambda Architecture Data Source Stream Storage Speed Layer Batch Layer Batch Process Batch View Real- time View Consumer Query & Merge Results Service Layer Stream Ingestion Raw Data Storage Streaming Data Stream Delivery Stream Process

Slide 9

Slide 9 text

Lambda Architecture Streaming Data Batch View Stream Process Real-time View Query Query Batch View Real-time View Raw Data Batch Process Batch Layer Serving Layer Speed Layer

Slide 10

Slide 10 text

Key Components of Real-time Analytics Data Source Stream Storage Stream Process Stream Ingestion Data Sink Devices and/or applications that produce real-time data at high velocity Data from tens of thousands of data sources can be written to a single stream Data are stored in the order they were received for a set duration of time and can be replayed indefinitely during that time Records are read in the order they are produced, enabling real-time analytics or streaming ETL Data lake (most common) Database (least common)

Slide 11

Slide 11 text

Where to Store Streaming Data? Data Source Stream Storage Stream Process Stream Ingestion Data Sink

Slide 12

Slide 12 text

Stream Storage Data Source Stream Storage Stream Process Stream Ingestion Data Sink Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka

Slide 13

Slide 13 text

Hash Function Consumer Consumer Consumer Consumer Group PK PK PK PK = next consumer offset oldest data newest data Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Producers shard/partition-1 shard/partition-2 5 4 3 2 1 0 3 2 1 0 4 3 2 1 0 4 2 0 shard/partition-3

Slide 14

Slide 14 text

Why is Stream Storage? • Decouple producers & consumers • Persistent buffer • Collect multiple streams • Preserve client ordering • Parallel consumption • Streaming MapReduce

Slide 15

Slide 15 text

• Decouple producers & consumers • Persistent buffer • Collect multiple streams • No client ordering (standard) • FIFO queue preserves client ordering • No streaming MapReduce • No parallel consumption • Amazon SNS can publish to multiple SNS subscribers (queues or Lambda functions) Consumers 4 3 2 1 1 2 3 4 4 3 2 1 1 2 3 4 2 1 3 4 1 3 3 4 2 Standard FIFO Producers Amazon SQS Queue What about SQS? Publisher Amazon SNS Topic AWS Lambda function Amazon SQS queue Queue Subscriber

Slide 16

Slide 16 text

Topic Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka

Slide 17

Slide 17 text

Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka • Operational Considerations • Number of clusters? • Number of brokers per cluster? • Number of topics per broker? • Number of partitions per topic? • Only increase number of partitions; can’t decrease • Integration with a few of AWS Services such as Kinesis Data Analytics for Apache Flink • Operational Considerations • Number of Kinesis Data Streams? • Number of shards per stream? • Increase/Decrease number of shards • Fully Integration with AWS Services such as Lambda function, Kinesis Data Analytics, etc

Slide 18

Slide 18 text

RequestQueue - Length - WaitTime ResponseQueue - Length - WaitTime Network - Packet Drop? Produce/Consume Rate Unbalance Who is Leader? Disk Full? Too many topics? Metrics to Monitor: MSK (Kafka)

Slide 19

Slide 19 text

Metrics to Monitor: MSK (Kafka) Metric Level Description ActiveControllerCount DEFAULT Only one controller per cluster should be active at any given time. OfflinePartitionsCount DEFAULT Total number of partitions that are offline in the cluster. GlobalPartitionCount DEFAULT Total number of partitions across all brokers in the cluster. GlobalTopicCount DEFAULT Total number of topics across all brokers in the cluster. KafkaAppLogsDiskUsed DEFAULT The percentage of disk space used for application logs. KafkaDataLogsDiskUsed DEFAULT The percentage of disk space used for data logs. RootDiskUsed DEFAULT The percentage of the root disk used by the broker. PartitionCount PER_BROKER The number of partitions for the broker. LeaderCount PER_BROKER The number of leader replicas. UnderMinIsrPartitionCount PER_BROKER The number of under minIsr partitions for the broker. UnderReplicatedPartitions PER_BROKER The number of under-replicated partitions for the broker. FetchConsumerTotalTimeMsMean PER_BROKER The mean total time in milliseconds that consumers spend on fetching data from the broker. ProduceTotalTimeMsMean PER_BROKER The mean produce time in milliseconds.

Slide 20

Slide 20 text

How about monitoring Kinesis Data Streams? Consumer Application GetRecords() Data How long time does a record stay in a shard?

Slide 21

Slide 21 text

Metrics to Monitor: Kinesis Data Streams Metric Description GetRecords.IteratorAgeMilliseconds Age of the last record in all GetRecords ReadProvisionedThroughputExceeded Number of GetRecords calls throttled WriteProvisionedThroughputExceeded Number of PutRecord(s) calls throttled PutRecord.Success, PutRecords.Success Number of successful PutRecord(s) operations GetRecords.Success Number of successful GetRecords operations

Slide 22

Slide 22 text

Choosing Good Metrics Too much information can be just as useless as too little

Slide 23

Slide 23 text

How to Ingest Streaming Data? Data Source Stream Storage Stream Process Stream Ingestion Data Sink

Slide 24

Slide 24 text

Stream Ingestion • AWS SDKs • Publish directly from application code via APIs • AWS Mobile SDK • Kinesis Agent • Monitors log files and forwards lines as messages to Kinesis Data Streams • Kinesis Producer Library (KPL) • Background process aggregates and batches messages • 3rd party and open source • Kafka Connect (kinesis-kafka-connector) • fluentd (aws-fluent-plugin-kinesis) • Log4J Appender (kinesis-log4j-appender) • and more … Data Source Stream Storage Stream Process Stream Ingestion Data Sink Amazon Kinesis Data Streams

Slide 25

Slide 25 text

How to Process Streaming Data? Data Source Stream Storage Stream Process Stream Ingestion Data Sink

Slide 26

Slide 26 text

Elasticsearch Redshift Stream Delivery Data Source Stream Storage Stream Process Stream Ingestion Data Sink Stream Delivery Kinesis Data Firehose • Kinesis Agent • CloudWatch Logs • CloudWatch Events • AWS IoT • Direct PUT using APIs • Kinesis Data Streams • MSK(Kafka) using Kafka Connect Kinesis Data Analytics S3

Slide 27

Slide 27 text

Kinesis Firehose: Filter, Enrich, Convert Data Source apache log apache log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } json Lambda function Kinesis Data Firehose

Slide 28

Slide 28 text

Pre-built Data Transformation Blueprints Blueprint Description General Processing For custom transformation logic Apache Log to JSON Parses and converts Apache log lines to JSON objects using predefined JSON field names Apache Log to CSV Parses and converts Apache log lines to CSV format Syslog to JSON Parses and converts Syslog lines to JSON objects using predefined JSON field names Syslog to CSV Parses and converts Syslog lines to CSV format

Slide 29

Slide 29 text

Pre-built Data Conversion Data Source Kinesis Data Firehose JSON Data schema AWS Glue Data Catalog Amazon S3 • Convert the format of your input data from JSON to columnar data format Apache Parquet or Apache ORC before storing the data in Amazon S3 • Works in conjunction to the transform features to convert other format to JSON before the data conversion convert to columnar format /failed

Slide 30

Slide 30 text

Failure and Error Handling • S3 Destination • Pause and retry for up to 24 hours (maximum data retention period) • If data delivery fails for more than 24 hours, your data is lost. • Redshift Destination • Configurable retry duration (0-2 hours) • After retry, skip and load error manifest files to S3’s errors/ folder • Elasticsearch Destination • Configurable retry duration (0-2 hours) • After retry, skip and load failed records to S3’s elasticsearch_failed/ folder

Slide 31

Slide 31 text

Stream Process • Transform • Filter, Enrich, Convert • Aggregation • Windows Queries • Top-K Contributor • Join • Stream-Stream Join • Stream-(External) Table Join Data Source Stream Storage Stream Process Stream Ingestion Data Sink AWS Lambda Amazon Kinesis Data Analytics AWS Glue Amazon EMR

Slide 32

Slide 32 text

Dive into Stream Process Services

Slide 33

Slide 33 text

AWS Lambda • Serverless functions • Event-based, stateless processing • Continuous and simple scaling mechanism event (3) event (2) event (1) Lambda (1) Lambda (2) Lambda (3)

Slide 34

Slide 34 text

Amazon Kinesis Data Analytics AWS Glue Amazon EMR Serverless Serverless Fully Managed

Slide 35

Slide 35 text

Architecture: Master-Worker Master Worker (1) Worker (2) Worker (3) part-01 part-02 part-03 part-01 part-02 part-03

Slide 36

Slide 36 text

Master Workers Architecture

Slide 37

Slide 37 text

Architecture Workers Master

Slide 38

Slide 38 text

Streaming Programming Guide

Slide 39

Slide 39 text

Treat Streams as Unbounded Tables

Slide 40

Slide 40 text

“It's raining cats and dogs!” ["It's", "raining", "cats", "and", "dogs!"] [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] It’s 1 raining 1 cats 1 and 1 dogs! 1

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

“It's raining cats and dogs!” ["It's", "raining", "cats", "and", "dogs!"] [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] It’s 1 raining 1 cats 1 and 1 dogs! 1

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Setup session Read stream Start running Apply Streaming ETL

Slide 45

Slide 45 text

What about (Stream) SQL? Data Source Stream Storage Stream SQL Process Stream Ingestion Data Sink [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] “It's raining cats and dogs!” It’s 1 raining 1 cats 1 and 1 dogs! 1

Slide 46

Slide 46 text

Kinesis Data Analytics (SQL) • STREAM (in-application): a continuously updated entity that you can SELECT from and INSERT into like a TABLE • PUMP: an entity used to continuously 'SELECT ... FROM' a source STREAM, and INSERT SQL results into an output STREAM • Create output stream, which can be used to send to a destination SOURCE STREAM INSERT & SELECT (PUMP) DESTIN. STREAM Destination Source [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)]

Slide 47

Slide 47 text

Kinesis Data Analytics SQL vs Java

Slide 48

Slide 48 text

DEMO

Slide 49

Slide 49 text

https://aws.amazon.com/ko/blogs/aws/new-amazon-kinesis-data-analytics-for-java/ Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Amazon S3 Amazon Kinesis Data Analytics (Java) Amazon Kinesis Data Streams Amazon Kinesis Data Streams Amazon Kinesis Data Analytics (SQL) DEMO: Word Count “It's raining cats and dogs!” [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] It’s 1 raining 1 cats 1 and 1 dogs! 1 [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] 1 2

Slide 50

Slide 50 text

Filter, Enrich, Convert Streaming Data Data Source Stream Storage Stream Process Stream Ingestion Data Sink

Slide 51

Slide 51 text

Revisit Example: Filter, Enrich, Convert Data Source Kinesis Data Firehose apache log apache log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } json Lambda function

Slide 52

Slide 52 text

Stream Process: Filter, Enrich, Convert Data Source apache log apache log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } Amazon Kinesis Data Streams Lambda function Amazon EMR AWS Glue Amazon Kinesis Data Analytics

Slide 53

Slide 53 text

Stream Process: Filter, Enrich, Convert Data Source apache log apache log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } Amazon EMR AWS Glue Amazon MSK Amazon Kinesis Data Analytics (Java)

Slide 54

Slide 54 text

Kinesis Data Analytics (SQL): Preprocessing Data https://aws.amazon.com/ko/blogs/big-data/preprocessing-data-in-amazon-kinesis-analytics-with-aws-lambda/

Slide 55

Slide 55 text

Integration of Stream Process and Stream Storage Amazon Lambda Kinesis Data Analytics (SQL) Kinesis Data Analytics (Flink) Glue EMR Kinesis Data Firehose O O X X X Kinesis Data Streams O O O O O Managed Streaming for Kafka (MSK) O X O O O Data Source Stream Storage Stream Process Stream Ingestion Data Sink

Slide 56

Slide 56 text

Aggregate Streaming Data Data Source Stream Storage Stream Process Stream Ingestion Data Sink

Slide 57

Slide 57 text

Stream Process: Aggregation • Aggregations (count, sum, min,...) take granular real time data and turn it into insights • Data is continuously processed so you need to tell the application when you want results • Windowed Queries a. Sliding Windows (with Overlap) b. Tumbling Windows (No Overlap) c. Custom Windows

Slide 58

Slide 58 text

Join Streaming Data Data Source Stream Storage Stream Process Stream Ingestion Data Sink

Slide 59

Slide 59 text

Imagine It! How to build?

Slide 60

Slide 60 text

Stream Process: Join Data Source Stream Storage Data Source Stream Storage Stream Process Data Source Stream Storage Data Source Stream Process (a) Stream-Stream Join (b) Stream-Join by Partition Key (c) Stream-Join by Hash Table Data Source Stream Storage Stream Process Key-Value Storage

Slide 61

Slide 61 text

Why Stream-Stream Join is so difficult? Data Source Stream Storage Data Source Stream Storage Stream Process Data Sink t0 t1 t2 tN . . . . . . . • Timing • Skewed data ∆𝑡 ∆𝑡 ∆𝑡

Slide 62

Slide 62 text

How about Stream-Join by Partition Key? Data Source Stream Storage Data Source Stream Storage Stream Process Data Source Stream Storage Data Source Stream Process t1 t2 t3 t5 t1 t2 t3 t5 t1 t1 t2 t3 Each shard will be filled with records coming from fast data producers shard-1 shard-2 shard-3

Slide 63

Slide 63 text

Lastly, how about Stream-Join by Hash Table? Data Source Stream Storage Stream Process Key-Value Storage Data Source Stream Storage Data Source Stream Storage Stream Process

Slide 64

Slide 64 text

DEMO

Slide 65

Slide 65 text

{"TICKER_SYMBOL": "CVB", "SECTOR": "TECHNOLOGY", "CHANGE": 0.81, "PRICE": 53.63} {"TICKER_SYMBOL": "ABC", "SECTOR": "RETAIL", "CHANGE": -1.14, "PRICE": 23.64} {"TICKER_SYMBOL": "JKL", "SECTOR": "TECHNOLOGY", "CHANGE": 0.22, "PRICE": 15.32} Amazon Kinesis Data Streams Amazon Kinesis Data Analytics (SQL) DEMO: Filter, Aggregate, Join Continuous filter Aggregate function Data enrichment (join) Bucket with objects Ticker,Company AMZN,Amazon ASD,SomeCompanyA BAC,SomeCompanyB CRM,SomeCompanyC https://docs.aws.amazon.com/kinesisanalytics/latest/dev/app-add-reference-data.html

Slide 66

Slide 66 text

Comparing Stream Process Services

Slide 67

Slide 67 text

DevOps! Master-Worker Framework Master Worker (1) Worker (2) Worker (3) part-01 part-02 part-03 part-01 part-02 part-03 Master is alive? Worker has enough resources such as CPU, Memory, Disk? Checkpoint? Right Instance Type? C-Family, or R-Family? Learning curve? - SQL - Python - Scala - Java

Slide 68

Slide 68 text

EMR vs Glue vs Kinesis Data Analytics Operational Excellence Kinesis Data Analytics (SQL) EMR Glue Kinesis Data Analytics (Java) Degree of Freedom ≈ Complexity

Slide 69

Slide 69 text

AWS Glue Comparing stream processing services AWS Lambda Amazon Kinesis Data Analytics Amazon EMR Simple programming interface and scaling • Serverless functions • Six languages (Java, Python, Golang, Node.js, Ruby, C#) • Event-based, stateless processing • Continuous and simple scaling mechanism Easy and powerful stream processing Simple, flexible, and cost-effective ETL & Data Catalog Flexibility and choice for your needs • Serverless applications • Supports SQL and Java (Apache Flink) • Stateful processing with automatic backups • Stream operators make building app easy • Serverless applications • Can use the transforms native to Apache Spark Structured Streaming • Automatically discover new data, extracts schema definitions • Automatically generates the ETL code • Choose your instances • Use your favorite open-source framework • Fine-grained control over cluster, debugging tools, and more • Deep open-source tool integrations with AWS

Slide 70

Slide 70 text

Case Studies

Slide 71

Slide 71 text

Example Usage Pattern 1: Web Analytics and Leaderboards Amazon DynamoDB Amazon Kinesis Data Analytics Amazon Kinesis Data Streams Amazon Cognito Lightweight JS client code Web server on Amazon EC2 OR Compute top 10 users Ingest web app data Persist to feed live apps Lambda function https://aws.amazon.com/solutions/implementations/real-time-web-analytics-with-kinesis/

Slide 72

Slide 72 text

https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/ Example Usage Pattern 2: Monitoring IoT Devices Ingest sensor data Convert json to parquet Store all data points in an S3 data lake https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/

Slide 73

Slide 73 text

Example Usage Pattern 3: Analyzing AWS CloudTrail Event Logs AWS CloudTrail CloudWatch Events trigger Kinesis Data Analytics Lambda function S3 bucket for raw data DynamoDB table Chart.JS dashboard Compute operational metrics Ingest raw log data Deliver to real time dashboards and archival Kinesis Data Firehose https://aws.amazon.com/solutions/implementations/real-time-insights-account-activity/

Slide 74

Slide 74 text

Takeaways

Slide 75

Slide 75 text

Slide 76

Slide 76 text

Key Components of Real-time Analytics Data Source Stream Storage Stream Process Stream Ingestion Data Sink AWS Lambda Kinesis Data Analytics Glue EMR Kinesis Data Firehose Kinesis Data Streams Managed Streaming for Kafka Real-Time Applications - Aggregation - Top-K Contributor - Anomaly Detection Streaming ETL - Filter, Enrich, Convert - Join Kafka Connect KPL Kinesis Agent AWS SDKs

Slide 77

Slide 77 text

Key Takeaways • Build decoupled systems • Data → Store → Process → Store → Analyze → Answers • Data Source → Stream Ingestion → Stream Storage → Stream Process → Data Sink • Follow the principle of "extract data once and reuse multiple times” to power new customer experiences • Use the right tool for the job • Know the AWS services soft and hard limits • Leverage managed and serverless services (DevOps!) • Scalable/elastic, available, reliable, secure, no/low admin

Slide 78

Slide 78 text

Where To Go Next? • AWS Analytics Immersion Day - Build BI System from Scratch • Workshop - https://serverless-bi-system-from-scratch.workshop.aws/ • Slides - https://tinyurl.com/serverless-bi-on-aws • Video - https://youtu.be/FX5iWHFn1v0 • Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 1, 2 • Part1 - https://tinyurl.com/y8vo8q7o • Part2 - https://tinyurl.com/ycbv7wel • Streaming Analytics Workshop – Kinesis Data Analytics for Java (Flink) https://streaming-analytics.labgui.de/ • Amazon MSK Labs https://amazonmsk-labs.workshop.aws/ • Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming https://tinyurl.com/y7hklyff • AWS Glue Streaming ETL - Scala Script Example https://tinyurl.com/y79x6jda