Real-time Analytics on AWS

Real-time Analytics on Sungmin, Kim Solutions Architect, AWS

Agenda • Why Real-time Data streaming and Analytics? • How
to Build? • Where to Store streaming data? • How to Ingest streaming data? • How to Process streaming data? • Delivery Streaming Data • Dive into Stream Process Framework • Transform, Aggregate, Join Streaming Data • Case Studies • Key Takeaways

Why Real-time Data streaming and Analytics?

Data The world’s most valuable resource is no longer oil,
but data.* *Copyright: David Parkins , The Economist, 2017 “ ”

Data Loses Value Over Time * Source: Mike Gualtieri, Forrester,
Perishable insights Real time Seconds Minutes Hours Days Months Value of data to decision-making Preventive/predictive Actionable Reactive Historical Time-critical decisions Traditional “batch” business intelligence

To create Value, derive insights in Real-time * image source:
https://androidby.com/wp-content/uploads/2020/04/Need-for-Speed-No-Limits-4.4.6-.APK-MOD-Unlimited-money.png

Batch vs Real-time Batch Difference Real-time Arbitrarily, or Periodically Continuity
Constant Store → Process (Hadoop MapReduce, Hive, Pig, Spark) Method of analysis Process → Store (Spark Streaming, Flink, Apache Storm) Small - Huge (KB~TB) Data size per a unit Small (B~KB) Low - High (minutes to hours) Query Latency Low (milliseconds to minutes) Low - High (hourly/daily/monthly) Request Rate Very High - High (in seconds, minutes) High - Very high Durability Low - High ¢~$ (Amazon S3, Glacier) Cost/GB $$~$ (Redis, Memcached)

From Batch to Real-time: Lambda Architecture Data Source Stream Storage
Speed Layer Batch Layer Batch Process Batch View Real- time View Consumer Query & Merge Results Service Layer Stream Ingestion Raw Data Storage Streaming Data Stream Delivery Stream Process

Lambda Architecture Streaming Data Batch View Stream Process Real-time View
Query Query Batch View Real-time View Raw Data Batch Process Batch Layer Serving Layer Speed Layer

Key Components of Real-time Analytics Data Source Stream Storage Stream
Process Stream Ingestion Data Sink Devices and/or applications that produce real-time data at high velocity Data from tens of thousands of data sources can be written to a single stream Data are stored in the order they were received for a set duration of time and can be replayed indefinitely during that time Records are read in the order they are produced, enabling real-time analytics or streaming ETL Data lake (most common) Database (least common)

Where to Store Streaming Data? Data Source Stream Storage Stream
Process Stream Ingestion Data Sink

Stream Storage Data Source Stream Storage Stream Process Stream Ingestion
Data Sink Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka

Hash Function Consumer Consumer Consumer Consumer Group PK PK PK
PK = next consumer offset oldest data newest data Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Producers shard/partition-1 shard/partition-2 5 4 3 2 1 0 3 2 1 0 4 3 2 1 0 4 2 0 shard/partition-3

Why is Stream Storage? • Decouple producers & consumers •
Persistent buffer • Collect multiple streams • Preserve client ordering • Parallel consumption • Streaming MapReduce

• Decouple producers & consumers • Persistent buffer • Collect
multiple streams • No client ordering (standard) • FIFO queue preserves client ordering • No streaming MapReduce • No parallel consumption • Amazon SNS can publish to multiple SNS subscribers (queues or Lambda functions) Consumers 4 3 2 1 1 2 3 4 4 3 2 1 1 2 3 4 2 1 3 4 1 3 3 4 2 Standard FIFO Producers Amazon SQS Queue What about SQS? Publisher Amazon SNS Topic AWS Lambda function Amazon SQS queue Queue Subscriber

Topic Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka

Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka •
Operational Considerations • Number of clusters? • Number of brokers per cluster? • Number of topics per broker? • Number of partitions per topic? • Only increase number of partitions; can’t decrease • Integration with a few of AWS Services such as Kinesis Data Analytics for Apache Flink • Operational Considerations • Number of Kinesis Data Streams? • Number of shards per stream? • Increase/Decrease number of shards • Fully Integration with AWS Services such as Lambda function, Kinesis Data Analytics, etc

RequestQueue - Length - WaitTime ResponseQueue - Length - WaitTime
Network - Packet Drop? Produce/Consume Rate Unbalance Who is Leader? Disk Full? Too many topics? Metrics to Monitor: MSK (Kafka)

Metrics to Monitor: MSK (Kafka) Metric Level Description ActiveControllerCount DEFAULT
Only one controller per cluster should be active at any given time. OfflinePartitionsCount DEFAULT Total number of partitions that are offline in the cluster. GlobalPartitionCount DEFAULT Total number of partitions across all brokers in the cluster. GlobalTopicCount DEFAULT Total number of topics across all brokers in the cluster. KafkaAppLogsDiskUsed DEFAULT The percentage of disk space used for application logs. KafkaDataLogsDiskUsed DEFAULT The percentage of disk space used for data logs. RootDiskUsed DEFAULT The percentage of the root disk used by the broker. PartitionCount PER_BROKER The number of partitions for the broker. LeaderCount PER_BROKER The number of leader replicas. UnderMinIsrPartitionCount PER_BROKER The number of under minIsr partitions for the broker. UnderReplicatedPartitions PER_BROKER The number of under-replicated partitions for the broker. FetchConsumerTotalTimeMsMean PER_BROKER The mean total time in milliseconds that consumers spend on fetching data from the broker. ProduceTotalTimeMsMean PER_BROKER The mean produce time in milliseconds.

How about monitoring Kinesis Data Streams? Consumer Application GetRecords() Data
How long time does a record stay in a shard?

Metrics to Monitor: Kinesis Data Streams Metric Description GetRecords.IteratorAgeMilliseconds Age
of the last record in all GetRecords ReadProvisionedThroughputExceeded Number of GetRecords calls throttled WriteProvisionedThroughputExceeded Number of PutRecord(s) calls throttled PutRecord.Success, PutRecords.Success Number of successful PutRecord(s) operations GetRecords.Success Number of successful GetRecords operations

Choosing Good Metrics Too much information can be just as
useless as too little

How to Ingest Streaming Data? Data Source Stream Storage Stream

Stream Ingestion • AWS SDKs • Publish directly from application
code via APIs • AWS Mobile SDK • Kinesis Agent • Monitors log files and forwards lines as messages to Kinesis Data Streams • Kinesis Producer Library (KPL) • Background process aggregates and batches messages • 3rd party and open source • Kafka Connect (kinesis-kafka-connector) • fluentd (aws-fluent-plugin-kinesis) • Log4J Appender (kinesis-log4j-appender) • and more … Data Source Stream Storage Stream Process Stream Ingestion Data Sink Amazon Kinesis Data Streams

How to Process Streaming Data? Data Source Stream Storage Stream

Elasticsearch Redshift Stream Delivery Data Source Stream Storage Stream Process
Stream Ingestion Data Sink Stream Delivery Kinesis Data Firehose • Kinesis Agent • CloudWatch Logs • CloudWatch Events • AWS IoT • Direct PUT using APIs • Kinesis Data Streams • MSK(Kafka) using Kafka Connect Kinesis Data Analytics S3

Kinesis Firehose: Filter, Enrich, Convert Data Source apache log apache
log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } json Lambda function Kinesis Data Firehose

Pre-built Data Transformation Blueprints Blueprint Description General Processing For custom
transformation logic Apache Log to JSON Parses and converts Apache log lines to JSON objects using predefined JSON field names Apache Log to CSV Parses and converts Apache log lines to CSV format Syslog to JSON Parses and converts Syslog lines to JSON objects using predefined JSON field names Syslog to CSV Parses and converts Syslog lines to CSV format

Pre-built Data Conversion Data Source Kinesis Data Firehose JSON Data
schema AWS Glue Data Catalog Amazon S3 • Convert the format of your input data from JSON to columnar data format Apache Parquet or Apache ORC before storing the data in Amazon S3 • Works in conjunction to the transform features to convert other format to JSON before the data conversion convert to columnar format /failed

Failure and Error Handling • S3 Destination • Pause and
retry for up to 24 hours (maximum data retention period) • If data delivery fails for more than 24 hours, your data is lost. • Redshift Destination • Configurable retry duration (0-2 hours) • After retry, skip and load error manifest files to S3’s errors/ folder • Elasticsearch Destination • Configurable retry duration (0-2 hours) • After retry, skip and load failed records to S3’s elasticsearch_failed/ folder

Stream Process • Transform • Filter, Enrich, Convert • Aggregation
• Windows Queries • Top-K Contributor • Join • Stream-Stream Join • Stream-(External) Table Join Data Source Stream Storage Stream Process Stream Ingestion Data Sink AWS Lambda Amazon Kinesis Data Analytics AWS Glue Amazon EMR

Dive into Stream Process Services

AWS Lambda • Serverless functions • Event-based, stateless processing •
Continuous and simple scaling mechanism event (3) event (2) event (1) Lambda (1) Lambda (2) Lambda (3)

Amazon Kinesis Data Analytics AWS Glue Amazon EMR Serverless Serverless
Fully Managed

Architecture: Master-Worker Master Worker (1) Worker (2) Worker (3) part-01
part-02 part-03 part-01 part-02 part-03

Master Workers Architecture

Architecture Workers Master

Streaming Programming Guide

Treat Streams as Unbounded Tables

“It's raining cats and dogs!” ["It's", "raining", "cats", "and", "dogs!"]
[("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] It’s 1 raining 1 cats 1 and 1 dogs! 1

Setup session Read stream Start running Apply Streaming ETL

What about (Stream) SQL? Data Source Stream Storage Stream SQL
Process Stream Ingestion Data Sink [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] “It's raining cats and dogs!” It’s 1 raining 1 cats 1 and 1 dogs! 1

Kinesis Data Analytics (SQL) • STREAM (in-application): a continuously updated
entity that you can SELECT from and INSERT into like a TABLE • PUMP: an entity used to continuously 'SELECT ... FROM' a source STREAM, and INSERT SQL results into an output STREAM • Create output stream, which can be used to send to a destination SOURCE STREAM INSERT & SELECT (PUMP) DESTIN. STREAM Destination Source [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)]

Kinesis Data Analytics SQL vs Java

https://aws.amazon.com/ko/blogs/aws/new-amazon-kinesis-data-analytics-for-java/ Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Amazon
S3 Amazon Kinesis Data Analytics (Java) Amazon Kinesis Data Streams Amazon Kinesis Data Streams Amazon Kinesis Data Analytics (SQL) DEMO: Word Count “It's raining cats and dogs!” [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] It’s 1 raining 1 cats 1 and 1 dogs! 1 [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] 1 2

Filter, Enrich, Convert Streaming Data Data Source Stream Storage Stream

Revisit Example: Filter, Enrich, Convert Data Source Kinesis Data Firehose
apache log apache log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } json Lambda function

Stream Process: Filter, Enrich, Convert Data Source apache log apache
log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } Amazon Kinesis Data Streams Lambda function Amazon EMR AWS Glue Amazon Kinesis Data Analytics

Stream Process: Filter, Enrich, Convert Data Source apache log apache
log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } Amazon EMR AWS Glue Amazon MSK Amazon Kinesis Data Analytics (Java)

Kinesis Data Analytics (SQL): Preprocessing Data https://aws.amazon.com/ko/blogs/big-data/preprocessing-data-in-amazon-kinesis-analytics-with-aws-lambda/

Integration of Stream Process and Stream Storage Amazon Lambda Kinesis
Data Analytics (SQL) Kinesis Data Analytics (Flink) Glue EMR Kinesis Data Firehose O O X X X Kinesis Data Streams O O O O O Managed Streaming for Kafka (MSK) O X O O O Data Source Stream Storage Stream Process Stream Ingestion Data Sink

Aggregate Streaming Data Data Source Stream Storage Stream Process Stream
Ingestion Data Sink

Stream Process: Aggregation • Aggregations (count, sum, min,...) take granular
real time data and turn it into insights • Data is continuously processed so you need to tell the application when you want results • Windowed Queries a. Sliding Windows (with Overlap) b. Tumbling Windows (No Overlap) c. Custom Windows

Join Streaming Data Data Source Stream Storage Stream Process Stream
Ingestion Data Sink

Imagine It! How to build?

Stream Process: Join Data Source Stream Storage Data Source Stream
Storage Stream Process Data Source Stream Storage Data Source Stream Process (a) Stream-Stream Join (b) Stream-Join by Partition Key (c) Stream-Join by Hash Table Data Source Stream Storage Stream Process Key-Value Storage

Why Stream-Stream Join is so difficult? Data Source Stream Storage
Data Source Stream Storage Stream Process Data Sink t0 t1 t2 tN . . . . . . . • Timing • Skewed data ∆𝑡 ∆𝑡 ∆𝑡

How about Stream-Join by Partition Key? Data Source Stream Storage
Data Source Stream Storage Stream Process Data Source Stream Storage Data Source Stream Process t1 t2 t3 t5 t1 t2 t3 t5 t1 t1 t2 t3 Each shard will be filled with records coming from fast data producers shard-1 shard-2 shard-3

Lastly, how about Stream-Join by Hash Table? Data Source Stream
Storage Stream Process Key-Value Storage Data Source Stream Storage Data Source Stream Storage Stream Process

{"TICKER_SYMBOL": "CVB", "SECTOR": "TECHNOLOGY", "CHANGE": 0.81, "PRICE": 53.63} {"TICKER_SYMBOL": "ABC",
"SECTOR": "RETAIL", "CHANGE": -1.14, "PRICE": 23.64} {"TICKER_SYMBOL": "JKL", "SECTOR": "TECHNOLOGY", "CHANGE": 0.22, "PRICE": 15.32} Amazon Kinesis Data Streams Amazon Kinesis Data Analytics (SQL) DEMO: Filter, Aggregate, Join Continuous filter Aggregate function Data enrichment (join) Bucket with objects Ticker,Company AMZN,Amazon ASD,SomeCompanyA BAC,SomeCompanyB CRM,SomeCompanyC https://docs.aws.amazon.com/kinesisanalytics/latest/dev/app-add-reference-data.html

Comparing Stream Process Services

DevOps! Master-Worker Framework Master Worker (1) Worker (2) Worker (3)
part-01 part-02 part-03 part-01 part-02 part-03 Master is alive? Worker has enough resources such as CPU, Memory, Disk? Checkpoint? Right Instance Type? C-Family, or R-Family? Learning curve? - SQL - Python - Scala - Java

EMR vs Glue vs Kinesis Data Analytics Operational Excellence Kinesis
Data Analytics (SQL) EMR Glue Kinesis Data Analytics (Java) Degree of Freedom ≈ Complexity

AWS Glue Comparing stream processing services AWS Lambda Amazon Kinesis
Data Analytics Amazon EMR Simple programming interface and scaling • Serverless functions • Six languages (Java, Python, Golang, Node.js, Ruby, C#) • Event-based, stateless processing • Continuous and simple scaling mechanism Easy and powerful stream processing Simple, flexible, and cost-effective ETL & Data Catalog Flexibility and choice for your needs • Serverless applications • Supports SQL and Java (Apache Flink) • Stateful processing with automatic backups • Stream operators make building app easy • Serverless applications • Can use the transforms native to Apache Spark Structured Streaming • Automatically discover new data, extracts schema definitions • Automatically generates the ETL code • Choose your instances • Use your favorite open-source framework • Fine-grained control over cluster, debugging tools, and more • Deep open-source tool integrations with AWS

Case Studies

Example Usage Pattern 1: Web Analytics and Leaderboards Amazon DynamoDB
Amazon Kinesis Data Analytics Amazon Kinesis Data Streams Amazon Cognito Lightweight JS client code Web server on Amazon EC2 OR Compute top 10 users Ingest web app data Persist to feed live apps Lambda function https://aws.amazon.com/solutions/implementations/real-time-web-analytics-with-kinesis/

https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/ Example Usage Pattern 2: Monitoring IoT Devices Ingest sensor
data Convert json to parquet Store all data points in an S3 data lake https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/

Example Usage Pattern 3: Analyzing AWS CloudTrail Event Logs AWS
CloudTrail CloudWatch Events trigger Kinesis Data Analytics Lambda function S3 bucket for raw data DynamoDB table Chart.JS dashboard Compute operational metrics Ingest raw log data Deliver to real time dashboards and archival Kinesis Data Firehose https://aws.amazon.com/solutions/implementations/real-time-insights-account-activity/

Takeaways

From Batch to Real-time: Lambda Architecture Data Source Stream Storage
Speed Layer Batch Layer Batch Process Batch View Real- time View Consumer Query & Merge Results Service Layer Stream Ingestion Raw Data Storage Streaming Data Stream Delivery Stream Process

Key Components of Real-time Analytics Data Source Stream Storage Stream
Process Stream Ingestion Data Sink AWS Lambda Kinesis Data Analytics Glue EMR Kinesis Data Firehose Kinesis Data Streams Managed Streaming for Kafka Real-Time Applications - Aggregation - Top-K Contributor - Anomaly Detection Streaming ETL - Filter, Enrich, Convert - Join Kafka Connect KPL Kinesis Agent AWS SDKs

Key Takeaways • Build decoupled systems • Data → Store
→ Process → Store → Analyze → Answers • Data Source → Stream Ingestion → Stream Storage → Stream Process → Data Sink • Follow the principle of "extract data once and reuse multiple times” to power new customer experiences • Use the right tool for the job • Know the AWS services soft and hard limits • Leverage managed and serverless services (DevOps!) • Scalable/elastic, available, reliable, secure, no/low admin

Where To Go Next? • AWS Analytics Immersion Day -
Build BI System from Scratch • Workshop - https://serverless-bi-system-from-scratch.workshop.aws/ • Slides - https://tinyurl.com/serverless-bi-on-aws • Video - https://youtu.be/FX5iWHFn1v0 • Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 1, 2 • Part1 - https://tinyurl.com/y8vo8q7o • Part2 - https://tinyurl.com/ycbv7wel • Streaming Analytics Workshop – Kinesis Data Analytics for Java (Flink) https://streaming-analytics.labgui.de/ • Amazon MSK Labs https://amazonmsk-labs.workshop.aws/ • Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming https://tinyurl.com/y7hklyff • AWS Glue Streaming ETL - Scala Script Example https://tinyurl.com/y79x6jda

Real-time Analytics on AWS

Real-time Analytics on AWS

More Decks by Sungmin Kim

Other Decks in Programming

Featured

Transcript