Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Real-time Analytics on AWS

Real-time Analytics on AWS

Agenda
• Why Real-time Data streaming and Analytics? • How to Build?
• Where to Store streaming data?
• How to Ingest streaming data?
• How to Process streaming data?
• Delivery Streaming Data
• Dive into Stream Process Framework
• Transform, Aggregate, Join Streaming Data
• Case Studies
• Key Takeaways

07413f37e2af6fe6c0b23f0e0c4d0830?s=128

Sungmin Kim

June 23, 2022
Tweet

More Decks by Sungmin Kim

Other Decks in Programming

Transcript

  1. Real-time Analytics on Sungmin, Kim Solutions Architect, AWS

  2. Agenda • Why Real-time Data streaming and Analytics? • How

    to Build? • Where to Store streaming data? • How to Ingest streaming data? • How to Process streaming data? • Delivery Streaming Data • Dive into Stream Process Framework • Transform, Aggregate, Join Streaming Data • Case Studies • Key Takeaways
  3. Why Real-time Data streaming and Analytics?

  4. Data The world’s most valuable resource is no longer oil,

    but data.* *Copyright: David Parkins , The Economist, 2017 “ ”
  5. Data Loses Value Over Time * Source: Mike Gualtieri, Forrester,

    Perishable insights Real time Seconds Minutes Hours Days Months Value of data to decision-making Preventive/predictive Actionable Reactive Historical Time-critical decisions Traditional “batch” business intelligence
  6. To create Value, derive insights in Real-time * image source:

    https://androidby.com/wp-content/uploads/2020/04/Need-for-Speed-No-Limits-4.4.6-.APK-MOD-Unlimited-money.png
  7. Batch vs Real-time Batch Difference Real-time Arbitrarily, or Periodically Continuity

    Constant Store → Process (Hadoop MapReduce, Hive, Pig, Spark) Method of analysis Process → Store (Spark Streaming, Flink, Apache Storm) Small - Huge (KB~TB) Data size per a unit Small (B~KB) Low - High (minutes to hours) Query Latency Low (milliseconds to minutes) Low - High (hourly/daily/monthly) Request Rate Very High - High (in seconds, minutes) High - Very high Durability Low - High ¢~$ (Amazon S3, Glacier) Cost/GB $$~$ (Redis, Memcached)
  8. From Batch to Real-time: Lambda Architecture Data Source Stream Storage

    Speed Layer Batch Layer Batch Process Batch View Real- time View Consumer Query & Merge Results Service Layer Stream Ingestion Raw Data Storage Streaming Data Stream Delivery Stream Process
  9. Lambda Architecture Streaming Data Batch View Stream Process Real-time View

    Query Query Batch View Real-time View Raw Data Batch Process Batch Layer Serving Layer Speed Layer
  10. Key Components of Real-time Analytics Data Source Stream Storage Stream

    Process Stream Ingestion Data Sink Devices and/or applications that produce real-time data at high velocity Data from tens of thousands of data sources can be written to a single stream Data are stored in the order they were received for a set duration of time and can be replayed indefinitely during that time Records are read in the order they are produced, enabling real-time analytics or streaming ETL Data lake (most common) Database (least common)
  11. Where to Store Streaming Data? Data Source Stream Storage Stream

    Process Stream Ingestion Data Sink
  12. Stream Storage Data Source Stream Storage Stream Process Stream Ingestion

    Data Sink Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka
  13. Hash Function Consumer Consumer Consumer Consumer Group PK PK PK

    PK = next consumer offset oldest data newest data Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Producers shard/partition-1 shard/partition-2 5 4 3 2 1 0 3 2 1 0 4 3 2 1 0 4 2 0 shard/partition-3
  14. Why is Stream Storage? • Decouple producers & consumers •

    Persistent buffer • Collect multiple streams • Preserve client ordering • Parallel consumption • Streaming MapReduce
  15. • Decouple producers & consumers • Persistent buffer • Collect

    multiple streams • No client ordering (standard) • FIFO queue preserves client ordering • No streaming MapReduce • No parallel consumption • Amazon SNS can publish to multiple SNS subscribers (queues or Lambda functions) Consumers 4 3 2 1 1 2 3 4 4 3 2 1 1 2 3 4 2 1 3 4 1 3 3 4 2 Standard FIFO Producers Amazon SQS Queue What about SQS? Publisher Amazon SNS Topic AWS Lambda function Amazon SQS queue Queue Subscriber
  16. Topic Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka

  17. Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka •

    Operational Considerations • Number of clusters? • Number of brokers per cluster? • Number of topics per broker? • Number of partitions per topic? • Only increase number of partitions; can’t decrease • Integration with a few of AWS Services such as Kinesis Data Analytics for Apache Flink • Operational Considerations • Number of Kinesis Data Streams? • Number of shards per stream? • Increase/Decrease number of shards • Fully Integration with AWS Services such as Lambda function, Kinesis Data Analytics, etc
  18. RequestQueue - Length - WaitTime ResponseQueue - Length - WaitTime

    Network - Packet Drop? Produce/Consume Rate Unbalance Who is Leader? Disk Full? Too many topics? Metrics to Monitor: MSK (Kafka)
  19. Metrics to Monitor: MSK (Kafka) Metric Level Description ActiveControllerCount DEFAULT

    Only one controller per cluster should be active at any given time. OfflinePartitionsCount DEFAULT Total number of partitions that are offline in the cluster. GlobalPartitionCount DEFAULT Total number of partitions across all brokers in the cluster. GlobalTopicCount DEFAULT Total number of topics across all brokers in the cluster. KafkaAppLogsDiskUsed DEFAULT The percentage of disk space used for application logs. KafkaDataLogsDiskUsed DEFAULT The percentage of disk space used for data logs. RootDiskUsed DEFAULT The percentage of the root disk used by the broker. PartitionCount PER_BROKER The number of partitions for the broker. LeaderCount PER_BROKER The number of leader replicas. UnderMinIsrPartitionCount PER_BROKER The number of under minIsr partitions for the broker. UnderReplicatedPartitions PER_BROKER The number of under-replicated partitions for the broker. FetchConsumerTotalTimeMsMean PER_BROKER The mean total time in milliseconds that consumers spend on fetching data from the broker. ProduceTotalTimeMsMean PER_BROKER The mean produce time in milliseconds.
  20. How about monitoring Kinesis Data Streams? Consumer Application GetRecords() Data

    How long time does a record stay in a shard?
  21. Metrics to Monitor: Kinesis Data Streams Metric Description GetRecords.IteratorAgeMilliseconds Age

    of the last record in all GetRecords ReadProvisionedThroughputExceeded Number of GetRecords calls throttled WriteProvisionedThroughputExceeded Number of PutRecord(s) calls throttled PutRecord.Success, PutRecords.Success Number of successful PutRecord(s) operations GetRecords.Success Number of successful GetRecords operations
  22. Choosing Good Metrics Too much information can be just as

    useless as too little
  23. How to Ingest Streaming Data? Data Source Stream Storage Stream

    Process Stream Ingestion Data Sink
  24. Stream Ingestion • AWS SDKs • Publish directly from application

    code via APIs • AWS Mobile SDK • Kinesis Agent • Monitors log files and forwards lines as messages to Kinesis Data Streams • Kinesis Producer Library (KPL) • Background process aggregates and batches messages • 3rd party and open source • Kafka Connect (kinesis-kafka-connector) • fluentd (aws-fluent-plugin-kinesis) • Log4J Appender (kinesis-log4j-appender) • and more … Data Source Stream Storage Stream Process Stream Ingestion Data Sink Amazon Kinesis Data Streams
  25. How to Process Streaming Data? Data Source Stream Storage Stream

    Process Stream Ingestion Data Sink
  26. Elasticsearch Redshift Stream Delivery Data Source Stream Storage Stream Process

    Stream Ingestion Data Sink Stream Delivery Kinesis Data Firehose • Kinesis Agent • CloudWatch Logs • CloudWatch Events • AWS IoT • Direct PUT using APIs • Kinesis Data Streams • MSK(Kafka) using Kafka Connect Kinesis Data Analytics S3
  27. Kinesis Firehose: Filter, Enrich, Convert Data Source apache log apache

    log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } json Lambda function Kinesis Data Firehose
  28. Pre-built Data Transformation Blueprints Blueprint Description General Processing For custom

    transformation logic Apache Log to JSON Parses and converts Apache log lines to JSON objects using predefined JSON field names Apache Log to CSV Parses and converts Apache log lines to CSV format Syslog to JSON Parses and converts Syslog lines to JSON objects using predefined JSON field names Syslog to CSV Parses and converts Syslog lines to CSV format
  29. Pre-built Data Conversion Data Source Kinesis Data Firehose JSON Data

    schema AWS Glue Data Catalog Amazon S3 • Convert the format of your input data from JSON to columnar data format Apache Parquet or Apache ORC before storing the data in Amazon S3 • Works in conjunction to the transform features to convert other format to JSON before the data conversion convert to columnar format /failed
  30. Failure and Error Handling • S3 Destination • Pause and

    retry for up to 24 hours (maximum data retention period) • If data delivery fails for more than 24 hours, your data is lost. • Redshift Destination • Configurable retry duration (0-2 hours) • After retry, skip and load error manifest files to S3’s errors/ folder • Elasticsearch Destination • Configurable retry duration (0-2 hours) • After retry, skip and load failed records to S3’s elasticsearch_failed/ folder
  31. Stream Process • Transform • Filter, Enrich, Convert • Aggregation

    • Windows Queries • Top-K Contributor • Join • Stream-Stream Join • Stream-(External) Table Join Data Source Stream Storage Stream Process Stream Ingestion Data Sink AWS Lambda Amazon Kinesis Data Analytics AWS Glue Amazon EMR
  32. Dive into Stream Process Services

  33. AWS Lambda • Serverless functions • Event-based, stateless processing •

    Continuous and simple scaling mechanism event (3) event (2) event (1) Lambda (1) Lambda (2) Lambda (3)
  34. Amazon Kinesis Data Analytics AWS Glue Amazon EMR Serverless Serverless

    Fully Managed
  35. Architecture: Master-Worker Master Worker (1) Worker (2) Worker (3) part-01

    part-02 part-03 part-01 part-02 part-03
  36. Master Workers Architecture

  37. Architecture Workers Master

  38. Streaming Programming Guide

  39. Treat Streams as Unbounded Tables

  40. “It's raining cats and dogs!” ["It's", "raining", "cats", "and", "dogs!"]

    [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] It’s 1 raining 1 cats 1 and 1 dogs! 1
  41. None
  42. “It's raining cats and dogs!” ["It's", "raining", "cats", "and", "dogs!"]

    [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] It’s 1 raining 1 cats 1 and 1 dogs! 1
  43. None
  44. Setup session Read stream Start running Apply Streaming ETL

  45. What about (Stream) SQL? Data Source Stream Storage Stream SQL

    Process Stream Ingestion Data Sink [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] “It's raining cats and dogs!” It’s 1 raining 1 cats 1 and 1 dogs! 1
  46. Kinesis Data Analytics (SQL) • STREAM (in-application): a continuously updated

    entity that you can SELECT from and INSERT into like a TABLE • PUMP: an entity used to continuously 'SELECT ... FROM' a source STREAM, and INSERT SQL results into an output STREAM • Create output stream, which can be used to send to a destination SOURCE STREAM INSERT & SELECT (PUMP) DESTIN. STREAM Destination Source [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)]
  47. Kinesis Data Analytics SQL vs Java

  48. DEMO

  49. https://aws.amazon.com/ko/blogs/aws/new-amazon-kinesis-data-analytics-for-java/ Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Amazon

    S3 Amazon Kinesis Data Analytics (Java) Amazon Kinesis Data Streams Amazon Kinesis Data Streams Amazon Kinesis Data Analytics (SQL) DEMO: Word Count “It's raining cats and dogs!” [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] It’s 1 raining 1 cats 1 and 1 dogs! 1 [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] 1 2
  50. Filter, Enrich, Convert Streaming Data Data Source Stream Storage Stream

    Process Stream Ingestion Data Sink
  51. Revisit Example: Filter, Enrich, Convert Data Source Kinesis Data Firehose

    apache log apache log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } json Lambda function
  52. Stream Process: Filter, Enrich, Convert Data Source apache log apache

    log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } Amazon Kinesis Data Streams Lambda function Amazon EMR AWS Glue Amazon Kinesis Data Analytics
  53. Stream Process: Filter, Enrich, Convert Data Source apache log apache

    log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } Amazon EMR AWS Glue Amazon MSK Amazon Kinesis Data Analytics (Java)
  54. Kinesis Data Analytics (SQL): Preprocessing Data https://aws.amazon.com/ko/blogs/big-data/preprocessing-data-in-amazon-kinesis-analytics-with-aws-lambda/

  55. Integration of Stream Process and Stream Storage Amazon Lambda Kinesis

    Data Analytics (SQL) Kinesis Data Analytics (Flink) Glue EMR Kinesis Data Firehose O O X X X Kinesis Data Streams O O O O O Managed Streaming for Kafka (MSK) O X O O O Data Source Stream Storage Stream Process Stream Ingestion Data Sink
  56. Aggregate Streaming Data Data Source Stream Storage Stream Process Stream

    Ingestion Data Sink
  57. Stream Process: Aggregation • Aggregations (count, sum, min,...) take granular

    real time data and turn it into insights • Data is continuously processed so you need to tell the application when you want results • Windowed Queries a. Sliding Windows (with Overlap) b. Tumbling Windows (No Overlap) c. Custom Windows
  58. Join Streaming Data Data Source Stream Storage Stream Process Stream

    Ingestion Data Sink
  59. Imagine It! How to build?

  60. Stream Process: Join Data Source Stream Storage Data Source Stream

    Storage Stream Process Data Source Stream Storage Data Source Stream Process (a) Stream-Stream Join (b) Stream-Join by Partition Key (c) Stream-Join by Hash Table Data Source Stream Storage Stream Process Key-Value Storage
  61. Why Stream-Stream Join is so difficult? Data Source Stream Storage

    Data Source Stream Storage Stream Process Data Sink t0 t1 t2 tN . . . . . . . • Timing • Skewed data ∆𝑡 ∆𝑡 ∆𝑡
  62. How about Stream-Join by Partition Key? Data Source Stream Storage

    Data Source Stream Storage Stream Process Data Source Stream Storage Data Source Stream Process t1 t2 t3 t5 t1 t2 t3 t5 t1 t1 t2 t3 Each shard will be filled with records coming from fast data producers shard-1 shard-2 shard-3
  63. Lastly, how about Stream-Join by Hash Table? Data Source Stream

    Storage Stream Process Key-Value Storage Data Source Stream Storage Data Source Stream Storage Stream Process
  64. DEMO

  65. {"TICKER_SYMBOL": "CVB", "SECTOR": "TECHNOLOGY", "CHANGE": 0.81, "PRICE": 53.63} {"TICKER_SYMBOL": "ABC",

    "SECTOR": "RETAIL", "CHANGE": -1.14, "PRICE": 23.64} {"TICKER_SYMBOL": "JKL", "SECTOR": "TECHNOLOGY", "CHANGE": 0.22, "PRICE": 15.32} Amazon Kinesis Data Streams Amazon Kinesis Data Analytics (SQL) DEMO: Filter, Aggregate, Join Continuous filter Aggregate function Data enrichment (join) Bucket with objects Ticker,Company AMZN,Amazon ASD,SomeCompanyA BAC,SomeCompanyB CRM,SomeCompanyC https://docs.aws.amazon.com/kinesisanalytics/latest/dev/app-add-reference-data.html
  66. Comparing Stream Process Services

  67. DevOps! Master-Worker Framework Master Worker (1) Worker (2) Worker (3)

    part-01 part-02 part-03 part-01 part-02 part-03 Master is alive? Worker has enough resources such as CPU, Memory, Disk? Checkpoint? Right Instance Type? C-Family, or R-Family? Learning curve? - SQL - Python - Scala - Java
  68. EMR vs Glue vs Kinesis Data Analytics Operational Excellence Kinesis

    Data Analytics (SQL) EMR Glue Kinesis Data Analytics (Java) Degree of Freedom ≈ Complexity
  69. AWS Glue Comparing stream processing services AWS Lambda Amazon Kinesis

    Data Analytics Amazon EMR Simple programming interface and scaling • Serverless functions • Six languages (Java, Python, Golang, Node.js, Ruby, C#) • Event-based, stateless processing • Continuous and simple scaling mechanism Easy and powerful stream processing Simple, flexible, and cost-effective ETL & Data Catalog Flexibility and choice for your needs • Serverless applications • Supports SQL and Java (Apache Flink) • Stateful processing with automatic backups • Stream operators make building app easy • Serverless applications • Can use the transforms native to Apache Spark Structured Streaming • Automatically discover new data, extracts schema definitions • Automatically generates the ETL code • Choose your instances • Use your favorite open-source framework • Fine-grained control over cluster, debugging tools, and more • Deep open-source tool integrations with AWS
  70. Case Studies

  71. Example Usage Pattern 1: Web Analytics and Leaderboards Amazon DynamoDB

    Amazon Kinesis Data Analytics Amazon Kinesis Data Streams Amazon Cognito Lightweight JS client code Web server on Amazon EC2 OR Compute top 10 users Ingest web app data Persist to feed live apps Lambda function https://aws.amazon.com/solutions/implementations/real-time-web-analytics-with-kinesis/
  72. https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/ Example Usage Pattern 2: Monitoring IoT Devices Ingest sensor

    data Convert json to parquet Store all data points in an S3 data lake https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/
  73. Example Usage Pattern 3: Analyzing AWS CloudTrail Event Logs AWS

    CloudTrail CloudWatch Events trigger Kinesis Data Analytics Lambda function S3 bucket for raw data DynamoDB table Chart.JS dashboard Compute operational metrics Ingest raw log data Deliver to real time dashboards and archival Kinesis Data Firehose https://aws.amazon.com/solutions/implementations/real-time-insights-account-activity/
  74. Takeaways

  75. From Batch to Real-time: Lambda Architecture Data Source Stream Storage

    Speed Layer Batch Layer Batch Process Batch View Real- time View Consumer Query & Merge Results Service Layer Stream Ingestion Raw Data Storage Streaming Data Stream Delivery Stream Process
  76. Key Components of Real-time Analytics Data Source Stream Storage Stream

    Process Stream Ingestion Data Sink AWS Lambda Kinesis Data Analytics Glue EMR Kinesis Data Firehose Kinesis Data Streams Managed Streaming for Kafka Real-Time Applications - Aggregation - Top-K Contributor - Anomaly Detection Streaming ETL - Filter, Enrich, Convert - Join Kafka Connect KPL Kinesis Agent AWS SDKs
  77. Key Takeaways • Build decoupled systems • Data → Store

    → Process → Store → Analyze → Answers • Data Source → Stream Ingestion → Stream Storage → Stream Process → Data Sink • Follow the principle of "extract data once and reuse multiple times” to power new customer experiences • Use the right tool for the job • Know the AWS services soft and hard limits • Leverage managed and serverless services (DevOps!) • Scalable/elastic, available, reliable, secure, no/low admin
  78. Where To Go Next? • AWS Analytics Immersion Day -

    Build BI System from Scratch • Workshop - https://serverless-bi-system-from-scratch.workshop.aws/ • Slides - https://tinyurl.com/serverless-bi-on-aws • Video - https://youtu.be/FX5iWHFn1v0 • Writing SQL on Streaming Data with Amazon Kinesis Analytics – Part 1, 2 • Part1 - https://tinyurl.com/y8vo8q7o • Part2 - https://tinyurl.com/ycbv7wel • Streaming Analytics Workshop – Kinesis Data Analytics for Java (Flink) https://streaming-analytics.labgui.de/ • Amazon MSK Labs https://amazonmsk-labs.workshop.aws/ • Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming https://tinyurl.com/y7hklyff • AWS Glue Streaming ETL - Scala Script Example https://tinyurl.com/y79x6jda