Data Stream Processing with AWS Kinesis

AWS: Data Streams processing with Kinesis Alexey Novakov Solution Architect

Agenda # # # # # K I N E
S I S S E R V I C E S S T R E A M ( 1 ) A N A L Y T I C S A P P L I C A T I O N ( 2 ) F I R E H O S E ( 3 ) S U M M A R Y 2

Kinesis Services 3 Data Stream (acts as buffer) Analytics App
Firehose (acts as buffer): - can deliver to destination unlike Stream - can transform data as well reads writes S3 writes reads Video Stream reads

Data Streams

Data Streams 5 - collect gigabytes of data per second
- make it available for processing and analysing in real time - serverless - SDKs for Java, Scala, Python, Go, Rust, etc. - data retention 1-365 days - AWS Glue Schema Registry - Up to 1 Mb payload - Array[Byte] Comparison with Kafka: concepts Kinesis Kafka Message holder stream topic Throughput shard partition Server N/A broker

Shards 6 def putEntry(key: String, data: String) = PutRecordsEntry( partitionKey
= key, data = data.getBytes("UTF-8"), explicitHashKey = None )

Shard Capacity 7 - Read throughput: shared or shard-dedicated https://docs.aws.amazon.com/streams/latest/dev/building-consumers.html

Data Stream Cost 8 Property Spec Records / second 100
Avg. record size, KB 100 Consumer count 10 Total monthly cost* Frankfurt, EU Ohio, US 662.26 USD 551.27 USD *as of April 2021 Max (9.77 shards needed for ingress, 48.85 shards needed for egress, 0.100 shards needed for records) = 48.85 Number of shards

Analytics Applications

Analytics Applications 10 Some use-cases: - Generate time-series analytics -
Feed real-time dashboards - Create real-time metrics Option 2: Scala/Java Flink appliaction (jar on S3) Option 1: ANSI 2008 SQL standard with extensions Reminds Kafka tools: - Kafka-Streams lib - KSQL - Any client Kafka app SELECT STREAM "number", AVG("temperature") AS avg_temperature FROM "sensor-temperature_001" -- Uses a 10-second tumbling time window GROUP BY "number", FLOOR(("sensor-temperature_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND);

Flink Option: consumer 11 val input = createConsumer(env, consumerProps) input
.flatMap { json => Option(json) .filter(_.trim.nonEmpty) .map(j => Json.readValue(j, classOf[Event])) } .keyBy(_.sensor.number) // Logically partition the stream per sensor id .timeWindow(Time.seconds(10), Time.seconds(5)) // Sliding window definition .apply(new TemperatureAverager) .name("TemperatureAverager") .map(Json.writeAsString(_)) .addSink(createProducer(producerProps)) .name("Kinesis Stream")

Flink Option: TemperatureAverager 12 /** apply() is invoked once for
each window */ override def apply( sensorId: Int, window: TimeWindow, events: Iterable[Event], out: Collector[Event] ): Unit = { val (count, sum) = events.foldLeft((0, 0.0)) { case ((count, temperature), e) => (count + 1, temperature + e.temperature) } // emit an Event with the average temperature out.collect(Event(window.getEnd, avgTemp, events.head.sensor)) } val avgTemp = if (count == 0) 0 else sum / count

Analytics Application Cost 13 Unit conversions SQL: SQL KPUs: 5
per day * (730 hours in a month / 24 hours in a day) = 152.08 per month Pricing calculations: 10 applications x 152.08 KPUs x 0.127 USD = 193.14 USD per month for SQL applications Kinesis Data Analytics for SQL applications cost (monthly): 193.14 USD* *as of April 2021

Firehose

Firehose: data flow 15 Processing feaures: 1. Convert data to
Parquet/ORC 2. Transform data with AWS Lambda Destinations: 1. S3 2. Redshift 3. Elasticsearch 4. Splunk 5. HTTP endpoint Sources: 1. Data Streams 2. Direct PUT

Firehose Data Delivery 16 Frequency - Depends on destination: S3,
Redshift, etc. - Firehose buffers data, thus your flow is not real streaming - S3: - Buffer size: 1-128 Mb - Buffer interval: 60-900 seconds - …. - ….

Monitoring

Kinesis Monitoring 18 - CloudWatch Logs, Metrics - Custom Metrics
- CloudTrail Streams, Analytics Applications, Firehose:

19 Kinesis Use Cases

21 Common example:

Thank you! Questions? 22 Twitter: @alexey_novakov Blog: https://novakov-alexey.github.io/ Example project
to create: - stream - analytics app, - firehose, - and run producer https://github.com/novakov-alexey/kinesis-ingest

Data Stream Processing with AWS Kinesis

Data Stream Processing with AWS Kinesis

Alexey Novakov

More Decks by Alexey Novakov

Other Decks in Programming

Featured

Transcript

AWS: Data Streams processing with Kinesis Alexey Novakov Solution Architect

Agenda # # # # # K I N E

Kinesis Services 3 Data Stream (acts as buffer) Analytics App

Data Streams

Data Streams 5 - collect gigabytes of data per second

Shards 6 def putEntry(key: String, data: String) = PutRecordsEntry( partitionKey

Shard Capacity 7 - Read throughput: shared or shard-dedicated https://docs.aws.amazon.com/streams/latest/dev/building-consumers.html

Data Stream Cost 8 Property Spec Records / second 100

Analytics Applications

Analytics Applications 10 Some use-cases: - Generate time-series analytics -

Flink Option: consumer 11 val input = createConsumer(env, consumerProps) input

Flink Option: TemperatureAverager 12 /** apply() is invoked once for

Analytics Application Cost 13 Unit conversions SQL: SQL KPUs: 5

Firehose

Firehose: data flow 15 Processing feaures: 1. Convert data to

Firehose Data Delivery 16 Frequency - Depends on destination: S3,

Monitoring

Kinesis Monitoring 18 - CloudWatch Logs, Metrics - Custom Metrics

19 Kinesis Use Cases

20

21 Common example:

Thank you! Questions? 22 Twitter: @alexey_novakov Blog: https://novakov-alexey.github.io/ Example project