Slide 1

Slide 1 text

AWS: Data Streams processing with Kinesis Alexey Novakov Solution Architect

Slide 2

Slide 2 text

Agenda # # # # # K I N E S I S S E R V I C E S S T R E A M ( 1 ) A N A L Y T I C S A P P L I C A T I O N ( 2 ) F I R E H O S E ( 3 ) S U M M A R Y 2

Slide 3

Slide 3 text

Kinesis Services 3 Data Stream (acts as buffer) Analytics App Firehose (acts as buffer): - can deliver to destination unlike Stream - can transform data as well reads writes S3 writes reads Video Stream reads

Slide 4

Slide 4 text

Data Streams

Slide 5

Slide 5 text

Data Streams 5 - collect gigabytes of data per second - make it available for processing and analysing in real time - serverless - SDKs for Java, Scala, Python, Go, Rust, etc. - data retention 1-365 days - AWS Glue Schema Registry - Up to 1 Mb payload - Array[Byte] Comparison with Kafka: concepts Kinesis Kafka Message holder stream topic Throughput shard partition Server N/A broker

Slide 6

Slide 6 text

Shards 6 def putEntry(key: String, data: String) = PutRecordsEntry( partitionKey = key, data = data.getBytes("UTF-8"), explicitHashKey = None )

Slide 7

Slide 7 text

Shard Capacity 7 - Read throughput: shared or shard-dedicated https://docs.aws.amazon.com/streams/latest/dev/building-consumers.html

Slide 8

Slide 8 text

Data Stream Cost 8 Property Spec Records / second 100 Avg. record size, KB 100 Consumer count 10 Total monthly cost* Frankfurt, EU Ohio, US 662.26 USD 551.27 USD *as of April 2021 Max (9.77 shards needed for ingress, 48.85 shards needed for egress, 0.100 shards needed for records) = 48.85 Number of shards

Slide 9

Slide 9 text

Analytics Applications

Slide 10

Slide 10 text

Analytics Applications 10 Some use-cases: - Generate time-series analytics - Feed real-time dashboards - Create real-time metrics Option 2: Scala/Java Flink appliaction (jar on S3) Option 1: ANSI 2008 SQL standard with extensions Reminds Kafka tools: - Kafka-Streams lib - KSQL - Any client Kafka app SELECT STREAM "number", AVG("temperature") AS avg_temperature FROM "sensor-temperature_001" -- Uses a 10-second tumbling time window GROUP BY "number", FLOOR(("sensor-temperature_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND);

Slide 11

Slide 11 text

Flink Option: consumer 11 val input = createConsumer(env, consumerProps) input .flatMap { json => Option(json) .filter(_.trim.nonEmpty) .map(j => Json.readValue(j, classOf[Event])) } .keyBy(_.sensor.number) // Logically partition the stream per sensor id .timeWindow(Time.seconds(10), Time.seconds(5)) // Sliding window definition .apply(new TemperatureAverager) .name("TemperatureAverager") .map(Json.writeAsString(_)) .addSink(createProducer(producerProps)) .name("Kinesis Stream")

Slide 12

Slide 12 text

Flink Option: TemperatureAverager 12 /** apply() is invoked once for each window */ override def apply( sensorId: Int, window: TimeWindow, events: Iterable[Event], out: Collector[Event] ): Unit = { val (count, sum) = events.foldLeft((0, 0.0)) { case ((count, temperature), e) => (count + 1, temperature + e.temperature) } // emit an Event with the average temperature out.collect(Event(window.getEnd, avgTemp, events.head.sensor)) } val avgTemp = if (count == 0) 0 else sum / count

Slide 13

Slide 13 text

Analytics Application Cost 13 Unit conversions SQL: SQL KPUs: 5 per day * (730 hours in a month / 24 hours in a day) = 152.08 per month Pricing calculations: 10 applications x 152.08 KPUs x 0.127 USD = 193.14 USD per month for SQL applications Kinesis Data Analytics for SQL applications cost (monthly): 193.14 USD* *as of April 2021

Slide 14

Slide 14 text

Firehose

Slide 15

Slide 15 text

Firehose: data flow 15 Processing feaures: 1. Convert data to Parquet/ORC 2. Transform data with AWS Lambda Destinations: 1. S3 2. Redshift 3. Elasticsearch 4. Splunk 5. HTTP endpoint Sources: 1. Data Streams 2. Direct PUT

Slide 16

Slide 16 text

Firehose Data Delivery 16 Frequency - Depends on destination: S3, Redshift, etc. - Firehose buffers data, thus your flow is not real streaming - S3: - Buffer size: 1-128 Mb - Buffer interval: 60-900 seconds - …. - ….

Slide 17

Slide 17 text

Monitoring

Slide 18

Slide 18 text

Kinesis Monitoring 18 - CloudWatch Logs, Metrics - Custom Metrics - CloudTrail Streams, Analytics Applications, Firehose:

Slide 19

Slide 19 text

19 Kinesis Use Cases

Slide 20

Slide 20 text

20

Slide 21

Slide 21 text

21 Common example:

Slide 22

Slide 22 text

Thank you! Questions? 22 Twitter: @alexey_novakov Blog: https://novakov-alexey.github.io/ Example project to create: - stream - analytics app, - firehose, - and run producer https://github.com/novakov-alexey/kinesis-ingest