Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Stream Processing with AWS Kinesis

Data Stream Processing with AWS Kinesis

7a04b88e1469561db6da3818348d4b8f?s=128

Alexey Novakov

May 03, 2021
Tweet

Transcript

  1. AWS: Data Streams processing with Kinesis Alexey Novakov Solution Architect

  2. Agenda # # # # # K I N E

    S I S S E R V I C E S S T R E A M ( 1 ) A N A L Y T I C S A P P L I C A T I O N ( 2 ) F I R E H O S E ( 3 ) S U M M A R Y 2
  3. Kinesis Services 3 Data Stream (acts as buffer) Analytics App

    Firehose (acts as buffer): - can deliver to destination unlike Stream - can transform data as well reads writes S3 writes reads Video Stream reads
  4. Data Streams

  5. Data Streams 5 - collect gigabytes of data per second

    - make it available for processing and analysing in real time - serverless - SDKs for Java, Scala, Python, Go, Rust, etc. - data retention 1-365 days - AWS Glue Schema Registry - Up to 1 Mb payload - Array[Byte] Comparison with Kafka: concepts Kinesis Kafka Message holder stream topic Throughput shard partition Server N/A broker
  6. Shards 6 def putEntry(key: String, data: String) = PutRecordsEntry( partitionKey

    = key, data = data.getBytes("UTF-8"), explicitHashKey = None )
  7. Shard Capacity 7 - Read throughput: shared or shard-dedicated https://docs.aws.amazon.com/streams/latest/dev/building-consumers.html

  8. Data Stream Cost 8 Property Spec Records / second 100

    Avg. record size, KB 100 Consumer count 10 Total monthly cost* Frankfurt, EU Ohio, US 662.26 USD 551.27 USD *as of April 2021 Max (9.77 shards needed for ingress, 48.85 shards needed for egress, 0.100 shards needed for records) = 48.85 Number of shards
  9. Analytics Applications

  10. Analytics Applications 10 Some use-cases: - Generate time-series analytics -

    Feed real-time dashboards - Create real-time metrics Option 2: Scala/Java Flink appliaction (jar on S3) Option 1: ANSI 2008 SQL standard with extensions Reminds Kafka tools: - Kafka-Streams lib - KSQL - Any client Kafka app SELECT STREAM "number", AVG("temperature") AS avg_temperature FROM "sensor-temperature_001" -- Uses a 10-second tumbling time window GROUP BY "number", FLOOR(("sensor-temperature_001".ROWTIME - TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND);
  11. Flink Option: consumer 11 val input = createConsumer(env, consumerProps) input

    .flatMap { json => Option(json) .filter(_.trim.nonEmpty) .map(j => Json.readValue(j, classOf[Event])) } .keyBy(_.sensor.number) // Logically partition the stream per sensor id .timeWindow(Time.seconds(10), Time.seconds(5)) // Sliding window definition .apply(new TemperatureAverager) .name("TemperatureAverager") .map(Json.writeAsString(_)) .addSink(createProducer(producerProps)) .name("Kinesis Stream")
  12. Flink Option: TemperatureAverager 12 /** apply() is invoked once for

    each window */ override def apply( sensorId: Int, window: TimeWindow, events: Iterable[Event], out: Collector[Event] ): Unit = { val (count, sum) = events.foldLeft((0, 0.0)) { case ((count, temperature), e) => (count + 1, temperature + e.temperature) } // emit an Event with the average temperature out.collect(Event(window.getEnd, avgTemp, events.head.sensor)) } val avgTemp = if (count == 0) 0 else sum / count
  13. Analytics Application Cost 13 Unit conversions SQL: SQL KPUs: 5

    per day * (730 hours in a month / 24 hours in a day) = 152.08 per month Pricing calculations: 10 applications x 152.08 KPUs x 0.127 USD = 193.14 USD per month for SQL applications Kinesis Data Analytics for SQL applications cost (monthly): 193.14 USD* *as of April 2021
  14. Firehose

  15. Firehose: data flow 15 Processing feaures: 1. Convert data to

    Parquet/ORC 2. Transform data with AWS Lambda Destinations: 1. S3 2. Redshift 3. Elasticsearch 4. Splunk 5. HTTP endpoint Sources: 1. Data Streams 2. Direct PUT
  16. Firehose Data Delivery 16 Frequency - Depends on destination: S3,

    Redshift, etc. - Firehose buffers data, thus your flow is not real streaming - S3: - Buffer size: 1-128 Mb - Buffer interval: 60-900 seconds - …. - ….
  17. Monitoring

  18. Kinesis Monitoring 18 - CloudWatch Logs, Metrics - Custom Metrics

    - CloudTrail Streams, Analytics Applications, Firehose:
  19. 19 Kinesis Use Cases

  20. 20

  21. 21 Common example:

  22. Thank you! Questions? 22 Twitter: @alexey_novakov Blog: https://novakov-alexey.github.io/ Example project

    to create: - stream - analytics app, - firehose, - and run producer https://github.com/novakov-alexey/kinesis-ingest