Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Stream Processing with AWS Kinesis

Data Stream Processing with AWS Kinesis

Alexey Novakov

May 03, 2021
Tweet

More Decks by Alexey Novakov

Other Decks in Programming

Transcript

  1. AWS: Data Streams
    processing with
    Kinesis
    Alexey Novakov
    Solution Architect

    View Slide

  2. Agenda
    #
    #
    #
    #
    #
    K I N E S I S S E R V I C E S
    S T R E A M ( 1 )
    A N A L Y T I C S A P P L I C A T I O N ( 2 )
    F I R E H O S E ( 3 )
    S U M M A R Y
    2

    View Slide

  3. Kinesis Services
    3
    Data Stream
    (acts as buffer)
    Analytics App
    Firehose (acts as buffer):
    - can deliver to destination unlike Stream
    - can transform data as well
    reads
    writes
    S3
    writes
    reads
    Video Stream
    reads

    View Slide

  4. Data Streams

    View Slide

  5. Data Streams
    5
    - collect gigabytes of data per second
    - make it available for processing and analysing in real
    time
    - serverless
    - SDKs for Java, Scala, Python, Go, Rust, etc.
    - data retention 1-365 days
    - AWS Glue Schema Registry
    - Up to 1 Mb payload - Array[Byte]
    Comparison with Kafka:
    concepts Kinesis Kafka
    Message holder stream topic
    Throughput shard partition
    Server N/A broker

    View Slide

  6. Shards
    6
    def putEntry(key: String, data: String) =
    PutRecordsEntry(
    partitionKey = key,
    data = data.getBytes("UTF-8"),
    explicitHashKey = None
    )

    View Slide

  7. Shard Capacity
    7
    - Read throughput: shared or shard-dedicated
    https://docs.aws.amazon.com/streams/latest/dev/building-consumers.html

    View Slide

  8. Data Stream Cost
    8
    Property Spec
    Records / second 100
    Avg. record size, KB 100
    Consumer count 10
    Total monthly cost*
    Frankfurt, EU Ohio, US
    662.26 USD 551.27 USD
    *as of April 2021
    Max (9.77 shards needed for ingress, 48.85 shards needed for egress, 0.100 shards needed
    for records) = 48.85 Number of shards

    View Slide

  9. Analytics Applications

    View Slide

  10. Analytics Applications
    10
    Some use-cases:
    - Generate time-series analytics
    - Feed real-time dashboards
    - Create real-time metrics
    Option 2: Scala/Java Flink appliaction (jar on S3)
    Option 1: ANSI 2008 SQL standard with extensions
    Reminds Kafka tools:
    - Kafka-Streams lib
    - KSQL
    - Any client Kafka app
    SELECT STREAM "number", AVG("temperature") AS avg_temperature
    FROM "sensor-temperature_001"
    -- Uses a 10-second tumbling time window
    GROUP BY "number", FLOOR(("sensor-temperature_001".ROWTIME -
    TIMESTAMP '1970-01-01 00:00:00') SECOND / 10 TO SECOND);

    View Slide

  11. Flink Option: consumer
    11
    val input = createConsumer(env, consumerProps)
    input
    .flatMap { json =>
    Option(json)
    .filter(_.trim.nonEmpty)
    .map(j => Json.readValue(j, classOf[Event]))
    }
    .keyBy(_.sensor.number) // Logically partition the stream per sensor id
    .timeWindow(Time.seconds(10), Time.seconds(5)) // Sliding window definition
    .apply(new TemperatureAverager)
    .name("TemperatureAverager")
    .map(Json.writeAsString(_))
    .addSink(createProducer(producerProps))
    .name("Kinesis Stream")

    View Slide

  12. Flink Option: TemperatureAverager
    12
    /** apply() is invoked once for each window */
    override def apply(
    sensorId: Int,
    window: TimeWindow,
    events: Iterable[Event],
    out: Collector[Event]
    ): Unit = {
    val (count, sum) = events.foldLeft((0, 0.0)) {
    case ((count, temperature), e) =>
    (count + 1, temperature + e.temperature)
    }
    // emit an Event with the average temperature
    out.collect(Event(window.getEnd, avgTemp, events.head.sensor))
    }
    val avgTemp = if (count == 0) 0 else sum / count

    View Slide

  13. Analytics Application Cost
    13
    Unit conversions SQL:
    SQL KPUs: 5 per day * (730 hours in a month / 24 hours in a day) = 152.08
    per month
    Pricing calculations:
    10 applications x 152.08 KPUs x 0.127 USD = 193.14 USD per month for SQL
    applications
    Kinesis Data Analytics for SQL applications cost (monthly): 193.14 USD*
    *as of April 2021

    View Slide

  14. Firehose

    View Slide

  15. Firehose: data flow
    15
    Processing feaures:
    1. Convert data to Parquet/ORC
    2. Transform data with AWS Lambda
    Destinations:
    1. S3
    2. Redshift
    3. Elasticsearch
    4. Splunk
    5. HTTP endpoint
    Sources:
    1. Data Streams
    2. Direct PUT

    View Slide

  16. Firehose Data Delivery
    16
    Frequency
    - Depends on destination: S3, Redshift, etc.
    - Firehose buffers data, thus your flow is not real streaming
    - S3:
    - Buffer size: 1-128 Mb
    - Buffer interval: 60-900 seconds
    - ….
    - ….

    View Slide

  17. Monitoring

    View Slide

  18. Kinesis Monitoring
    18
    - CloudWatch Logs, Metrics
    - Custom Metrics
    - CloudTrail
    Streams, Analytics Applications, Firehose:

    View Slide

  19. 19
    Kinesis Use Cases

    View Slide

  20. 20

    View Slide

  21. 21
    Common example:

    View Slide

  22. Thank you! Questions?
    22
    Twitter: @alexey_novakov
    Blog: https://novakov-alexey.github.io/
    Example project to create:
    - stream
    - analytics app,
    - firehose,
    - and run producer
    https://github.com/novakov-alexey/kinesis-ingest

    View Slide