Elementary Analytics with Kafka Streams (Anna McDonald, Confluent)

Elementary Analytics with Kafka Streams (Anna McDonald, Confluent) | RTA Summit '23

by StarTree

Slide 1

Slide 1 text

@jbfletch_ Anna McDonald: Twitter: @jbfletch_ <- fully committed to the underscore GitHub: jbfletch

Slide 2

Slide 2 text

@jbfletch_ Agenda 1. Elementary analytics 2. Kafka Streams for Analytics 3. Real example

Slide 3

Slide 3 text

@jbfletch_ Elementary, my dear attendees… ● Mean ● Median ● Mode ● Skewness ● Kurtosis

Slide 4

Slide 4 text

@jbfletch_ Why should I not do advanced Math Stat?

Slide 5

Slide 5 text

@jbfletch_ Why should I not do advanced Math Stat? Java

Slide 6

Slide 6 text

@jbfletch_ Why should I not do advanced Math Stat? Java ● No elegant support for 2d dynamic arrays ● Lack of support for more advanced DoE test etc.

Slide 7

Slide 7 text

@jbfletch_ Why should I not do advanced Math Stat? Java ● No elegant support for 2d dynamic arrays ● Lack of support for more advanced DoE test etc. Pandas, R!

Slide 8

Slide 8 text

A Case for Kafka Streams @jbfletch_ I want to run my analysis as soon as all my assumptions are met. I don’t want to wait until a window closes if I have everything I need…

Slide 9

Slide 9 text

A Case for Kafka Streams @jbfletch_ INTRODUCING…….

Slide 10

Slide 10 text

A Case for Kafka Streams @jbfletch_ INTRODUCING……. Statistical assumption templates! aka SATs

Slide 11

Slide 11 text

Example Assumption Templates @jbfletch_ ● Number of Observations == 30

Slide 12

Slide 12 text

Example Assumption Templates @jbfletch_ ● Number of Observations == 30 ● Number of Unique States (as in NY) represented == 50

Slide 13

Slide 13 text

Example Assumption Templates @jbfletch_ ● Number of Observations == 30 ● Number of Unique States (as in NY) represented == 50 ● Specific combinations for Design of Experiments

Slide 14

Slide 14 text

Example Assumption Templates @jbfletch_ ● Number of Observations == 30 ● Number of Unique States (as in NY) represented == 50 ● Specific combinations for Design of Experiments ● Study Group completeness

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Kafka Streams Overview for Analytics @jbfletch_

Slide 17

Slide 17 text

Kafka Streams Dedicated Cluster VS Java Library @jbfletch_

Slide 18

Slide 18 text

Kafka Streams @jbfletch_ Kafka Streams DSL (Domain Specific Language) Processor API (PAPI) ?

Slide 19

Slide 19 text

Time @MatthiasJSax Stream Time Wall-Clock Time Vs. @jbfletch_

Slide 20

Slide 20 text

Time Wall-Clock Time @jbfletch_ ● The actual time based on system time or your watch ● Not fully available in all DSL Methods ● Used when there is no way to drive time by incoming messages

Slide 21

Slide 21 text

Time @MatthiasJSax Stream Time @jbfletch_

Slide 22

Slide 22 text

Time @MatthiasJSax Stream Time @jbfletch_ ● Per Partition aka Stream Task ● Advances Based on Incoming Messages at a Partition Level

Slide 23

Slide 23 text

@jbfletch_ Stream Time 101 this.testTopology.input() .at(1000).add("1",obs1) Stream Time: 1000 Key Value

Slide 24

Slide 24 text

@jbfletch_ Stream Time 101 this.testTopology.input() .at(1000).add("1",obs1) .at(1100).add("1",obs2) Stream Time: 1100

Slide 25

Slide 25 text

@jbfletch_ Stream Time 101 this.testTopology.input() .at(1000).add("1",obs1) .at(1100).add("1",obs2) .at(1200).add("3",obs3) Stream Time: 1200

Slide 26

Slide 26 text

@jbfletch_ Stream Time 101 this.testTopology.input() .at(1000).add("1",obs1) .at(1100).add("1",obs2) .at(1200).add("3",obs3) .at(1300).add("4",obs4) Stream Time: 1300

Slide 27

Slide 27 text

@jbfletch_ Stream Time 101 this.testTopology.input() .at(1000).add("1",obs1) .at(1100).add("1",obs2) .at(1200).add("3",obs3) .at(1300).add("4",obs4) .at(1400).add("1",obs5) .at(1500).add("6",obs6) .at(1510).add("10",obs10) .at(1520).add("11",obs11) .at(1530).add("1",obs13) .at(1540).add("4",obs14) .at(1550).add("1",obs15) .at(1560).add("4",obs16) .at(12000).add("7",obs7) .at(12100).add("1",obs8) .at(12200).add("4",obs9); Stream Time: 12200

Slide 28

Slide 28 text

@jbfletch_ Windowing!

Slide 29

Slide 29 text

@jbfletch_ Windowing Steps 1. Group data

Slide 30

Slide 30 text

Group Data ● What do I need to reason about? Ex. Store Location, Shoe Type, Feature ID groupBy @jbfletch_

Slide 31

Slide 31 text

@jbfletch_ Windowing Steps 1. Group data 2. Specify the window type

Slide 32

Slide 32 text

@jbfletch_ Tumbling Windows

Slide 33

Slide 33 text

@jbfletch_ Tumbling Windows ● Non-Overlapping

Slide 34

Slide 34 text

@jbfletch_ Tumbling Windows ● Non-Overlapping ● Controlled by a Fixed-Size Window

Slide 35

Slide 35 text

@jbfletch_ Tumbling Windows ● Non-Overlapping ● Controlled by a Fixed-Size Window ● Uniquely Identifiable as: Memory Changelog @[start epoch]/[end epoch] @[start epoch]

Slide 36

Slide 36 text

@jbfletch_ Hopping Windows

Slide 37

Slide 37 text

@jbfletch_ Hopping Windows ● Overlapping

Slide 38

Slide 38 text

@jbfletch_ Hopping Windows ● Overlapping ● Fixed Sized

Slide 39

Slide 39 text

@jbfletch_ Hopping Windows ● Overlapping ● Fixed Sized ● Controlled by Window Size and Advance

Slide 40

Slide 40 text

@jbfletch_ Hopping Windows ● Overlapping ● Fixed Sized ● Controlled by Window Size and Advance ● Uniquely Identifiable as: Memory Changelog @[start epoch]/[end epoch] @[start epoch]

Slide 41

Slide 41 text

@jbfletch_ Sliding Windows aka Look Back/Forward

Slide 42

Slide 42 text

@jbfletch_ Sliding Windows aka Look Back ● Slides continuously along a timeline

Slide 43

Slide 43 text

@jbfletch_ Sliding Windows aka Look Back ● Slides continuously along a timeline ● Fixed Size

Slide 44

Slide 44 text

@jbfletch_ Sliding Windows aka Look Back ● Slides continuously along a timeline ● Fixed Size ● Controlled by max time difference between two records of the same key

Slide 45

Slide 45 text

@jbfletch_ Sliding Windows aka Look Back ● Slides continuously along a timeline ● Fixed Size ● Controlled by max time difference between two records of the same key ● Uniquely Identifiable as: Memory Changelog @[start epoch]/[end epoch] @[start epoch]

Slide 46

Slide 46 text

@jbfletch_ Sliding Windows aka Look Back Key Value Record Time Stream Time A 1 8000 8000 A 2 9200 9200 A 3 12400 12400 Phil 496 13200 13200 Angela Lansbury 96 14500 14500 We'd have the following 5 windows: ● window [3000;8000] contains [1] (created when first record enters the window) ● window [4200;9200] contains [1,2] (created when second record enters the window) ● window [7400;12400] contains [1,2,3] (created when third record enters the window) ● window [8001;13001] contains [2,3] (created when the first record drops out of the window) ● window [9201;14201] contains [3] (created when the second record drops out of the window) Time Difference = 5000

Slide 47

Slide 47 text

@jbfletch_ Session Windows

Slide 48

Slide 48 text

@jbfletch_ Session Windows ● Non-Overlapping with Gaps

Slide 49

Slide 49 text

@jbfletch_ Session Windows ● Non-Overlapping with Gaps ● Unfixed Size!

Slide 50

Slide 50 text

@jbfletch_ Session Windows ● Non-Overlapping with Gaps ● Unfixed Size! ● Controlled by defining an inactivity gap for seeing records with a given key

Slide 51

Slide 51 text

@jbfletch_ Session Windows ● Non-Overlapping with Gaps ● Unfixed Size! ● Controlled by defining an inactivity gap for seeing records with a given key ● Uniquely Identifiable as: Memory Changelog @[start epoch]/[end epoch] @[start epoch]/[end epoch]

Slide 52

Slide 52 text

@jbfletch_ Windowing Steps 1. Group data 2. Specify the window type 3. Aggregate

Slide 53

Slide 53 text

@jbfletch_ Aggregate ● Simple Aggregates - Count, Summation, etc.

Slide 54

Slide 54 text

@jbfletch_ Aggregate ● Simple Aggregates - Count, Summation, etc. ● Create an array containing all values for further analysis

Slide 55

Slide 55 text

@jbfletch_ Aggregate ● Simple Aggregates - Count, Summation, etc. ● Create an array containing all values for further analysis ● Continuous to Discrete Mappings

Slide 56

Slide 56 text

@jbfletch_ Putting it all Together! Task: Calculate the following descriptive statistics when and only when you have at least 30 observations for each class of sneaker: ● Count ● Min ● Max ● Mean ● Skewness ● Kurtosis

Slide 57

Slide 57 text

@jbfletch_ Putting it all Together! We want fixed size non-overlapping windows so we choose tumbling and define our window size TimeWindows tumblingWindow = TimeWindows.ofSizeWithNoGrace(Duration.ofSeconds(10));

Slide 58

Slide 58 text

@jbfletch_ Putting it all Together! We define our input KTable from our baseStream and create our groupBy, windowBy and aggregate TimeWindows tumblingWindow = TimeWindows.ofSizeWithNoGrace(Duration.ofSeconds(10)); KTable, List> observationsByClass = baseStream .groupBy((k,v) -> v.path("sneakerID").asText()) .windowedBy(tumblingWindow) .aggregate(() -> new ArrayList(), (key, value, aggregate) -> { aggregate.add(value.path("peeps").asInt()); return aggregate; } , Materialized., WindowStore>as("classes").withValueSerde(listSerde));

Slide 59

Slide 59 text

Slide 60

Slide 60 text

Slide 61

Slide 61 text

@jbfletch_ Putting it all Together! observationsByClass .toStream() .filter((k,v) -> v.size() == 30) .map( ((stringWindowed, integers) -> { DescriptiveStatistics stats= new DescriptiveStatistics(); double[] array = integers.stream().mapToDouble(Integer::doubleValue).toArray(); Arrays.stream(array) .forEach(obs -> { stats.addValue(obs); }); System.out.println("Number of Values: " + String.valueOf(stats.getN())); System.out.println("Min Value: " + String.valueOf(stats.getMin())); System.out.println("Max Value: " + String.valueOf(stats.getMax())); System.out.println("Mean: " + String.valueOf(stats.getMean())); System.out.println("Skewness: " + String.valueOf(stats.getSkewness())); System.out.println("Kurtosis: " + String.valueOf(stats.getKurtosis())); return KeyValue.pair(stringWindowed, integers); }) ) .print(Printed., List>toSysOut().withLabel("Full Key and Values")); Time to apply our statistical assumption template (SAT)

Slide 62

Slide 62 text

@jbfletch_ Putting it all Together! observationsByClass .toStream() .filter((k,v) -> v.size() == 30) .map( ((stringWindowed, integers) -> { DescriptiveStatistics stats= new DescriptiveStatistics(); double[] array = integers.stream().mapToDouble(Integer::doubleValue).toArray(); Arrays.stream(array) .forEach(obs -> { stats.addValue(obs); }); System.out.println("Number of Values: " + String.valueOf(stats.getN())); System.out.println("Min Value: " + String.valueOf(stats.getMin())); System.out.println("Max Value: " + String.valueOf(stats.getMax())); System.out.println("Mean: " + String.valueOf(stats.getMean())); System.out.println("Skewness: " + String.valueOf(stats.getSkewness())); System.out.println("Kurtosis: " + String.valueOf(stats.getKurtosis())); return KeyValue.pair(stringWindowed, integers); }) ) .print(Printed., List>toSysOut().withLabel("Full Key and Values")); Create and add the observations in our window to a new stats object

Slide 63

Slide 63 text

@jbfletch_ Putting it all Together! observationsByClass .toStream() .filter((k,v) -> v.size() == 30) .map( ((stringWindowed, integers) -> { DescriptiveStatistics stats= new DescriptiveStatistics(); double[] array = integers.stream().mapToDouble(Integer::doubleValue).toArray(); Arrays.stream(array) .forEach(obs -> { stats.addValue(obs); }); System.out.println("Number of Values: " + String.valueOf(stats.getN())); System.out.println("Min Value: " + String.valueOf(stats.getMin())); System.out.println("Max Value: " + String.valueOf(stats.getMax())); System.out.println("Mean: " + String.valueOf(stats.getMean())); System.out.println("Skewness: " + String.valueOf(stats.getSkewness())); System.out.println("Kurtosis: " + String.valueOf(stats.getKurtosis())); return KeyValue.pair(stringWindowed, integers); }) ) .print(Printed., List>toSysOut().withLabel("Full Key and Values")); And finally…calculate our required descriptive statistics!

Slide 64

Slide 64 text

@jbfletch_ GitHub: https://github.com/jbfletch/kafkastreamsmathstat