Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge

Slide 1

Slide 1 text

Real-Time IoT Analytics with Apache Pulsar August 6th, 2019 David Kjerrumgaard

Slide 2

Slide 2 text

Apache Pulsar • Cloud Native Messaging Platform developed at Yahoo! • Horizontally Scalable – Topics, Storage • Provides message ordering, durability, and delivery guarantees • Supports both Queuing and Pub/Sub messaging. • Decoupled Serving and Storage Layers allows for edge deployment

Slide 3

Slide 3 text

Defining IoT Analytics • It is NOT JUST loading sensor data into a data lake to create predictive analytic models. While this is crucial piece of the puzzle, it is not the only one. • IoT Analytics requires the ability to ingest, aggregate, and process an endless stream of real-time data coming off a wide variety of sensor devices “at the edge” • IoT Analytics renders real-time decisions at the edge of the network to either optimize operational performance or detect anomalies for immediate remediation. 3

Slide 4

Slide 4 text

What Makes IoT Analytics Different? 4

Slide 5

Slide 5 text

IoT Analytics Challenges • IoT deals with machine generated data consisting of discrete observations such as temperature, vibration, pressure, etc. that is produced at very high rates. • We need an architecture that: • Allows us to quickly identify and react to anomalous events • Reduces the volume of data transmitted back to the data lake. • In this talk, we will present a solution based on Apache Pulsar Functions that distributes the analytics processing across all tiers of the IoT data ingestion pipeline. 5

Slide 6

Slide 6 text

IoT Data Ingestion Pipeline 6

Slide 7

Slide 7 text

Apache Pulsar Functions 7

Slide 8

Slide 8 text

Pulsar Functions The Apache Pulsar platform provides a flexible, serverless computing framework that allows you execute user-defined functions to process and transform data. • Implemented as simple methods, but allows you to leverage existing libraries and code within Java or Python code. • Functions execute against every single event that is published to a specified topic, and write their results to another topic. Forming a logical directed-acyclic graph. • Enable dynamic filtering, transformation, routing and analytics. • Can run anywhere a JVM can, including edge devices 8

Slide 9

Slide 9 text

Building Blocks for IoT Analytics 9

Slide 10

Slide 10 text

Distributed Probabilistic Analytics with Apache Pulsar Functions 10

Slide 11

Slide 11 text

Probabilistic Analysis • Often times, it is sufficient to provide an approximate value when it is impossible and/or impractical to provide a precise value. In many cases having an approximate answer within a given time frame is better than waiting for an exact answer. • Probabilistic algorithms can provide approximate values when the event stream is either too large to store in memory, or the data is moving too fast to process. • Instead of requiring to keep such enormous data on-hand, we leverage algorithms that require only a few kilobytes of data. 11

Slide 12

Slide 12 text

Data Sketches • A central theme throughout most of these probabilistic data structures is the concept of data sketches, which are designed to require only enough of the data necessary to make an accurate estimation of the correct answer. • Typically, sketches are implemented a bit arrays or maps thereby requiring memory on the order of Kilobytes, making them ideal for resource-constrained environments, e.g. on the edge. • Sketching algorithms only need to see each incoming item only once, and are therefore ideal for processing infinite streams of data. 12

Slide 13

Slide 13 text

Data Sketch Properties • Configurable Accuracy • Sketches sized correctly can be 100% accurate • Error rate is inversely proportional to size of a Sketch • Fixed Memory Utilization • Maximum Sketch size is configured in advance • Memory cost of a query is thus known in advance • Allows Non-additive Operations to be Additive • Sketches can be merged into a single Sketch without over counting • Allows tasks to be parallelized and combined later • Allows results to be combined across windows of execution 13

Slide 14

Slide 14 text

Sketch Example • Let’s walk through an demonstration to show exactly what I mean by sketches and show you that we do not need 100% of the data in order to make an accurate prediction of what the picture contains • How much of the data did you require to identify the main item in the picture? 14

Slide 15

Slide 15 text

Operations Supported by Sketches 15

Slide 16

Slide 16 text

Some Sketchy Functions 16

Slide 17

Slide 17 text

Event Frequency • A common statistic computed is the frequency at which a specific element occurs within an endless data stream with repeated elements, which enables us to answer questions such as; “How many times has element X occurred in the data stream?”. • Consider trying to analyze and sample the IoT sensor data for just a single industrial plant that can produce millions of readings per second. There isn’t enough time to perform the calculations or store the data. • In such a scenario you can chose to forego an exact answer, which will we never be able to compute in time, for an approximate answer that is within an acceptable range of accuracy. 17

Slide 18

Slide 18 text

Count-Min Sketch • The Count-Min Sketch algorithm uses two elements: • An M-by-K matrix of counters, each initialized to 0, where each row corresponds to a hash function • A collection of K independent hash functions h(x). • When an element is added to the sketch, each of the hash functions are applied to the element. These hash values are treated as indexes into the bit array, and the corresponding array element is set incremented by 1. • Now that we have an approximate count for each element we have seen stored in the M-by-K matrix, we are able to quickly determine how many times an element X has occurred previously in the stream by simply applying each of the hash functions to the element, and retrieving all of the corresponding array elements and using the SMALLEST value in the list are the approximate event count. 18

Slide 19

Slide 19 text

Pulsar Function: Event Frequency 19

Slide 20

Slide 20 text

K-Frequency-Estimation, aka “Heavy Hitters” • A common use of the Count-Min algorithm is maintaining lists of frequent items which is commonly referred to as the “Heavy Hitters”. • The K-Frequency-Estimation problem can also be solved by using the Count-Min Sketch algorithm. The logic for updating the counts is exactly the same as in the Event Frequency use case. • However, there is an additional list of length K used to keep the top-K elements seen that is updated. 20

Slide 21

Slide 21 text

Pulsar Function: Top K • Each of the hash functions are applied to the element. These hash values are treated as indexes into the bit array, and the corresponding array element is set incremented by 1. • Calculate the event frequency for the element as we did in the event frequency use case. However, this time we take the SMALLEST value in the list are use that as the approximate event count. • Compare the calculated event frequency of this element against the smallest value in the top-K elements array, and if it is LARGER, remove the smallest value and replace it with the new element. 21

Slide 22

Slide 22 text

Pulsar Function: Top K 22

Slide 23

Slide 23 text

Anomaly Detection • The most anomaly detectors use a manually configured threshold value that is not adaptive to even simple patterns or variances. • Instead of using a single static value for our thresholds, we should consider using quantiles instead. • In statistics and probably, quantiles are used to represent probability distributions. The most common of which are known as percentiles. 23

Slide 24

Slide 24 text

Anomaly Detection with Quantiles • The data structure known as t-digest was developed by Ted Dunning, as a way to accurately estimate extreme quantiles for very large data sets with limited memory use. • This capability makes t-digest particularly useful for calculating quantiles that can be used to select a good threshold for anomaly detection. • The advantage of this approach is that the threshold automatically adapts to the dataset as it collects more data. 24

Slide 25

Slide 25 text

Pulsar Function: T-Digest 25

Slide 26

Slide 26 text

IoT Analytics Pipeline Using Apache Pulsar Functions 26

Slide 27

Slide 27 text

Identifying Real-Time Energy Consumption Patterns • A network of smart meters enables utilities companies to gain greater visibility into their customers energy consumption. • Increase/decrease energy generation to meet the demand. • Implement dynamic notifications to encourage consumers to use less energy during peak demand periods. • Provide real-time revenue forecasts to senior business leaders. • Identify fault meters and schedule maintenance calls. 27

Slide 28

Slide 28 text

Smart Meter Analytics Flow Logic 28

Slide 29

Slide 29 text

Smart Meter Analytics - Step 1 29

Slide 30

Slide 30 text

Smart Meter Analytics - Step 2 30

Slide 31

Slide 31 text

Smart Meter Analytics - Step 3 31

Slide 32

Slide 32 text

Smart Meter Analytics - Step 4 32

Slide 33

Slide 33 text

Smart Meter Analytics - Step 5 33

Slide 34

Slide 34 text

Smart Meter Analytics - Step 6 34

Slide 35

Slide 35 text

Summary & Review 35

Slide 36

Slide 36 text

Summary & Review • IoT Analytics is an extremely complex problem, and modern streaming platforms are not well suited to solving this problem. • Apache Pulsar provides a platform for implementing distributed analytics on the edge to decrease the data capture time. • Apache Pulsar Functions allows you to leverage existing probabilistic analysis techniques to provide approximate values, within an acceptable degree of accuracy. • Both techniques allow you to act upon your data while the business value is still high. 36

Slide 37

Slide 37 text

More Information on Apache Pulsar My Book, Pulsar in Action: • https://www.manning.com/books/pulsar-in-action • ApacheCon Discount Code: ctwapachecon19 Apache Pulsar documentation: • http://pulsar.apache.org Streamlio Tutorials: • https://streaml.io/resources/tutorials Streamlio Blogs: • https://streaml.io/blog Slack: • apache-pulsar.slack.com 37