Alice and the Mad Hatter: Predict or not to predict

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Marianna Diachuk • Data Scientist at Restream • Women Who Code Kyiv Data Science Lead • Speaker, writer

Slide 3

Slide 3 text

Roksolana Diachuk • Big Data Developer at Captify • Diversity & Inclusion ambassador at Captify • Women Who Code Kyiv Data Engineering Lead • Speaker

Slide 4

Slide 4 text

In previous episodes talks…

Slide 5

Slide 5 text

Functional forest

Slide 6

Slide 6 text

magic-db- cluster-0

Slide 7

Slide 7 text

2 years later

Slide 8

Slide 8 text

magic-db- cluster-0 Long time no see! !

Slide 9

Slide 9 text

NAME READY STATUS AGE launcher-crd 1/1 Running 33s magic-db- cluster-0

Slide 10

Slide 10 text

3 months passed

Slide 11

Slide 11 text

{ “predictions”: [ [ “B-natural phenomenon”, “O”, “B-geographical entity”, “B-time indicator”, ] ] }

Slide 12

Slide 12 text

{ “instances”: [Pods and higher- order functions are in danger]} Test

Slide 13

Slide 13 text

5 Days passed

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Alice! Alice!

Slide 16

Slide 16 text

“Long time no see!” “Happy to see you too, Alice!”

Slide 17

Slide 17 text

“Sorry, Alice! We don’t have enough time for stories”

Slide 18

Slide 18 text

“I was just waiting for you!” “What for?”

Slide 19

Slide 19 text

“Oh, don’t be rude, Alice!  I will only entertain you a bit with my riddle” “Like it explains anything”

Slide 20

Slide 20 text

“Soon it will all make sense” “I hope so”

Slide 21

Slide 21 text

What was lost can be found only in the anomalies of this world

Slide 22

Slide 22 text

“Any ideas?” “Oh yes,I have a pretty clear understanding of this riddle”

Slide 23

Slide 23 text

“Anomalies?” “We need to fi nd anomalies in the data”

Slide 24

Slide 24 text

“Anomaly is some kind of deviation from the standard”

Slide 25

Slide 25 text

Cat Not a cat Supervised learning Unsupervised learning

Slide 26

Slide 26 text

K-means

Slide 27

Slide 27 text

“Wait! You’re talking about ML models” “Exactly!”

Slide 28

Slide 28 text

Data ingestion Data analysis Data transformation Model evaluation and validation Serving ML model building

Slide 29

Slide 29 text

ML model Source data Anomalies detected

Slide 30

Slide 30 text

“How are we going to solve it together?”

Slide 31

Slide 31 text

Data ingestion Data analysis Data transformation Model evaluation and validation Serving ML model building

Slide 32

Slide 32 text

“Let’s solve it in Scala!”

Slide 33

Slide 33 text

def loadData(rawData: RDD[String]) : RDD[Vector] = { val dataAndLabel = rawData.map { line = > val buffer = ArrayBuffer[String]( ) val label = buffer.remove(buffer.length-1 ) val vector = Vectors.dense(buffer.map(_.toDouble).toArray) (vector, label ) … }

Slide 34

Slide 34 text

def loadData(rawData: RDD[String]) : RDD[Vector] = { … val data = dataAndLabel.map(_._1).cache( ) normalization(data ) }

Slide 35

Slide 35 text

“We also need to normalise the data” “What’s data normalisation?”

Slide 36

Slide 36 text

def normalization(dataArray: Array[Double]): RDD[Vector] = { val sums = dataArray.reduce((a, b) => a.zip(b).map(t => t._1 + t._2) ) val sumSquares = dataArray.fold(new Array[Double](numCols)) ( (a,b) => a.zip(b).map(t => t._1 + t._2 * t._2))  val stdevs = sumSquares.zip(sums).map { cas e (sumSq, sum) => math.sqrt(n * sumSq - sum * sum) / count } va l val means = sums.map(_ / dataArray.count() ) … }

Slide 37

Slide 37 text

“Let’s work on the pipeline” “First part is done. What’s next?”

Slide 38

Slide 38 text

object AnomalyDetection { def main(args: Array[String]) { val sc = new SparkContext(new SparkCon fi g() ) val normalizedData = loadData(sc ) val model = trainModel(normalizedData ) val centroid = model.clusterCenters(0).toStrin g … }

Slide 39

Slide 39 text

object AnomalyDetection { def main(args: Array[String]) { … val distances = normalizedData.map(d => distToCentroid(d, model) ) val threshold = distances.top(2000).las t }

Slide 40

Slide 40 text

“Now we need to actually train a model”

Slide 41

Slide 41 text

def trainModel(normalizedData: RDD[Vector]): KMeansModel = { val kmeans = new KMeans( ) kmeans.setK(1)  kmeans.setRuns(10)  val model = kmeans.run(normalizedData ) mode l }

Slide 42

Slide 42 text

“Let’s dive deeper into distToCentroid function”

Slide 43

Slide 43 text

def distToCentroid(data: Vector, model: KMeansModel): Double = { val centroid = model.clusterCenters(model.predict(data) ) Vectors.sqdist(data, centroid ) }

Slide 44

Slide 44 text

“Ok, how are we going to run this job?” “We’ll use Kubernetes”

Slide 45

Slide 45 text

“Kubernetes is always a good idea”

Slide 46

Slide 46 text

apiserver scheduler Client Spark-submit Spark driver Executor 1 Executor 2 Executor 3

Slide 47

Slide 47 text

./spark-submit \ --master k8s://https://host:port \ --deploy-mode cluster \ --name anomaly-detection \ --class anomaly.AnomalyDetection \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=spark:latest \ local:///spark/jars/anomaly-detection-3.0.1.jar

Slide 48

Slide 48 text

01010100 01101000 01100101 00100000 01110000 01100001 01110011 …

Slide 49

Slide 49 text

“Oh, they are not random at all” “Those are just some random numbers”

Slide 50

Slide 50 text

“I’ll give you just a small hint. Those numbers are binary” “We can see that”