Alice and the Mad Hatter: Predict or not to predict

Marianna Diachuk • Data Scientist at Restream • Women Who
Code Kyiv Data Science Lead • Speaker, writer

Roksolana Diachuk • Big Data Developer at Captify • Diversity
& Inclusion ambassador at Captify • Women Who Code Kyiv Data Engineering Lead • Speaker

In previous episodes talks…

Functional forest

magic-db- cluster-0

2 years later

magic-db- cluster-0 Long time no see! !

NAME READY STATUS AGE launcher-crd 1/1 Running 33s magic-db- cluster-0

3 months passed

{ “predictions”: [ [ “B-natural phenomenon”, “O”, “B-geographical entity”, “B-time
indicator”, ] ] }

{ “instances”: [Pods and higher- order functions are in danger]}
Test

5 Days passed

Alice! Alice!

“Long time no see!” “Happy to see you too, Alice!”

“Sorry, Alice! We don’t have enough time for stories”

“I was just waiting for you!” “What for?”

“Oh, don’t be rude, Alice!  I will only entertain you
a bit with my riddle” “Like it explains anything”

“Soon it will all make sense” “I hope so”

What was lost can be found only in the anomalies
of this world

“Any ideas?” “Oh yes,I have a pretty clear understanding of
this riddle”

“Anomalies?” “We need to fi nd anomalies in the data”

“Anomaly is some kind of deviation from the standard”

Cat Not a cat Supervised learning Unsupervised learning

K-means

“Wait! You’re talking about ML models” “Exactly!”

Data ingestion Data analysis Data transformation Model evaluation and validation
Serving ML model building

ML model Source data Anomalies detected

“How are we going to solve it together?”

Data ingestion Data analysis Data transformation Model evaluation and validation
Serving ML model building

“Let’s solve it in Scala!”

def loadData(rawData: RDD[String]) : RDD[Vector] = { val dataAndLabel =
rawData.map { line = > val buffer = ArrayBuffer[String]( ) val label = buffer.remove(buffer.length-1 ) val vector = Vectors.dense(buffer.map(_.toDouble).toArray) (vector, label ) … }

def loadData(rawData: RDD[String]) : RDD[Vector] = { … val data
= dataAndLabel.map(_._1).cache( ) normalization(data ) }

“We also need to normalise the data” “What’s data normalisation?”

def normalization(dataArray: Array[Double]): RDD[Vector] = { val sums = dataArray.reduce((a,
b) => a.zip(b).map(t => t._1 + t._2) ) val sumSquares = dataArray.fold(new Array[Double](numCols)) ( (a,b) => a.zip(b).map(t => t._1 + t._2 * t._2))  val stdevs = sumSquares.zip(sums).map { cas e (sumSq, sum) => math.sqrt(n * sumSq - sum * sum) / count } va l val means = sums.map(_ / dataArray.count() ) … }

“Let’s work on the pipeline” “First part is done. What’s
next?”

object AnomalyDetection { def main(args: Array[String]) { val sc =
new SparkContext(new SparkCon fi g() ) val normalizedData = loadData(sc ) val model = trainModel(normalizedData ) val centroid = model.clusterCenters(0).toStrin g … }

object AnomalyDetection { def main(args: Array[String]) { … val distances
= normalizedData.map(d => distToCentroid(d, model) ) val threshold = distances.top(2000).las t }

“Now we need to actually train a model”

def trainModel(normalizedData: RDD[Vector]): KMeansModel = { val kmeans = new
KMeans( ) kmeans.setK(1)  kmeans.setRuns(10)  val model = kmeans.run(normalizedData ) mode l }

“Let’s dive deeper into distToCentroid function”

def distToCentroid(data: Vector, model: KMeansModel): Double = { val centroid
= model.clusterCenters(model.predict(data) ) Vectors.sqdist(data, centroid ) }

“Ok, how are we going to run this job?” “We’ll
use Kubernetes”

“Kubernetes is always a good idea”

apiserver scheduler Client Spark-submit Spark driver Executor 1 Executor 2
Executor 3

./spark-submit \ --master k8s://https://host:port \ --deploy-mode cluster \ --name anomaly-detection
\ --class anomaly.AnomalyDetection \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=spark:latest \ local:///spark/jars/anomaly-detection-3.0.1.jar

01010100 01101000 01100101 00100000 01110000 01100001 01110011 …

“Oh, they are not random at all” “Those are just
some random numbers”

“I’ll give you just a small hint. Those numbers are
binary” “We can see that”

“I think I know how to…”

The past is the key to the future

To be continued…

Contact info dead_flowers22 roksolanad dark_matter_88 marianna-d

Alice and the Mad Hatter: Predict or not to pre...

Alice and the Mad Hatter: Predict or not to predict

More Decks by Roksolana

Other Decks in Technology

Featured

Transcript