Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Alice and the Mad Hatter: Predict or not to predict

Roksolana
October 29, 2021

Alice and the Mad Hatter: Predict or not to predict

"Alice and Mad Hatter should solve a mystical riddle that will help them to find their way back home. Alice is a big data engineer and Mad Hatter is a skilled data scientist. In order to solve the riddle, they need to build a prediction using a machine learning model. Will knowledge of Scala help Alice to find a solution? And will she be able to collaborate with the Mad Hatter? You will find out in this talk."
Co-presented at Scale By The Bay 2021 - virtually with Marianna Diachuk

Roksolana

October 29, 2021
Tweet

More Decks by Roksolana

Other Decks in Technology

Transcript

  1. Marianna Diachuk • Data Scientist at Restream • Women Who

    Code Kyiv Data Science Lead • Speaker, writer
  2. Roksolana Diachuk • Big Data Developer at Captify • Diversity

    & Inclusion ambassador at Captify • Women Who Code Kyiv Data Engineering Lead • Speaker
  3. “Oh, don’t be rude, Alice!
 I will only entertain you

    a bit with my riddle” “Like it explains anything”
  4. def loadData(rawData: RDD[String]) : RDD[Vector] = { val dataAndLabel =

    rawData.map { line = > val buffer = ArrayBuffer[String]( ) val label = buffer.remove(buffer.length-1 ) val vector = Vectors.dense(buffer.map(_.toDouble).toArray) (vector, label ) … }
  5. def loadData(rawData: RDD[String]) : RDD[Vector] = { … val data

    = dataAndLabel.map(_._1).cache( ) normalization(data ) }
  6. def normalization(dataArray: Array[Double]): RDD[Vector] = { val sums = dataArray.reduce((a,

    b) => a.zip(b).map(t => t._1 + t._2) ) val sumSquares = dataArray.fold(new Array[Double](numCols)) ( (a,b) => a.zip(b).map(t => t._1 + t._2 * t._2))
 val stdevs = sumSquares.zip(sums).map { cas e (sumSq, sum) => math.sqrt(n * sumSq - sum * sum) / count } va l val means = sums.map(_ / dataArray.count() ) … }
  7. object AnomalyDetection { def main(args: Array[String]) { val sc =

    new SparkContext(new SparkCon fi g() ) val normalizedData = loadData(sc ) val model = trainModel(normalizedData ) val centroid = model.clusterCenters(0).toStrin g … }
  8. object AnomalyDetection { def main(args: Array[String]) { … val distances

    = normalizedData.map(d => distToCentroid(d, model) ) val threshold = distances.top(2000).las t }
  9. def trainModel(normalizedData: RDD[Vector]): KMeansModel = { val kmeans = new

    KMeans( ) kmeans.setK(1)
 kmeans.setRuns(10)
 val model = kmeans.run(normalizedData ) mode l }
  10. def distToCentroid(data: Vector, model: KMeansModel): Double = { val centroid

    = model.clusterCenters(model.predict(data) ) Vectors.sqdist(data, centroid ) }
  11. ./spark-submit \ --master k8s://https://host:port \ --deploy-mode cluster \ --name anomaly-detection

    \ --class anomaly.AnomalyDetection \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=spark:latest \ local:///spark/jars/anomaly-detection-3.0.1.jar