Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Anomaly Detection & Clustering with Apache Spark

Anomaly Detection & Clustering with Apache Spark

Talk by Sean Owen, Director Data Science @ Cloudera at Data Science London meetup @ds_ldn

Data Science London

July 20, 2014
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. 1 Anomaly  Detec-on  with   Apache  Spark:  Workshop   Sean

     Owen  /  Director  of  Data  Science  /  Cloudera  
  2. Anomaly  Detec-on   4   •  Anomalous…   •  Server

     metrics   •  Access  paNerns   •  Transac-ons   •  Labeled,  or  not   •  Some-mes  know   examples  of  “unusual”   •  Some-mes  not   •  Applica-ons   •  Network  security   •  IT  monitoring   •  Fraud  detec-on   •  Error  detec-on  
  3. Clustering   5   •  Find  areas  of  dense  data

      •  Unusual  =     far  from  any  cluster   •  What  is  “far”?   •  Unsupervised  learning   •  Supervise  with  labels  to   improve,  interpret     en.wikipedia.org/wiki/Cluster_analysis  
  4. k-­‐means++  clustering   6   •  Simple,  well-­‐known,   parallel

      •  Assign  points,  update   centers,  repeat   •  Goal:  points  close  to   nearest  cluster  center   •  Must  choose  k  =     number  of  clusters   mahout.apache.org/users/clustering/fuzzy-­‐k-­‐means.html  
  5. KDD  Cup  1999   8   •  Annual  ML  compe--on

      www.sigkdd.org/kddcup/ index.php! •  1999:  Network  intrusion   detec-on   •  4.9M  connec-ons   •  Most  normal,  some   known  aNacks   •  Not  a  realis+c  sample!  
  6. Apache  Spark:  Something  For  Everyone   10   •  Scala-­‐based

      •  Expressive,  efficient   •  JVM-­‐based   •  Consistent  Scala-­‐like  API   •  RDD  works  like  collec-on   •  RDDs  for  everything   •  Like  Apache  Crunch  is   Collec-on-­‐like   •  Distributed   •  Hadoop-­‐friendly   •  Integrate  with  where   data,  cluster  already  is   •  ETL  no  longer  separate   •  Interac-ve  REPL   •  MLlib  
  7. 12 val rawData = sc.textFile("/user/srowen/kddcup.data", 120)! rawData: org.apache.spark.rdd.RDD[String] =! MappedRDD[13]

    at textFile at <console>:15! ! rawData.count! ...! res1: Long = 4898431! ! rawData.take(1)! ...! res3: Array[String] = Array(0,tcp,http,SF, 215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0. 00,0.00,1.00,0.00,0.00,0,0,0.00,0.00,0.00,0.00,0.00,0.00,0 .00,0.00,normal.)!
  8. 14 val dataAndLabel = rawData.map { line =>! val buffer

    = line.split(",").toBuffer! buffer.remove(1, 3)! val label = buffer.remove(buffer.length-1)! val vector = buffer.map(_.toDouble).toArray! (vector,label)! }! ! val data = dataAndLabel.map(_._1).cache()! ! import org.apache.spark.mllib.clustering._! val kmeans = new KMeans()! val model = kmeans.run(data)! ! val clusterLabelCount = dataAndLabel.map { ! case (data,label) => (model.predict(data),label) ! }.countByValue.toList.sorted.foreach { ! case ((cluster,label),count) => ! println(f"$cluster%1s$label%18s$count%8s") ! }!
  9. 15 0 back. 2203! 0 buffer_overflow. 30! 0 ftp_write. 8!

    0 guess_passwd. 53! 0 imap. 12! 0 ipsweep. 12481! 0 land. 21! 0 loadmodule. 9! 0 multihop. 7! 0 neptune. 1072017! 0 nmap. 2316! 0 normal. 972781! 0 perl. 3! 0 phf. 4! 0 pod. 264! 0 portsweep. 10412! 0 rootkit. 10! 0 satan. 15892! 0 smurf. 2807886! 0 spy. 2! 0 teardrop. 979! 0 warezclient. 1020! 0 warezmaster. 20! 1 portsweep. 1! Terrible.  
  10. 17 import scala.math._! import org.apache.spark.rdd._! ! def distance(a: Array[Double], b:

    Array[Double]) = ! sqrt(a.zip(b).map(p => p._1 - p._2).map(d => d*d).sum)! def distToCentroid(datum: Array[Double], ! model: KMeansModel) = ! distance(model.clusterCenters(model.predict(datum)),! datum)! ! def clusteringScore(data: RDD[Array[Double]], k: Int) = {! val kmeans = new KMeans()! kmeans.setK(k)! val model = kmeans.run(data)! data.map(datum => distToCentroid(datum, model)).mean! }
 
 val kScores = (5 to 40 by 5).par.map(k => 
 (k, clusteringScore(data, k)))! !
  11. 18

  12. 20 kmeans.setRuns(10)! kmeans.setEpsilon(1.0e-6)! (30 to 100 by 10)! ! (30,

    886.974050712821)! (40, 747.4268153420192)! (50, 370.2801596900413)! (60, 325.883722754848)! (70, 276.05785104442657)! (80, 193.53996444359856)! (90, 162.72596475533814)! (100,133.19275833671574)!
  13. 21 library(rgl)! ! clusters_data <- ! read.csv(pipe("hadoop fs -cat data/part-00000"))!

    clusters <- clusters_data[1]! data <- data.matrix(clusters_data[-c(1)])! ! random_projection <- 
 matrix(data = rnorm(3*ncol(data)), ncol = 3)! random_projection_norm <- 
 random_projection / 
 sqrt(rowSums(random_projection*random_projection))! ! projected_data <- 
 data.frame(data %*% random_projection_norm)! ! num_clusters <- nrow(unique(clusters))! palette <- rainbow(num_clusters)! colors = sapply(clusters, function(c) palette[c])! plot3d(projected_data, col = colors, size = 1)!
  14. 22

  15. Normaliza-on   24   x - µ •  “z  score”

      •  Normalize  away  scale   differences   •  (Mean  doesn’t  maNer)   •  Assumes  normal-­‐ish   distribu-on   σ
  16. 25 val numCols = data.take(1)(0).length! val n = data.count! val

    sums = data.reduce((a,b) => 
 a.zip(b).map(t => t._1 + t._2))! val sumSquares = data.fold(new Array[Double](numCols))
 ((a,b) => a.zip(b).map(t => t._1 + t._2*t._2))! val stdevs = sumSquares.zip(sums).map { ! case(sumSq,sum) => sqrt(n*sumSq - sum*sum)/n ! }! val means = sums.map(_ / n)! ! def normalize(f:Array[Double]) = ! (f,means,stdevs).zipped.map((value,mean,stdev) => ! if (stdev <= 0) (value-mean) else (value-mean)/stdev)! val normalizedData = data.map(normalize(_)).cache()! ! val kScores = (50 to 120 by 10).par.map(k => 
 (k, clusteringScore(normalizedData, k)))!
  17. 26 (50, 0.008184436460307516)! (60, 0.005003794119180148)! (70, 0.0036252446694127255)! (80, 0.003448993315406253)! (90,

    0.0028508261816040984)! (100,0.0024371619202127343)! (110,0.002273862516438719)! (120,0.0022075535103855447)!
  18. 27

  19. 30 val protocols = rawData.map(
 _.split(",")(1)).distinct.collect.zipWithIndex.toMap! ...! ! val dataAndLabel

    = rawData.map { line =>! val buffer = line.split(",").toBuffer! val protocol = buffer.remove(1)! val vector = buffer.map(_.toDouble)! ! val newProtocolFeatures = 
 new Array[Double](protocols.size)! newProtocolFeatures(protocols(protocol)) = 1.0! ...! vector.insertAll(1, newProtocolFeatures)! ...! (vector.toArray,label)! }!
  20. 31 (50, 0.09807063330707691)! (60, 0.07344136010921463)! (70, 0.05098421746285664)! (80, 0.04059365147197857)! (90,

    0.03647143491690264)! (100,0.02384443440377552)! (110,0.016909326439972006)! (120,0.01610738339266529)! (130,0.014301399891441647)! (140,0.008563067306283041)!
  21. 32

  22. Using  Labels  with  Entropy   35   •  Measures  mixed-­‐ness

      •  Bad  clusters  have     very  mixed  labels   •  Func-on  of  cluster’s   label  frequencies,  p(x)   •  Good  clustering  =   low  entropy   - p log p Σ
  23. 36 def entropy(counts: Iterable[Int]) = {! val values = counts.filter(_

    > 0)! val sum: Double = values.sum! values.map { v =>! val p = v / sum! -p * log(p)! }.sum! }! ! def clusteringScore(data: RDD[Array[Double]], 
 labels: RDD[String], ! k: Int) = {! ...! val labelsInCluster = 
 data.map(model.predict(_)).zip(labels).
 groupByKey.values! val labelCounts = labelsInCluster.map(! _.groupBy(l => l).map(t => t._2.length))! val n = data.count! labelCounts.map(m => m.sum * entropy(m)).sum / n! }!
  24. 37 (30, 1.0266922080881913)! (40, 1.0226914826265483)! (50, 1.019971839275925)! (60, 1.0162839563855304)! (70,

    1.0108882243857347)! (80, 1.0076114958062241)! (95, 0.4731290640152461)! (100,0.5756131018520718)! (105,0.9090079450132587)! (110,0.8480807836884104)! (120,0.3923520444828631)!
  25. 38 72 ipsweep. 1! 72 normal. 85! 77 ipsweep. 6!

    77 land. 9! 77 neptune. 1597! 77 normal. 4775! 77 portsweep. 2! 77 satan. 20! 90 buffer_overflow. 1! 90 guess_passwd. 45! 90 ipsweep. 36! 90 neptune. 4600! 90 normal. 598! 90 portsweep. 54! 90 satan. 6! 90 warezclient. 1! 93 ftp_write. 3! 93 loadmodule. 1! 93 multihop. 1! 93 normal. 4635! 93 phf. 4! 93 portsweep. 1! 93 spy. 1!
  26. 40 val kmeans = new KMeans()! kmeans.setK(95)! kmeans.setRuns(10)! kmeans.setEpsilon(1.0e-6)! val

    model = kmeans.run(normalizedData)! ! val distances = normalizedData.map(datum =>
 (distToCentroid(datum, model), datum))! ! val outliers = distances.top(100)(Ordering.by(_._1))! val threshold = outliers.last._1! ! def anomaly(datum: Array[Double], model: KMeansModel) = ! distToCentroid(normalize(datum), model) > threshold!
  27. From  Here  to  Produc-on?   42   •  Real  data

     set!   •  Algorithmic   •  Distance  metrics   •  k-­‐means||  init   •  Algorithms   •  Uniquely  ID  data  points   •  Real-­‐Time     •  with  Spark  Streaming?   •  Con-nuous  Pipeline   •  Visualiza-on