Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fast Memory-Efficient Anomaly Detection in Streaming Heterogeneous Graphs

Emaad Manzoor
August 15, 2016

Fast Memory-Efficient Anomaly Detection in Streaming Heterogeneous Graphs

20 minute talk at ACM SIGKDD 2016.
Project website: http://sbustreamspot.github.io/

Emaad Manzoor

August 15, 2016
Tweet

More Decks by Emaad Manzoor

Other Decks in Science

Transcript

  1. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs Emaad Manzoor

    STONY BROOK UNIVERSITY Sadegh M. Milajerdi UNIVERSITY OF ILLINOIS AT CHICAGO Leman Akoglu STONY BROOK UNIVERSITY
  2. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 2 StreamSpot tracks anomalous heterogenous graph objects as they evolve from a stream of typed edges B A E B C A E F Time t = 100 t = 200 t = 300 t = 400
  3. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot Stream of Typed Edges B A E B C A E F t = 100 t = 200 t = 300 t = 400
  4. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 B A 100 t = 100
  5. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 5 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 B A 100 E B 200 B A 100 t = 200
  6. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 6 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 B A C 100 300 E B 200 B A 100 E B 200 B A 100 t = 300
  7. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 7 B A C 100 300 E B F 200 400 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 B A C 100 300 E B 200 B A 100 E B 200 B A 100 t = 400
  8. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 8 B A C 100 300 E B F 200 400 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 Anomaly Scores B A C 100 300 E B 200 B A 100 E B 200 B A 100
  9. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 9 B A C 100 300 E B F 200 400 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 Anomaly Scores B A C 100 300 E B 200 B A 100 E B 200 B A 100 s = 0.2
  10. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 10 B A C 100 300 E B F 200 400 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 Anomaly Scores B A C 100 300 E B 200 B A 100 E B 200 B A 100 s = 0.2 s = 0.2 s = 0.4
  11. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 11 B A C 100 300 E B F 200 400 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 Anomaly Scores B A C 100 300 E B 200 B A 100 E B 200 B A 100 s = 0.2 s = 0.2 s = 0.4 s = 0.4 s = 0.6
  12. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 12 B A C 100 300 E B F 200 400 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 Anomaly Scores B A C 100 300 E B 200 B A 100 E B 200 B A 100 s = 0.2 s = 0.2 s = 0.4 s = 0.4 s = 0.1 s = 0.6 s = 0.6
  13. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 13 B A C 100 300 E B F 200 400 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 Anomaly Scores B A C 100 300 E B 200 B A 100 E B 200 B A 100 s = 0.2 s = 0.2 s = 0.4 s = 0.6 s = 0.4 s = 0.6 s = 0.1 Challenges • Fast • Incremental • Bounded space • Accurate
  14. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 14 Motivation 1 Realtime detection of malicious/compromised application software time subject type object flow 100 proc/10639 fork proc/10640 1 200 proc/10640 exec file/“/bin/sh” 1 300 proc/10650 read file/stdin 2 400 proc/10640 stat mem/bfc5598 1 500 proc/10660 read sock/0.0.0.0 2 … … … … … TRANSPARENT COMPUTING Input: Stream of system calls Goal: Flag suspicious applications quickly, accurately and in real-time
  15. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot Goal: Flag suspicious applications quickly, accurately and in real-time Motivation 1 Realtime detection of malicious/compromised application software Input: Stream of system events 15 time subject event object flow 100 proc/10639 fork proc/10640 1 200 proc/10640 exec file/“/bin/sh” 1 300 proc/10650 read file/stdin 2 400 proc/10640 stat mem/bfc5598 1 500 proc/10660 read sock/0.0.0.0 2 … … … … … TRANSPARENT COMPUTING proc 10639 <100, fork> FLOW 1 proc 10640 <200, execve> file /bin/sh <400, stat> mem 0xbfc5598 proc 10650 <300, read> FLOW 2 file stdin proc 10660 <500, read> sock 0.0.0.0
  16. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 16 Motivation II Realtime detection of anomalous users from filesystem navigation time source destination user 100 /streamspot /streamspot/paper 1 200 /streamspot/paper /streamspot/code 1 300 /streamspot /streamspot/code 2 400 /streamspot/code /streamspot/data 1 500 /streamspot /streamspot/data 2 … … … … Input: Stream of filesystem navigation traces Goal: Flag suspicious users quickly, accurately and in real-time
  17. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot Goal: Flag suspicious applications quickly, accurately and in real-time 17 Motivation II Realtime detection of anomalous users from filesystem navigation time source destination user 100 /streamspot /streamspot/paper 1 200 /streamspot/paper /streamspot/code 1 300 /streamspot /streamspot/code 2 400 /streamspot/code /streamspot/data 1 500 /streamspot /streamspot/data 2 … … … … Input: Stream of website navigation traces /streamspot <100> USER 1 /streamspot/paper /streamspot/data /streamspot/code <200> <400> /streamspot <300> USER 2 /streamspot/code /streamspot/data <500>
  18. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 18 Related Work • Operates one edge at a time • Multiple simultaneously evolving graphs • Embraces heterogeneity in node/edge types Aggarwal, Charu C., Yuchen Zhao, and S. Yu Philip. Outlier detection in graph streams. ICDE ’11. Aggarwal, Charu C., Yuchen Zhao, and S. Yu Philip. On Clustering Graph Streams. SDM 2010. Kostakis, Orestis. Classy: fast clustering streams of call-graphs. Data Mining and Knowledge Discovery 2014. Graph at a time Graph at a time Edge at a time, connectivity anomalies StreamSpot
  19. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 19 StreamSpot in a Nutshell Clustering-based Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anomaly Score Distance to nearest cluster centroid Graph Representation/Comparison Streaming and bounded space 1 3 2 4 1 3 2 1 2 1 t = t = t = t = Data to Streaming Heterogenous Graphs +3 -1 -1 +1 -1 -1 yG xG
  20. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 20 Outline • Problem • Motivation • Related Work • Method Overview • Method Details • Evaluation • Summary
  21. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 21 Graph Representation Graphs to vectors via shingling AxByC BpEoD C D E A B C E D <600,o> k-shingle strings constructed by an Ordered k-hop Breadth-First Traversal (OkBFT) from each node. k-shingles from each node k = 1 Source node types in bold
  22. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 22 Graph Representation Graphs to vectors via shingling AxByC BpEoD C D E A B C E D <600,o> A B C A A <650,r> AxByC BrArA C A A Shingle (frequency) vector for each graph G contains the frequencies of each k-shingle in G. zG zG’ A 0 2 C 1 1 D 1 0 E 1 0 AxByC 1 1 BrArA 0 1 BpEoD 1 0 Graph G Graph G’
  23. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 23 Graph Comparison Cosine similarity between shingle vectors sim(G, G0) = zG · zG0 kzG kkzG0 k zG zG’ A 0 2 C 1 1 D 1 0 E 1 0 AxByC 1 1 BrArA 0 1 BpEoD 1 0
  24. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 24 Graph Comparison Cosine similarity between shingle vectors sim(G, G0) = zG · zG0 kzG kkzG0 k zG zG’ A 0 2 C 1 1 D 1 0 E 1 0 AxByC 1 1 BrArA 0 1 BpEoD 1 0 Issue Shingle universe is large! Shingle universe is unknown!
  25. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 25 Compact Graph Representation Sketching graphs with SimHash Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003.
  26. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 26 Compact Graph Representation Sketching graphs with SimHash Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003.
  27. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 27 Compact Graph Representation Sketching graphs with SimHash Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. SAME SIDE SAME SIDE Probability [ D1 and D2 are on the same side of R ] ∝ cosine-sim(D1,D2)
  28. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 28 Compact Graph Representation Sketching graphs with SimHash Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. Estimate Probability [ D1 and D2 are on the same side of R ] ∝ cosine-sim(D1,D2) by sampling r1 — rL
  29. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 29 Compact Graph Representation Sketching graphs with SimHash Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. Compute L dot-products zG · r1 — zG · rL Store them in an L-element projection vector yG 2 1 0 0 1 1 0 zG +3 -1 -1 +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 r2 r3 yG
  30. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 30 Compact Graph Representation Sketching graphs with SimHash Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. Store the sign of each element of yG in an L-bit sketch xG 2 1 0 0 1 1 0 zG +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 r2 r3 +1 -1 -1 = sign +3 -1 -1 [ [ xG yG
  31. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 31 Compact Graph Representation Comparing graphs with SimHash Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. ∝ sim(G,G’) fraction of bits that agree +1 -1 -1 +1 -1 +1 xG xG’ [ [
  32. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 32 Compact Graph Representation Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. 2 1 0 0 1 1 0 zG +3 -1 -1 +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 r2 r3 +1 -1 -1 yG xG
  33. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 33 Compact Graph Representation Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. 2 1 1 0 1 1 0 zG +3 -1 -1 +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 r2 r3 +1 -1 -1 yG xG Add s3
  34. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 34 Compact Graph Representation Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. 2 1 1 0 1 1 0 zG +3 -1 -1 +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 r2 r3 +1 -1 -1 yG xG Add s3
  35. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 35 Compact Graph Representation Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. 2 1 0 0 1 1 0 zG +3 -1 -1 +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 r2 r3 +1 -1 -1 Shingle universe is large! Shingle universe is unknown! yG xG
  36. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 36 Streaming Graph Representation StreamHash Why cache when you can hash? +1 +1 -1 -1 +1 -1 -1 ri hi(shingle) hi(s1) hi(s7) hi(s2) hi(s3) hi(s4) hi(s5) hi(s6) • hi(s): s → {+1, -1} • Store L StreamHash functions h1 — hL
  37. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 37 Streaming Graph Representation h1 — hL • Each function drawn from a universal hash family • Multilinear family: Each function represented by |shingle|max integers +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 | rL StreamHash Why cache when you can hash?
  38. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 38 Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. 2 1 1 0 1 1 0 zG +3 -1 -1 +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 r2 r3 +1 -1 -1 yG xG Add s3 Incremental Sketch Construction
  39. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 39 Incremental Sketch Construction Adding a shingle s3, L = 3 h1(s3) = -1 h2(s3) = +1 h3(s3) = -1 0 0 0 +1 +1 +1 yG xG Old projection vector -1 +1 -1 StreamHash update vector + -1 +1 -1 -1 +1 -1 yG xG New projection vector = sign
  40. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 40 Incremental Sketch Construction Adding a shingle s3, L = 3 h1(s3) = -1 h2(s3) = +1 h3(s3) = -1 -1 +1 -1 -1 +1 -1 yG xG Old projection vector -1 +1 -1 StreamHash update vector - 0 0 0 +1 +1 +1 yG xG New projection vector = sign
  41. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 41 Streaming Graph Representation On each new edge: • Construct the set of shingles to update • Hash the shingles to update • Update the projection vector and sketch O(L|shingle|max + L) for k = 1
  42. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 42 Streaming Graph Representation O(L|shingle|max + L) for k = 1 Hash L times Update sketch and projection On each new edge: • Construct the set of shingles to update • Hash the shingles to update • Update the projection vector and sketch
  43. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 43 Space Complexity O(num_graphs x L + num_cached_edges) Streaming Graph Representation
  44. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 44 Space Complexity O(num_graphs x L + num_cached_edges) Streaming Graph Representation Control by evicting oldest-first edges
  45. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 45 Clusters and Anomalies A B 100 E 300 C 500 Cluster 1 A B 200 C 400 C D 100 E 200 Cluster 2 • Bootstrap K clusters • Cluster centroid: “Average” graph • Update clusters: Constant time • Anomaly score: Nearest centroid
  46. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 46 Outline • Motivation • Problem • Related Work • Overview • Details • Evaluation • Summary
  47. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 47 Evaluation • How accurate? • How fast? • Performance as parameters are configured?
  48. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 48 Experiment Setup Datasets 2 malicious browser-based scenarios • Flash Player drive-by download (CVE-2015-5119)* • JRE untrusted code execution (CVE-2012-4681) Table 1: Dataset summary: Training scenarios and test edges (attack + 25% ben Dataset Scenarios # Graphs Avg. |V| Avg. |E| YDC YouTube, Download, CNN 300 8705 239648 GFC GMail, VGame, CNN 300 8151 148414 ALL YouTube, Download, CNN, GMail, VGame 500 8315 173857 (a) YDC (b) GFC (c) ALL gure 4: Distribution of pairwise cosine distances di↵erent values of chunk lengths. We aim to choose a C that neither makes all pairs of phs too similar or dissimilar. Figure 5 shows the entropy based on which we plot the precision curves. As a baseline, we use iFore and 75% subsampling rate with each a vector of 10 structural features: degree and distinct-degree5, the av shortest-path length, and the diamete of nodes/edges. The curves (average random samples) for all the datasets Note that even with 25% of the data, e↵ective in correctly ranking the atta an average precision (AP, area under then 0.9 and a near-ideal AUC (area 5 benign browser-based scenarios — 3 datasets
  49. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 49 Experiment Setup Evaluation Settings and Metrics • Offline • Assume all graphs present offline • Compute precision-recall and ROC curves • Online • One edge at a time • Control no. of simultaneous graphs • Instantaneous AP and AUC
  50. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 50 Experiment Setup Offline Evaluation Precision-Recall Curves Compared to IFOREST with graph structural features Training data percentage = 25%, curves averaged 10x length, and the diameter, density and number of nodes/edges. The curves (averaged over 10 independent random samples) for all the datasets are shown in Figure 7. Note that even with 25% of the data, static StreamSpot is effective in correctly ranking the attack graphs and achieves an average precision (AP, area under the PR curve) of more then 0.9 and a near-ideal AUC (area under ROC curve). 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest precision (AP, area under the PR curve) of more then 0.9 and a near-ideal AUC (area under ROC curve). 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest (a) YDC (b) GFC (c) ALL Figure 7 : (top) Precision-Recall (PR) and (bottom) ROC curves averaged over 10 samples. (p = 25%) Finally, in Figure 8 we show how the AP and AUC change
  51. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 51 Experiment Setup Evaluation Settings and Metrics • Offline • Assume all graphs present offline • Compute precision-recall and ROC curves • Online • One edge at a time • Control no. of simultaneous graphs • Instantaneous AP and AUC
  52. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 52 Experiment Setup Online Evaluation Sketch size L = 1000, 50 simultaneous scenarios 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (a) YDC 0 5 10 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (b) GFC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (c) ALL Figure 9: Performance of StreamSpot at different instants of the stream for all datasets (L = 1000). 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC Figure 10: StreamSpot performance on ALL (mea- sured at every 10K edges) when (left) B = 20 graphs and (right) B = 100 graphs arrive simultaneously. 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (a) L = 100 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (b) L = 10 Figure 11: Performance of StreamSpot on ALL for
  53. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 53 Experiment Setup Online Evaluation 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (a) YDC 0 5 10 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (b) GFC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (c) ALL Figure 9: Performance of StreamSpot at different instants of the stream for all datasets (L = 1000). 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC Figure 10: StreamSpot performance on ALL (mea- sured at every 10K edges) when (left) B = 20 graphs and (right) B = 100 graphs arrive simultaneously. 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (a) L = 100 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (b) L = 10 Figure 11: Performance of StreamSpot on ALL for AP and AUC drop when a “new set” of graphs starts growing Sketch size L = 1000, 50 simultaneous scenarios
  54. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 54 Experiment Setup Online Evaluation 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (a) YDC 0 5 10 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (b) GFC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (c) ALL Figure 9: Performance of StreamSpot at different instants of the stream for all datasets (L = 1000). 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC Figure 10: StreamSpot performance on ALL (mea- sured at every 10K edges) when (left) B = 20 graphs and (right) B = 100 graphs arrive simultaneously. 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (a) L = 100 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (b) L = 10 Figure 11: Performance of StreamSpot on ALL for AP and AUC drop when a “new set” of graphs starts growing But recover quickly! Sketch size L = 1000, 50 simultaneous scenarios
  55. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 55 Experiment Setup Online Evaluation: Sketch Size L 50 simultaneous scenarios L = 1000 (b) GFC (c) ALL StreamSpot at di↵erent instants of the stream for all datasets (L = 1000). rmance on ALL (mea- en (left) B = 20 graphs rive simultaneously. (a) L = 100 (b) L = 10 Figure 11: Performance of StreamSpot on ALL for di↵erent values of the sketch size. (b) GFC (c) ALL amSpot at di↵erent instants of the stream for all datasets (L = 1000). ce on ALL (mea- eft) B = 20 graphs imultaneously. (a) L = 100 (b) L = 10 Figure 11: Performance of StreamSpot on ALL for di↵erent values of the sketch size. L = 100 L = 10
  56. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 56 Experiment Setup Online Evaluation: Edge Limit N Sketch size L = 1000, 100 simultaneous scenarios 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (a) Limit = 15% 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (b) Limit = 10% Figure 12: Performance of StreamSpot on ALL (L = 1000), for different va fraction of the number of incoming edges). edge running time on ALL for sketch sizes L = 1000, 100 Table 2: StreamSp N = 15% N = 10% N as a percentage of the “full” stream size of ~25M edges
  57. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 57 Experiment Setup Online Evaluation: Memory Usage “Full” stream size of ~25M edges Recorded at the end of the stream       (GJHOLPLW          0HPRU\XVDJH 0% *UDSKV       (GJHOLPLW         0HPRU\XVDJH .% 6NHWFKHV 3URMHFWLRQ9HFWRUV 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (a) Limit = 15% 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Acc AP AUC (b) Limit = 10% Figure 12: Performance of StreamSpot on ALL (L = 1000), for d fraction of the number of incoming edges). N = 15%, ~75MB
  58. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 58 Experiment Setup Online Evaluation: Running Time Per-edge time < 70 microseconds L = 1000 >10,000 edges/second L = 100 >100,000 edges/second Intel Xeon (R) at 2.1GHz with 1TB RAM
  59. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 59 Outline • Motivation • Problem • Related Work • Overview • Details • Evaluation • Summary
  60. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 60 Summary AxByC BpEoD C D E A B C E D <600,o> Shingles + StreamHash Streaming, bounded-space graph representation and comparison h1 — hL 1 3 2 4 1 3 2 1 2 1 t = t = t = t = Data to Streaming Heterogenous Graphs Challenges • Accurate AP ~ 1.0 • Fast >100,000 edges/s • Incremental • Bounded space Code/data at bit.ly/streamspot
  61. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 63 Overview Centroid-based clustering and anomaly scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anomaly Score Distance to nearest cluster centroid Constant space and streaming graph representation and comparison Shingling + Sketching + Hashing h1 — hL Cluster 1 Cluster 2 Cluster 3 Cluster 4 Graph G +1 +1 -1 +1 +1 -1 yG xG
  62. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 64 Faster Hashing Lemire, Daniel, and Owen Kaser. "Strongly universal string hashing is fast." The Computer Journal. 2013. 0.5 CPU cycles / byte Observed shingle size ~ 50 bytes ~25ns / shingle ~70us / shingle StreamSpot’s Straightforward Multilinear Hash Intel Xeon (R) at 2.1GHz with 1TB RAM Possible to go 1000x faster! Observed input rate — 500,000 events/second
  63. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 65 Clusters and Anomalies A B 100 E 300 C 500 Cluster 1 A B 200 C 400 C D 100 E 200 Cluster 2 • Time complexity: O(KL) per-edge with K clusters • Space complexity: O(c + KL) with c graphs in memory (controlled by edge limit N)
  64. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 66 Experiment Setup Offline Evaluation Compared to IFOREST with graph structural features Training data percentage = 25%, curves averaged 10x ROC pairs of entropy . At the niformly d to the entiates ar ones. ively for training aximum g; these L. 100 150 200 ength LL cosine with 25% of the data, static StreamSpot is effective in correctly ranking the attack graphs and achieves an average precision (AP, area under the PR curve) of more then 0.9 and a near-ideal AUC (area under ROC curve). 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest (a) YDC (b) GFC (c) ALL Figure 7 : (top) Precision-Recall (PR) and (bottom) ROC curves averaged over 10 samples. (p = 25%) Finally, in Figure 8 we show how the AP and AUC change as the training data percentage p is varied from p = 10% to p = 90%. We note that with sufficient training data, the test
  65. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 67 Experiment Setup Online Evaluation 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (a) YDC 0 5 10 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (b) GFC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (c) ALL Figure 9: Performance of StreamSpot at different instants of the stream for all datasets (L = 1000). 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC Figure 10: StreamSpot performance on ALL (mea- sured at every 10K edges) when (left) B = 20 graphs and (right) B = 100 graphs arrive simultaneously. 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (a) L = 100 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (b) L = 10 Figure 11: Performance of StreamSpot on ALL for AP and AUC drop when a “new set” of graphs starts growing But recover quickly! Accuracy based on cluster assignments — poor due to “permissive” clusters Sketch size L = 1000, 50 simultaneous scenarios
  66. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 68 Experiment Setup Online Evaluation (CVE-2012) Sketch size L = 1000, 50 simultaneous scenarios
  67. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 69 Experiment Setup Online Evaluation: Simultaneous Scenarios Sketch size L = 1000 0 5 10 15 20 Edges Seen (millions) 0.0 AP (a) YDC 0 5 Edges Seen (millio 0.0 (b) GFC Figure 9: Performance of StreamSpot at different instan 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC Figure 10: StreamSpot performance on ALL (mea- sured at every 10K edges) when (left) B = 20 graphs and (right) B = 100 graphs arrive simultaneously. graphs, picking one group at a time and interleaving the edges from graphs within the group to form the stream. In all experiments, 75% of the benign graphs where used for bootstrap clustering. Performance metrics are computed 0 0.0 0.2 0.4 0.6 0.8 1.0 Metric Figu diffe 0.9 1.0 20 simultaneous scenarios 100 simultaneous scenarios Slower recovery with more simultaneous scenarios
  68. Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot

    bit.ly/streamspot 70 Experiment Setup Online Evaluation: Memory Usage “Full” stream size of ~25M edges Recorded at the end of the stream       (GJHOLPLW          0HPRU\XVDJH 0% *UDSKV       (GJHOLPLW         0HPRU\XVDJH .% 6NHWFKHV 3URMHFWLRQ9HFWRUV