Slide 1

Slide 1 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs Emaad Manzoor STONY BROOK UNIVERSITY Sadegh M. Milajerdi UNIVERSITY OF ILLINOIS AT CHICAGO Leman Akoglu STONY BROOK UNIVERSITY

Slide 2

Slide 2 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 2 StreamSpot tracks anomalous heterogenous graph objects as they evolve from a stream of typed edges B A E B C A E F Time t = 100 t = 200 t = 300 t = 400

Slide 3

Slide 3 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot Stream of Typed Edges B A E B C A E F t = 100 t = 200 t = 300 t = 400

Slide 4

Slide 4 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 B A 100 t = 100

Slide 5

Slide 5 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 5 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 B A 100 E B 200 B A 100 t = 200

Slide 6

Slide 6 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 6 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 B A C 100 300 E B 200 B A 100 E B 200 B A 100 t = 300

Slide 7

Slide 7 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 7 B A C 100 300 E B F 200 400 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 B A C 100 300 E B 200 B A 100 E B 200 B A 100 t = 400

Slide 8

Slide 8 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 8 B A C 100 300 E B F 200 400 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 Anomaly Scores B A C 100 300 E B 200 B A 100 E B 200 B A 100

Slide 9

Slide 9 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 9 B A C 100 300 E B F 200 400 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 Anomaly Scores B A C 100 300 E B 200 B A 100 E B 200 B A 100 s = 0.2

Slide 10

Slide 10 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 10 B A C 100 300 E B F 200 400 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 Anomaly Scores B A C 100 300 E B 200 B A 100 E B 200 B A 100 s = 0.2 s = 0.2 s = 0.4

Slide 11

Slide 11 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 11 B A C 100 300 E B F 200 400 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 Anomaly Scores B A C 100 300 E B 200 B A 100 E B 200 B A 100 s = 0.2 s = 0.2 s = 0.4 s = 0.4 s = 0.6

Slide 12

Slide 12 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 12 B A C 100 300 E B F 200 400 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 Anomaly Scores B A C 100 300 E B 200 B A 100 E B 200 B A 100 s = 0.2 s = 0.2 s = 0.4 s = 0.4 s = 0.1 s = 0.6 s = 0.6

Slide 13

Slide 13 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 13 B A C 100 300 E B F 200 400 Stream of Typed Edges Evolving Heterogenous Graphs B A E B C A E F t = 100 t = 200 t = 300 t = 400 Anomaly Scores B A C 100 300 E B 200 B A 100 E B 200 B A 100 s = 0.2 s = 0.2 s = 0.4 s = 0.6 s = 0.4 s = 0.6 s = 0.1 Challenges • Fast • Incremental • Bounded space • Accurate

Slide 14

Slide 14 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 14 Motivation 1 Realtime detection of malicious/compromised application software time subject type object flow 100 proc/10639 fork proc/10640 1 200 proc/10640 exec file/“/bin/sh” 1 300 proc/10650 read file/stdin 2 400 proc/10640 stat mem/bfc5598 1 500 proc/10660 read sock/0.0.0.0 2 … … … … … TRANSPARENT COMPUTING Input: Stream of system calls Goal: Flag suspicious applications quickly, accurately and in real-time

Slide 15

Slide 15 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot Goal: Flag suspicious applications quickly, accurately and in real-time Motivation 1 Realtime detection of malicious/compromised application software Input: Stream of system events 15 time subject event object flow 100 proc/10639 fork proc/10640 1 200 proc/10640 exec file/“/bin/sh” 1 300 proc/10650 read file/stdin 2 400 proc/10640 stat mem/bfc5598 1 500 proc/10660 read sock/0.0.0.0 2 … … … … … TRANSPARENT COMPUTING proc 10639 <100, fork> FLOW 1 proc 10640 <200, execve> file /bin/sh <400, stat> mem 0xbfc5598 proc 10650 <300, read> FLOW 2 file stdin proc 10660 <500, read> sock 0.0.0.0

Slide 16

Slide 16 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 16 Motivation II Realtime detection of anomalous users from filesystem navigation time source destination user 100 /streamspot /streamspot/paper 1 200 /streamspot/paper /streamspot/code 1 300 /streamspot /streamspot/code 2 400 /streamspot/code /streamspot/data 1 500 /streamspot /streamspot/data 2 … … … … Input: Stream of filesystem navigation traces Goal: Flag suspicious users quickly, accurately and in real-time

Slide 17

Slide 17 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot Goal: Flag suspicious applications quickly, accurately and in real-time 17 Motivation II Realtime detection of anomalous users from filesystem navigation time source destination user 100 /streamspot /streamspot/paper 1 200 /streamspot/paper /streamspot/code 1 300 /streamspot /streamspot/code 2 400 /streamspot/code /streamspot/data 1 500 /streamspot /streamspot/data 2 … … … … Input: Stream of website navigation traces /streamspot <100> USER 1 /streamspot/paper /streamspot/data /streamspot/code <200> <400> /streamspot <300> USER 2 /streamspot/code /streamspot/data <500>

Slide 18

Slide 18 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 18 Related Work • Operates one edge at a time • Multiple simultaneously evolving graphs • Embraces heterogeneity in node/edge types Aggarwal, Charu C., Yuchen Zhao, and S. Yu Philip. Outlier detection in graph streams. ICDE ’11. Aggarwal, Charu C., Yuchen Zhao, and S. Yu Philip. On Clustering Graph Streams. SDM 2010. Kostakis, Orestis. Classy: fast clustering streams of call-graphs. Data Mining and Knowledge Discovery 2014. Graph at a time Graph at a time Edge at a time, connectivity anomalies StreamSpot

Slide 19

Slide 19 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 19 StreamSpot in a Nutshell Clustering-based Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anomaly Score Distance to nearest cluster centroid Graph Representation/Comparison Streaming and bounded space 1 3 2 4 1 3 2 1 2 1 t = t = t = t = Data to Streaming Heterogenous Graphs +3 -1 -1 +1 -1 -1 yG xG

Slide 20

Slide 20 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 20 Outline • Problem • Motivation • Related Work • Method Overview • Method Details • Evaluation • Summary

Slide 21

Slide 21 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 21 Graph Representation Graphs to vectors via shingling AxByC BpEoD C D E A B C E D <600,o> k-shingle strings constructed by an Ordered k-hop Breadth-First Traversal (OkBFT) from each node. k-shingles from each node k = 1 Source node types in bold

Slide 22

Slide 22 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 22 Graph Representation Graphs to vectors via shingling AxByC BpEoD C D E A B C E D <600,o> A B C A A <650,r> AxByC BrArA C A A Shingle (frequency) vector for each graph G contains the frequencies of each k-shingle in G. zG zG’ A 0 2 C 1 1 D 1 0 E 1 0 AxByC 1 1 BrArA 0 1 BpEoD 1 0 Graph G Graph G’

Slide 23

Slide 23 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 23 Graph Comparison Cosine similarity between shingle vectors sim(G, G0) = zG · zG0 kzG kkzG0 k zG zG’ A 0 2 C 1 1 D 1 0 E 1 0 AxByC 1 1 BrArA 0 1 BpEoD 1 0

Slide 24

Slide 24 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 24 Graph Comparison Cosine similarity between shingle vectors sim(G, G0) = zG · zG0 kzG kkzG0 k zG zG’ A 0 2 C 1 1 D 1 0 E 1 0 AxByC 1 1 BrArA 0 1 BpEoD 1 0 Issue Shingle universe is large! Shingle universe is unknown!

Slide 25

Slide 25 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 25 Compact Graph Representation Sketching graphs with SimHash Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003.

Slide 26

Slide 26 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 26 Compact Graph Representation Sketching graphs with SimHash Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003.

Slide 27

Slide 27 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 27 Compact Graph Representation Sketching graphs with SimHash Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. SAME SIDE SAME SIDE Probability [ D1 and D2 are on the same side of R ] ∝ cosine-sim(D1,D2)

Slide 28

Slide 28 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 28 Compact Graph Representation Sketching graphs with SimHash Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. Estimate Probability [ D1 and D2 are on the same side of R ] ∝ cosine-sim(D1,D2) by sampling r1 — rL

Slide 29

Slide 29 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 29 Compact Graph Representation Sketching graphs with SimHash Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. Compute L dot-products zG · r1 — zG · rL Store them in an L-element projection vector yG 2 1 0 0 1 1 0 zG +3 -1 -1 +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 r2 r3 yG

Slide 30

Slide 30 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 30 Compact Graph Representation Sketching graphs with SimHash Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. Store the sign of each element of yG in an L-bit sketch xG 2 1 0 0 1 1 0 zG +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 r2 r3 +1 -1 -1 = sign +3 -1 -1 [ [ xG yG

Slide 31

Slide 31 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 31 Compact Graph Representation Comparing graphs with SimHash Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. ∝ sim(G,G’) fraction of bits that agree +1 -1 -1 +1 -1 +1 xG xG’ [ [

Slide 32

Slide 32 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 32 Compact Graph Representation Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. 2 1 0 0 1 1 0 zG +3 -1 -1 +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 r2 r3 +1 -1 -1 yG xG

Slide 33

Slide 33 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 33 Compact Graph Representation Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. 2 1 1 0 1 1 0 zG +3 -1 -1 +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 r2 r3 +1 -1 -1 yG xG Add s3

Slide 34

Slide 34 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 34 Compact Graph Representation Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. 2 1 1 0 1 1 0 zG +3 -1 -1 +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 r2 r3 +1 -1 -1 yG xG Add s3

Slide 35

Slide 35 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 35 Compact Graph Representation Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. 2 1 0 0 1 1 0 zG +3 -1 -1 +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 r2 r3 +1 -1 -1 Shingle universe is large! Shingle universe is unknown! yG xG

Slide 36

Slide 36 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 36 Streaming Graph Representation StreamHash Why cache when you can hash? +1 +1 -1 -1 +1 -1 -1 ri hi(shingle) hi(s1) hi(s7) hi(s2) hi(s3) hi(s4) hi(s5) hi(s6) • hi(s): s → {+1, -1} • Store L StreamHash functions h1 — hL

Slide 37

Slide 37 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 37 Streaming Graph Representation h1 — hL • Each function drawn from a universal hash family • Multilinear family: Each function represented by |shingle|max integers +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 | rL StreamHash Why cache when you can hash?

Slide 38

Slide 38 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 38 Achlioptas, Dimitris. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of computer and System Sciences. 2003. 2 1 1 0 1 1 0 zG +3 -1 -1 +1 +1 -1 -1 +1 -1 -1 +1 -1 +1 +1 -1 -1 +1 -1 +1 -1 +1 -1 +1 -1 r1 r2 r3 +1 -1 -1 yG xG Add s3 Incremental Sketch Construction

Slide 39

Slide 39 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 39 Incremental Sketch Construction Adding a shingle s3, L = 3 h1(s3) = -1 h2(s3) = +1 h3(s3) = -1 0 0 0 +1 +1 +1 yG xG Old projection vector -1 +1 -1 StreamHash update vector + -1 +1 -1 -1 +1 -1 yG xG New projection vector = sign

Slide 40

Slide 40 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 40 Incremental Sketch Construction Adding a shingle s3, L = 3 h1(s3) = -1 h2(s3) = +1 h3(s3) = -1 -1 +1 -1 -1 +1 -1 yG xG Old projection vector -1 +1 -1 StreamHash update vector - 0 0 0 +1 +1 +1 yG xG New projection vector = sign

Slide 41

Slide 41 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 41 Streaming Graph Representation On each new edge: • Construct the set of shingles to update • Hash the shingles to update • Update the projection vector and sketch O(L|shingle|max + L) for k = 1

Slide 42

Slide 42 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 42 Streaming Graph Representation O(L|shingle|max + L) for k = 1 Hash L times Update sketch and projection On each new edge: • Construct the set of shingles to update • Hash the shingles to update • Update the projection vector and sketch

Slide 43

Slide 43 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 43 Space Complexity O(num_graphs x L + num_cached_edges) Streaming Graph Representation

Slide 44

Slide 44 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 44 Space Complexity O(num_graphs x L + num_cached_edges) Streaming Graph Representation Control by evicting oldest-first edges

Slide 45

Slide 45 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 45 Clusters and Anomalies A B 100 E 300 C 500 Cluster 1 A B 200 C 400 C D 100 E 200 Cluster 2 • Bootstrap K clusters • Cluster centroid: “Average” graph • Update clusters: Constant time • Anomaly score: Nearest centroid

Slide 46

Slide 46 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 46 Outline • Motivation • Problem • Related Work • Overview • Details • Evaluation • Summary

Slide 47

Slide 47 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 47 Evaluation • How accurate? • How fast? • Performance as parameters are configured?

Slide 48

Slide 48 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 48 Experiment Setup Datasets 2 malicious browser-based scenarios • Flash Player drive-by download (CVE-2015-5119)* • JRE untrusted code execution (CVE-2012-4681) Table 1: Dataset summary: Training scenarios and test edges (attack + 25% ben Dataset Scenarios # Graphs Avg. |V| Avg. |E| YDC YouTube, Download, CNN 300 8705 239648 GFC GMail, VGame, CNN 300 8151 148414 ALL YouTube, Download, CNN, GMail, VGame 500 8315 173857 (a) YDC (b) GFC (c) ALL gure 4: Distribution of pairwise cosine distances di↵erent values of chunk lengths. We aim to choose a C that neither makes all pairs of phs too similar or dissimilar. Figure 5 shows the entropy based on which we plot the precision curves. As a baseline, we use iFore and 75% subsampling rate with each a vector of 10 structural features: degree and distinct-degree5, the av shortest-path length, and the diamete of nodes/edges. The curves (average random samples) for all the datasets Note that even with 25% of the data, e↵ective in correctly ranking the atta an average precision (AP, area under then 0.9 and a near-ideal AUC (area 5 benign browser-based scenarios — 3 datasets

Slide 49

Slide 49 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 49 Experiment Setup Evaluation Settings and Metrics • Offline • Assume all graphs present offline • Compute precision-recall and ROC curves • Online • One edge at a time • Control no. of simultaneous graphs • Instantaneous AP and AUC

Slide 50

Slide 50 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 50 Experiment Setup Offline Evaluation Precision-Recall Curves Compared to IFOREST with graph structural features Training data percentage = 25%, curves averaged 10x length, and the diameter, density and number of nodes/edges. The curves (averaged over 10 independent random samples) for all the datasets are shown in Figure 7. Note that even with 25% of the data, static StreamSpot is effective in correctly ranking the attack graphs and achieves an average precision (AP, area under the PR curve) of more then 0.9 and a near-ideal AUC (area under ROC curve). 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest precision (AP, area under the PR curve) of more then 0.9 and a near-ideal AUC (area under ROC curve). 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest (a) YDC (b) GFC (c) ALL Figure 7 : (top) Precision-Recall (PR) and (bottom) ROC curves averaged over 10 samples. (p = 25%) Finally, in Figure 8 we show how the AP and AUC change

Slide 51

Slide 51 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 51 Experiment Setup Evaluation Settings and Metrics • Offline • Assume all graphs present offline • Compute precision-recall and ROC curves • Online • One edge at a time • Control no. of simultaneous graphs • Instantaneous AP and AUC

Slide 52

Slide 52 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 52 Experiment Setup Online Evaluation Sketch size L = 1000, 50 simultaneous scenarios 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (a) YDC 0 5 10 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (b) GFC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (c) ALL Figure 9: Performance of StreamSpot at different instants of the stream for all datasets (L = 1000). 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC Figure 10: StreamSpot performance on ALL (mea- sured at every 10K edges) when (left) B = 20 graphs and (right) B = 100 graphs arrive simultaneously. 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (a) L = 100 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (b) L = 10 Figure 11: Performance of StreamSpot on ALL for

Slide 53

Slide 53 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 53 Experiment Setup Online Evaluation 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (a) YDC 0 5 10 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (b) GFC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (c) ALL Figure 9: Performance of StreamSpot at different instants of the stream for all datasets (L = 1000). 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC Figure 10: StreamSpot performance on ALL (mea- sured at every 10K edges) when (left) B = 20 graphs and (right) B = 100 graphs arrive simultaneously. 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (a) L = 100 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (b) L = 10 Figure 11: Performance of StreamSpot on ALL for AP and AUC drop when a “new set” of graphs starts growing Sketch size L = 1000, 50 simultaneous scenarios

Slide 54

Slide 54 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 54 Experiment Setup Online Evaluation 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (a) YDC 0 5 10 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (b) GFC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (c) ALL Figure 9: Performance of StreamSpot at different instants of the stream for all datasets (L = 1000). 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC Figure 10: StreamSpot performance on ALL (mea- sured at every 10K edges) when (left) B = 20 graphs and (right) B = 100 graphs arrive simultaneously. 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (a) L = 100 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (b) L = 10 Figure 11: Performance of StreamSpot on ALL for AP and AUC drop when a “new set” of graphs starts growing But recover quickly! Sketch size L = 1000, 50 simultaneous scenarios

Slide 55

Slide 55 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 55 Experiment Setup Online Evaluation: Sketch Size L 50 simultaneous scenarios L = 1000 (b) GFC (c) ALL StreamSpot at di↵erent instants of the stream for all datasets (L = 1000). rmance on ALL (mea- en (left) B = 20 graphs rive simultaneously. (a) L = 100 (b) L = 10 Figure 11: Performance of StreamSpot on ALL for di↵erent values of the sketch size. (b) GFC (c) ALL amSpot at di↵erent instants of the stream for all datasets (L = 1000). ce on ALL (mea- eft) B = 20 graphs imultaneously. (a) L = 100 (b) L = 10 Figure 11: Performance of StreamSpot on ALL for di↵erent values of the sketch size. L = 100 L = 10

Slide 56

Slide 56 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 56 Experiment Setup Online Evaluation: Edge Limit N Sketch size L = 1000, 100 simultaneous scenarios 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (a) Limit = 15% 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (b) Limit = 10% Figure 12: Performance of StreamSpot on ALL (L = 1000), for different va fraction of the number of incoming edges). edge running time on ALL for sketch sizes L = 1000, 100 Table 2: StreamSp N = 15% N = 10% N as a percentage of the “full” stream size of ~25M edges

Slide 57

Slide 57 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 57 Experiment Setup Online Evaluation: Memory Usage “Full” stream size of ~25M edges Recorded at the end of the stream       (GJHOLPLW          0HPRU\XVDJH 0% *UDSKV       (GJHOLPLW         0HPRU\XVDJH .% 6NHWFKHV 3URMHFWLRQ9HFWRUV 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (a) Limit = 15% 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Acc AP AUC (b) Limit = 10% Figure 12: Performance of StreamSpot on ALL (L = 1000), for d fraction of the number of incoming edges). N = 15%, ~75MB

Slide 58

Slide 58 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 58 Experiment Setup Online Evaluation: Running Time Per-edge time < 70 microseconds L = 1000 >10,000 edges/second L = 100 >100,000 edges/second Intel Xeon (R) at 2.1GHz with 1TB RAM

Slide 59

Slide 59 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 59 Outline • Motivation • Problem • Related Work • Overview • Details • Evaluation • Summary

Slide 60

Slide 60 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 60 Summary AxByC BpEoD C D E A B C E D <600,o> Shingles + StreamHash Streaming, bounded-space graph representation and comparison h1 — hL 1 3 2 4 1 3 2 1 2 1 t = t = t = t = Data to Streaming Heterogenous Graphs Challenges • Accurate AP ~ 1.0 • Fast >100,000 edges/s • Incremental • Bounded space Code/data at bit.ly/streamspot

Slide 61

Slide 61 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 61 StreamSpot bit.ly/streamspot

Slide 62

Slide 62 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 62 Appendix

Slide 63

Slide 63 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 63 Overview Centroid-based clustering and anomaly scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anomaly Score Distance to nearest cluster centroid Constant space and streaming graph representation and comparison Shingling + Sketching + Hashing h1 — hL Cluster 1 Cluster 2 Cluster 3 Cluster 4 Graph G +1 +1 -1 +1 +1 -1 yG xG

Slide 64

Slide 64 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 64 Faster Hashing Lemire, Daniel, and Owen Kaser. "Strongly universal string hashing is fast." The Computer Journal. 2013. 0.5 CPU cycles / byte Observed shingle size ~ 50 bytes ~25ns / shingle ~70us / shingle StreamSpot’s Straightforward Multilinear Hash Intel Xeon (R) at 2.1GHz with 1TB RAM Possible to go 1000x faster! Observed input rate — 500,000 events/second

Slide 65

Slide 65 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 65 Clusters and Anomalies A B 100 E 300 C 500 Cluster 1 A B 200 C 400 C D 100 E 200 Cluster 2 • Time complexity: O(KL) per-edge with K clusters • Space complexity: O(c + KL) with c graphs in memory (controlled by edge limit N)

Slide 66

Slide 66 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 66 Experiment Setup Offline Evaluation Compared to IFOREST with graph structural features Training data percentage = 25%, curves averaged 10x ROC pairs of entropy . At the niformly d to the entiates ar ones. ively for training aximum g; these L. 100 150 200 ength LL cosine with 25% of the data, static StreamSpot is effective in correctly ranking the attack graphs and achieves an average precision (AP, area under the PR curve) of more then 0.9 and a near-ideal AUC (area under ROC curve). 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest 0.0 0.2 0.4 0.6 0.8 1.0 FPR 0.0 0.2 0.4 0.6 0.8 1.0 TPR Static StreamSpot iForest (a) YDC (b) GFC (c) ALL Figure 7 : (top) Precision-Recall (PR) and (bottom) ROC curves averaged over 10 samples. (p = 25%) Finally, in Figure 8 we show how the AP and AUC change as the training data percentage p is varied from p = 10% to p = 90%. We note that with sufficient training data, the test

Slide 67

Slide 67 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 67 Experiment Setup Online Evaluation 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (a) YDC 0 5 10 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (b) GFC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AUC AP (c) ALL Figure 9: Performance of StreamSpot at different instants of the stream for all datasets (L = 1000). 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC Figure 10: StreamSpot performance on ALL (mea- sured at every 10K edges) when (left) B = 20 graphs and (right) B = 100 graphs arrive simultaneously. 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (a) L = 100 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC (b) L = 10 Figure 11: Performance of StreamSpot on ALL for AP and AUC drop when a “new set” of graphs starts growing But recover quickly! Accuracy based on cluster assignments — poor due to “permissive” clusters Sketch size L = 1000, 50 simultaneous scenarios

Slide 68

Slide 68 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 68 Experiment Setup Online Evaluation (CVE-2012) Sketch size L = 1000, 50 simultaneous scenarios

Slide 69

Slide 69 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 69 Experiment Setup Online Evaluation: Simultaneous Scenarios Sketch size L = 1000 0 5 10 15 20 Edges Seen (millions) 0.0 AP (a) YDC 0 5 Edges Seen (millio 0.0 (b) GFC Figure 9: Performance of StreamSpot at different instan 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC 0 5 10 15 20 Edges Seen (millions) 0.0 0.2 0.4 0.6 0.8 1.0 Metric Accuracy AP AUC Figure 10: StreamSpot performance on ALL (mea- sured at every 10K edges) when (left) B = 20 graphs and (right) B = 100 graphs arrive simultaneously. graphs, picking one group at a time and interleaving the edges from graphs within the group to form the stream. In all experiments, 75% of the benign graphs where used for bootstrap clustering. Performance metrics are computed 0 0.0 0.2 0.4 0.6 0.8 1.0 Metric Figu diffe 0.9 1.0 20 simultaneous scenarios 100 simultaneous scenarios Slower recovery with more simultaneous scenarios

Slide 70

Slide 70 text

Fast Memory-efficient Anomaly Detection in Streaming Heterogenous Graphs / StreamSpot bit.ly/streamspot 70 Experiment Setup Online Evaluation: Memory Usage “Full” stream size of ~25M edges Recorded at the end of the stream       (GJHOLPLW          0HPRU\XVDJH 0% *UDSKV       (GJHOLPLW         0HPRU\XVDJH .% 6NHWFKHV 3URMHFWLRQ9HFWRUV