Time series anomaly discovery with grammar based compression

Time series anomaly discovery with grammar based compression EDBT Paper
#155 Pavel Senin [email protected] Jessica Lin Xing Wang {jessica,xwang24}@gmu.edu Tim Oates Sunil Gandhi* {Oates,sunilga1}@cs.umbc.edu Arnold Boedihardjo Crystal Chen Susan Frankenstein {arnold.p.boedihardjo, crystal.chen, susan.frankenstein}@usace.army.mil 3/26/2015 Time series anomaly discovery with grammar-based compression 1

Time series are ubiqutous Planetary orbits, 10/11th century ICU display
Shape to time series transform Trajectory to time series transform 3/26/2015 Time series anomaly discovery with grammar-based compression 2

Problem definition • Given a sequences of observations • Goal
: Discover surprising/unusual/interesting/anomalous patterns in time series 1 2 , ,..., , m i T t t t t R ∈ Real-life data: - not equidistant - congested/stretched - noise - lost points ECG qtdb/sel102 (excerpt) 3/26/2015 Time series anomaly discovery with grammar-based compression 3

Classic approaches • Brute force all-with-all comparison • Simple statistics
– Compute distribution – Make a decision base on likelihood • Complex statistics – HMM • Transformation into a feature space, such as DFT, DWT, etc. 3/26/2015 Time series anomaly discovery with grammar-based compression 4

Issues with existing approaches 1. Point outliers often do not
capture anomaly 2. Computational complexity 1. Scan over the data estimating a model, scan to find anomalies 2. Distance computation (reported to be ~99% of runtime*) 3. Input parameters, i.e. a need for an apriori knowledge 1. Length of the potential anomalous or recurrent pattern 2. Threshold of divergence, etc. What if there is a complex distribution? 4. Limited interpretability of the feature space * Eamonn Keogh, Jessica Lin, Ada Fu, HOT SAX: Finding the Most Unusual Time Series Subsequence: Algorithms and Applications, ICDM05 3/26/2015 5

Time series discord, definition • Match: Given a positive real
number t (i.e., threshold) and subsequences C and M, if Dist(C, M) ≤ t then subsequence M is a match to C • Non-self match: Given a subsequence C of length n starting at position p of time series T, the subsequence M beginning at q is a non-self match to C at distance Dist(C, M) if |p − q| ≥ n • Time Series Discord: Given a time series T, the time series subsequence C ∈ T is called the discord if it has the largest Euclidean distance to its nearest non-self match 3/26/2015 Time series anomaly discovery with grammar-based compression 6

• Subsequence that is least similar to other subsequences –
“..on 19 different publicly available data sets, comparing 9 different techniques, time series discord is the best overall technique among all techniques”, Chandola et al. 2009 ECG qtdb/sel102 (excerpt) Eamonn Keogh, Jessica Lin, Ada Fu & Helga Van Herie. (2006). Finding unusual medical time series subsequences: algorithms and applications. IEEE Transactions on Information Technology in Biomedicine, 10(3): 429-439. Time Series Discord 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Poppet pulled significantly out of the solenoid before energizing The De- Energizing phase is normal Space Shuttle Marotta Valve Series 3/26/2015 7 Courtesy of Jessica Lin

HOT-SAX algorithm (Keogh, Lin, Fu, `05) for exact discord discovery
• Guarantees to find the discord through an exhaustive, exact search. • At the first step decomposes the input time series into a map of key and values – words are keys – occurrence indices are values • Intuitively, rare subsequences are good discord candidates – search is ordered (heuristics #1) • Similar subsequences map to the same word – compares them first (heuristics #2) 3/26/2015 Time series anomaly discovery with grammar-based compression 8

Symbolic Aggregate Approximation 0 - - 0 20 40 60
80 100 120 b b b a c c c a 0 20 40 60 80 100 120 C C baabccbc 3/26/2015 Time series anomaly discovery with grammar-based compression 9

Time series discretization • Sliding window – Captures fine, Local
phenomenon • Numerosity Reduction S = caa caa cab cac ccc caa caa caa cab… = caa1 cab3 cac4 ccc5 caa6 cab9 numerosity reduction 3/26/2015 Time series anomaly discovery with grammar-based compression 10

Grammar-Based Anomaly Detection Input string: S = abcabcabcXXXabcabc . (Desired)
Output grammar: Observation: The pattern “XXX” never appears in grammar rules except for the topmost rule (the original string). Approach: For each pattern, just count how many rules it appears in, e.g. “abc” appears in 3 rules. “XXX” appears in none. Advantage: Fast; No prior knowledge on discord length is needed. 3/26/2015 11 Courtesy of Jessica Lin

Our first contribution: Rare Rule Anomaly algorithm • Exact algorithm
based on HOT-SAX: guarantees to find the discord through a nearly exhaustive, exact search. • Uses grammar induction to transform the list of SAX words into a context-free grammar – Terminal symbols and rule correspond subsequences for the search • Intuitively, terminal symbols and rare grammar rules are good discord candidates – search is ordered (new heuristics #1) • Similar subsequences map to the same terminal or grammar rule – compares them first (the same heuristics #2) 3/26/2015 Time series anomaly discovery with grammar-based compression 12

Our second contribution: Rule Density Curve algorithm • As Sequitur
digests new terminals, it continuously changes the grammar: – Creates new rules – Removes obsolete ones • Many rules are overlapping, reflecting the grammar’ hierarchical structure and the correlation between terminal symbol-corresponding subsequences • We account for the rules covering each point of the input time series building a rule density curve 3/26/2015 Time series anomaly discovery with grammar-based compression 13

Rule Density Curve, Live YouTube video? https://www.youtube.com/watch?v=9lH-RG5OtkY 3/26/2015 Time series
anomaly discovery with grammar-based compression 14

Theoretical motivation • Given string s we want to infer
rare patterns • Kolmogorov Complexity – Size of smallest Turing machine that describes s. 3/26/2015 Time series anomaly discovery with grammar-based compression 15 aaaaaaaaaaaa abbaababab More Structured Less Structured KC(s) less KC(s) more

rare patterns • Kolmogorov Complexity – Size of smallest Turing machine that describes s. 3/26/2015 Time series anomaly discovery with grammar-based compression 16 abcabcabcXXXabcabc

rare patterns • Kolmogorov Complexity – Size of smallest Turing machine that describes s. 3/26/2015 Time series anomaly discovery with grammar-based compression 17

Performance 3/26/2015 Time series anomaly discovery with grammar-based compression 18

Hilbert space-filling curve  Curve that passes through every point
of an n- dimensional region  Transform multi-dimensional trajectory data (time, latitude, longitude) into a sequence of scalars preserving locality 3/26/2015 Time series anomaly discovery with grammar-based compression 19

Hilbert space-filling curve • The trajectory becomes a sequence of
scalars {0,3,2,2,2,7,7,8,11,13,13,2,1,1}, i.e., a time series! 3/26/2015 Time series anomaly discovery with grammar-based compression 20

Finding an anomaly in Hilbert curve- transformed trajectory Planted anomaly,
traveled once A week of typical commute 3/26/2015 Time series anomaly discovery with grammar-based compression 21

Examples of true anomalies discovered in the trajectory data Abnormal
behavior of not visiting the parking lot Abnormal path outside from a highly visited area (similar to the planted anomaly) 3/26/2015 Time series anomaly discovery with grammar-based compression 22

23 Pavel Senin, Jessica Lin, Xing Wang, Tim Oates, Sunil
Gandhi, Arnold P. Boedihardjo, Crystal Chen, and Susan Frankenstein. 2015. Time Series Anomaly Discovery with Grammar-Based Compressions. In Proceedings of the 18th International Conference on Extending Database Technology (EDBT). Brussels, Belgium. March 23-27. Visualizing Anomalies Courtesy of Pavel Senin

24 Visualizing Anomalies https://github.com/GrammarViz2/grammarviz2_src Courtesy of Pavel Senin

Conclusions • Grammatical inference provides an effective and efficient mechanism
for discovering short- and long-term correlations in the input string, approximating its Kolmogorov complexity – Sequitur enables streaming data mining • SAX discretization and its numerosity reduction trick enable a significant numerosity reduction of the input data and enable variable-length patterns discovery • GrammarViz 2.0 implements the both proposed algorithms: RRA and Rule Density curve enabling interactive parameters tuning via GUI 3/26/2015 Time series anomaly discovery with grammar-based compression 25

Questions? 3/26/2015 Time series anomaly discovery with grammar-based compression 26

Thank you! • Pavel Senin, University Of Hawaii at Manoa,
ICS, Collaborative Software Development Laboratory • Jessica Lin, Xing Wang, George Mason University, Department of Computer Science • Tim Oates, Sunil Gandhi, University of Maryland, Baltimore County, Department of Computer Science • Arnold P. Boedihardjo, Crystal Chen, Susan Frankenstein, U.S. Army Corps of Engineers, Engineer Research and Development Center 3/26/2015 Time series anomaly discovery with grammar-based compression 27

Time series anomaly discovery with grammar base...

Time series anomaly discovery with grammar based compression

Pavel Senin

More Decks by Pavel Senin

Other Decks in Education

Featured

Transcript