Kolmogorov complexity & applications. Time series anomaly discovery with grammar-based compression.

Kolmogorov complexity & applications Time series anomaly discovery with grammar-based
compression Pavel Senin [email protected] 1

Understanding the information • We live in a society driven
by information, undeniably: • But do we know, what “information” is mathematically? • How to quantify it? Or to assess its quality? • How to use it for research, or to prove a theorem? • How to refine it? 2

Information quantification, beginning • Turned out these questions were asked
long before the information become ubiquitous. • As Lance Fortnow noted, 1903 was an interesting year: first flight was made and three men were born. The three who happened to be quite determined in finding the answers: – Alonzo Church (adviser of Alan Turing) – John von Neumann – Andrey Kolmogorov First flight, Orville and Wilbur Wright 3

Key work introducing Kolmogorov complexity • “Three approaches to the
quantitative definition of information”, A.N. Kolmogorov, 1965. • Discusses approaches: – Combinatorial, Ralph Hartley, 1928 • Probability-independent (sampled uniformly at random) • Can be seen as a Shannon entropy for uniform distribution • Non negative value – Probabilistic, Claude E. Shannon, 1948 • Probabilistic assumptions • May produce a negative value (differential entropy) • Proposes: – Algorithmic, based on the “true information content”. 4

Solomonoff – Kolmogorov – Chaitin Solomonoff (1960)-Kolmogorov (1965)-Chaitin (1969) The
amount of information in a string is the size of the smallest program of an optimal Universal TM generating that string 5

• Assume there are n mutually exclusive alternatives, and one
of them is true, but we don’t know which (equiprobable). • How we can measure the amount of information gained by knowing which one is true, or equivalently, the uncertainty that is associated with these n possibilities? • Hartley postulated that this function, SH ,shall satisfy a set of axioms while mapping natural numbers to real: – Monotonicity SH (n)≤ SH (n+1) – Branching (additivity) SH (nm) = SH (n) + SH (m) – Normalization SH (2) = 1 • Naturally, there is exactly one function satisfying these, which is logarithm, i.e. SH (n) = log n. Hartley function (1927) (Ralph Hartley, Lake Como Italy, “Transmission of information”) 6

• Shannon entropy properties hold only when a characteristic probabilities
(distributions) of the source are known. • Message is a random sample or characters drawn from a data stream. Shannon entropy is the expected value of the information contained in each message received. • Entropy characterizes the uncertainty about the source of information, and increases for more sources of greater randomness. max is when all events are equiprobable. • The less likely the message is, the more information it provides when it is received. Shannon entropy (1948) (later generalized by Alfréd Rényi, 1961) 7

• Kolmogorov proposed to change the paradigm. Kolmogorov (i.e. algorithmic)
complexity  “Discrete forms of storing and processing information are fundamental…”  “…it is not clear why information theory should be based so essentially on the probability theory …”  “…the foundations of information theory must have a finite combinatorial character.” • In contrast to previous measures, Kolmogorov’s approach deals with finite sequences, i.e. obtained from a source with unknown characteristics. 8

Kolmogorov’s heat conductivity example      
  ∂ ∂ + ∂ ∂ + ∂ ∂ = ∂ ∂ 2 2 2 2 2 2 z u y u x u t u α General, exact form of the heat equation representing the continuous process of the heat transfer: ) ( u u u u zz yy xx t ∆ + ∆ + ∆ = ∆ α practical, universally used difference scheme: “…Quite probably, with the development of novel computing technique it will be clear that in very many cases it is reasonable to conduct the study of real phenomena avoiding the intermediately stage of stylizing them in the spirit of ideas of mathematics of the infinite and the continuous, and passing directly to discrete models…” A.N. Kolmogorov, 1970, Nice, France, International Congress of Mathematicians 9

Computability (applicability boundaries) 1. Partial recursive function, and the lambda-calculus
are well-grounded theories which provide a formal system in mathematical logic for expressing a process of computation. 2. Church hypothesis (Church-Turing thesis): The class of algorithmically computable functions (i.e. with paper and ink) coincides with the class of all partial recursive functions. We assume a Turing machine equivalence to lambda calculus. 3. In addition to that, there exist definitions of a universal Turing machine which can simulate any arbitrary Turing machine on an arbitrary input. We assume the existence of a universal Turing machine, a universal partial recursive function, and their equivalence. • Resume: the computers are as powerful as humans, and the Universe is equivalent to a Turing machine, or maybe Universe is a hypercomputer capable of computing super-recursive functions… 10

Kolmogorov complexity (conditional) • Say we are interested in finding
out the quantity of information object Y conveys about object X. • The computability theory gives us a formalism that if X and Y can be expressed as numbers, there exists a computable (partial recursive) function Φ (P, Y) = X, where P is the “program” describing the computation constructively. • Then, the Kolmogorov complexity is the size of the smallest such program. – as there are many possible programs, “…it is natural to consider only minimal in length numbers P that lead to the object …”. 11

For strings X and Y, an interpreter A, and the
program p (just assume that A is a Turing machine) Kolmogorov formulated and proved the fundamental theorem in his work, that there exists a partial recursive function U, that for any other partial recursive function A Kolmogorov complexity The proof of this asymptotically optimal function existence is based on the existence of universal partial recursive function. 12

– this is an equivalent of undecidability of the Halting
problem – this is also a reflection of Kurt Gödel incompleteness theorem (a system capable of expressing elementary arithmetic cannot be both consistent and complete. i.e. for a system that proves certain arithmetic truths, there exists a an arithmetical statement that is true, but not provable in the system) • Getting around: by using a compressor, i.e. gzip. Or JPEG. The better it compresses the string x, the better it approximates K(x). (lossy JPEG is questionable… but it works). The catch “… it is important to note, that partial recursive functions are not defined everywhere, and there is no fixed method for determining whether application of the program P to an object k will lead to result or not…” 13

Summary on K-complexity • The Kolmogorov complexity deals with the
complexity of objects and defines it as the size of the shortest binary program capable of the object’s generation – a fascinating concept describing an object’s complexity by scientific means. • The concept of “the shortest program” was developed by Solomonoff, Kolmogorov, and Chaitin independently while working on Turing machines, random objects, and inductive inference. Whereas Solomonoff worked on the idealized inference and universal prior and Chaitin worked on Turing machinery properties, Kolmogorov proposed the complexity measure directly. • The field of Kolmogorov complexity, while mature, is still active research area where many research problems still to be solved. 14

Some properties and implications • For all x, K(x) ≤
|x|+O(1) (upper bound) • K(xx) = K(x) + O(1) (a loop over a program) • K(x|x) = O(1) (just print out input) • K(x|ε) = K(x) (empty set provides no information) • K(x|y) < K(x) +O(1) (at worst y has no value) • K(1n ) ≤ K(logn) (the sequence’ encoding length) • K(π1:n ) ≤ K(logn)(there is a short program generating π ) • C(xy) ≤ C(x) + C(y) + O(log(min{C(x),C(y)}) (additivity) 15

Applications (I). Randomness. A case of cheating casino* • Bob
proposes to flip a coin with Alice: – Alice wins a euro if Heads; – Bob wins a euro if Tails… • Result: TTTTTT …. 100 Tails in a roll. – Alice lost € 100. She feels being cheated… * Courtesy of M. Li and P. Vitaniy, Lectures on Kolmogorov Complexity 16

Randomness. Alice goes to court* * Courtesy of M. Li
and P. Vitaniy, Lectures on Kolmogorov Complexity • Alice complains: T100 is not random. • Bob asks Alice to produce a random coin flip sequence. • Alice flipped her coin 100 times and got THTTHHTHTHHHTTTTH … • But Bob claims Alice’s sequence has probability 2-100, and so does his. • How do we define randomness? 17

Randomness • By computing the Kolmogorov complexity, or approximating it,
we essentially compress the object. • Incompressibility: for constant c>0, a string x ε {0,1}* is c-incompressible if K(x) ≥ |x|-c. – for a constant c, we simply say that x is incompressible. – i.e., a string is called compressible if it has a description which is shorter than the string itself. • Incompressible strings lack regularities that could be exploited to obtain a compressed description for them; they are effectively patternless. • For a FINITE string x, we say that x is random if K(x) ≥ |x| - c, for a small constant c. 18

Randomness, Alice goes to court* • S0 = TTTTT, 100
tails in a row – K(S0) is small, “print ‘T’ 100 times” ~ 20 characters • S1 = THTTHHTHTHHHTTTTH – K(S1) = ???, if truly random, then K(S1)>100 characters • Lemma. There are at least 2n – 2n-c +1 c-incompressible strings of length n. Proof. There are only ∑k=0,…,n-c-1 2k = 2n-c -1 programs with length less than n-c. Hence only that many strings (out of total 2n strings of length n) can have shorter programs (descriptions) than n-c. QED. * Courtesy of M. Li and P. Vitaniy, Lectures on Kolmogorov Complexity 19

Randomness • P. Martin-Lof visited Kolmogorov in Moscow 1964-1965 •
We may have zillions of statistical tests for randomness. – A random sequence must have roughly ½ 0’s and ½ 1’s, furthermore, ¼ 00’s, 01’s, 10’s 11’s. – A random sequence of length n cannot have a large block of 0’s. – ….. • A truly random sequence shall pass all such tests! • A set of all possible tests should be enumerable. Martin-Lof defined a universal P-test for randomness using that fact. And, he has shown that if a sequence passes the universal test, it passes all enumerated tests. • Then Martin Lof has shown that an effective randomness test cannot distinguish incompressible strings from “truly random” strings if their length exceeds a constant (depending on the test)… I.e., all incompressible strings whose length is greater than this constant pass the universal test. 20

Summary on randomness • Kolmogorov complexity effectively enables the definition
of incompressible (i.e. random strings). K(x) ≥ |x|-c. • There is a lot of incompressible strings. – 2n – 2n-c +1 of c-incompressible strings of length n. • Per Martin-Löf provided a theoretical framework which proves that incompressible sequences are in fact random. 21

Applications (II). Incompressibility method. • A general-purpose method for formal
proofs which can be used as an alternative to counting arguments or probabilistic arguments. • To show that in average case the objects in a given class have a certain property: 1. Choose a random object from the class. 2. This object is incompressible, with probability 1. 3. Prove that the property holds for the object. 4. Assume that the property does not hold. 5. Show that we can use the property to compress the object, yielding a contradiction. 22

Incompressibility example. Theorem: there are infinitely many primes.* • Suppose
not, and there are k primes (p1 ,..,pk ). • Then, any m is a product of these: • Let m be a Kolmogorov-random number of length n • m can be described as above by k numbers (ei ). • ei <log(m), => |ei |<log(log(m)), => |(ei ,…,ek )| < 2k log(log(m)) • as m<2n+1, |(ei ,…,ek )|<2k log(n+1), and K(m)<2k log(n+1)+C. • but, for a large m, K(m)>n, since m is random! • Contradiction, so there are infinitely many primes. * The example from lectures by Lance Fortnow, prepared from notes of the author taken by Amy Gale in Kaikoura, January 2000. 23

A selected list of results proven with the incompressibility method
(summary)* • Ω(n2) for simulating 2 tapes by 1 (20 years) • k heads > k-1 heads for PDAs (15 years) • k one-ways heads can’t do string matching (13 yrs) • 2 heads are better than 2 tapes (10 years) • Average case analysis for heapsort (30 years) • k tapes are better than k-1 tapes. (20 years) • Many theorems in combinatorics, formal language/automata, parallel computing, VLSI • Simplify old proofs (Hastad Lemma). • Shellsort average case lower bound (40 years) * Courtesy of M. Li and P. Vitaniy, Lectures on Kolmogorov Complexity 24

Applications (III). Minimum description length. • MDL is a formalization
of Occam’s razor. – among competing hypotheses that predict equally well, the one with the fewest assumptions should be selected – given set of data, the best description is the one that leads to the best compression of the data (i.e. shortest description) • Introduced in 1978, Jorma Rissanen. • MDL “…is based on the following insight: any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally..." (Grünwald, 1998). 25

MDL in patterns mining. • Pattern mining is an important
concept in data mining contrasting to modeling. Patterns describe only the data. – Think motif sequence discovery (i.e. domains, repeats) in bioinformatics • Obviously, there are way too many possible patterns to examine each candidate. • Typically this issue handled with minimum support threshold. But that is only a part of solution, because support threshold does not limit redundancy. • MDL helps here – we use patterns that compress the dataset most. 26

Patterns mining. The KRIMP algorithm. http://www.patternsthatmatter.org/ Vreeken, J., Van Leeuwen,
M., & Siebes, A. (2011). Krimp: mining itemsets that compress. Data Mining and Knowledge Discovery, 23(1), 169-214. 27

• Patterns in data can be ranked by their ability
to compress the dataset. • Equally sound models can be ranked by their complexity/assumptions. • This technique (philosophy) is general and can be applied across research areas and applications. • Use carefully, if none of the distributions under consideration represents the data generating machinery very well, MDL fails. https://xkcd.com/1155/ MDL, summary 28

Applications (IV). Information Distance. (This is my favorite) • Enables
measuring the distance between digital objects: – Two genomes (evolution) – Two documents (plagiarism detection, authorship/subject recognition) – Two computer programs (virus detection) – Two emails (signature verification) – Two pictures – Two homepages – Two songs – Two youtube movies * Image example - courtesy of M. Li and P. Vitaniy, Lectures on Kolmogorov Complexity 29

Normalized Information Distance Normalized Compression Distance (using bzip, gzip, winrar)
Normalized Google Distance (pages containing x, y, and x and y together) 30

Whole Genome Phylogeny Li et al, Bioinformatics, 2001 • Uses
all the information in the genome; no need of evolutionary model – universal; no need of multiple alignment • Eutherian Orders problem: it has been a disputed issue which of the two groups of placental mammals are closer: Ferungulates, Primates, or Rodents? In mtDNA: - 6 proteins say primates closer to ferungulates; - 6 proteins say primates closer to rodents 31

Whole Genome Phylogeny Li et al, Bioinformatics, 2001 • Hasegawa’s
group concatenated 12 mtDNA proteins from: rat, house mouse, grey seal, harbor seal, cat, white rhino, horse, finback whale, blue whale, cow, gibbon, gorilla, human, chimpanzee, pygmy chimpanzee, orangutan, sumatran orangutan, with opossum, wallaroo, platypus as out group ( 1998, using max likelihood method in MOLPHY) • Li’s group used complete mtDNA genome of exactly the same species. – Computed NCD(x,y) for each pair of species, using GenCompress (DNA- tuned gzip) and used Neighbor Joining in MOLPHY package. – Constructed exactly the same tree. Confirming Primates and Ferungulates are closer than Rodents. 32

Phylogenetic trees from both papers 33

Summary on information distance • Normalized compression distance is way
of measuring the similarity between two objects. • General, i.e. not application-dependent. It is a truly "parameter-free, feature-free" data-mining tool. • Can be used for clustering of heterogeneous data. • Use Google search engine as a compressor useful for data mining. 34

Applications (V). Time series anomaly. Planetary orbits, 10/11th century ICU
display Shape to time series transform Trajectory to time series transform 35

Classic approaches • Brute force all-with-all comparison • Simple statistics
– Compute distribution – Make a decision base on likelihood • Complex statistics – HMM • Transformation into a feature space, such as DFT, DWT, etc. • Current state of the art: HOT-SAX discord discovery algorithm 36

Our approach. • In our approach we follow steps suggested
by Kolmogorov exactly: 1. Continuous signal discretization (SAX) via sliding window • reduces the dimensionality greatly • enables variable length pattern discovery 2. Grammatical compression (Sequitur) • effective and efficient technique dynamically compresses the discretized signal into a set of rules • enables variable length pattern discovery 3. Conditional Kolmogorov complexity K(X|Y) • at any time our algorithm is able to pinpoint anomalies with respect to the observed signal 37

Performance evaluation (orders of magnitude faster than current state of
the art) 38 • We propose two algorithms for variable-length anomaly discovery: 1. Rule density curve for approximate anomaly discovery • rule coverage counting, linear time and space, online anomaly discovery 2. Rare Rule Anomaly (RRA) for exact anomaly discovery • HOTSAX modification, heuristics uses ranked grammatical rules (after GI, |terminals|+|non_terminals| ≤ |terminals|, thus less calls to distance function)

Step 1: Symbolic Aggregate Approximation 0 - - 0 20
40 60 80 100 120 b b b a c c c a 0 20 40 60 80 100 120 C C baabccbc 39 We pass a sliding window along the time series extracting a sequence of words

Step 2: Discretized time series to context-free grammar with Sequitur
Input: abcabcabcXXXabcabc Output: 40

R2 R2 R2 Step 3: Grammar structure analysis, rule density
curve Input: abcabcabc XXX abcabc Output: R2 R2 R1 Coverage depth 2 Coverage depth 1 Coverage depth = 0, i.e. incompressible Anomaly! Coverage depth 2 41 R1

Live demonstration https://www.youtube.com/watch?v=9lH-RG5OtkY 42

How good Sequitur is? (better than gzip, worse than arithmetic
coding) Table by Richard Ladner, U. Washington. 43

Applications. Trajectory data. • The trajectory data is intrinsically complex
to explore for regularity since patterns of movement are often driven by unperceived goals and constrained by unknown environmental settings. • The data used in this study was gathered from a GPS device which recorded location coordinates and times while commuting during a typical week on foot, by car, and bicycle. • To apply RRA to the trajectory, the multi-dimensional trajectory data (time, latitude, longitude) was transformed into a sequence of scalars. 44

Hilbert space-filling curve (1891) • The trajectory becomes a sequence
of scalars {0,3,2,2,2,7,7,8,11,13,13,2,1,1}, i.e., a time series! 45

Finding an anomaly in Hilbert curve-transformed trajectory Planted anomaly, traveled
once A week of typical commute 46

Examples of true anomalies discovered in the trajectory data Abnormal
behavior of not visiting the parking lot Abnormal path outside from a highly visited area (similar to the planted anomaly) 47

Discretization parameters sensitivity analysis 48

Resume • Kolmogorov complexity, when approximated with a compressor, enables
the ranking of objects based on their information context. • This ranking is general, effective, and efficient. • Conditional Kolmogorov complexity enables information quality assessment. – how much new information was added? – what is the nature of the observed information? • Kolmogorov complexity enables the quantification of algorithmic randomness enabling the discovery of unusual (incompressible) data entities. 49

Thank you! • Jessica Lin, Xing Wang, George Mason University,
Department of Computer Science. • Tim Oates, Sunil Gandhi, University of Maryland, Baltimore County, Department of Computer Science. • Arnold P. Boedihardjo, Crystal Chen, Susan Frankenstein, U.S. Army Corps of Engineers, Engineer Research and Development Center. • Paul Vitanyi, CWI (for pointers, the book, and lecture slides). 50

Kolmogorov complexity & applications. Time se...

Kolmogorov complexity & applications. Time series anomaly discovery with grammar-based compression.

More Decks by Pavel Senin

Other Decks in Education

Featured

Transcript