• web caching and Prefetching, • (variation between two prefetch lists as the Jaccard Distance between them, • Client-service issues: Distances between chunk indexes of clients requesting the same content. • malware detection and attacks recognition (estimation of dissimilarity between malicious and benign codes , binaries, traces)/ For the most part the similarity estimation is based on the concept of distance between the data sets, 2
function D with nonnegative real values defined on the Cartesian product X ×X such that D : X ×XR+. It is called a distance metric on X if for every x, y, z X: • D(x, y) = 0 iff_x = y (the identity axiom); • D(x, y) + D(y, z) ≥ D(x, z) (the triangle inequality); • D(x, y) = D(y, x) (the symmetry axiom). • A set X, which is provided with a metric, is called a metric space. • The similarity S(x, y) metrics considered as an inversion to the distance notion which must follow these rule, but be greater, the smaller the differences between the objects x,y, S(x; x) > S(x; y), x ≠ y, in particular 3
representation of the document when we check for similarity? – E.g., representing a document as a set of characters will not do (why?) • When we have billions of documents, keeping the full text in memory is not an option. – We need to find a shorter representation • How do we do pairwise comparisons of billions of documents? – If exact match was the issue it would be ok, can we replicate this idea?
can have a very broad interpretation… • Relatively to ordering : two strings which contain the same words, but in a different order, should be recognized as being similar. • If one string is just a random anagram of the characters contained in the order, then it should be recognized as dissimilar. • Content: should we consider FFUUU as more different to EEUUU than to PPUUU? • a) The count of position-wise mismatches is insensitive to which tokens do not match, and would lead to the same dissimilarity for any pair of the sequences FFUUU, EEUUU, and PPUUU. • b)The distances between states renders dissimilarities sensitive to which of the elements do not match. • с) If we have any reasons for considering P and F as more similar than E and F, then FFUUU will be closer to PPUUU than to EEUUU.
Pr(hi (xi )=hi (yi )) =JS (x,y)+(1-Js )/2k that a random permutation of the subsets produces the same values, k is the number of bits mapped by the hash-function hi , that is a random permutation of the bit vectors' coordinates (the same permutation on all vectors), MinHashing: replace long text document by much shorter unified-length MinHash signatures. A MinHash function is defined as the index of the first bit, in the permuted order, to have a value 1. h(x) = (ax+b) mod p Then, the MinHash signature of the set S is: • [h (S), h π 2 (S), h π3 (S),…h πk (S)], • π1 , π2 ,…, πk are random permutations of the bit vector's coordinates and h1 ; h2 , h3 ,…,hk their matching MinHash functions. 1. Minhashing: convert large sets to short signatures (lists of integers), while preserving similarity. 2. Locality-sensitive hashing: focus on pairs of signatures likely to be similar. 9
from a set of n-grams by: 1.Sorting the n-grams alphabetically. 2.Concatenating them to form one string. 12 Text: abcdef 3-gram “abc”, “bcd”, “cde”, “def” Sorted n- grams set: concat Representing String: abcbcdcdede f
set of n-grams Signatures: short integer vectors that represent the sets, and reflect their Jaccardsimilarity Locality- Sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity
Delete, Insert, Substitute of 1 char are Edit Operations. • The Edit Distance between strings x and y , ED(x,y): Is the number of edit operations required to transform x into y. • Examples: ED(’abc’, ’aac’) = 1 ED(’xabcy’, ’aby’) = 2 ED(‘revolution’, ’evolution’) = 1 ED(‘kitten’, ’sitting’) = 3 • Computed by dynamic programing in quadratic time in the strings’ length. BGU 14
a program as “benign” , “malicious” . • Two possible paradigm: (i) features based classification , • (ii) formal properties verification • Malware variants share similar behaviors yet they have different syntactic structure due to the incorporation of many obfuscation and code change techniques. •Requirements to features based models •behavior-based features model that describes malicious action exhibited by malware instance. Possible means: a dynamic analysis on a recent malware dataset inside a controlled virtual environment and capture traces of API calls invoked by malware instances. 17
calls issued by a program for the duration of it’s execution is called a system call trace. • Load Library lpFileName=VERSION.dll Return=SUCCESS • A compromised program cannot cause significant damage to the underlying system without using system calls, i.e Creating a new process, accessing a file etc. • The Malware instances largely depend on API calls provided by the operating system to achieve their malicious tasks. Therefore, behavior-based techniques that utilize API calls are promising. • .
registry, file, process, etc.) • Event subject if applicable (i.e. the process that takes the action) • Action parameters (e.g. registry value, file path, IP address) • Status of the action (e.g. file handle created, registry removed, etc.) • Kernel Function ZwOpnKey • Parameters \Registry\Machine\Software\Microsoft\Windows • NT\CurrentVersion\Image File Execution Options\iexplore.exe Status Key handle 20
GetProcAddress GetProcAddress GetProcAddress GetProcAddress GetProcAddress RegOpenKeyExW • 0,RegOpenKeyExW LoadLibraryA GetProcAddress GetProcAddress GetProcAddress GetProcAddress GetProcAddress LoadL ibraryA A subset of the Windows API/System-Calls which are considered informative for differentiating a malware from a benign software are logged by API monitors when a designated program is running in the system. 23
metrics: • J S (J D )is a true metric in the space of sets with such distance, as the triangle inequality holds. This is why it may be effectively used in clustering a algorithms. Low computational complexity. Formally intended to measure the similarity of simple sets and not strings. In fact, the Jaccard similarity/distance of two traces x,y, is the simple ratio of numbers of coincided symbols. Example: Replacement Attacks. J S of two traces can incorrectly reflect the change of control graph (representing dependencies between the system calls in the traces of the program execution) Reason: the triviality of connection between the similarity of J S (or J D ) and the structure and semantics of data program behavior displayed in traces, an incorrect detection of the consequences of attacks is possible. 25
the length of strings difference, • Allows a probabilistic interpretation. • However: • in the general case, the triangle inequality does not hold • there is no obvious connection between the probability of two strings getting into the same bucket and their coincidence. therefore, it’s difficult to implement efficient hashing like Jaccard. • Computational complexity 26
a similarity metric that could combine the advantages of both metrics without their drawbacks? The first problem on this way is how we can co-consider a relation between such different measures as Jaccard and NED (one which is defined on unordered set, while another on the strings)? We can overcome this by using the conception of representing string (Jef. Ulman)
textual data shingling, taking into account that there is a solid evidence that these similarity estimation results can be applied to (raw) strings that have n-gram representation with low repetitions. 29
string of n-gram sets there is a direct relationship between JD (X;,Y ) and delete/insertion based Edit Distance E(x,y). • For any pair of strings x; y, ED(x; y) does not Exceed E(x; y). For example 32 The elements in the representing strings are sorted and have no repetitions. The strings’ longest common subsequence (LCS) is exactly the size of the corresponding sets intersection
as a new similarity metric for classification-via-clustering problems. • - the average value was obtained by averaging over the interval of possible NED values. Benefits against J: the possibility to take into account the explicit dependence on the difference in sizes of the compared sets, • Benefits again NED: the possibility of using the property of triangle inequality for clustering different subsets of strings, linear computational complexity, • - the relationship between the values of the true values, and the approximate values of JD, NED and their approximate estimates is shown, • -ANED is more sensitive to semantic differences, but formally calculated without using any conditions for the semantics proximity of the objects being compared , 37