Slide 1

Slide 1 text

About Using of Strings Similarity Conception in Software Engineering Sergey Frenkel and Victor Zakharov Institute of Informatics Problems, FRC “Computers Science and Control”, RAS, Moscow, Russia

Slide 2

Slide 2 text

Motivating Problem Data set similarity tasks in SW engineering : • web caching and Prefetching, • (variation between two prefetch lists as the Jaccard Distance between them, • Client-service issues: Distances between chunk indexes of clients requesting the same content. • malware detection and attacks recognition (estimation of dissimilarity between malicious and benign codes , binaries, traces)/ For the most part the similarity estimation is based on the concept of distance between the data sets, 2

Slide 3

Slide 3 text

Distance formality • distance (“distance in a space”) is a function D with nonnegative real values defined on the Cartesian product X ×X such that D : X ×XR+. It is called a distance metric on X if for every x, y, z  X: • D(x, y) = 0 iff_x = y (the identity axiom); • D(x, y) + D(y, z) ≥ D(x, z) (the triangle inequality); • D(x, y) = D(y, x) (the symmetry axiom). • A set X, which is provided with a metric, is called a metric space. • The similarity S(x, y) metrics considered as an inversion to the distance notion which must follow these rule, but be greater, the smaller the differences between the objects x,y, S(x; x) > S(x; y), x ≠ y, in particular 3

Slide 4

Slide 4 text

Main issues for all tasks • What is the right representation of the document when we check for similarity? – E.g., representing a document as a set of characters will not do (why?) • When we have billions of documents, keeping the full text in memory is not an option. – We need to find a shorter representation • How do we do pairwise comparisons of billions of documents? – If exact match was the issue it would be ok, can we replicate this idea?

Slide 5

Slide 5 text

• Most popular metrics: • Jaccard rate – both distance and similarity • Edit (Levenshtein) distance 5

Slide 6

Slide 6 text

Intuition about similarity (informal semantic) • The concept of similarity can have a very broad interpretation… • Relatively to ordering : two strings which contain the same words, but in a different order, should be recognized as being similar. • If one string is just a random anagram of the characters contained in the order, then it should be recognized as dissimilar. • Content: should we consider FFUUU as more different to EEUUU than to PPUUU? • a) The count of position-wise mismatches is insensitive to which tokens do not match, and would lead to the same dissimilarity for any pair of the sequences FFUUU, EEUUU, and PPUUU. • b)The distances between states renders dissimilarities sensitive to which of the elements do not match. • с) If we have any reasons for considering P and F as more similar than E and F, then FFUUU will be closer to PPUUU than to EEUUU.

Slide 7

Slide 7 text

Jaccard Similarity SimJ =3/8 7

Slide 8

Slide 8 text

Classification via clustering

Slide 9

Slide 9 text

Probabilistic interpretation Probability that random mapping by a hash-function hi Pr(hi (xi )=hi (yi )) =JS (x,y)+(1-Js )/2k that a random permutation of the subsets produces the same values, k is the number of bits mapped by the hash-function hi , that is a random permutation of the bit vectors' coordinates (the same permutation on all vectors), MinHashing: replace long text document by much shorter unified-length MinHash signatures. A MinHash function is defined as the index of the first bit, in the permuted order, to have a value 1. h(x) = (ax+b) mod p Then, the MinHash signature of the set S is: • [h (S), h π 2 (S), h π3 (S),…h πk (S)], • π1 , π2 ,…, πk are random permutations of the bit vector's coordinates and h1 ; h2 , h3 ,…,hk their matching MinHash functions. 1. Minhashing: convert large sets to short signatures (lists of integers), while preserving similarity. 2. Locality-sensitive hashing: focus on pairs of signatures likely to be similar. 9

Slide 10

Slide 10 text

Locality Sensitive Hashing N points LSH Hash Buckets: https://ieeexplore.ieee.org/document/4472264 A. Rajaraman and J. Ullman. Mining of Massive Datasets, Ch. 3.4 (2010).

Slide 11

Slide 11 text

n-grams & “shingling” • We regard a trace as plain text – no semantics involved. • n-gram : substring of length n of a string. • Collect all the n-grams of each trace and store them in a set. 11

Slide 12

Slide 12 text

n-gram Sets & Representing Strings •A Representing string is obtained from a set of n-grams by: 1.Sorting the n-grams alphabetically. 2.Concatenating them to form one string. 12 Text: abcdef 3-gram “abc”, “bcd”, “cde”, “def” Sorted n- grams set: concat Representing String: abcbcdcdede f

Slide 13

Slide 13 text

General scheme of Jaccard based clusterization http://www.mmds.org 13 Trace The set of n-grams Signatures: short integer vectors that represent the sets, and reflect their Jaccardsimilarity Locality- Sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity

Slide 14

Slide 14 text

Edit Distance • Edit Distance is defined over strings. • Delete, Insert, Substitute of 1 char are Edit Operations. • The Edit Distance between strings x and y , ED(x,y): Is the number of edit operations required to transform x into y. • Examples: ED(’abc’, ’aac’) = 1 ED(’xabcy’, ’aby’) = 2 ED(‘revolution’, ’evolution’) = 1 ED(‘kitten’, ’sitting’) = 3 • Computed by dynamic programing in quadratic time in the strings’ length. BGU 14

Slide 15

Slide 15 text

Normalized Edit Distance BGU 15

Slide 16

Slide 16 text

Let’s consider the disadvantages and advantages of applying these metrics to similarity estimates for specific examples (malware traces, for example)… 16

Slide 17

Slide 17 text

Malware detection problem • 1. The problem is to classify a program as “benign” , “malicious” . • Two possible paradigm: (i) features based classification , • (ii) formal properties verification • Malware variants share similar behaviors yet they have different syntactic structure due to the incorporation of many obfuscation and code change techniques. •Requirements to features based models •behavior-based features model that describes malicious action exhibited by malware instance. Possible means: a dynamic analysis on a recent malware dataset inside a controlled virtual environment and capture traces of API calls invoked by malware instances. 17

Slide 18

Slide 18 text

• Zero-day or unknown malware are created using code obfuscation techniques that can modify the parent code • Produce offspring copies which have the same functionality but with different signatures. 18

Slide 19

Slide 19 text

19 Why System Calls? • Definition: The list of system calls issued by a program for the duration of it’s execution is called a system call trace. • Load Library lpFileName=VERSION.dll Return=SUCCESS • A compromised program cannot cause significant damage to the underlying system without using system calls, i.e Creating a new process, accessing a file etc. • The Malware instances largely depend on API calls provided by the operating system to achieve their malicious tasks. Therefore, behavior-based techniques that utilize API calls are promising. • .

Slide 20

Slide 20 text

Possible trace semantic • Event ID • Event object (e.g registry, file, process, etc.) • Event subject if applicable (i.e. the process that takes the action) • Action parameters (e.g. registry value, file path, IP address) • Status of the action (e.g. file handle created, registry removed, etc.) • Kernel Function ZwOpnKey • Parameters \Registry\Machine\Software\Microsoft\Windows • NT\CurrentVersion\Image File Execution Options\iexplore.exe Status Key handle 20

Slide 21

Slide 21 text

Traces representation • Trace of system calls with part of parameters • LoadLibrary lpFileName=VERSION.dll CreateFile hName=C:\WINDOWS\System32\Wbem\wmic.exe CreateFile hName=C:\WINDOWS\System32\Wbem\wmic.exe RegQueryValue hKey=HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Explore r\Shell Folders\Cache LoadLibrary lpFileName=advapi32.dll CreateProcessInternal lpCommandLine=WMIC csproduct Get UUID /FORMAT:textvaluelist.xsl CreateProcessInternal lpCommandLine=WMIC csproduct Get UUID /FORMAT:textvaluelist.xsl CreateProcess lpCommandLine=WMIC csproduct Get UUID /FORMAT:textvaluelist.xsl 21

Slide 22

Slide 22 text

Example Trace: BGU 22 1416 malware.exe #237690000 LoadLibrary lpFileName=advapi32.dll Return=SUCCESS #238830000 RegQueryValue hKey=HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\CriticalSectionTimeout Return=SUCCESS #244600000 CreateFile hName=C:\WINDOWS\WindowsShell.Manifest desiredAccess=GENERIC_READ creationDisposition=OPEN_EXISTING Return=SUCCESS

Slide 23

Slide 23 text

System calls without Parameters(e.g.KVM data set ) • 1,RegOpenKeyExW LoadLibraryA GetProcAddress GetProcAddress GetProcAddress GetProcAddress GetProcAddress RegOpenKeyExW • 0,RegOpenKeyExW LoadLibraryA GetProcAddress GetProcAddress GetProcAddress GetProcAddress GetProcAddress LoadL ibraryA A subset of the Windows API/System-Calls which are considered informative for differentiating a malware from a benign software are logged by API monitors when a designated program is running in the system. 23

Slide 24

Slide 24 text

Detection=Classification 24

Slide 25

Slide 25 text

J vs. NED: benefits and drawbacks for Malware detection Jaccard metrics: • J S (J D )is a true metric in the space of sets with such distance, as the triangle inequality holds. This is why it may be effectively used in clustering a algorithms. Low computational complexity. Formally intended to measure the similarity of simple sets and not strings. In fact, the Jaccard similarity/distance of two traces x,y, is the simple ratio of numbers of coincided symbols. Example: Replacement Attacks. J S of two traces can incorrectly reflect the change of control graph (representing dependencies between the system calls in the traces of the program execution) Reason: the triviality of connection between the similarity of J S (or J D ) and the structure and semantics of data program behavior displayed in traces, an incorrect detection of the consequences of attacks is possible. 25

Slide 26

Slide 26 text

Contd. • NED: • -deals with strings, • Sensitive to the length of strings difference, • Allows a probabilistic interpretation. • However: • in the general case, the triangle inequality does not hold • there is no obvious connection between the probability of two strings getting into the same bucket and their coincidence. therefore, it’s difficult to implement efficient hashing like Jaccard. • Computational complexity 26

Slide 27

Slide 27 text

Clustering 27

Slide 28

Slide 28 text

• The question naturally arises: is it possible to construct a similarity metric that could combine the advantages of both metrics without their drawbacks? The first problem on this way is how we can co-consider a relation between such different measures as Jaccard and NED (one which is defined on unordered set, while another on the strings)? We can overcome this by using the conception of representing string (Jef. Ulman)

Slide 29

Slide 29 text

Representing string Representing string is a result of original (raw) textual data shingling, taking into account that there is a solid evidence that these similarity estimation results can be applied to (raw) strings that have n-gram representation with low repetitions. 29

Slide 30

Slide 30 text

Correspondence Between Representing Strings And Original Strings The average and standard deviation of the difference (delta) between the NED on pairs of original traces and the NED on the same pair of representing strings, as a function of the n-gram length. 30

Slide 31

Slide 31 text

Average NED • α=min(|x|,|y|)/max(|x|, |y|) 31 Relationship between Average Normalized Edit Distance, ratio of strings pair α, and Jaccard Distance.

Slide 32

Slide 32 text

Some mathematical reasons • Thus, for strings which are representing string of n-gram sets there is a direct relationship between JD (X;,Y ) and delete/insertion based Edit Distance E(x,y). • For any pair of strings x; y, ED(x; y) does not Exceed E(x; y). For example 32 The elements in the representing strings are sorted and have no repetitions. The strings’ longest common subsequence (LCS) is exactly the size of the corresponding sets intersection

Slide 33

Slide 33 text

Triangle inequality 33

Slide 34

Slide 34 text

Experiments for different metrics 34

Slide 35

Slide 35 text

Malware-benign similarity rates to the nearest medoid 35

Slide 36

Slide 36 text

Windows API Call Traces • We trace calls to selected set (55) of Windows APIs. • The traces were obtained from the KVM hypervisor Runtime Execution Introspection and Profiling (REIP) system. BGU 36

Slide 37

Slide 37 text

Goal and Contribution • the Averaged Normalized Edit Distance (ANED) as a new similarity metric for classification-via-clustering problems. • - the average value was obtained by averaging over the interval of possible NED values. Benefits against J: the possibility to take into account the explicit dependence on the difference in sizes of the compared sets, • Benefits again NED: the possibility of using the property of triangle inequality for clustering different subsets of strings, linear computational complexity, • - the relationship between the values of the true values, and the approximate values of JD, NED and their approximate estimates is shown, • -ANED is more sensitive to semantic differences, but formally calculated without using any conditions for the semantics proximity of the objects being compared , 37