150

# About Using of Strings Similarity Conception in Software Engineering

Sergey Frenkel and Victor Zakharov

International Conference on Software Testing, Machine Learning and Complex Process Analysis (TMPA-2019)
7-9 November 2019, Tbilisi

TMPA Conference website https://tmpaconf.org/

## ExactproPRO

November 08, 2019

## Transcript

1. ### About Using of Strings Similarity Conception in Software Engineering Sergey

Frenkel and Victor Zakharov Institute of Informatics Problems, FRC “Computers Science and Control”, RAS, Moscow, Russia
2. ### Motivating Problem Data set similarity tasks in SW engineering :

• web caching and Prefetching, • (variation between two prefetch lists as the Jaccard Distance between them, • Client-service issues: Distances between chunk indexes of clients requesting the same content. • malware detection and attacks recognition (estimation of dissimilarity between malicious and benign codes , binaries, traces)/ For the most part the similarity estimation is based on the concept of distance between the data sets, 2
3. ### Distance formality • distance (“distance in a space”) is a

function D with nonnegative real values defined on the Cartesian product X ×X such that D : X ×XR+. It is called a distance metric on X if for every x, y, z  X: • D(x, y) = 0 iff_x = y (the identity axiom); • D(x, y) + D(y, z) ≥ D(x, z) (the triangle inequality); • D(x, y) = D(y, x) (the symmetry axiom). • A set X, which is provided with a metric, is called a metric space. • The similarity S(x, y) metrics considered as an inversion to the distance notion which must follow these rule, but be greater, the smaller the differences between the objects x,y, S(x; x) > S(x; y), x ≠ y, in particular 3
4. ### Main issues for all tasks • What is the right

representation of the document when we check for similarity? – E.g., representing a document as a set of characters will not do (why?) • When we have billions of documents, keeping the full text in memory is not an option. – We need to find a shorter representation • How do we do pairwise comparisons of billions of documents? – If exact match was the issue it would be ok, can we replicate this idea?
5. ### • Most popular metrics: • Jaccard rate – both distance

and similarity • Edit (Levenshtein) distance 5
6. ### Intuition about similarity (informal semantic) • The concept of similarity

can have a very broad interpretation… • Relatively to ordering : two strings which contain the same words, but in a different order, should be recognized as being similar. • If one string is just a random anagram of the characters contained in the order, then it should be recognized as dissimilar. • Content: should we consider FFUUU as more different to EEUUU than to PPUUU? • a) The count of position-wise mismatches is insensitive to which tokens do not match, and would lead to the same dissimilarity for any pair of the sequences FFUUU, EEUUU, and PPUUU. • b)The distances between states renders dissimilarities sensitive to which of the elements do not match. • с) If we have any reasons for considering P and F as more similar than E and F, then FFUUU will be closer to PPUUU than to EEUUU.

9. ### Probabilistic interpretation Probability that random mapping by a hash-function hi

Pr(hi (xi )=hi (yi )) =JS (x,y)+(1-Js )/2k that a random permutation of the subsets produces the same values, k is the number of bits mapped by the hash-function hi , that is a random permutation of the bit vectors' coordinates (the same permutation on all vectors), MinHashing: replace long text document by much shorter unified-length MinHash signatures. A MinHash function is defined as the index of the first bit, in the permuted order, to have a value 1. h(x) = (ax+b) mod p Then, the MinHash signature of the set S is: • [h (S), h π 2 (S), h π3 (S),…h πk (S)], • π1 , π2 ,…, πk are random permutations of the bit vector's coordinates and h1 ; h2 , h3 ,…,hk their matching MinHash functions. 1. Minhashing: convert large sets to short signatures (lists of integers), while preserving similarity. 2. Locality-sensitive hashing: focus on pairs of signatures likely to be similar. 9
10. ### Locality Sensitive Hashing N points LSH Hash Buckets: https://ieeexplore.ieee.org/document/4472264 A.

Rajaraman and J. Ullman. Mining of Massive Datasets, Ch. 3.4 (2010).
11. ### n-grams & “shingling” • We regard a trace as plain

text – no semantics involved. • n-gram : substring of length n of a string. • Collect all the n-grams of each trace and store them in a set. 11
12. ### n-gram Sets & Representing Strings •A Representing string is obtained

from a set of n-grams by: 1.Sorting the n-grams alphabetically. 2.Concatenating them to form one string. 12 Text: abcdef 3-gram “abc”, “bcd”, “cde”, “def” Sorted n- grams set: concat Representing String: abcbcdcdede f
13. ### General scheme of Jaccard based clusterization http://www.mmds.org 13 Trace The

set of n-grams Signatures: short integer vectors that represent the sets, and reflect their Jaccardsimilarity Locality- Sensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity
14. ### Edit Distance • Edit Distance is defined over strings. •

Delete, Insert, Substitute of 1 char are Edit Operations. • The Edit Distance between strings x and y , ED(x,y): Is the number of edit operations required to transform x into y. • Examples: ED(’abc’, ’aac’) = 1 ED(’xabcy’, ’aby’) = 2 ED(‘revolution’, ’evolution’) = 1 ED(‘kitten’, ’sitting’) = 3 • Computed by dynamic programing in quadratic time in the strings’ length. BGU 14

16. ### Let’s consider the disadvantages and advantages of applying these metrics

to similarity estimates for specific examples (malware traces, for example)… 16
17. ### Malware detection problem • 1. The problem is to classify

a program as “benign” , “malicious” . • Two possible paradigm: (i) features based classification , • (ii) formal properties verification • Malware variants share similar behaviors yet they have different syntactic structure due to the incorporation of many obfuscation and code change techniques. •Requirements to features based models •behavior-based features model that describes malicious action exhibited by malware instance. Possible means: a dynamic analysis on a recent malware dataset inside a controlled virtual environment and capture traces of API calls invoked by malware instances. 17
18. ### • Zero-day or unknown malware are created using code obfuscation

techniques that can modify the parent code • Produce offspring copies which have the same functionality but with different signatures. 18
19. ### 19 Why System Calls? • Definition: The list of system

calls issued by a program for the duration of it’s execution is called a system call trace. • Load Library lpFileName=VERSION.dll Return=SUCCESS • A compromised program cannot cause significant damage to the underlying system without using system calls, i.e Creating a new process, accessing a file etc. • The Malware instances largely depend on API calls provided by the operating system to achieve their malicious tasks. Therefore, behavior-based techniques that utilize API calls are promising. • .
20. ### Possible trace semantic • Event ID • Event object (e.g

registry, file, process, etc.) • Event subject if applicable (i.e. the process that takes the action) • Action parameters (e.g. registry value, file path, IP address) • Status of the action (e.g. file handle created, registry removed, etc.) • Kernel Function ZwOpnKey • Parameters \Registry\Machine\Software\Microsoft\Windows • NT\CurrentVersion\Image File Execution Options\iexplore.exe Status Key handle 20
21. ### Traces representation • Trace of system calls with part of

parameters • LoadLibrary lpFileName=VERSION.dll CreateFile hName=C:\WINDOWS\System32\Wbem\wmic.exe CreateFile hName=C:\WINDOWS\System32\Wbem\wmic.exe RegQueryValue hKey=HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Explore r\Shell Folders\Cache LoadLibrary lpFileName=advapi32.dll CreateProcessInternal lpCommandLine=WMIC csproduct Get UUID /FORMAT:textvaluelist.xsl CreateProcessInternal lpCommandLine=WMIC csproduct Get UUID /FORMAT:textvaluelist.xsl CreateProcess lpCommandLine=WMIC csproduct Get UUID /FORMAT:textvaluelist.xsl 21
22. ### Example Trace: BGU 22 1416 malware.exe #237690000 LoadLibrary lpFileName=advapi32.dll Return=SUCCESS

#238830000 RegQueryValue hKey=HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\CriticalSectionTimeout Return=SUCCESS #244600000 CreateFile hName=C:\WINDOWS\WindowsShell.Manifest desiredAccess=GENERIC_READ creationDisposition=OPEN_EXISTING Return=SUCCESS

25. ### J vs. NED: benefits and drawbacks for Malware detection Jaccard

metrics: • J S (J D )is a true metric in the space of sets with such distance, as the triangle inequality holds. This is why it may be effectively used in clustering a algorithms. Low computational complexity. Formally intended to measure the similarity of simple sets and not strings. In fact, the Jaccard similarity/distance of two traces x,y, is the simple ratio of numbers of coincided symbols. Example: Replacement Attacks. J S of two traces can incorrectly reflect the change of control graph (representing dependencies between the system calls in the traces of the program execution) Reason: the triviality of connection between the similarity of J S (or J D ) and the structure and semantics of data program behavior displayed in traces, an incorrect detection of the consequences of attacks is possible. 25
26. ### Contd. • NED: • -deals with strings, • Sensitive to

the length of strings difference, • Allows a probabilistic interpretation. • However: • in the general case, the triangle inequality does not hold • there is no obvious connection between the probability of two strings getting into the same bucket and their coincidence. therefore, it’s difficult to implement efficient hashing like Jaccard. • Computational complexity 26

28. ### • The question naturally arises: is it possible to construct

a similarity metric that could combine the advantages of both metrics without their drawbacks? The first problem on this way is how we can co-consider a relation between such different measures as Jaccard and NED (one which is defined on unordered set, while another on the strings)? We can overcome this by using the conception of representing string (Jef. Ulman)
29. ### Representing string Representing string is a result of original (raw)

textual data shingling, taking into account that there is a solid evidence that these similarity estimation results can be applied to (raw) strings that have n-gram representation with low repetitions. 29
30. ### Correspondence Between Representing Strings And Original Strings The average and

standard deviation of the difference (delta) between the NED on pairs of original traces and the NED on the same pair of representing strings, as a function of the n-gram length. 30
31. ### Average NED • α=min(|x|,|y|)/max(|x|, |y|) 31 Relationship between Average Normalized

Edit Distance, ratio of strings pair α, and Jaccard Distance.
32. ### Some mathematical reasons • Thus, for strings which are representing

string of n-gram sets there is a direct relationship between JD (X;,Y ) and delete/insertion based Edit Distance E(x,y). • For any pair of strings x; y, ED(x; y) does not Exceed E(x; y). For example 32 The elements in the representing strings are sorted and have no repetitions. The strings’ longest common subsequence (LCS) is exactly the size of the corresponding sets intersection

36. ### Windows API Call Traces • We trace calls to selected

set (55) of Windows APIs. • The traces were obtained from the KVM hypervisor Runtime Execution Introspection and Profiling (REIP) system. BGU 36
37. ### Goal and Contribution • the Averaged Normalized Edit Distance (ANED)

as a new similarity metric for classification-via-clustering problems. • - the average value was obtained by averaging over the interval of possible NED values. Benefits against J: the possibility to take into account the explicit dependence on the difference in sizes of the compared sets, • Benefits again NED: the possibility of using the property of triangle inequality for clustering different subsets of strings, linear computational complexity, • - the relationship between the values of the true values, and the approximate values of JD, NED and their approximate estimates is shown, • -ANED is more sensitive to semantic differences, but formally calculated without using any conditions for the semantics proximity of the objects being compared , 37