Exactpro
November 08, 2019
230

# About Using of Strings Similarity Conception in Software Engineering

Sergey Frenkel and Victor Zakharov

International Conference on Software Testing, Machine Learning and Complex Process Analysis (TMPA-2019)
7-9 November 2019, Tbilisi

TMPA Conference website https://tmpaconf.org/

## ExactproPRO

November 08, 2019

## Transcript

Similarity Conception in
Software Engineering
Sergey Frenkel and Victor Zakharov
Institute of Informatics Problems,
FRC “Computers Science and Control”, RAS,
Moscow, Russia

2. Motivating Problem
Data set similarity tasks in SW engineering :
• web caching and Prefetching,
• (variation between two prefetch lists as the Jaccard Distance
between them,
• Client-service issues: Distances between chunk indexes of clients
requesting the same content.
• malware detection and attacks recognition
(estimation of dissimilarity between malicious and benign codes ,
binaries, traces)/
For the most part the similarity estimation is based on the concept of
distance between the data sets,
2

3. Distance formality
• distance (“distance in a space”) is a function D with nonnegative real
values defined on the Cartesian product X ×X such that D : X
×XR+. It is called a distance metric on X if for every x, y, z  X:
• D(x, y) = 0 iff_x = y (the identity axiom);
• D(x, y) + D(y, z) ≥ D(x, z) (the triangle inequality);
• D(x, y) = D(y, x) (the symmetry axiom).

A set X, which is provided with a metric, is called a metric space.
• The similarity S(x, y) metrics considered as an inversion to the
distance notion which must follow these rule, but be greater, the
smaller the differences between the objects x,y, S(x; x) > S(x; y), x ≠
y, in particular
3

4. Main issues for all tasks
• What is the right representation of the document when we check for
similarity?
– E.g., representing a document as a set of characters will not do (why?)
• When we have billions of documents, keeping the full text in memory is not
an option.
– We need to find a shorter representation
• How do we do pairwise comparisons of billions of documents?
– If exact match was the issue it would be ok, can we replicate this idea?

5. • Most popular metrics:
• Jaccard rate – both distance and similarity
• Edit (Levenshtein) distance
5

6. Intuition about similarity (informal semantic)
• The concept of similarity can have a very
• Relatively to ordering : two strings which contain the same words,
but in a different order, should be recognized as being similar.
• If one string is just a random anagram of the characters contained in
the order, then it should be recognized as dissimilar.
• Content: should we consider FFUUU as more different to EEUUU
than to PPUUU?
• a) The count of position-wise mismatches is insensitive to which
tokens do not match, and would lead to the same dissimilarity for
any pair of the sequences FFUUU, EEUUU, and PPUUU.
• b)The distances between states renders dissimilarities sensitive to
which of the elements do not match.
• с) If we have any reasons for considering P and F as more similar
than E and F, then FFUUU will be closer to PPUUU than to EEUUU.

7. Jaccard Similarity
SimJ
=3/8
7

8. Classification via clustering

9. Probabilistic interpretation
Probability that random mapping by a hash-function hi
Pr(hi
(xi
)=hi
(yi
)) =JS
(x,y)+(1-Js
)/2k
that a random permutation of the subsets produces the same values, k
is the number of bits mapped by the hash-function hi
, that is a random
permutation of the bit vectors' coordinates (the same permutation on all
vectors),
MinHashing: replace long text document by much shorter unified-length
MinHash signatures.
A MinHash function is defined as the index of the first bit, in the permuted
order, to have a value 1.
h(x) = (ax+b) mod p
Then, the MinHash signature of the set S is:
• [h (S), h π 2
(S), h π3
(S),…h πk
(S)],
• π1
, π2
,…, πk
are random permutations of the bit vector's coordinates and h1
; h2
, h3
,…,hk
their matching MinHash functions.
1. Minhashing: convert large sets to short signatures (lists of integers), while
preserving similarity.
2. Locality-sensitive hashing: focus on pairs of signatures likely to be
similar.
9

10. Locality Sensitive Hashing
N
points
LSH
Hash
Buckets:
https://ieeexplore.ieee.org/document/4472264
A. Rajaraman and J. Ullman. Mining of Massive
Datasets, Ch. 3.4 (2010).

11. n-grams & “shingling”
• We regard a trace as plain text – no
semantics involved.
• n-gram : substring of length n of a string.
• Collect all the n-grams of each trace and
store them in a set.
11

12. n-gram Sets & Representing Strings
•A Representing string is obtained
from a set of n-grams by:
1.Sorting the n-grams
alphabetically.
2.Concatenating them to form one
string.
12
Text:
abcdef 3-gram
“abc”, “bcd”,
“cde”, “def”
Sorted n-
grams set:
concat
Representing
String:
abcbcdcdede
f

13. General scheme of Jaccard
based clusterization
http://www.mmds.org 13
Trace
The set
of n-grams
Signatures:
short integer
vectors that
represent the
sets, and
reflect their
Jaccardsimilarity
Locality-
Sensitive
Hashing
Candidate
pairs:
those pairs
of signatures
that we need
to test for
similarity

14. Edit Distance
• Edit Distance is defined over strings.
• Delete, Insert, Substitute of 1 char are Edit Operations.
• The Edit Distance between strings x and y , ED(x,y):
Is the number of edit operations required to transform x into y.
• Examples:
ED(’abc’, ’aac’) = 1
ED(’xabcy’, ’aby’) = 2
ED(‘revolution’, ’evolution’) = 1
ED(‘kitten’, ’sitting’) = 3
• Computed by dynamic programing in quadratic time in the
strings’ length.
BGU 14

15. Normalized Edit Distance
BGU 15

16. Let’s consider the disadvantages and
advantages of applying these metrics to
similarity estimates for specific examples
(malware traces, for example)…
16

17. Malware detection problem
• 1. The problem is to classify a program as “benign” , “malicious” .
• Two possible paradigm: (i) features based classification ,
• (ii) formal properties verification
• Malware variants share similar behaviors yet they have different
syntactic structure due to the incorporation of many obfuscation and
code change techniques.
•Requirements to features based models
•behavior-based features model that describes malicious action
exhibited by malware instance.
Possible means:
a dynamic analysis on a recent malware dataset inside a controlled
virtual environment and capture traces of API calls invoked by
malware instances.
17

18. • Zero-day or unknown malware are created
using code obfuscation techniques that
can modify the parent code
• Produce offspring copies which have the
same functionality but with different
signatures.
18

19. 19
Why System Calls?
• Definition: The list of system calls issued by a program for the
duration of it’s execution is called a system call trace.
• A compromised program cannot cause significant damage to the
underlying system without using system calls, i.e Creating a new
process, accessing a file etc.
• The Malware instances largely depend on API calls provided by the
operating system to achieve their malicious tasks. Therefore,
behavior-based techniques that utilize API calls are promising.
• .

20. Possible trace semantic
• Event ID
• Event object (e.g registry, file, process, etc.)
• Event subject if applicable (i.e. the process that takes the
action)
• Action parameters (e.g. registry value, file path, IP address)
• Status of the action (e.g. file handle created, registry removed,
etc.)
• Kernel Function ZwOpnKey
• Parameters \Registry\Machine\Software\Microsoft\Windows
• NT\CurrentVersion\Image File Execution Options\iexplore.exe
Status Key handle 20

21. Traces representation
• Trace of system calls with part of parameters
hName=C:\WINDOWS\System32\Wbem\wmic.exe
CreateFile
hName=C:\WINDOWS\System32\Wbem\wmic.exe
RegQueryValue
hKey=HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Explore
r\Shell Folders\Cache
CreateProcessInternal
lpCommandLine=WMIC csproduct Get UUID /FORMAT:textvaluelist.xsl
CreateProcessInternal
lpCommandLine=WMIC csproduct Get UUID /FORMAT:textvaluelist.xsl
CreateProcess
lpCommandLine=WMIC csproduct Get UUID /FORMAT:textvaluelist.xsl
21

22. Example Trace:
BGU 22
1416 malware.exe
#237690000
Return=SUCCESS
#238830000
RegQueryValue
hKey=HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session
Manager\CriticalSectionTimeout
Return=SUCCESS
#244600000
CreateFile
hName=C:\WINDOWS\WindowsShell.Manifest
creationDisposition=OPEN_EXISTING
Return=SUCCESS

23. System calls without Parameters(e.g.KVM data
set )
RegOpenKeyExW
A subset of the Windows API/System-Calls which are considered
informative for differentiating a malware from a benign software are
logged by API monitors when a designated program is running in
the system.
23

24. Detection=Classification
24

25. J vs. NED: benefits and drawbacks for Malware detection
Jaccard metrics:
• J
S
(J
D
)is a true metric in the space of sets with such distance,
as the triangle inequality holds. This is why it may be
effectively used in clustering a algorithms.
Low computational complexity.
Formally intended to measure the similarity of simple sets and
not strings.
In fact, the Jaccard similarity/distance of two traces x,y, is the
simple ratio of numbers of coincided symbols.
Example: Replacement Attacks. J
S
of two traces can incorrectly
reflect the change of control graph (representing dependencies
between the system calls in the traces of the program execution)
Reason: the triviality of connection between the similarity of J
S
(or
J
D
) and the structure and semantics of data program behavior
displayed in traces, an incorrect detection of the consequences of
attacks is possible.
25

26. Contd.
• NED:
• -deals with strings,
• Sensitive to the length of strings difference,
• Allows a probabilistic interpretation.
• However:
• in the general case, the triangle inequality does not hold
• there is no obvious connection between the probability of
two strings getting into the same bucket and their
coincidence. therefore, it’s difficult to implement efficient
hashing like Jaccard.
• Computational complexity
26

27. Clustering
27

28. • The question naturally arises: is it possible to construct a
similarity metric that could combine the advantages of
both metrics without their drawbacks?
The first problem on this way is how we can co-consider
a relation between such different measures as Jaccard and
NED (one which is defined on unordered set, while another
on the strings)?
We can overcome this by using the conception of
representing string (Jef. Ulman)

29. Representing string
Representing string is a result of original
(raw) textual data shingling, taking into
account that there is a solid evidence that
these similarity estimation results can be
applied to (raw) strings that have n-gram
representation with low repetitions.
29

30. Correspondence Between Representing Strings
And Original Strings
The average and standard
deviation of the difference (delta)
between the NED on pairs of
original traces and the NED on the
same pair of representing strings,
as a function of the n-gram length.
30

31. Average NED
• α=min(|x|,|y|)/max(|x|, |y|)
31
Relationship between Average Normalized
Edit Distance, ratio of strings pair α, and
Jaccard Distance.

32. Some mathematical reasons
• Thus, for strings which are representing string of n-gram sets there is a
direct relationship between JD
(X;,Y ) and delete/insertion based Edit
Distance E(x,y).
• For any pair of strings x; y, ED(x; y) does not Exceed E(x; y). For example
32
The elements in the representing strings are sorted and have no
repetitions.
The strings’ longest common subsequence (LCS) is exactly the size of
the corresponding sets intersection

33. Triangle inequality
33

34. Experiments for different
metrics
34

35. Malware-benign similarity rates to the nearest
medoid
35

36. Windows API Call Traces
• We trace calls to selected set (55) of
Windows APIs.
• The traces were obtained from the KVM
hypervisor Runtime Execution
Introspection and Profiling (REIP) system.
BGU 36

37. Goal and Contribution
• the Averaged Normalized Edit Distance (ANED) as a new similarity
metric for classification-via-clustering problems.
• - the average value was obtained by averaging over the interval of
possible NED values.
Benefits against J: the possibility to take into account the explicit
dependence on the difference in sizes of the compared sets,
• Benefits again NED: the possibility of using the property of triangle
inequality for clustering different subsets of strings, linear
computational complexity,
• - the relationship between the values of the true values, and the
approximate values of JD, NED and their approximate estimates is
shown,
• -ANED is more sensitive to semantic differences, but formally
calculated without using any conditions for the semantics proximity
of the objects being compared ,
37