Upgrade to Pro — share decks privately, control downloads, hide ads and more …

About Using of Strings Similarity Conception in Software Engineering

Exactpro
PRO
November 08, 2019

About Using of Strings Similarity Conception in Software Engineering

Sergey Frenkel and Victor Zakharov

International Conference on Software Testing, Machine Learning and Complex Process Analysis (TMPA-2019)
7-9 November 2019, Tbilisi

Video: https://youtu.be/7xtfC0QQtcE

TMPA Conference website https://tmpaconf.org/
TMPA Conference on Facebook https://www.facebook.com/groups/tmpaconf/

Exactpro
PRO

November 08, 2019
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. About Using of Strings
    Similarity Conception in
    Software Engineering
    Sergey Frenkel and Victor Zakharov
    Institute of Informatics Problems,
    FRC “Computers Science and Control”, RAS,
    Moscow, Russia

    View Slide

  2. Motivating Problem
    Data set similarity tasks in SW engineering :
    • web caching and Prefetching,
    • (variation between two prefetch lists as the Jaccard Distance
    between them,
    • Client-service issues: Distances between chunk indexes of clients
    requesting the same content.
    • malware detection and attacks recognition
    (estimation of dissimilarity between malicious and benign codes ,
    binaries, traces)/
    For the most part the similarity estimation is based on the concept of
    distance between the data sets,
    2

    View Slide

  3. Distance formality
    • distance (“distance in a space”) is a function D with nonnegative real
    values defined on the Cartesian product X ×X such that D : X
    ×XR+. It is called a distance metric on X if for every x, y, z  X:
    • D(x, y) = 0 iff_x = y (the identity axiom);
    • D(x, y) + D(y, z) ≥ D(x, z) (the triangle inequality);
    • D(x, y) = D(y, x) (the symmetry axiom).

    A set X, which is provided with a metric, is called a metric space.
    • The similarity S(x, y) metrics considered as an inversion to the
    distance notion which must follow these rule, but be greater, the
    smaller the differences between the objects x,y, S(x; x) > S(x; y), x ≠
    y, in particular
    3

    View Slide

  4. Main issues for all tasks
    • What is the right representation of the document when we check for
    similarity?
    – E.g., representing a document as a set of characters will not do (why?)
    • When we have billions of documents, keeping the full text in memory is not
    an option.
    – We need to find a shorter representation
    • How do we do pairwise comparisons of billions of documents?
    – If exact match was the issue it would be ok, can we replicate this idea?

    View Slide

  5. • Most popular metrics:
    • Jaccard rate – both distance and similarity
    • Edit (Levenshtein) distance
    5

    View Slide

  6. Intuition about similarity (informal semantic)
    • The concept of similarity can have a very
    broad interpretation…
    • Relatively to ordering : two strings which contain the same words,
    but in a different order, should be recognized as being similar.
    • If one string is just a random anagram of the characters contained in
    the order, then it should be recognized as dissimilar.
    • Content: should we consider FFUUU as more different to EEUUU
    than to PPUUU?
    • a) The count of position-wise mismatches is insensitive to which
    tokens do not match, and would lead to the same dissimilarity for
    any pair of the sequences FFUUU, EEUUU, and PPUUU.
    • b)The distances between states renders dissimilarities sensitive to
    which of the elements do not match.
    • с) If we have any reasons for considering P and F as more similar
    than E and F, then FFUUU will be closer to PPUUU than to EEUUU.

    View Slide

  7. Jaccard Similarity
    SimJ
    =3/8
    7

    View Slide

  8. Classification via clustering

    View Slide

  9. Probabilistic interpretation
    Probability that random mapping by a hash-function hi
    Pr(hi
    (xi
    )=hi
    (yi
    )) =JS
    (x,y)+(1-Js
    )/2k
    that a random permutation of the subsets produces the same values, k
    is the number of bits mapped by the hash-function hi
    , that is a random
    permutation of the bit vectors' coordinates (the same permutation on all
    vectors),
    MinHashing: replace long text document by much shorter unified-length
    MinHash signatures.
    A MinHash function is defined as the index of the first bit, in the permuted
    order, to have a value 1.
    h(x) = (ax+b) mod p
    Then, the MinHash signature of the set S is:
    • [h (S), h π 2
    (S), h π3
    (S),…h πk
    (S)],
    • π1
    , π2
    ,…, πk
    are random permutations of the bit vector's coordinates and h1
    ; h2
    , h3
    ,…,hk
    their matching MinHash functions.
    1. Minhashing: convert large sets to short signatures (lists of integers), while
    preserving similarity.
    2. Locality-sensitive hashing: focus on pairs of signatures likely to be
    similar.
    9

    View Slide

  10. Locality Sensitive Hashing
    N
    points
    LSH
    Hash
    Buckets:
    https://ieeexplore.ieee.org/document/4472264
    A. Rajaraman and J. Ullman. Mining of Massive
    Datasets, Ch. 3.4 (2010).

    View Slide

  11. n-grams & “shingling”
    • We regard a trace as plain text – no
    semantics involved.
    • n-gram : substring of length n of a string.
    • Collect all the n-grams of each trace and
    store them in a set.
    11

    View Slide

  12. n-gram Sets & Representing Strings
    •A Representing string is obtained
    from a set of n-grams by:
    1.Sorting the n-grams
    alphabetically.
    2.Concatenating them to form one
    string.
    12
    Text:
    abcdef 3-gram
    “abc”, “bcd”,
    “cde”, “def”
    Sorted n-
    grams set:
    concat
    Representing
    String:
    abcbcdcdede
    f

    View Slide

  13. General scheme of Jaccard
    based clusterization
    http://www.mmds.org 13
    Trace
    The set
    of n-grams
    Signatures:
    short integer
    vectors that
    represent the
    sets, and
    reflect their
    Jaccardsimilarity
    Locality-
    Sensitive
    Hashing
    Candidate
    pairs:
    those pairs
    of signatures
    that we need
    to test for
    similarity

    View Slide

  14. Edit Distance
    • Edit Distance is defined over strings.
    • Delete, Insert, Substitute of 1 char are Edit Operations.
    • The Edit Distance between strings x and y , ED(x,y):
    Is the number of edit operations required to transform x into y.
    • Examples:
    ED(’abc’, ’aac’) = 1
    ED(’xabcy’, ’aby’) = 2
    ED(‘revolution’, ’evolution’) = 1
    ED(‘kitten’, ’sitting’) = 3
    • Computed by dynamic programing in quadratic time in the
    strings’ length.
    BGU 14

    View Slide

  15. Normalized Edit Distance
    BGU 15

    View Slide

  16. Let’s consider the disadvantages and
    advantages of applying these metrics to
    similarity estimates for specific examples
    (malware traces, for example)…
    16

    View Slide

  17. Malware detection problem
    • 1. The problem is to classify a program as “benign” , “malicious” .
    • Two possible paradigm: (i) features based classification ,
    • (ii) formal properties verification
    • Malware variants share similar behaviors yet they have different
    syntactic structure due to the incorporation of many obfuscation and
    code change techniques.
    •Requirements to features based models
    •behavior-based features model that describes malicious action
    exhibited by malware instance.
    Possible means:
    a dynamic analysis on a recent malware dataset inside a controlled
    virtual environment and capture traces of API calls invoked by
    malware instances.
    17

    View Slide

  18. • Zero-day or unknown malware are created
    using code obfuscation techniques that
    can modify the parent code
    • Produce offspring copies which have the
    same functionality but with different
    signatures.
    18

    View Slide

  19. 19
    Why System Calls?
    • Definition: The list of system calls issued by a program for the
    duration of it’s execution is called a system call trace.
    • Load Library lpFileName=VERSION.dll Return=SUCCESS
    • A compromised program cannot cause significant damage to the
    underlying system without using system calls, i.e Creating a new
    process, accessing a file etc.
    • The Malware instances largely depend on API calls provided by the
    operating system to achieve their malicious tasks. Therefore,
    behavior-based techniques that utilize API calls are promising.
    • .

    View Slide

  20. Possible trace semantic
    • Event ID
    • Event object (e.g registry, file, process, etc.)
    • Event subject if applicable (i.e. the process that takes the
    action)
    • Action parameters (e.g. registry value, file path, IP address)
    • Status of the action (e.g. file handle created, registry removed,
    etc.)
    • Kernel Function ZwOpnKey
    • Parameters \Registry\Machine\Software\Microsoft\Windows
    • NT\CurrentVersion\Image File Execution Options\iexplore.exe
    Status Key handle 20

    View Slide

  21. Traces representation
    • Trace of system calls with part of parameters
    • LoadLibrary lpFileName=VERSION.dll CreateFile
    hName=C:\WINDOWS\System32\Wbem\wmic.exe
    CreateFile
    hName=C:\WINDOWS\System32\Wbem\wmic.exe
    RegQueryValue
    hKey=HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Explore
    r\Shell Folders\Cache
    LoadLibrary lpFileName=advapi32.dll
    CreateProcessInternal
    lpCommandLine=WMIC csproduct Get UUID /FORMAT:textvaluelist.xsl
    CreateProcessInternal
    lpCommandLine=WMIC csproduct Get UUID /FORMAT:textvaluelist.xsl
    CreateProcess
    lpCommandLine=WMIC csproduct Get UUID /FORMAT:textvaluelist.xsl
    21

    View Slide

  22. Example Trace:
    BGU 22
    1416 malware.exe
    #237690000
    LoadLibrary
    lpFileName=advapi32.dll
    Return=SUCCESS
    #238830000
    RegQueryValue
    hKey=HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session
    Manager\CriticalSectionTimeout
    Return=SUCCESS
    #244600000
    CreateFile
    hName=C:\WINDOWS\WindowsShell.Manifest
    desiredAccess=GENERIC_READ
    creationDisposition=OPEN_EXISTING
    Return=SUCCESS

    View Slide

  23. System calls without Parameters(e.g.KVM data
    set )
    • 1,RegOpenKeyExW LoadLibraryA GetProcAddress
    GetProcAddress GetProcAddress GetProcAddress GetProcAddress
    RegOpenKeyExW
    • 0,RegOpenKeyExW LoadLibraryA GetProcAddress
    GetProcAddress GetProcAddress GetProcAddress GetProcAddress
    LoadL ibraryA
    A subset of the Windows API/System-Calls which are considered
    informative for differentiating a malware from a benign software are
    logged by API monitors when a designated program is running in
    the system.
    23

    View Slide

  24. Detection=Classification
    24

    View Slide

  25. J vs. NED: benefits and drawbacks for Malware detection
    Jaccard metrics:
    • J
    S
    (J
    D
    )is a true metric in the space of sets with such distance,
    as the triangle inequality holds. This is why it may be
    effectively used in clustering a algorithms.
    Low computational complexity.
    Formally intended to measure the similarity of simple sets and
    not strings.
    In fact, the Jaccard similarity/distance of two traces x,y, is the
    simple ratio of numbers of coincided symbols.
    Example: Replacement Attacks. J
    S
    of two traces can incorrectly
    reflect the change of control graph (representing dependencies
    between the system calls in the traces of the program execution)
    Reason: the triviality of connection between the similarity of J
    S
    (or
    J
    D
    ) and the structure and semantics of data program behavior
    displayed in traces, an incorrect detection of the consequences of
    attacks is possible.
    25

    View Slide

  26. Contd.
    • NED:
    • -deals with strings,
    • Sensitive to the length of strings difference,
    • Allows a probabilistic interpretation.
    • However:
    • in the general case, the triangle inequality does not hold
    • there is no obvious connection between the probability of
    two strings getting into the same bucket and their
    coincidence. therefore, it’s difficult to implement efficient
    hashing like Jaccard.
    • Computational complexity
    26

    View Slide

  27. Clustering
    27

    View Slide

  28. • The question naturally arises: is it possible to construct a
    similarity metric that could combine the advantages of
    both metrics without their drawbacks?
    The first problem on this way is how we can co-consider
    a relation between such different measures as Jaccard and
    NED (one which is defined on unordered set, while another
    on the strings)?
    We can overcome this by using the conception of
    representing string (Jef. Ulman)

    View Slide

  29. Representing string
    Representing string is a result of original
    (raw) textual data shingling, taking into
    account that there is a solid evidence that
    these similarity estimation results can be
    applied to (raw) strings that have n-gram
    representation with low repetitions.
    29

    View Slide

  30. Correspondence Between Representing Strings
    And Original Strings
    The average and standard
    deviation of the difference (delta)
    between the NED on pairs of
    original traces and the NED on the
    same pair of representing strings,
    as a function of the n-gram length.
    30

    View Slide

  31. Average NED
    • α=min(|x|,|y|)/max(|x|, |y|)
    31
    Relationship between Average Normalized
    Edit Distance, ratio of strings pair α, and
    Jaccard Distance.

    View Slide

  32. Some mathematical reasons
    • Thus, for strings which are representing string of n-gram sets there is a
    direct relationship between JD
    (X;,Y ) and delete/insertion based Edit
    Distance E(x,y).
    • For any pair of strings x; y, ED(x; y) does not Exceed E(x; y). For example
    32
    The elements in the representing strings are sorted and have no
    repetitions.
    The strings’ longest common subsequence (LCS) is exactly the size of
    the corresponding sets intersection

    View Slide

  33. Triangle inequality
    33

    View Slide

  34. Experiments for different
    metrics
    34

    View Slide

  35. Malware-benign similarity rates to the nearest
    medoid
    35

    View Slide

  36. Windows API Call Traces
    • We trace calls to selected set (55) of
    Windows APIs.
    • The traces were obtained from the KVM
    hypervisor Runtime Execution
    Introspection and Profiling (REIP) system.
    BGU 36

    View Slide

  37. Goal and Contribution
    • the Averaged Normalized Edit Distance (ANED) as a new similarity
    metric for classification-via-clustering problems.
    • - the average value was obtained by averaging over the interval of
    possible NED values.
    Benefits against J: the possibility to take into account the explicit
    dependence on the difference in sizes of the compared sets,
    • Benefits again NED: the possibility of using the property of triangle
    inequality for clustering different subsets of strings, linear
    computational complexity,
    • - the relationship between the values of the true values, and the
    approximate values of JD, NED and their approximate estimates is
    shown,
    • -ANED is more sensitive to semantic differences, but formally
    calculated without using any conditions for the semantics proximity
    of the objects being compared ,
    37

    View Slide