Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Martínez at Big Data Spain 2017

Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Martínez at Big Data Spain 2017

In recent years Machine Learning (ML) and especially Deep Learning (DL) have achieved great success in many areas such as visual recognition, NLP or even aiding in medical research.


Big Data Spain 2017
16th - 17th Kinépolis Madrid


Big Data Spain

December 01, 2017


  1. None
  2. Attacking Machine Learning used in AntiVirus with RL Datahack

  3. # whoami Rubén Martínez Sánchez • Twitter: @eldarsilver • Computer

    Engineer (Universidad Politécnica Madrid) • Security Researcher (Pentester) • Certified Etical Hacker (CEH) • Member of MundoHacker (TV Show) • Master Data Science Datahack • Cloudera Developer Training for Apache Spark • Cloudera Developer Training for Apache Hadoop
  4. Agenda# ls() • Static Malware Analysis • Reinforcement Learning (RL)

    • Antivirus Evasion using RL • Demo Antivirus Evasion using RL
  5. # cat Static_Malware_Analysis • Definition  Static Malware Analysis is

    usually performed by dissecting the different resources of the binary file without executing it and studying each component. The binary file can also be disassembled (or reverse engineered) using a disassembler such as IDA or radare. (Wikipedia)  Search for signatures in the executable.
  6. # cat Static_Malware_Analysis • Portable Executable (PE)  The Portable

    Executable (PE) file format is a data structure that contains the information necessary for the Windows OS loader to manage the wrapped executable code.  PE File Format by Saurabh & Chinmaya
  7. # cat Reinforcement_Learning • Definitions  A Reinforcement Learning model

    consists of an angent and an environment.  For each turn, an agent receives a state and may choose one from a set of actions .  The policy is the agent’s behavior, i.e., a mapping from states to actions .  The agent receives the next state and a scalar reward .  http://www.ausy.tu-darmstadt.de/Research/Research Α
  8. # cat Reinforcement_Learning • Definitions  Immediate rewards are generally

    not very helpful while learning a game. So, what we should aim for is long term rewards.  The long term reward of step t will be:  The agent aims to maximize the expectation of such long term return from each state.  The parameter is the discount factor that defines the weight of distant rewards in relation to those obtained sooner.  The discounting by ensured that this sum is finite.
  9. # cat Reinforcement_Learning • Q value  The optimal action-value

    function: Q value  A Neural Network will be used to approximate this function.  Next we can define the policy to choose an action.  The Loss function to update the Network: http://web.stanford.edu/class/cs20si/lectures/slides_14.pdf
  10. # cat Reinforcement_Learning • Actor-Critic Algorithms  The actor produces

    an action given the current state of the environment.  The critic produces a TD (Temporal-Difference) error signal given the state and resultant reward.  If the critic is estimating the action-value function Q(s,a), it will also need the output of the actor.  The output of the critic drives learning in both the actor and the critic.  In Deep Reinforcement Learning, neural networks can be used to represent the actor and critic structures.
  11. # cat Antivirus_Evasion_Using_RL • Overview  The environment → the

    malware sample.  The environment emits the state in the form of a 2350-dimensional feature vector:  PE header metadata.  Section metadata: section name, size and characteristics.  Import & Export Table metadata.  Counts of human readable strings.  Byte histogram.
  12. # cat Antivirus_Evasion_Using_RL • Overview  The agent → the

    algorithm used to change the environment.  The agent sends actions to the environment, and the environment replies with observations and rewards (that is, a score).  There will be an anti-malware engine (the attack target).  Each step will provide:  Reward: value of reward scored by the previous action. 10.0 (pass), 0.0 (fail).  Observation space (object): feature vector summarizing the composition of the malware sample.  Done(bool): Determines whether environment needs to be reset; True means episode was successful.
  13. # cat Antivirus_Evasion_Using_RL • Overview  The actions that can

    be performed on a malware sample in our environment consist of the following binary manipulations: * append_zero * append_random_ascii * append_random_bytes * remove_signature * upx_pack * upx_unpack * change_section_names_from_list * change_section_names_to random * modify_export * remove_debug * break_optional_header_checksum  Over time, the agent learns which combinations lead to the highest rewards, or learns a policy.
  14. #python Demo_Effect.py 