usually performed by dissecting the different resources of the binary file without executing it and studying each component. The binary file can also be disassembled (or reverse engineered) using a disassembler such as IDA or radare. (Wikipedia) Search for signatures in the executable.
consists of an angent and an environment. For each turn, an agent receives a state and may choose one from a set of actions . The policy is the agent’s behavior, i.e., a mapping from states to actions . The agent receives the next state and a scalar reward . http://www.ausy.tu-darmstadt.de/Research/Research Α
not very helpful while learning a game. So, what we should aim for is long term rewards. The long term reward of step t will be: The agent aims to maximize the expectation of such long term return from each state. The parameter is the discount factor that defines the weight of distant rewards in relation to those obtained sooner. The discounting by ensured that this sum is finite.
function: Q value A Neural Network will be used to approximate this function. Next we can define the policy to choose an action. The Loss function to update the Network: http://web.stanford.edu/class/cs20si/lectures/slides_14.pdf
an action given the current state of the environment. The critic produces a TD (Temporal-Difference) error signal given the state and resultant reward. If the critic is estimating the action-value function Q(s,a), it will also need the output of the actor. The output of the critic drives learning in both the actor and the critic. In Deep Reinforcement Learning, neural networks can be used to represent the actor and critic structures.
malware sample. The environment emits the state in the form of a 2350-dimensional feature vector: PE header metadata. Section metadata: section name, size and characteristics. Import & Export Table metadata. Counts of human readable strings. Byte histogram.
algorithm used to change the environment. The agent sends actions to the environment, and the environment replies with observations and rewards (that is, a score). There will be an anti-malware engine (the attack target). Each step will provide: Reward: value of reward scored by the previous action. 10.0 (pass), 0.0 (fail). Observation space (object): feature vector summarizing the composition of the malware sample. Done(bool): Determines whether environment needs to be reset; True means episode was successful.
be performed on a malware sample in our environment consist of the following binary manipulations: * append_zero * append_random_ascii * append_random_bytes * remove_signature * upx_pack * upx_unpack * change_section_names_from_list * change_section_names_to random * modify_export * remove_debug * break_optional_header_checksum Over time, the agent learns which combinations lead to the highest rewards, or learns a policy.