Malware Classification Presentation

Machine Learning and Images for Malware Detection and Classification Konstantinos
Kosmidis June 2017

About me ▪ BSc in Computer Science(AUTH) - MSc in
Communications and Cybersecurity(IHU) ▪ Research during my dissertation with Prof. Ch. Kalloniatis ▪ Interested in 1. Machine Learning - Deep Learning 2. Vulnerability Research • GitHub: https://github.com/consteax • Twitter : @consteax But this presentation is more malware research oriented

Agenda ▪ Brief Definitions-Question of the research presented ▪ Research
Goals - Why malware research is important ▪ Feature Engineering ▪ Classification Algorithms ▪ Attacking Machine Learning Model ▪ Defending Machine Learning Model ▪ Why it is hard to defend Machine Learning Systems ▪ Future Work – Ideas- Improvements

Definition – Question of the research

Research Goals – Why malware research is important The goal
in general is to extend and improve a system by: ❖Performing malware detection. ❖Performing classification of malware families. ❖Finding new and improving old features. ❖Applying a feature selection algorithm, that will select the most distinguished features. Importance of Malware Research ❖Real time detection before infection->prevention ❖Several incidents-> Future: Internet of Things. Difficulties ❖Many variants - large scale problem. ❖Difficult to find appropriate datasets to be processed. ❖Hardware resources(Memory , Hard drive). .

Advantages – Disadvantages of the Methodology Advantages of Image depended
Malware Analysis and Classification ❖Fast ❖No execution or disassembly. ❖Images give more information about the construction/architecture of the malware. ❖Identifying the encryption algorithm of Ransomware families Disadvantages of Image depended Malware Analysis and Classification ❖Data Depended: Analysis that depends on existing malware. Problematic to counter and detect zero day attacks. ❖Characterization: Malware converted to images does not give much information on what the malware does other than the signature(label) given by vendors.

Collecting and Presenting Malware Malware Datasets Byte files, asm files,
.exe files Supervised->labels, malware families Overfitting • Occurs when a model is excessively complex, such as having too many parameters relative to the number of observations (random error or noise instead of relationship) Cross-Validation • A technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it. In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. Create Features Classification Algorithm Cross- Validation

Feature 1 – Segment Count Segment Address Bytes Opcode Operands
Snippet of an asm file

Feature 2 – Pixel intensity ❖ Malwares can be visualized
as grayscale images using the byte file , asm files or exe files Malware Binary Binary to 8 bit Vector 8 Bit vector to Grayscale Image Image

Classification Algorithms ❖Random Forest random decision forests are an ensemble
learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the classification or mean prediction of the individual trees ❖Multilayer Perceptron a supervised learning algorithm that learns a function f by training on a dataset, where m is the number of dimensions for input and o is the number of dimensions for output. Given a set of features X = {x1, x2, ..., xm} and a target y, it can learn a non-linear function approximator for either classification or regression. It is different from logistic regression, in that between the input and the output layer, there can be one or more non-linear layers, called hidden layers. ❖Decision Tree a binary tree, and a non-parametric supervised learning method that can be utilized for classification where each child of this tree is considered as a class label but also many leaves may have the same label. Decision’s tree goal is to make a prediction model that calculates and computes the target variable from decision rules that come from the data features being extracted. ❖Nearest Centroid Each class is represented by its centroid, with test samples classified to the class with the nearest centroid. In machine learning, a nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation.

Classification diagrams-graphs (1) Random Forest Multilayer Layer Perceptron Accuracy =
0.0916-91.6% Training Time = 1.72secs Testing Time = 0.0063 secs Accuracy = 0.0878-87.8% Training Time = 16.05 secs Testing Time = 0.091 secs

Classification diagrams-graphs (2) Accuracy = 0.088-88.8% Training Time = 6.53secs
Testing Time = 0.00096secs Accuracy = 0.0856-85.6% Training Time = 0.218secs Testing Time = 0.00211 secs Decision Tree Nearest Centroid

Attacking Machine Learning Model ❖Adversarial examples ❖My attacking-threat model on
making malicious examples Steganography Clean Images(MNIST / ImageNet Malicious Ιmages New Malicious Ιmages Python script Taken from openai.com function

Defend Machine Learning Model Adversarial training Defensive distillation Implementation Image
Forensics - Steganalysis brute force solution output probabilities of different classes Statistical or Noise floor consistency analysis

Why is it hard to defend against adversarial examples? Difficult
to construct a theoretical model of the adversarial example crafting process. Good outputs for every possible input

Future Work ❖Combine malicious images with clean images-ImageNet/MNIST ❖Except malicious
images/pdfs there are other online objects->video, voice/sound, rar files. ❖Test those algorithms to new datasets and make new and better algorithms.

References-Bibliography • [1] SARVAM: Search and Retrieval of Malware- Lakshmanan
Nataraj, Dhilung Kirat, Giovanni Proc. Annual Computer Security Applications Conference (ACSAC) Workshop on Next Generation Malware Attacks and Defense (NGMAD), New Orleans, Dec. 2013,http://sarvam.ece.ucsb.edu(Vigna, 2013). • [2] Malware Images: Visualization and Automatic Classification-L. Nataraj, S. Karthikeyan, G. Jacob, Proceedings of the 8th International Symposium on Visualization for Cyber Security Article No 4(Vizsec ,2011). • [3] A Comparative Assessment of Malware Classification using Binary Texture Analysis and Dynamic Analysis- Lakshmanan Nataraj, Vinod Yegneswaran, Phillip Porras, Jian Zhang Proceedings of ACM CCS Workshop on Artificial Intelligence and Security(AISEC,2011). • [4] Malware Analysis and Classification: A Survey- Gandotra, E., Bansal, D. and Sofat, S. Journal of Information Security, 5, 56-64. doi: 10.4236/jis.2014.52006 (2014). GitHub: https://github.com/consteax Twitter : @consteax

Malware Classification Presentation

Malware Classification Presentation

Konstantinos Kosmidis

More Decks by Konstantinos Kosmidis

Other Decks in Research

Featured

Transcript

Machine Learning and Images for Malware Detection and Classification Konstantinos

About me ▪ BSc in Computer Science(AUTH) - MSc in

Agenda ▪ Brief Definitions-Question of the research presented ▪ Research

Definition – Question of the research

Research Goals – Why malware research is important The goal

Advantages – Disadvantages of the Methodology Advantages of Image depended

Collecting and Presenting Malware Malware Datasets Byte files, asm files,

Feature 1 – Segment Count Segment Address Bytes Opcode Operands

Feature 2 – Pixel intensity ❖ Malwares can be visualized

Classification Algorithms ❖Random Forest random decision forests are an ensemble

Classification diagrams-graphs (1) Random Forest Multilayer Layer Perceptron Accuracy =

Classification diagrams-graphs (2) Accuracy = 0.088-88.8% Training Time = 6.53secs

Attacking Machine Learning Model ❖Adversarial examples ❖My attacking-threat model on

Defend Machine Learning Model Adversarial training Defensive distillation Implementation Image

Why is it hard to defend against adversarial examples? Difficult

Future Work ❖Combine malicious images with clean images-ImageNet/MNIST ❖Except malicious

References-Bibliography • [1] SARVAM: Search and Retrieval of Malware- Lakshmanan