Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Malware Classification Presentation

Malware Classification Presentation

Konstantinos Kosmidis

June 25, 2017
Tweet

More Decks by Konstantinos Kosmidis

Other Decks in Research

Transcript

  1. About me ▪ BSc in Computer Science(AUTH) - MSc in

    Communications and Cybersecurity(IHU) ▪ Research during my dissertation with Prof. Ch. Kalloniatis ▪ Interested in 1. Machine Learning - Deep Learning 2. Vulnerability Research • GitHub: https://github.com/consteax • Twitter : @consteax But this presentation is more malware research oriented
  2. Agenda ▪ Brief Definitions-Question of the research presented ▪ Research

    Goals - Why malware research is important ▪ Feature Engineering ▪ Classification Algorithms ▪ Attacking Machine Learning Model ▪ Defending Machine Learning Model ▪ Why it is hard to defend Machine Learning Systems ▪ Future Work – Ideas- Improvements
  3. Research Goals – Why malware research is important The goal

    in general is to extend and improve a system by: ❖Performing malware detection. ❖Performing classification of malware families. ❖Finding new and improving old features. ❖Applying a feature selection algorithm, that will select the most distinguished features. Importance of Malware Research ❖Real time detection before infection->prevention ❖Several incidents-> Future: Internet of Things. Difficulties ❖Many variants - large scale problem. ❖Difficult to find appropriate datasets to be processed. ❖Hardware resources(Memory , Hard drive). .
  4. Advantages – Disadvantages of the Methodology Advantages of Image depended

    Malware Analysis and Classification ❖Fast ❖No execution or disassembly. ❖Images give more information about the construction/architecture of the malware. ❖Identifying the encryption algorithm of Ransomware families Disadvantages of Image depended Malware Analysis and Classification ❖Data Depended: Analysis that depends on existing malware. Problematic to counter and detect zero day attacks. ❖Characterization: Malware converted to images does not give much information on what the malware does other than the signature(label) given by vendors.
  5. Collecting and Presenting Malware Malware Datasets Byte files, asm files,

    .exe files Supervised->labels, malware families Overfitting • Occurs when a model is excessively complex, such as having too many parameters relative to the number of observations (random error or noise instead of relationship) Cross-Validation • A technique to evaluate predictive models by partitioning the original sample into a training set to train the model, and a test set to evaluate it. In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. Create Features Classification Algorithm Cross- Validation
  6. Feature 2 – Pixel intensity ❖ Malwares can be visualized

    as grayscale images using the byte file , asm files or exe files Malware Binary Binary to 8 bit Vector 8 Bit vector to Grayscale Image Image
  7. Classification Algorithms ❖Random Forest random decision forests are an ensemble

    learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the classification or mean prediction of the individual trees ❖Multilayer Perceptron a supervised learning algorithm that learns a function f by training on a dataset, where m is the number of dimensions for input and o is the number of dimensions for output. Given a set of features X = {x1, x2, ..., xm} and a target y, it can learn a non-linear function approximator for either classification or regression. It is different from logistic regression, in that between the input and the output layer, there can be one or more non-linear layers, called hidden layers. ❖Decision Tree a binary tree, and a non-parametric supervised learning method that can be utilized for classification where each child of this tree is considered as a class label but also many leaves may have the same label. Decision’s tree goal is to make a prediction model that calculates and computes the target variable from decision rules that come from the data features being extracted. ❖Nearest Centroid Each class is represented by its centroid, with test samples classified to the class with the nearest centroid. In machine learning, a nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose mean (centroid) is closest to the observation.
  8. Classification diagrams-graphs (1) Random Forest Multilayer Layer Perceptron Accuracy =

    0.0916-91.6% Training Time = 1.72secs Testing Time = 0.0063 secs Accuracy = 0.0878-87.8% Training Time = 16.05 secs Testing Time = 0.091 secs
  9. Classification diagrams-graphs (2) Accuracy = 0.088-88.8% Training Time = 6.53secs

    Testing Time = 0.00096secs Accuracy = 0.0856-85.6% Training Time = 0.218secs Testing Time = 0.00211 secs Decision Tree Nearest Centroid
  10. Attacking Machine Learning Model ❖Adversarial examples ❖My attacking-threat model on

    making malicious examples Steganography Clean Images(MNIST / ImageNet Malicious Ιmages New Malicious Ιmages Python script Taken from openai.com function
  11. Defend Machine Learning Model Adversarial training Defensive distillation Implementation Image

    Forensics - Steganalysis brute force solution output probabilities of different classes Statistical or Noise floor consistency analysis
  12. Why is it hard to defend against adversarial examples? Difficult

    to construct a theoretical model of the adversarial example crafting process. Good outputs for every possible input
  13. Future Work ❖Combine malicious images with clean images-ImageNet/MNIST ❖Except malicious

    images/pdfs there are other online objects->video, voice/sound, rar files. ❖Test those algorithms to new datasets and make new and better algorithms.
  14. References-Bibliography • [1] SARVAM: Search and Retrieval of Malware- Lakshmanan

    Nataraj, Dhilung Kirat, Giovanni Proc. Annual Computer Security Applications Conference (ACSAC) Workshop on Next Generation Malware Attacks and Defense (NGMAD), New Orleans, Dec. 2013,http://sarvam.ece.ucsb.edu(Vigna, 2013). • [2] Malware Images: Visualization and Automatic Classification-L. Nataraj, S. Karthikeyan, G. Jacob, Proceedings of the 8th International Symposium on Visualization for Cyber Security Article No 4(Vizsec ,2011). • [3] A Comparative Assessment of Malware Classification using Binary Texture Analysis and Dynamic Analysis- Lakshmanan Nataraj, Vinod Yegneswaran, Phillip Porras, Jian Zhang Proceedings of ACM CCS Workshop on Artificial Intelligence and Security(AISEC,2011). • [4] Malware Analysis and Classification: A Survey- Gandotra, E., Bansal, D. and Sofat, S. Journal of Information Security, 5, 56-64. doi: 10.4236/jis.2014.52006 (2014). GitHub: https://github.com/consteax Twitter : @consteax