Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PCI 2017 - Machine Learning for Malware Detection

Konstantinos Kosmidis
September 30, 2017
120

PCI 2017 - Machine Learning for Malware Detection

Konstantinos Kosmidis

September 30, 2017
Tweet

Transcript

  1. About us Konstan6nos Kosmidis Currently Research Fellow in Security Engineering

    and member of SESAME project at Sense Research Cluster - University of Brighton Interested in 1.  Machine Learning - Deep Learning 2.  Vulnerability Research & Malware Research Christos Kallonia6s Associate Professor Director, Privacy Engineering and Social Informa6cs Laboratory Department of Cultural Technology and Communica6on University of the Aegean
  2. Agenda •  Brief Defini6ons-Ques6on of the research presented •  Research

    Goals •  Experiment Setup •  Feature Engineering •  Classifica6on Algorithms •  Conclusions-Future Work
  3. Why malware detec6on using ML is important How An6virus detect

    malware? •  PaZern matching on sta6c files How malware evade an6viruses? •  use polymorphic packers that evade sta6c paZern matching •  Malware like Zeus are masters of evasion Why Machine Learning? •  Many variants of malware- large scale problem, •  Too much data for researcher to process Difficul6es •  Difficult to find appropriate datasets to be processed. •  Hardware resources(Memory , Hard drive).
  4. Research Goals The goal in general is to extend and

    improve a system by: •  Perform malware detec6on. •  Perform classifica6on of malware families. •  Finding new and improve old features.
  5. Experiment Setup •  Ubuntu 16.04 LTS (Xenial Xerus) system 64bit

    (AMD 64 bit) with 16 GB RAM and 1 TB Hard drive. •  Python programming language The pyleargist package, TensorFlow (Mul6layer Perceptron) •  The dataset is Malimg Dataset, from the paper Nataraj et al., 2011 Malware Images: Visualiza6on and Automa6c Classifica6on.
  6. Dataset- Collec6ng and Presen6ng Malware Malimg Malware Dataset Supervised Dataset-25

    malware families, labels Around 9000 samples-> asm files, exe files, byte files •  Tested from Nataraj to various experiments so having a dataset that it is known that the conversion to images will work and perform in a great extend is important •  Having 25 malware families and in an arranged dataset plays a key role and it helps feature computaIon and label computaIon. •  Imbalanced and developing a training set must include a straIfied sampling of the populaIons to prevent over fiLng and under generalizaIon on specific malware variants.
  7. Dataset- Collec6ng and Presen6ng Malware .text: 00401394 B8 1F CD

    98 AE mov eax, 0AE98CD1Fh .text: 00401399 F7 E1 mul ecx .text: 0040139B C1 EA 1E shr edx, 1Eh .text: 0040139E 69 D2 FAC9 D6 5D imul edx, 5DD6C9FAh Snippet of an asm file segment address Bytes opcode operands
  8. Feature – Pixel intensity •  Malwares can be visualized as

    gray-scale images using the byte file , asm files or exe files The first 800 pixels Image of asm file Image of byte file
  9. Classifica6on Algorithms- Decision Tree •  There is no misclassificaIon but

    only minor accuracy errors (blue squares that are far from the confusion matrix line) meaning that some of the samples for this algorithm have some similariIes with other specimens. •  C2LOP.P malware family maybe has similariIes with C2LOP.gen!g. According to the names, this assumpIon is right as they belong to same malware family and they are considered variants but some of their samples can be on the Swizzor.gen!l. •  Labeling wise those few samples are a detecIon error or more in-depth features need to be examined to idenIfy if similariIes exist. This goes for the other malware families that behave the same.
  10. Classifica6on Algorithms- Perceptron •  For Autorun.K malware family most of

    their samples were classified into another malware family. •  Same goes to C2LOP.gen!g and C2LOP.P, and many similariIes of them were found to Swizzor malware family
  11. Classifica6on Algorithms- Mul6layer Perceptron •  Autorun.K is detected and classified

    to the Yuner.A malware family so training a bit more the algorithm could improve the results of this type of malware. •  It is evident from the matrix produced that the Swizzor variants are being classified accordingly.
  12. Classifica6on Algorithms- Stochas6c Gradient •  There is a misclassificaIon. For

    malware detecIon, it means that it is an error. Swizzor.gen!.I is classified and detected to another malware family with same similariIes- Swizzor.gen!.E. •  Autorun.K was misclassified to the Yuner.A and most of its samples. So the algorithm finds similariIes between those two
  13. Classifica6on Algorithms- Nearest Centroid •  there is no misclassificaIon, but

    all samples are kind of classified to the correct labels except some minor errors. •  An emphasis to C2LOP.P should be menIoned as it seems that a small percent is detected and classified correctly but most of the samples are classified into other malware families, so this is not the best algorithm to use to counter similar samples for this kind of malware family
  14. Classifica6on Algorithms- Random Forest •  Best accuracy •  Everything classified

    in the correct labels and again Swizzor variants were being detected,
  15. Summariza6on of Results Classifica(on Algorithms Average Training Time (secs) Tes(ng

    Time (secs) Average Accuracy Decision Tree 6.53 0.00096 0.088->88% Nearest Centroid 0.218 0.0211 0.0856->85.6% StochasIc Gradient 1.291 0.0585 0.087->87% + 2 misclassificaIons Perceptron 0.817 0.0418 0.0905-> 90.5% + 4 misclassificaIons MulIlayer Perceptron 16.05 0.091 0.087->87% Random Forest 1.72 0.0063 0.0916->91.6%
  16. Conclusions Advantages of Image depended Malware Analysis and Classifica6on •

    Fast • No execu6on or disassembly. • Images give more informa6on about the construc6on/architecture of the malware. • Create new naming schemes that depend on related malware images. Disadvantages of Image depended Malware Analysis and Classifica6on • Data Depended: Analysis that depends on exis6ng malware. Problema6c to counter and detect zero day aZacks. • Characteriza6on: Malware converted to images does not give much informa6on on what the malware does other than the signature(label) given by vendors. Silver Bullet? No, there is none.
  17. Future Work •  Test those algorithms to new datasets and

    make new and beZer algorithms. •  Applying a feature selec6on algorithm, that will select the most discrimina6ve features. In progress •  Combine malicious images with clean images-ImageNet/MNIST •  Except malicious images/pdfs there are other online objects->video, voice.
  18. References-Bibliography •  [1] SARVAM: Search and Retrieval of Malware- Lakshmanan

    Nataraj, Dhilung Kirat, Giovanni Proc. Annual Computer Security Applica2ons Conference (ACSAC) Workshop on Next Genera2on Malware A?acks and Defense (NGMAD), New Orleans, Dec. 2013,hip://sarvam.ece.ucsb.edu(Vigna, 2013). •  [2] Malware Images: VisualizaIon and AutomaIc ClassificaIon-L. Nataraj, S. Karthikeyan, G. Jacob, Proceedings of the 8th Interna2onal Symposium on Visualiza2on for Cyber Security Ar2cle No 4(Vizsec ,2011). •  [3] A ComparaIve Assessment of Malware ClassificaIon using Binary Texture Analysis and Dynamic Analysis- Lakshmanan Nataraj, Vinod Yegneswaran, Phillip Porras, Jian Zhang Proceedings of ACM CCS Workshop on Ar2ficial Intelligence and Security(AISEC,2011). •  [4] Malware Analysis and ClassificaIon: A Survey- Gandotra, E., Bansal, D. and Sofat, S. Journal of Informa2on Security, 5, 56-64. doi: 10.4236/jis.2014.52006 (2014). Github: hips://github.com/consteax Twiier : @consteax “AnIvirus won’t protect you from the ever increasing percentage of malware out there that are designed to bypass anIvirus but it will protect you for all the random unsophisIcated aiacks out there” Bruce Schneier QuesIons?