PCI 2017 - Machine Learning for Malware Detection

Machine Learning and Images for Malware Detec6on and Classiﬁca6on Konstan6nos
Kosmidis - Christos Kallonia6s September 2017

About us Konstan6nos Kosmidis Currently Research Fellow in Security Engineering
and member of SESAME project at Sense Research Cluster - University of Brighton Interested in 1.  Machine Learning - Deep Learning 2.  Vulnerability Research & Malware Research Christos Kallonia6s Associate Professor Director, Privacy Engineering and Social Informa6cs Laboratory Department of Cultural Technology and Communica6on University of the Aegean

Agenda •  Brief Deﬁni6ons-Ques6on of the research presented •  Research
Goals •  Experiment Setup •  Feature Engineering •  Classiﬁca6on Algorithms •  Conclusions-Future Work

Why malware detec6on using ML is important How An6virus detect
malware? •  PaZern matching on sta6c files How malware evade an6viruses? •  use polymorphic packers that evade sta6c paZern matching •  Malware like Zeus are masters of evasion Why Machine Learning? •  Many variants of malware- large scale problem, •  Too much data for researcher to process Difficul6es •  Difficult to find appropriate datasets to be processed. •  Hardware resources(Memory , Hard drive).

Research Goals The goal in general is to extend and
improve a system by: •  Perform malware detec6on. •  Perform classiﬁca6on of malware families. •  Finding new and improve old features.

Experiment Setup •  Ubuntu 16.04 LTS (Xenial Xerus) system 64bit
(AMD 64 bit) with 16 GB RAM and 1 TB Hard drive. •  Python programming language The pyleargist package, TensorFlow (Mul6layer Perceptron) •  The dataset is Malimg Dataset, from the paper Nataraj et al., 2011 Malware Images: Visualiza6on and Automa6c Classiﬁca6on.

Dataset- Collec6ng and Presen6ng Malware Malimg Malware Dataset Supervised Dataset-25
malware families, labels Around 9000 samples-> asm files, exe files, byte files •  Tested from Nataraj to various experiments so having a dataset that it is known that the conversion to images will work and perform in a great extend is important •  Having 25 malware families and in an arranged dataset plays a key role and it helps feature computaIon and label computaIon. •  Imbalanced and developing a training set must include a straIfied sampling of the populaIons to prevent over fiLng and under generalizaIon on specific malware variants.

Dataset- Collec6ng and Presen6ng Malware .text: 00401394 B8 1F CD
98 AE mov eax, 0AE98CD1Fh .text: 00401399 F7 E1 mul ecx .text: 0040139B C1 EA 1E shr edx, 1Eh .text: 0040139E 69 D2 FAC9 D6 5D imul edx, 5DD6C9FAh Snippet of an asm ﬁle segment address Bytes opcode operands

Feature – Pixel intensity •  Malwares can be visualized as
gray-scale images using the byte file , asm files or exe files The first 800 pixels Image of asm file Image of byte file

General Approach and Overview Create Features Cross ValidaIon ClassiﬁcaIon Algorithms

Classiﬁca6on Algorithms- Decision Tree •  There is no misclassiﬁcaIon but
only minor accuracy errors (blue squares that are far from the confusion matrix line) meaning that some of the samples for this algorithm have some similariIes with other specimens. •  C2LOP.P malware family maybe has similariIes with C2LOP.gen!g. According to the names, this assumpIon is right as they belong to same malware family and they are considered variants but some of their samples can be on the Swizzor.gen!l. •  Labeling wise those few samples are a detecIon error or more in-depth features need to be examined to idenIfy if similariIes exist. This goes for the other malware families that behave the same.

Classiﬁca6on Algorithms- Perceptron •  For Autorun.K malware family most of
their samples were classiﬁed into another malware family. •  Same goes to C2LOP.gen!g and C2LOP.P, and many similariIes of them were found to Swizzor malware family

Classifica6on Algorithms- Mul6layer Perceptron •  Autorun.K is detected and classified
to the Yuner.A malware family so training a bit more the algorithm could improve the results of this type of malware. •  It is evident from the matrix produced that the Swizzor variants are being classified accordingly.

Classifica6on Algorithms- Stochas6c Gradient •  There is a misclassificaIon. For
malware detecIon, it means that it is an error. Swizzor.gen!.I is classified and detected to another malware family with same similariIes- Swizzor.gen!.E. •  Autorun.K was misclassified to the Yuner.A and most of its samples. So the algorithm finds similariIes between those two

Classifica6on Algorithms- Nearest Centroid •  there is no misclassificaIon, but
all samples are kind of classified to the correct labels except some minor errors. •  An emphasis to C2LOP.P should be menIoned as it seems that a small percent is detected and classified correctly but most of the samples are classified into other malware families, so this is not the best algorithm to use to counter similar samples for this kind of malware family

Classiﬁca6on Algorithms- Random Forest •  Best accuracy •  Everything classiﬁed
in the correct labels and again Swizzor variants were being detected,

Summariza6on of Results Classifica(on Algorithms Average Training Time (secs) Tes(ng
Time (secs) Average Accuracy Decision Tree 6.53 0.00096 0.088->88% Nearest Centroid 0.218 0.0211 0.0856->85.6% StochasIc Gradient 1.291 0.0585 0.087->87% + 2 misclassificaIons Perceptron 0.817 0.0418 0.0905-> 90.5% + 4 misclassificaIons MulIlayer Perceptron 16.05 0.091 0.087->87% Random Forest 1.72 0.0063 0.0916->91.6%

Conclusions Advantages of Image depended Malware Analysis and Classiﬁca6on •
Fast • No execu6on or disassembly. • Images give more informa6on about the construc6on/architecture of the malware. • Create new naming schemes that depend on related malware images. Disadvantages of Image depended Malware Analysis and Classiﬁca6on • Data Depended: Analysis that depends on exis6ng malware. Problema6c to counter and detect zero day aZacks. • Characteriza6on: Malware converted to images does not give much informa6on on what the malware does other than the signature(label) given by vendors. Silver Bullet? No, there is none.

Future Work •  Test those algorithms to new datasets and
make new and beZer algorithms. •  Applying a feature selec6on algorithm, that will select the most discrimina6ve features. In progress •  Combine malicious images with clean images-ImageNet/MNIST •  Except malicious images/pdfs there are other online objects->video, voice.

References-Bibliography •  [1] SARVAM: Search and Retrieval of Malware- Lakshmanan
Nataraj, Dhilung Kirat, Giovanni Proc. Annual Computer Security Applica2ons Conference (ACSAC) Workshop on Next Genera2on Malware A?acks and Defense (NGMAD), New Orleans, Dec. 2013,hip://sarvam.ece.ucsb.edu(Vigna, 2013). •  [2] Malware Images: VisualizaIon and AutomaIc ClassificaIon-L. Nataraj, S. Karthikeyan, G. Jacob, Proceedings of the 8th Interna2onal Symposium on Visualiza2on for Cyber Security Ar2cle No 4(Vizsec ,2011). •  [3] A ComparaIve Assessment of Malware ClassificaIon using Binary Texture Analysis and Dynamic Analysis- Lakshmanan Nataraj, Vinod Yegneswaran, Phillip Porras, Jian Zhang Proceedings of ACM CCS Workshop on Ar2ficial Intelligence and Security(AISEC,2011). •  [4] Malware Analysis and ClassificaIon: A Survey- Gandotra, E., Bansal, D. and Sofat, S. Journal of Informa2on Security, 5, 56-64. doi: 10.4236/jis.2014.52006 (2014). Github: hips://github.com/consteax Twiier : @consteax “AnIvirus won’t protect you from the ever increasing percentage of malware out there that are designed to bypass anIvirus but it will protect you for all the random unsophisIcated aiacks out there” Bruce Schneier QuesIons?

PCI 2017 - Machine Learning for Malware Detection

PCI 2017 - Machine Learning for Malware Detection

Konstantinos Kosmidis

More Decks by Konstantinos Kosmidis

Featured

Transcript

Machine Learning and Images for Malware Detec6on and Classiﬁca6on Konstan6nos

About us Konstan6nos Kosmidis Currently Research Fellow in Security Engineering

Agenda •  Brief Deﬁni6ons-Ques6on of the research presented •  Research

Why malware detec6on using ML is important How An6virus detect

Research Goals The goal in general is to extend and

Experiment Setup •  Ubuntu 16.04 LTS (Xenial Xerus) system 64bit

Dataset- Collec6ng and Presen6ng Malware Malimg Malware Dataset Supervised Dataset-25

Dataset- Collec6ng and Presen6ng Malware .text: 00401394 B8 1F CD

Feature – Pixel intensity •  Malwares can be visualized as

General Approach and Overview Create Features Cross ValidaIon ClassiﬁcaIon Algorithms

Classiﬁca6on Algorithms- Decision Tree •  There is no misclassiﬁcaIon but

Classiﬁca6on Algorithms- Perceptron •  For Autorun.K malware family most of

Classiﬁca6on Algorithms- Mul6layer Perceptron •  Autorun.K is detected and classiﬁed

Classiﬁca6on Algorithms- Stochas6c Gradient •  There is a misclassiﬁcaIon. For

Classiﬁca6on Algorithms- Nearest Centroid •  there is no misclassiﬁcaIon, but

Classiﬁca6on Algorithms- Random Forest •  Best accuracy •  Everything classiﬁed

Summariza6on of Results Classiﬁca(on Algorithms Average Training Time (secs) Tes(ng

Conclusions Advantages of Image depended Malware Analysis and Classiﬁca6on •

Future Work •  Test those algorithms to new datasets and

References-Bibliography •  [1] SARVAM: Search and Retrieval of Malware- Lakshmanan