A3_1_Detection of phished brands logos with Convolutional Neural Networks

Detection of phished brands logos with Convolutional Neural Networks Vade
Secure Sébastien Goutal Chief Science Officer

Introduction 2 • Fact : More and more threats evade
traditional filtering technologies by using images: • Sextortion, phishing, etc... • Solution : Apply Computer Vision to extract relevant content: • Text, logo, etc... • Existing technologies: Google Vision, Microsoft Azure Computer Vision • Limitations: List of logos is fixed • Decision: Build our own logo detection technology Example of phishing with image attached to email, no relevant content in body Brand logo Typical phishing text

Data pipeline

Data pipeline 4 Training corpus Test corpus 125 images without
logo 1 713 images with logo Annotate images Convolve & square images Collect Images (Duplicates are removed) Benign emails Phishing emails Benign webpages Phishing webpages Logos Images Dictionaries Fonts Randomness Generate annotated images 43 663 images Split images 510 images 1 203 images Augment images 132 434 images 635 images 176 097 images, 30 brands, 66 logos

Data pipeline Annotate images 5 • Human task: identify logo,
draw bounding box and label logo • Minimal size for a logo is 40x30 • There may be several logos in an image • There may be variants of a brand logo: different geometry, different color, etc. → Same label Wells Fargo Yahoo! Scaling is costly as annotation is manual

Data pipeline Convolve & square images 6 • First purpose:
Increase resilience of CNN regarding logo position → Logos are often in the same position (top, top left) which increases CNN overfitting • Second purpose: Fit image to 512x512 square CNN input • How? Move a 512x512 sliding window on image and keep image if at least one logo is visible Image with two logos (Office, Microsoft) Image is rejected Image is kept

Data pipeline Generate annotated images 7 • First purpose: Increase
diversity of images to reduce CNN overfitting → Collected images often have a similar look & feel • Second purpose: Automate annotation to ease scaling → Annotation of collected images is manual: costly, time consuming • Generation is based on ‘randomness’: • Choice of resources: Images, words, fonts • Position of logo • Alterations of logo: down sampling, color balance, contrast, scaling Logos Images Dictionaries Fonts Randomness Generate annotated images

Training and prediction

Training 9 • VGG-16 and ResNet CNN used for training
and prediction • CNN input size increased to 512x512 • Transfer learning: • Models are pre-trained on ImageNet (~14M images, 20K classes) • Additional training performed with training corpus Transfer learning is the improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned. (Lisa Torrey and Jude Shavlik, University of Wisconsin)

Prediction 12 Input image Resize and pad to fit 512x512
input VGG-16 ResNet Combine predictions (Proprietary algorithm) Final prediction

Performance evaluation 13 Images without logo 125 Images with logo
510 Total 635 • Comparison with Google Vision logo detection • Google Vision logo detection : • General purpose (2D and 3D) • Number of logos supported unknown (>200) • Vade Secure logo detection : • 2D only (3D logo irrelevant in the context of threat detection) • Number of logos supported: 66 • Only logos supported by both are considered • Test set is used (independent from training set) Test set 𝑟𝑟𝑟𝑟𝑟𝑟 1 score Vade 0.95 0.94 0.94 Google 0.98 0.76 0.86 • Vade outperforms Google Vision • High number of FN for Google Vision 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 = + 𝐹𝐹 𝑟𝑟𝑟𝑟𝑟𝑟 = + 1 = 2 . 𝑟𝑟𝑟𝑟𝑟𝑟 + 𝑟𝑟𝑟𝑟𝑟𝑟 Metrics for evaluation:

Use case: ASTRD

ASTRD 15 Emails from feedback loops Emails from honey pots
Extract images Cluster images Analyze and label images QR Code Scanner QR Code are used for crypto payments Optical Character Recognition Natural Language Processing Logo Detection Classify images Images blacklist Global Network Intelligence (GNI)

ASTRD – Example with phishing 16 Image attached to email,
no relevant text in body Clue 1: Chase logo How? Extract logo with logo detection API Clue 2:Typical phishing text How? Extract text with OCR and classify text with NLP

Thank you for your listening!

A3_1_Detection of phished brands logos with Con...

A3_1_Detection of phished brands logos with Convolutional Neural Networks

JPAAWG

More Decks by JPAAWG

Featured

Transcript

Detection of phished brands logos with Convolutional Neural Networks Vade

Introduction 2 • Fact : More and more threats evade

Data pipeline

Data pipeline 4 Training corpus Test corpus 125 images without

Data pipeline Annotate images 5 • Human task: identify logo,

Data pipeline Convolve & square images 6 • First purpose:

Data pipeline Generate annotated images 7 • First purpose: Increase

Training and prediction

Training 9 • VGG-16 and ResNet CNN used for training

Prediction 12 Input image Resize and pad to fit 512x512

Performance evaluation 13 Images without logo 125 Images with logo

Use case: ASTRD

ASTRD 15 Emails from feedback loops Emails from honey pots

ASTRD – Example with phishing 16 Image attached to email,

Thank you for your listening!