A3_1_Detection of phished brands logos with Convolutional Neural Networks

A3_1_Detection of phished brands logos with Convolutional Neural Networks

54b2edd392fad51a4876ccf5b7dc65fe?s=128

JPAAWG_2nd_General_Meeting

November 14, 2019
Tweet

Transcript

  1. 1.

    Detection of phished brands logos with Convolutional Neural Networks Vade

    Secure Sébastien Goutal Chief Science Officer
  2. 2.

    Introduction 2 • Fact : More and more threats evade

    traditional filtering technologies by using images: • Sextortion, phishing, etc... • Solution : Apply Computer Vision to extract relevant content: • Text, logo, etc... • Existing technologies: Google Vision, Microsoft Azure Computer Vision • Limitations: List of logos is fixed • Decision: Build our own logo detection technology Example of phishing with image attached to email, no relevant content in body Brand logo Typical phishing text
  3. 4.

    Data pipeline 4 Training corpus Test corpus 125 images without

    logo 1 713 images with logo Annotate images Convolve & square images Collect Images (Duplicates are removed) Benign emails Phishing emails Benign webpages Phishing webpages Logos Images Dictionaries Fonts Randomness Generate annotated images 43 663 images Split images 510 images 1 203 images Augment images 132 434 images 635 images 176 097 images, 30 brands, 66 logos
  4. 5.

    Data pipeline Annotate images 5 • Human task: identify logo,

    draw bounding box and label logo • Minimal size for a logo is 40x30 • There may be several logos in an image • There may be variants of a brand logo: different geometry, different color, etc. → Same label Wells Fargo Yahoo! Scaling is costly as annotation is manual
  5. 6.

    Data pipeline Convolve & square images 6 • First purpose:

    Increase resilience of CNN regarding logo position → Logos are often in the same position (top, top left) which increases CNN overfitting • Second purpose: Fit image to 512x512 square CNN input • How? Move a 512x512 sliding window on image and keep image if at least one logo is visible Image with two logos (Office, Microsoft) Image is rejected Image is kept
  6. 7.

    Data pipeline Generate annotated images 7 • First purpose: Increase

    diversity of images to reduce CNN overfitting → Collected images often have a similar look & feel • Second purpose: Automate annotation to ease scaling → Annotation of collected images is manual: costly, time consuming • Generation is based on ‘randomness’: • Choice of resources: Images, words, fonts • Position of logo • Alterations of logo: down sampling, color balance, contrast, scaling Logos Images Dictionaries Fonts Randomness Generate annotated images
  7. 9.

    Training 9 • VGG-16 and ResNet CNN used for training

    and prediction • CNN input size increased to 512x512 • Transfer learning: • Models are pre-trained on ImageNet (~14M images, 20K classes) • Additional training performed with training corpus Transfer learning is the improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned. (Lisa Torrey and Jude Shavlik, University of Wisconsin)
  8. 10.

    Prediction 12 Input image Resize and pad to fit 512x512

    input VGG-16 ResNet Combine predictions (Proprietary algorithm) Final prediction
  9. 11.

    Performance evaluation 13 Images without logo 125 Images with logo

    510 Total 635 • Comparison with Google Vision logo detection • Google Vision logo detection : • General purpose (2D and 3D) • Number of logos supported unknown (>200) • Vade Secure logo detection : • 2D only (3D logo irrelevant in the context of threat detection) • Number of logos supported: 66 • Only logos supported by both are considered • Test set is used (independent from training set) Test set 𝑟𝑟𝑟𝑟𝑟𝑟 1 score Vade 0.95 0.94 0.94 Google 0.98 0.76 0.86 • Vade outperforms Google Vision • High number of FN for Google Vision 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 = + 𝐹𝐹 𝑟𝑟𝑟𝑟𝑟𝑟 = + 1 = 2 . 𝑟𝑟𝑟𝑟𝑟𝑟 + 𝑟𝑟𝑟𝑟𝑟𝑟 Metrics for evaluation:
  10. 13.

    ASTRD 15 Emails from feedback loops Emails from honey pots

    Extract images Cluster images Analyze and label images QR Code Scanner QR Code are used for crypto payments Optical Character Recognition Natural Language Processing Logo Detection Classify images Images blacklist Global Network Intelligence (GNI)
  11. 14.

    ASTRD – Example with phishing 16 Image attached to email,

    no relevant text in body Clue 1: Chase logo How? Extract logo with logo detection API Clue 2:Typical phishing text How? Extract text with OCR and classify text with NLP