$30 off During Our Annual Pro Sale. View Details »

JGS594 Lecture 20

JGS594 Lecture 20

Software Engineering for Machine Learning
Spam Detection
(202204)

Javier Gonzalez-Sanchez
PRO

April 14, 2022
Tweet

More Decks by Javier Gonzalez-Sanchez

Other Decks in Programming

Transcript

  1. jgs
    SER 594
    Software Engineering for
    Machine Learning
    Lecture 20: Supervised Learning (spam detection)
    Dr. Javier Gonzalez-Sanchez
    [email protected]
    javiergs.engineering.asu.edu | javiergs.com
    PERALTA 230U
    Office Hours: By appointment

    View Slide

  2. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 2
    jgs
    Machine Learning

    View Slide

  3. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 3
    jgs
    Definition
    § Supervised learning is the machine learning task of learning a function that
    maps an input to an output based on example input-output pairs.

    View Slide

  4. jgs
    Text Mining
    Supervised Learning

    View Slide

  5. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 5
    jgs
    Text Mining
    Extracting information from text documents (natural language).
    § generate a search index;
    § text categorization into domains;
    § text clustering to organize a set of documents;
    § sentiment analysis to identify subjective information;
    § concept or entity extraction (people, places, organizations);
    § document summarization (identify important points); and
    § learning relations between named entities.
    § Spam detection in email messages,

    View Slide

  6. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 6
    jgs
    Previously …
    § Text Mining as unsupervised technique that looks for patterns in a corpus of
    text – identify words (topics) that appear un a statistically meaningful way
    § analyze a large archive of text documents (blog post, an email, a tweet, a
    book chapter, etc.) and understand what the archive contains
    § The most well-known algorithm is Latent Dirichlet Allocation (LDA)
    2003, Blei, Ng, & Jordan (available on Canvas)

    View Slide

  7. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 7
    jgs
    Question
    Can we make classification of documents
    a supervised effort?

    View Slide

  8. jgs
    Tools
    Document Classification

    View Slide

  9. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 9
    jgs
    Library
    § Mallet (MAchine Learning for Language Toolkit), a Java-based package for
    statistical natural-language processing, document classification, clustering,
    topic modeling, information extraction, and other machine-learning
    applications to text.
    § Andrew McCallum, University of Massachusetts Amherst, 2002)

    View Slide

  10. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 10
    jgs
    Mallet
    § Download
    http://mallet.cs.umass.edu/download.php

    View Slide

  11. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 11
    jgs
    Mallet API
    mallet.jar mallet-deps.jar

    View Slide

  12. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 12
    jgs
    Data | Email Spam Dataset
    § 2000, Androutsopoulos et al. (reorganized - 2010, Ng)
    § Pre-processed emails
    § Consists of 350 spam and 350 non-spam (for training)
    § Consists of 130 spam and 130 non-spam (for testing)
    § http://openclassroom.stanford.edu/
    MainFolder/DocumentPage.php?
    course=MachineLearning&doc=exercises/ex6/ex6.html
    § Download ex6DataEmails.zip (the version that in NOT preproseced)

    View Slide

  13. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 13
    jgs
    Import Data for Training
    File iterator
    cc.mallet.pipe.iterator.FileIterator
    FileIterator iterator = new FileIterator(
    § A list of File[] directories with text files
    § A file filter that specifies which files to select within a directory
    § A pattern that is applied to a filename to produce a class label
    )

    View Slide

  14. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 14
    jgs
    We did This

    View Slide

  15. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 15
    jgs
    New Code

    View Slide

  16. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 16
    jgs
    We did This

    View Slide

  17. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 17
    jgs
    New Code

    View Slide

  18. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 18
    jgs
    Features + Label
    This is now a classification problem!

    View Slide

  19. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 19
    jgs
    Machine Learning

    View Slide

  20. jgs
    Classification
    Naive Bayes

    View Slide

  21. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 21
    jgs
    Data and Pipeline
    /
    / yes, we are still using Mallet.
    /
    / Nothing else.

    View Slide

  22. jgs
    Evaluation

    View Slide

  23. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 23
    jgs
    Testing

    View Slide

  24. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 24
    jgs
    What is this?

    View Slide

  25. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 25
    jgs
    Evaluation (1000)

    View Slide

  26. jgs
    Summary
    Text Mining

    View Slide

  27. jgs

    View Slide

  28. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 28
    jgs
    But,
    How does a Classifier work?

    View Slide

  29. Javier Gonzalez-Sanchez | SER 594 | Spring 2022 | 29
    jgs
    Questions

    View Slide

  30. jgs
    SER 594 Software Engineering for Machine Learning
    Javier Gonzalez-Sanchez, Ph.D.
    [email protected]
    Spring 2022
    Copyright. These slides can only be used as study material for the class CSE205 at Arizona State University.
    They cannot be distributed or used for another purpose.

    View Slide