Supervised learning, machine learning, classifiers, big data! What in the world are all of these things? As a beginning programmer the questions described as “machine learning” questions can be mystifying at best.
In this talk I will define the scope of a machine learning problem, identifying an email as ham or spam, from the perspective of a beginner (non master of all things “machine learning") and show how Python can help us simply learn how to classify a piece of email.
To begin we must ask, what is spam? How do I know it “when I see it”? From previous experience of course! We will provide human labeled examples of spam to our model for it to understand the likelihood of spam or ham. This approach, using examples and data we already know to determine the most likely label for a new example, uses the Naive Bayes classifier.
Our model will look at the words in the body of an email, finding the frequency of words in both spam and ham emails and the frequency of spam and ham. Once we know the prior likelihood of spam and what makes something spam, we can try applying a label to a new example.
Through this exercise we will see at a basic level what types of questions machine learning asks, learn to model “learning” with Python, and understand how learning can be measured.