Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Naive Bayes Classifiers

vhqviet
February 13, 2018

Naive Bayes Classifiers

vhqviet

February 13, 2018
Tweet

More Decks by vhqviet

Other Decks in Education

Transcript

  1. • Input: • A document d • A fixed set

    of classes C = {c1, c2,…, cJ} • Output: • A predicted class c ∈ C 2 Text Classification • Input: • A document d • A fixed set of classes C = {c1, c2,…, cJ } • A training set of m hand-labeled documents (d1 ,c1 ),....,(dm ,cm ) . • Output: • a learned classifier : d →c Classification Methods: Supervised Machine Learning Text Classification: definition
  2. 3 Naive Bayes • So called because it is a

    Bayesian classifier (Bayes’ rule) that makes a simplifying (naive) assumption about how the features interact. • Relies on very simple representation of document: • Bag of words Take for example 2 text samples: D1: The quick brown fox jumps over the lazy dog D2: Never jump over the lazy dog quickly The text samples then form a dictionary: Vectors are then formed to represent the count of each word: D1: [1,1,1,0,1,1,0,1,1,0,2] D2: [0,1,0,1,0,1,1,1,0,1,1] { 'brown': 0, 'dog': 1, 'fox': 2, 'jump': 3, 'jumps': 4, 'lazy': 5, 'never': 6, 'over': 7, 'quick': 8, 'quickly': 9, 'the': 10, }
  3. 4 Naive Bayes • Naive Bayes is a probabilistic classifier,

    meaning that for a document d, out of all classes c ∈ C the classifier returns the class which has the maximum posterior probability (most likely class) given the document. = argmax ∈ (|) argmax: Whereas the max operator produces the maximum value, the argmax operator produces an input value where the max value is obtained. • Bayes’ rule is presented as: = () ()
  4. Naive Bayes = argmax ∈ (|) = argmax ∈ ()

    () = argmax ∈ () Prior probability of the class Likelihood of the document Bayes’ rule Computing () () for all possible class, but P(d) doesn’t change for each class. Dropping the denominator P(d)
  5. Naive Bayes = argmax ∈ (|) = argmax ∈ ()

    () = argmax ∈ () • d refers to all of the text in the entire training set. It is given by d = ( d1 , d2 , .., dn ), where d i is the attribute (word) of document. = argmax ∈ 1 , 2 , … , () ⇒ Without some simplifying assumptions, estimating the probability of every possible combination of attribute attribute (for example, every possible set of words and positions) would require huge numbers of parameters and impossibly large training sets.
  6. 7 Naive Bayes Naive Bayes classifiers therefore make two simplifying

    assumptions: • The first is the bag of words assumption : assume position doesn’t matter. • That the word “love” has the same effect on classification whether it occurs as the first, middle or last word in the document. • Assume that the attribute d1 , d2 , .., dn only encode word identity but not position. • The second is the naive Bayes assumption: assumption that the probabilities P( |c) are independent given the class c. • So it can be ‘naively’ multiplied as: 1 , 2 , … , = 1 ∙ 2 … ∙ = ( |)
  7. 8 Naive Bayes The final equation for the class chosen

    by a naive Bayes classifier is: = argmax ∈ ( |) () To apply the naive Bayes classifier to text, simply walking an index through every word position in the document: = argmax ∈ ( |) () • Naive Bayes calculations, like calculations for language modeling, are done in log space, to avoid underflow and increase speed. = argmax ∈ + ( |)
  8. Training the Naive Bayes Classifier • Let Nc be the

    number of documents in training data with class c and Ndoc be the total number of documents. Then: Assume a feature is just the existence of a word in the document’s bag of words. 9 Naive Bayes = | = ( | ) (| ) fraction of times the word wi appears among all words in all documents of class cj . The vocabulary V consists all the word types in all classes, not just the words in one class c.
  9. Problem • When trying to estimate the likelihood of the

    word “fantastic” given class “positive”: • But suppose there are no training documents that contain the word “fantastic” and are classified as “positive”. • In such a case the probability will be zero: ⇒ cause the probability of the class to be zero. • The simplest solution is the add-one (Laplace) smoothing or remove all the “unknow” words: 10 Naive Bayes ""| = (""|) (|) = 0 | = + 1 ( + 1) = + 1 +
  10. Worked example • Use a sentiment analysis domain with the

    two classes positive (+) and negative (-) 11 Training - - - + + just plain boring entirely predictable and lacks energy no surprises and very few laughs very impressive the most fun film of the summer Test ? predictable with no fun = − = 3 5 + = 2 5 The word with doesn’t occur in the test set, so remove it | = + 1 + | − = 1 + 1 14 + 20 | + = 0 + 1 9 + 20 | − = 1 + 1 14 + 20 | − = 0 + 1 14 + 20 | + = 0 + 1 9 + 20 | + = 1 + 1 9 + 20 − = − | − + = + | + = 3 5 × 2 × 2 × 1 343 = 2 5 × 1 × 1 × 2 293 = 6.1 × 10−5 = 3.2 × 10−5 − > + negative Naive Bayes