Naive Bayes ACM SIGKDD - Speaker Deck

Slide 1

Slide 1 text

Naive Bayes Algorithms Advanced Machine Learning with Python Session 4, ACM SIGKDD Christine Doig Senior Data Scientist

Slide 2

Slide 2 text

Presenter Bio Christine Doig is a Senior Data Scientist at Continuum Analytics, where she worked on MEMEX, a DARPA- funded project helping stop human trafficking. She has 5+ years of experience in analytics, operations research, and machine learning in a variety of industries, including energy, manufacturing, and banking. Christine holds a M.S. in Industrial Engineering from the Polytechnic University of Catalonia in Barcelona. She is an open source advocate and has spoken at many conferences, including PyData, EuroPython, SciPy and PyCon. Christine Doig  Senior Data Scientist Continuum Analytics @ch_doig

Slide 3

Slide 3 text

Previous talks • Topic Modeling, Machine Learning with Python, ACM SIGKDD’15 • Scale your data, not your process. Welcome to the Blaze ecosystem! , EuroPython’15 • Reproducible Multilanguage Data Science with conda, PyData Dallas’15 • Building Python Data Applications with Blaze and Bokeh, SciPy’15 • Navigating the Data Science Python Ecosystem, PyConES’15 • The State of Python for Data Science, PySS’15 • Beginner’s Guide to Machine Learning Competitions, PyTexas’15 Christine Doig  Senior Data Scientist Continuum Analytics @ch_doig

Slide 4

Slide 4 text

INTRODUCTION

Slide 5

Slide 5 text

Slide 6

Slide 6 text

6 Machine Learning Supervised Unsupervised Classification Regression Clustering labeled not labeled categorical quatitative variable to predict (Y)

Slide 7

Slide 7 text

7 Classification Discrete-valued inputs (X) Continuous inputs (X) MultinomialNB BernouilliNB GaussianNB Naive Bayes Algorithms occurance counts binary/boolean features

Slide 8

Slide 8 text

Naive Bayes Algorithms • Probability Review – Joint distribution, Conditional probability – Bayes Theorem, Naive Bayes classifiers • Multinomial NB • Bernouilli NB • Gaussian NB

Slide 9

Slide 9 text

PROBABILITY CONCEPTS REVIEW

Slide 10

Slide 10 text

Data Feature X1 Feature X2 Feature X3 Label Y instance 1 instance 2 instance 3 instance 4

Slide 11

Slide 11 text

Data Feature X1 Feature X2 Feature X3 Label Y instance 1 instance 2 instance 3 instance 4 X Y

Slide 12

Slide 12 text

Data - Simple Example Rainy Ice cream Traffic jam Label Y day 1 0 1 1 happy day 2 0 1 1 happy day 3 1 0 1 sad day 4 0 0 0 happy Example: Variable Y to predict is categorical and X are discrete

Slide 13

Slide 13 text

Data - Simple Example Rainy Ice cream Traffic jam Label Y day 1 0 1 1 happy day 2 0 1 1 happy day 3 1 0 1 sad day 4 0 0 0 happy day 5 0 0 1 sad day 6 0 1 1 sad day 7 1 1 0 sad day 8 0 1 1 happy day 9 0 0 0 happy day 10 1 0 1 sad Rainy Ice cream Traffic jam Label Y day 11 0 0 1 sad day 12 0 1 1 happy day 13 1 0 1 sad day 14 0 1 0 happy day 15 0 1 1 happy day 16 0 1 0 happy day 17 1 1 1 happy day 18 0 0 0 happy day 19 0 0 1 sad day 20 1 1 1 happy

Slide 14

Slide 14 text

Joint Probability Distribution In the study of probability, given at least two random variables X, Y, ..., that are defined on a probability space, the joint probability distribution for X, Y, ... is a probability distribution that gives the probability that each of X, Y, ... falls in any particular range or discrete set of values specified for that variable.

Slide 15

Slide 15 text

Rainy Ice cream Traffic jam Label Y day 1 0 1 1 happy day 2 0 1 1 happy day 3 1 0 1 sad day 4 0 0 0 happy Rainy Ice cream Probability 1 1 0.15 1 0 0.15 0 1 0.40 0 0 0.30 ? Count how many times we encounter each situation. Divide by total number of instances. Joint Probability Distribution

Slide 16

Slide 16 text

Conditional probability A conditional probability measures the probability of an event given that (by assumption, presumption, assertion or evidence) another event has occurred.

Slide 17

Slide 17 text

Computing probabilities Rainy Ice cream Probability 1 1 0.15 1 0 0.15 0 1 0.40 0 0 0.30 Pr(¬Rainy) = 0.70 Pr(Ice cream | Rainy) = 0.15 / 0.3 = 0.5 What’s the probability of not rainy? What’s the probability that if it’s raining, I’m going to eat an ice cream? Conditional probability

Slide 18

Slide 18 text

Bayes’ theorem Bayes' theorem describes the probability of an event, based on conditions that might be related to the event.

Slide 19

Slide 19 text

Bayes’ theorem Rainy Ice cream Probability 1 1 0.15 1 0 0.15 0 1 0.40 0 0 0.30 Pr(Ice cream | Rainy) = 0.15 / 0.3 = 0.5 Pr(Rainy| Ice cream) = 0.15 / 0.55 = 0.27 Pr(Rainy| Ice cream) = Pr(Ice cream | Rainy) * Pr(Rainy) / Pr(Ice cream) = 0.5 * 0.3 / 0.55 = 0.27

Slide 20

Slide 20 text

Naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector representing some n features (independent variables), it assigns to this instance probabilities Naive Bayes for each of K possible outcomes or classes.

Slide 21

Slide 21 text

Naive Bayes Rainy Ice cream Traffic jam Label Y day 1 0 1 1 happy day 2 0 1 1 happy day 3 1 0 1 sad day 4 0 0 0 happy day 5 0 0 1 sad day 6 0 1 1 sad day 7 1 1 0 sad day 8 0 1 1 happy day 9 0 0 0 happy day 10 1 0 1 sad Pr(happy| RAINY, ICE CREAM, TRAFFIC JAM) Pr(sad| RAINY, ICE CREAM, TRAFFIC JAM) if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. PROBLEM

Slide 22

Slide 22 text

Rainy Ice cream Probability 1 1 0.15 1 0 0.15 0 1 0.40 0 0 0.30 Joint Probability Distribution 2 variables - 4 scenarios Rainy Ice cream Traffic Jam Probability 1 1 1 0.10 1 1 0 0.05 1 0 1 0.15 1 0 0 0 0 1 1 0.30 0 1 0 0.10 0 0 1 0.15 0 0 0 0.15 3 variables - 8 scenarios 2^n

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Bayes theorem + Chain rule + Naive CI Denominator doesn’t matter because it doesn’t depend on the class Chain rule Naive conditional independence

Slide 25

Slide 25 text

Naive Bayes Classifier How am I going to compute probabilities? How am I going to assign a class? MAP decision rule (maximum a posteriori)

Slide 26

Slide 26 text

Rainy Ice cream Traffic jam Label Y day 1 0 1 1 happy day 2 0 1 1 happy day 3 1 0 1 sad day 4 0 0 0 happy day 5 0 0 1 sad day 6 0 1 1 sad day 7 1 1 0 sad day 8 0 1 1 happy day 9 0 0 0 happy day 10 1 0 1 sad Naive Bayes Classifier Pr(happy| RAINY, ¬ICE CREAM, TRAFFIC JAM) Pr(happy) * Pr(rainy | happy) * Pr(¬ice cream | happy) * Pr (traffic jam | happy)= 0.6 * 0.167 * 0.25 * 0.583333 = 0.015 Pr(sad| RAINY, ¬ICE CREAM, TRAFFIC JAM) How often I’m happy? What are the chances that it's rainy, given that I am happy? Pr(sad) * Pr(rainy | sad) * Pr(¬ice cream | sad) * Pr (traffic jam | sad)= 0.4 * 0.5 * 0.75 * 0.875 = 0.13

Slide 27

Slide 27 text

Rainy Ice cream Traffic jam Label Y day 1 0 1 1 happy day 2 0 1 1 happy day 3 1 0 1 sad day 4 0 0 0 happy day 5 0 0 1 sad day 6 0 1 1 sad day 7 1 1 0 sad day 8 0 1 1 happy day 9 0 0 0 happy day 10 1 0 1 sad Naive Bayes Classifier Pr(happy| RAINY, ¬ICE CREAM, TRAFFIC JAM) Pr(happy) * Pr(rainy | happy) * Pr(¬ice cream | happy) * Pr (traffic jam | happy)= 0.6 * 0.167 * 0.25 * 0.583333 = 0.015 Pr(sad| RAINY, ¬ICE CREAM, TRAFFIC JAM) Pr(sad) * Pr(rainy | sad) * Pr(¬ice cream | sad) * Pr (traffic jam | sad)= 0.4 * 0.5 * 0.75 * 0.875 = 0.13

Slide 28

Slide 28 text

28 Discrete-valued inputs (X) Continuous inputs (X) MultinomialNB BernouilliNB GaussianNB Naive Bayes Algorithms occurance counts binary/boolean features The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of

Slide 29

Slide 29 text

MULTINOMIAL NB

Slide 30

Slide 30 text

Multinomial Naive Bayes Rainy Ice cream Traffic jam Label Y day 1 0 2 1 happy day 2 2 3 1 happy day 3 1 0 3 sad day 4 0 3 0 happy day 5 1 2 1 sad day 6 0 1 1 sad day 7 2 4 0 sad day 8 0 1 1 happy day 9 0 2 3 happy day 10 1 0 1 sad Not wheter it rained or not, I had ice cream or not, there was a traffic jam or not. Now the data tells me how many times it rained, how many ice creams I had and in how many traffic jams I was a day! 0 1 2 3 4

Slide 31

Slide 31 text

Slide 32

Slide 32 text

32 • the multinomial distribution is a generalization of the binomial distribution. • the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories. • A feature vector is then a histogram, with xi counting the number of times event i was observed in a particular instance. Multinomial Distribution

Slide 33

Slide 33 text

33 The smoothing priors accounts for features not present in the learning samples and prevents zero probabilities in further computations. - alpha = 1 is called Laplace smoothing - alpha < 1 is called Lidstone smoothing. Multinomial Naive Bayes

Slide 34

Slide 34 text

34 Multinomial Naive Bayes Application Application: Text classification (where the data are typically represented as word vector counts, although tf-idf vectors are also known to work well in practice). Rainy Ice cream Traffic jam Label Y doc 1 0 2 1 happy doc 2 2 3 1 happy doc 3 1 0 3 sad doc 4 0 3 0 happy doc 5 1 2 1 sad doc 6 0 1 1 sad day 7 2 4 0 sad doc 8 0 1 1 happy doc 9 0 2 3 happy doc 10 1 0 1 sad documents words type of document: happy or sad?

Slide 35

Slide 35 text

BERNOULLI NB

Slide 36

Slide 36 text

Bernoulli Naive Bayes Rainy Ice cream Traffic jam Label Y day 1 0 1 1 happy day 2 0 1 1 happy day 3 1 0 1 sad day 4 0 0 0 happy day 5 0 0 1 sad day 6 0 1 1 sad day 7 1 1 0 sad day 8 0 1 1 happy day 9 0 0 0 happy day 10 1 0 1 sad which differs from multinomial NB’s rule in that it explicitly penalizes the non- occurrence of a feature i that is an indicator for class y, where the multinomial variant would simply ignore a non-occurring feature.

Slide 37

Slide 37 text

37 • there may be multiple features but each one is assumed to be a binary-valued (Bernoulli, boolean) variable • if handed any other kind of data, a BernoulliNB instance may binarize its input (depending on the binarize parameter). • text classification => word occurrence vectors Bernoulli Naive Bayes

Slide 38

Slide 38 text

GAUSSIAN NB

Slide 39

Slide 39 text

Gaussian Naive Bayes Rainy Ice cream Traffic jam Label Y day 1 100,4 10,5 50,4 happy day 2 0,2 1,5 30,4 happy day 3 20,6 0,34 10,3 sad day 4 0,4 0,5 0 happy day 5 0,5 0,4 10,3 sad day 6 0,6 1,5 15,2 sad day 7 50,3 1,24 0 sad day 8 0,6 1,2 10,2 happy day 9 0,15 0 0 happy day 10 10,4 0 20,3 sad For continuous data, a typical assumption is that the continuous values associated with each class are distributed according to a Gaussian distribution.

Slide 40

Slide 40 text

RECAP

Slide 41

Slide 41 text

Naive Bayes :) • Inference is cheap - Fast! • Few parameters • Empirically successful classifier

Slide 42

Slide 42 text

Naive Bayes :( • Assumes independence of features – Doesn’t model interrelationships between attributes • Bad estimator, so the probability outputs are not really useful

Slide 43

Slide 43 text

43 Classification Discrete-valued inputs (X) Continuous inputs (X) MultinomialNB BernouilliNB GaussianNB Naive Bayes Algorithms occurance counts binary/boolean features

Slide 44

Slide 44 text

Thank you! :) Christine Doig Twitter: @ch_doig

Slide 45

Slide 45 text

45 http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html Scikit-learn Naive Bayes Resources http://scikit-learn.org/stable/modules/naive_bayes.html http://scikit-learn.org/stable/auto_examples/text/mlcomp_sparse_document_classification.html http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html