Slide 1

Slide 1 text

Predicting box office hits & Oscar winners using things you found on the Internet Deborah Hanus @deborahhanus

Slide 2

Slide 2 text

Background

Slide 3

Slide 3 text

Credit: Gil Press, Forbes 82% 13% 5%

Slide 4

Slide 4 text

82% DATA WRANGLING 13% EDA & ML 5% OTHER 62% DATA WRANGLING 20% DATA UNDERSTANDING

Slide 5

Slide 5 text

• Define a question you can answer • Acquire good data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset

Slide 6

Slide 6 text

• Define a question you can answer • Acquire good data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset

Slide 7

Slide 7 text

What factors drive movie revenue? Image: unclaimedmoney.com

Slide 8

Slide 8 text

Not so good: Vague Will my movie be a box office hit? Good: Likelihood What is the likelihood that my movie will be a box office hit given that it has X features? Good: Correlation What attributes of a movie are correlated with box office success? Define a question you can answer

Slide 9

Slide 9 text

• Define a question you can answer • Acquire good data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset

Slide 10

Slide 10 text

• Relevant • Structured • (Relatively) complete What is good data?

Slide 11

Slide 11 text

Where to find good data?

Slide 12

Slide 12 text

• Use an API • Write a web scraper • Get all the text • Make the text queryable How to get good data

Slide 13

Slide 13 text

• Make an HTTP request to get the HTML Writing a web scraper Requests Example

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

Writing a web scraper http://www.boxofficemojo.com/yearly/chart/? page=1&view=releasedate&view2=domestic&yr=%2 017.htm

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

• Make HTML queryable using BeautifulSoup & PyQuery Writing a web scraper

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Right Click : View Source

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

• Rate limiting • API Keys • Selenium Common problems

Slide 27

Slide 27 text

• Define a question you can answer • Acquire good data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset

Slide 28

Slide 28 text

Factors we can explore: • Movie budget • IMDB Rating • Power Studios • Opening Weekend • How many opening theaters • Seasonality • MPAA Rating Exploratory Data Analysis

Slide 29

Slide 29 text

Gross revenue vs. # opening theaters Exploratory Data Analysis ~3500

Slide 30

Slide 30 text

Gross revenue vs. Quality rating Exploratory Data Analysis No relationship

Slide 31

Slide 31 text

Gross revenue vs. Opening gross Exploratory Data Analysis Predictive

Slide 32

Slide 32 text

• Define a question you can answer • Acquire good data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset

Slide 33

Slide 33 text

Multivariate regression Exploratory Data Analysis Predictive

Slide 34

Slide 34 text

• Define a question you can answer • Acquire good data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset

Slide 35

Slide 35 text

• Budget helps (but only a little). • Timing is important. December is a great release date. • PG & G rated movies make more. • Money made in opening weekend is important. What did we find?

Slide 36

Slide 36 text

What does it take to win an Oscar?

Slide 37

Slide 37 text

• Define a question you can answer • Acquire good data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset

Slide 38

Slide 38 text

• Define a question you can answer • Acquire good data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset

Slide 39

Slide 39 text

What makes an Oscar winner? Image: superawesomevectors.com

Slide 40

Slide 40 text

Not so good: Vague Will this movie win an Oscar? Good: Likelihood What is the likelihood that this movie will win an Oscar given that it has X features? Good: Correlation What attributes of a movie are correlated with the movie winning an Oscar? Define a question you can answer

Slide 41

Slide 41 text

• Define a question you can answer • Acquire good data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset

Slide 42

Slide 42 text

Acquiring Data IMDBpy Drama

Slide 43

Slide 43 text

Not so good: Vague Will this movie win an Oscar? Good: Likelihood What is the likelihood that this movie will be a box office hit given that it has X features? Good: Correlation What attributes of a movie are correlated with the movie winning an Oscar? Define a question you can answer Best: Conditional correlation Given that a movie has been nominated for an Oscar, what attributes are correlated with winning?

Slide 44

Slide 44 text

• Define a question you can answer • Acquire good data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset

Slide 45

Slide 45 text

Factors we can explore: • Movie nomination category • Thematic content (e.g. family, violence, war, father- son relationship, smoking) • Movie genre • Where the movie was made • When the movie debuted Exploratory Data Analysis

Slide 46

Slide 46 text

Exploratory Data Analysis Countries associated with winning Oscar movies

Slide 47

Slide 47 text

Exploratory Data Analysis Number of winning movies per month

Slide 48

Slide 48 text

Exploratory Data Analysis Ratio of winning films to films to nominated films by month

Slide 49

Slide 49 text

• Define a question you can answer • Acquire good data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset

Slide 50

Slide 50 text

• Binary output - winner/non-winner • More accurate than baseline What do we want from a model?

Slide 51

Slide 51 text

• Binary output - winner/non-winner • More accurate than baseline What do we want from a model?

Slide 52

Slide 52 text

• Logistic regression (Ridge, Lasso, ElasticNet) • Support vector machine (SVM) • Ensemble methods Potential models

Slide 53

Slide 53 text

• Binary output - winner/non-winner • More accurate than baseline What do we want from a model?

Slide 54

Slide 54 text

• All winners Accuracy = 29% • All losers Accuracy = 71% Establish Baselines

Slide 55

Slide 55 text

• Define a question you can answer • Acquire good data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset

Slide 56

Slide 56 text

To select a model, think about what it gets wrong.

Slide 57

Slide 57 text

Selecting a model Confusion Matrix

Slide 58

Slide 58 text

• Accuracy = (TP+TN)/(TP+TN+FN+FP) • Recall = TP/(TP+FN) • Precision = TP/(TP+FP) • F1 = (Precision*Recall)/(Precision+Recall) Selecting a model

Slide 59

Slide 59 text

Selecting a model Receiver-Operating Characteristic (ROC)

Slide 60

Slide 60 text

• Define a question you can answer • Acquire good data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset

Slide 61

Slide 61 text

• Nominated films made in Italy & Spain have a good chance at winning. • Films are more likely to win if they are released later in the year. • Tone down the gore (unless it is a war film). • If a film is nominated for best picture, its odds of winning are good. • If a film is nominated for best cinematography, its odds are less good. What did we find?

Slide 62

Slide 62 text

What did our model get wrong?

Slide 63

Slide 63 text

• What did your model misclassify? • Are any of those errors systematic? Analyze errors

Slide 64

Slide 64 text

Image: NYT Coded Gaze: Joy Buolamwini

Slide 65

Slide 65 text

Labeling sensitive content

Slide 66

Slide 66 text

Always analyze your errors

Slide 67

Slide 67 text

• Defined an answerable question • Built a web scraper • Explored the data • Fit a model to the data • Analyzed our errors What did we do?

Slide 68

Slide 68 text

Team

Slide 69

Slide 69 text

Building a scraper Requests - HTTP for Humans BeautifulSoup or PyQuery Analyzing data Jupyter SciKit Learn Statsmodels Example Projects http://oscarpredictor.github.io Where to go from here? Deborah Hanus @deborahhanus