Predicting box office hits &
Oscar winners using things
you found on the Internet
Deborah Hanus
@deborahhanus
Slide 2
Slide 2 text
Background
Slide 3
Slide 3 text
Credit: Gil Press, Forbes
82%
13%
5%
Slide 4
Slide 4 text
82% DATA
WRANGLING
13% EDA & ML
5% OTHER
62% DATA WRANGLING
20% DATA UNDERSTANDING
Slide 5
Slide 5 text
• Define a question you can answer
• Acquire good data
• Understand data
• Fit a model & analyze error
• Draw conclusions from the data
How to build & use a great dataset
Slide 6
Slide 6 text
• Define a question you can answer
• Acquire good data
• Understand data
• Fit a model & analyze error
• Draw conclusions from the data
How to build & use a great dataset
Slide 7
Slide 7 text
What factors drive movie revenue?
Image: unclaimedmoney.com
Slide 8
Slide 8 text
Not so good: Vague
Will my movie be a box office hit?
Good: Likelihood
What is the likelihood that my movie will be a box
office hit given that it has X features?
Good: Correlation
What attributes of a movie are correlated with box
office success?
Define a question you can answer
Slide 9
Slide 9 text
• Define a question you can answer
• Acquire good data
• Understand data
• Fit a model & analyze error
• Draw conclusions from the data
How to build & use a great dataset
Slide 10
Slide 10 text
• Relevant
• Structured
• (Relatively) complete
What is good data?
Slide 11
Slide 11 text
Where to find good data?
Slide 12
Slide 12 text
• Use an API
• Write a web scraper
• Get all the text
• Make the text queryable
How to get good data
Slide 13
Slide 13 text
• Make an HTTP request to get the HTML
Writing a web scraper
Requests Example
Slide 14
Slide 14 text
No content
Slide 15
Slide 15 text
Writing a web scraper
http://www.boxofficemojo.com/yearly/chart/?
page=1&view=releasedate&view2=domestic&yr=%2
017.htm
Slide 16
Slide 16 text
No content
Slide 17
Slide 17 text
No content
Slide 18
Slide 18 text
• Make HTML queryable using BeautifulSoup &
PyQuery
Writing a web scraper
Slide 19
Slide 19 text
No content
Slide 20
Slide 20 text
Right Click : View Source
Slide 21
Slide 21 text
No content
Slide 22
Slide 22 text
No content
Slide 23
Slide 23 text
No content
Slide 24
Slide 24 text
No content
Slide 25
Slide 25 text
No content
Slide 26
Slide 26 text
• Rate limiting
• API Keys
• Selenium
Common problems
Slide 27
Slide 27 text
• Define a question you can answer
• Acquire good data
• Understand data
• Fit a model & analyze error
• Draw conclusions from the data
How to build & use a great dataset
Slide 28
Slide 28 text
Factors we can explore:
• Movie budget
• IMDB Rating
• Power Studios
• Opening Weekend
• How many opening theaters
• Seasonality
• MPAA Rating
Exploratory Data Analysis
Slide 29
Slide 29 text
Gross revenue vs. # opening theaters
Exploratory Data Analysis
~3500
Slide 30
Slide 30 text
Gross revenue vs. Quality rating
Exploratory Data Analysis
No relationship
Slide 31
Slide 31 text
Gross revenue vs. Opening gross
Exploratory Data Analysis
Predictive
Slide 32
Slide 32 text
• Define a question you can answer
• Acquire good data
• Understand data
• Fit a model & analyze error
• Draw conclusions from the data
How to build & use a great dataset
Slide 33
Slide 33 text
Multivariate regression
Exploratory Data Analysis
Predictive
Slide 34
Slide 34 text
• Define a question you can answer
• Acquire good data
• Understand data
• Fit a model & analyze error
• Draw conclusions from the data
How to build & use a great dataset
Slide 35
Slide 35 text
• Budget helps (but only a little).
• Timing is important. December is a great
release date.
• PG & G rated movies make more.
• Money made in opening weekend is important.
What did we find?
Slide 36
Slide 36 text
What does it take to win an Oscar?
Slide 37
Slide 37 text
• Define a question you can answer
• Acquire good data
• Understand data
• Fit a model & analyze error
• Draw conclusions from the data
How to build & use a great dataset
Slide 38
Slide 38 text
• Define a question you can answer
• Acquire good data
• Understand data
• Fit a model & analyze error
• Draw conclusions from the data
How to build & use a great dataset
Slide 39
Slide 39 text
What makes an Oscar winner?
Image: superawesomevectors.com
Slide 40
Slide 40 text
Not so good: Vague
Will this movie win an Oscar?
Good: Likelihood
What is the likelihood that this movie will win an
Oscar given that it has X features?
Good: Correlation
What attributes of a movie are correlated with the
movie winning an Oscar?
Define a question you can answer
Slide 41
Slide 41 text
• Define a question you can answer
• Acquire good data
• Understand data
• Fit a model & analyze error
• Draw conclusions from the data
How to build & use a great dataset
Slide 42
Slide 42 text
Acquiring Data
IMDBpy Drama
Slide 43
Slide 43 text
Not so good: Vague
Will this movie win an Oscar?
Good: Likelihood
What is the likelihood that this movie will be a box office hit given
that it has X features?
Good: Correlation
What attributes of a movie are correlated with the movie winning
an Oscar?
Define a question you can answer
Best: Conditional correlation
Given that a movie has been nominated for an Oscar,
what attributes are correlated with winning?
Slide 44
Slide 44 text
• Define a question you can answer
• Acquire good data
• Understand data
• Fit a model & analyze error
• Draw conclusions from the data
How to build & use a great dataset
Slide 45
Slide 45 text
Factors we can explore:
• Movie nomination category
• Thematic content (e.g. family, violence, war, father-
son relationship, smoking)
• Movie genre
• Where the movie was made
• When the movie debuted
Exploratory Data Analysis
Slide 46
Slide 46 text
Exploratory Data Analysis
Countries associated with winning Oscar movies
Slide 47
Slide 47 text
Exploratory Data Analysis
Number of winning movies per month
Slide 48
Slide 48 text
Exploratory Data Analysis
Ratio of winning films to films to nominated films by month
Slide 49
Slide 49 text
• Define a question you can answer
• Acquire good data
• Understand data
• Fit a model & analyze error
• Draw conclusions from the data
How to build & use a great dataset
Slide 50
Slide 50 text
• Binary output - winner/non-winner
• More accurate than baseline
What do we want from a model?
Slide 51
Slide 51 text
• Binary output - winner/non-winner
• More accurate than baseline
What do we want from a model?
• Binary output - winner/non-winner
• More accurate than baseline
What do we want from a model?
Slide 54
Slide 54 text
• All winners
Accuracy = 29%
• All losers
Accuracy = 71%
Establish Baselines
Slide 55
Slide 55 text
• Define a question you can answer
• Acquire good data
• Understand data
• Fit a model & analyze error
• Draw conclusions from the data
How to build & use a great dataset
Slide 56
Slide 56 text
To select a model, think about
what it gets wrong.
Slide 57
Slide 57 text
Selecting a model
Confusion Matrix
Slide 58
Slide 58 text
• Accuracy = (TP+TN)/(TP+TN+FN+FP)
• Recall = TP/(TP+FN)
• Precision = TP/(TP+FP)
• F1 = (Precision*Recall)/(Precision+Recall)
Selecting a model
Slide 59
Slide 59 text
Selecting a model
Receiver-Operating Characteristic (ROC)
Slide 60
Slide 60 text
• Define a question you can answer
• Acquire good data
• Understand data
• Fit a model & analyze error
• Draw conclusions from the data
How to build & use a great dataset
Slide 61
Slide 61 text
• Nominated films made in Italy & Spain have a good chance at
winning.
• Films are more likely to win if they are released later in the
year.
• Tone down the gore (unless it is a war film).
• If a film is nominated for best picture, its odds of winning are
good.
• If a film is nominated for best cinematography, its odds are less
good.
What did we find?
Slide 62
Slide 62 text
What did our model get wrong?
Slide 63
Slide 63 text
• What did your model misclassify?
• Are any of those errors systematic?
Analyze errors
Slide 64
Slide 64 text
Image: NYT
Coded Gaze: Joy Buolamwini
Slide 65
Slide 65 text
Labeling sensitive content
Slide 66
Slide 66 text
Always analyze your errors
Slide 67
Slide 67 text
• Defined an answerable question
• Built a web scraper
• Explored the data
• Fit a model to the data
• Analyzed our errors
What did we do?
Slide 68
Slide 68 text
Team
Slide 69
Slide 69 text
Building a scraper
Requests - HTTP for Humans
BeautifulSoup or PyQuery
Analyzing data
Jupyter
SciKit Learn
Statsmodels
Example Projects
http://oscarpredictor.github.io
Where to go from here?
Deborah Hanus
@deborahhanus