Lights Camera Action!
Predicting Box Office hits using
things you found on the Internet
Deborah Hanus
@deborahhanus
Slide 2
Slide 2 text
My Path
Slide 3
Slide 3 text
• Define a question you can answer
• Acquire good data
• Explore the data
• Draw conclusions from the data
How to build & use a great dataset
Slide 4
Slide 4 text
• Define a question you can answer
• Acquire good data
• Explore the data
• Draw conclusions from the data
How to build & use a great dataset
Slide 5
Slide 5 text
What factors drive movie revenue
Image: unclaimedmoney.com
Slide 6
Slide 6 text
Not so good: Vague
Will my movie be a box office hit?
Good: Likelihood
What is the likelihood that my movie will be a
box office hit given that it has X features?
Good: Correlation
What attributes of a movie are correlated with
box office success?
Define a question you can answer
Slide 7
Slide 7 text
• Define a question you can answer
• Acquire good data
• Explore the data
• Draw conclusions from the data
How to build & use a great dataset
Slide 8
Slide 8 text
• Relevant
• Structured
• (Relatively) complete
What is good data?
Slide 9
Slide 9 text
Where to find good data?
Slide 10
Slide 10 text
• Use an API
• Write a web scraper
• Get all the text
• Make the text queryable
How to get good data
Slide 11
Slide 11 text
• Make an HTTP request to get the HTML
Writing a web scraper
Requests Example
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
Writing a web scraper
http://www.boxofficemojo.com/yearly/chart/?
page=1&view=releasedate&view2=domestic&yr=%2
017.htm
Slide 14
Slide 14 text
No content
Slide 15
Slide 15 text
No content
Slide 16
Slide 16 text
• Make HTML queryable using BeautifulSoup
& PyQuery
Writing a web scraper
Slide 17
Slide 17 text
No content
Slide 18
Slide 18 text
Right Click : View Source
Slide 19
Slide 19 text
No content
Slide 20
Slide 20 text
No content
Slide 21
Slide 21 text
No content
Slide 22
Slide 22 text
No content
Slide 23
Slide 23 text
No content
Slide 24
Slide 24 text
• Rate limiting
• API Keys
• Selenium
Common problems
Slide 25
Slide 25 text
• Define a question you can answer
• Acquire good data
• Explore the data
• Draw conclusions from the data
How to build & use a great dataset
Slide 26
Slide 26 text
Factors we can explore:
• Movie budget
• IMDB Rating
• Power Studios
• Opening Weekend
• How many opening theaters
• Seasonality
• MPAA Rating
Exploratory Data Analysis
Slide 27
Slide 27 text
Gross revenue vs. # opening theaters
Exploratory Data Analysis
~3500
Slide 28
Slide 28 text
Gross revenue vs. Quality rating
Exploratory Data Analysis
No relationship :(
Slide 29
Slide 29 text
Gross revenue vs. Opening gross
Exploratory Data Analysis
Predictive
Slide 30
Slide 30 text
Multivariate regression
Exploratory Data Analysis
Slide 31
Slide 31 text
• Define a question you can answer
• Acquire good data
• Explore the data
• Draw conclusions from the data
How to build & use a great dataset
Slide 32
Slide 32 text
• Budget helps (but only a little).
• Timing is important. December is a great
release date.
• PG & G rated movies make more.
• Money made in opening weekend is important.
What did we find?
Slide 33
Slide 33 text
What does it take to win an Oscar?
Slide 34
Slide 34 text
• Define a question you can answer
• Acquire good data
• Explore the data
• Draw conclusions from the data
How to build & use a great dataset
Slide 35
Slide 35 text
• Define a question you can answer
• Acquire good data
• Explore the data
• Draw conclusions from the data
How to build & use a great dataset
Slide 36
Slide 36 text
What makes an Oscar winner?
Image: superawesomevectors.com
Slide 37
Slide 37 text
Not so good: Vague
Will this movie win an Oscar?
Good: Likelihood
What is the likelihood that this movie will win
an Oscar given that it has X features?
Good: Correlation
What attributes of a movie are correlated with
the movie winning an Oscar?
Define a question you can answer
Slide 38
Slide 38 text
• Define a question you can answer
• Acquire good data
• Explore the data
• Draw conclusions from the data
How to build & use a great dataset
Slide 39
Slide 39 text
Not so good: Vague
Will this movie win an Oscar?
Good: Likelihood
What is the likelihood that this movie will be a
box office hit given that it has X features?
Good: Correlation
What attributes of a movie are correlated with
the movie winning an Oscar?
Define a question you can answer
Best: Conditional correlation
Given that a movie has been nominated for an Oscar,
what attributes are correlated with winning?
Slide 40
Slide 40 text
• Define a question you can answer
• Acquire good data
• Explore the data
• Draw conclusions from the data
How to build & use a great dataset
Slide 41
Slide 41 text
Factors we can explore:
• Movie nomination category
• Thematic content (e.g. smoking, family, violence,
war, father-son relationship)
• Movie genre
• Where the movie was made
• When the movie debuts
Exploratory Data Analysis
Slide 42
Slide 42 text
Exploratory Data Analysis
Countries associated with winning Oscar movies
Slide 43
Slide 43 text
Exploratory Data Analysis
Number of winning movies per month
Slide 44
Slide 44 text
Exploratory Data Analysis
Ratio of winning films to films to nominated films by month
Slide 45
Slide 45 text
• Define a question you can answer
• Acquire good data
• Explore the data
• Draw conclusions from the data
How to build & use a great dataset
Slide 46
Slide 46 text
• Nominated films from Italy & Spain are likely to
win
• Films are more likely to win if they are released
later in the year
• Tone down the gore (unless it is a war film)
• If you are nominated for best picture, your odds
for winning are good.
• If you are nominated for best cinematography,
your odds are less good.
What did we find?
Slide 47
Slide 47 text
• Defined an answerable question
• Built a web scraper using Requests &
BeautifulSoup
• Explored the data
• Used it to answer our questions
What did we do?
Slide 48
Slide 48 text
Team
Slide 49
Slide 49 text
Building a scraper
Requests - HTTP for Humans
BeautifulSoup or PyQuery
Analyzing data
Jupyter
SciKit Learn
Statsmodel
Example Projects
http://oscarpredictor.github.io
Where to go from here?
Deborah Hanus
@deborahhanus