Slide 1

Slide 1 text

Lights Camera Action! Predicting Box Office hits using things you found on the Internet Deborah Hanus @deborahhanus

Slide 2

Slide 2 text

My Path

Slide 3

Slide 3 text

• Define a question you can answer • Acquire good data • Explore the data • Draw conclusions from the data How to build & use a great dataset

Slide 4

Slide 4 text

• Define a question you can answer • Acquire good data • Explore the data • Draw conclusions from the data How to build & use a great dataset

Slide 5

Slide 5 text

What factors drive movie revenue Image: unclaimedmoney.com

Slide 6

Slide 6 text

Not so good: Vague Will my movie be a box office hit? Good: Likelihood What is the likelihood that my movie will be a box office hit given that it has X features? Good: Correlation What attributes of a movie are correlated with box office success? Define a question you can answer

Slide 7

Slide 7 text

• Define a question you can answer • Acquire good data • Explore the data • Draw conclusions from the data How to build & use a great dataset

Slide 8

Slide 8 text

• Relevant • Structured • (Relatively) complete What is good data?

Slide 9

Slide 9 text

Where to find good data?

Slide 10

Slide 10 text

• Use an API • Write a web scraper • Get all the text • Make the text queryable How to get good data

Slide 11

Slide 11 text

• Make an HTTP request to get the HTML Writing a web scraper Requests Example

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Writing a web scraper http://www.boxofficemojo.com/yearly/chart/? page=1&view=releasedate&view2=domestic&yr=%2 017.htm

Slide 14

Slide 14 text

No content

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

• Make HTML queryable using BeautifulSoup & PyQuery Writing a web scraper

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Right Click : View Source

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

• Rate limiting • API Keys • Selenium Common problems

Slide 25

Slide 25 text

• Define a question you can answer • Acquire good data • Explore the data • Draw conclusions from the data How to build & use a great dataset

Slide 26

Slide 26 text

Factors we can explore: • Movie budget • IMDB Rating • Power Studios • Opening Weekend • How many opening theaters • Seasonality • MPAA Rating Exploratory Data Analysis

Slide 27

Slide 27 text

Gross revenue vs. # opening theaters Exploratory Data Analysis ~3500

Slide 28

Slide 28 text

Gross revenue vs. Quality rating Exploratory Data Analysis No relationship :(

Slide 29

Slide 29 text

Gross revenue vs. Opening gross Exploratory Data Analysis Predictive

Slide 30

Slide 30 text

Multivariate regression Exploratory Data Analysis

Slide 31

Slide 31 text

• Define a question you can answer • Acquire good data • Explore the data • Draw conclusions from the data How to build & use a great dataset

Slide 32

Slide 32 text

• Budget helps (but only a little). • Timing is important. December is a great release date. • PG & G rated movies make more. • Money made in opening weekend is important. What did we find?

Slide 33

Slide 33 text

What does it take to win an Oscar?

Slide 34

Slide 34 text

• Define a question you can answer • Acquire good data • Explore the data • Draw conclusions from the data How to build & use a great dataset

Slide 35

Slide 35 text

• Define a question you can answer • Acquire good data • Explore the data • Draw conclusions from the data How to build & use a great dataset

Slide 36

Slide 36 text

What makes an Oscar winner? Image: superawesomevectors.com

Slide 37

Slide 37 text

Not so good: Vague Will this movie win an Oscar? Good: Likelihood What is the likelihood that this movie will win an Oscar given that it has X features? Good: Correlation What attributes of a movie are correlated with the movie winning an Oscar? Define a question you can answer

Slide 38

Slide 38 text

• Define a question you can answer • Acquire good data • Explore the data • Draw conclusions from the data How to build & use a great dataset

Slide 39

Slide 39 text

Not so good: Vague Will this movie win an Oscar? Good: Likelihood What is the likelihood that this movie will be a box office hit given that it has X features? Good: Correlation What attributes of a movie are correlated with the movie winning an Oscar? Define a question you can answer Best: Conditional correlation Given that a movie has been nominated for an Oscar, what attributes are correlated with winning?

Slide 40

Slide 40 text

• Define a question you can answer • Acquire good data • Explore the data • Draw conclusions from the data How to build & use a great dataset

Slide 41

Slide 41 text

Factors we can explore: • Movie nomination category • Thematic content (e.g. smoking, family, violence, war, father-son relationship) • Movie genre • Where the movie was made • When the movie debuts Exploratory Data Analysis

Slide 42

Slide 42 text

Exploratory Data Analysis Countries associated with winning Oscar movies

Slide 43

Slide 43 text

Exploratory Data Analysis Number of winning movies per month

Slide 44

Slide 44 text

Exploratory Data Analysis Ratio of winning films to films to nominated films by month

Slide 45

Slide 45 text

• Define a question you can answer • Acquire good data • Explore the data • Draw conclusions from the data How to build & use a great dataset

Slide 46

Slide 46 text

• Nominated films from Italy & Spain are likely to win • Films are more likely to win if they are released later in the year • Tone down the gore (unless it is a war film) • If you are nominated for best picture, your odds for winning are good. • If you are nominated for best cinematography, your odds are less good. What did we find?

Slide 47

Slide 47 text

• Defined an answerable question • Built a web scraper using Requests & BeautifulSoup • Explored the data • Used it to answer our questions What did we do?

Slide 48

Slide 48 text

Team

Slide 49

Slide 49 text

Building a scraper Requests - HTTP for Humans BeautifulSoup or PyQuery Analyzing data Jupyter SciKit Learn Statsmodel Example Projects http://oscarpredictor.github.io Where to go from here? Deborah Hanus @deborahhanus