Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Deborah Hanus - Lights, camera, action! Scraping a great dataset to predict Oscar winners

Deborah Hanus - Lights, camera, action! Scraping a great dataset to predict Oscar winners

Using Jupyter notebooks and scikit-learn, you’ll predict whether a movie is likely to [win an Oscar](http://oscarpredictor.github.io/) or be a box office hit. Together, we’ll step through the creation of an effective dataset: asking a question your data can answer, writing a web scraper, and answering those questions using nothing but Python libraries and data from the Internet.

https://us.pycon.org/2017/schedule/presentation/395/

PyCon 2017

May 21, 2017
Tweet

More Decks by PyCon 2017

Other Decks in Programming

Transcript

  1. Lights Camera Action! Predicting Box Office hits using things you

    found on the Internet Deborah Hanus @deborahhanus
  2. • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  3. • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  4. Not so good: Vague Will my movie be a box

    office hit? Good: Likelihood What is the likelihood that my movie will be a box office hit given that it has X features? Good: Correlation What attributes of a movie are correlated with box office success? Define a question you can answer
  5. • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  6. • Use an API • Write a web scraper •

    Get all the text • Make the text queryable How to get good data
  7. • Make an HTTP request to get the HTML Writing

    a web scraper Requests Example
  8. • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  9. Factors we can explore: • Movie budget • IMDB Rating

    • Power Studios • Opening Weekend • How many opening theaters • Seasonality • MPAA Rating Exploratory Data Analysis
  10. • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  11. • Budget helps (but only a little). • Timing is

    important. December is a great release date. • PG & G rated movies make more. • Money made in opening weekend is important. What did we find?
  12. • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  13. • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  14. Not so good: Vague Will this movie win an Oscar?

    Good: Likelihood What is the likelihood that this movie will win an Oscar given that it has X features? Good: Correlation What attributes of a movie are correlated with the movie winning an Oscar? Define a question you can answer
  15. • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  16. Not so good: Vague Will this movie win an Oscar?

    Good: Likelihood What is the likelihood that this movie will be a box office hit given that it has X features? Good: Correlation What attributes of a movie are correlated with the movie winning an Oscar? Define a question you can answer Best: Conditional correlation Given that a movie has been nominated for an Oscar, what attributes are correlated with winning?
  17. • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  18. Factors we can explore: • Movie nomination category • Thematic

    content (e.g. smoking, family, violence, war, father-son relationship) • Movie genre • Where the movie was made • When the movie debuts Exploratory Data Analysis
  19. • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  20. • Nominated films from Italy & Spain are likely to

    win • Films are more likely to win if they are released later in the year • Tone down the gore (unless it is a war film) • If you are nominated for best picture, your odds for winning are good. • If you are nominated for best cinematography, your odds are less good. What did we find?
  21. • Defined an answerable question • Built a web scraper

    using Requests & BeautifulSoup • Explored the data • Used it to answer our questions What did we do?
  22. Building a scraper Requests - HTTP for Humans BeautifulSoup or

    PyQuery Analyzing data Jupyter SciKit Learn Statsmodel Example Projects http://oscarpredictor.github.io Where to go from here? Deborah Hanus @deborahhanus