Deborah Hanus - Lights, camera, action! Scraping a great dataset to predict Oscar winners

Deborah Hanus - Lights, camera, action! Scraping a great dataset to predict Oscar winners

Using Jupyter notebooks and scikit-learn, you’ll predict whether a movie is likely to [win an Oscar](http://oscarpredictor.github.io/) or be a box office hit. Together, we’ll step through the creation of an effective dataset: asking a question your data can answer, writing a web scraper, and answering those questions using nothing but Python libraries and data from the Internet.

https://us.pycon.org/2017/schedule/presentation/395/

Bde70c0ba031a765ff25c19e6b7d6d23?s=128

PyCon 2017

May 21, 2017
Tweet

Transcript

  1. 1.

    Lights Camera Action! Predicting Box Office hits using things you

    found on the Internet Deborah Hanus @deborahhanus
  2. 2.
  3. 3.

    • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  4. 4.

    • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  5. 6.

    Not so good: Vague Will my movie be a box

    office hit? Good: Likelihood What is the likelihood that my movie will be a box office hit given that it has X features? Good: Correlation What attributes of a movie are correlated with box office success? Define a question you can answer
  6. 7.

    • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  7. 10.

    • Use an API • Write a web scraper •

    Get all the text • Make the text queryable How to get good data
  8. 11.

    • Make an HTTP request to get the HTML Writing

    a web scraper Requests Example
  9. 12.
  10. 14.
  11. 15.
  12. 17.
  13. 19.
  14. 20.
  15. 21.
  16. 22.
  17. 23.
  18. 25.

    • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  19. 26.

    Factors we can explore: • Movie budget • IMDB Rating

    • Power Studios • Opening Weekend • How many opening theaters • Seasonality • MPAA Rating Exploratory Data Analysis
  20. 31.

    • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  21. 32.

    • Budget helps (but only a little). • Timing is

    important. December is a great release date. • PG & G rated movies make more. • Money made in opening weekend is important. What did we find?
  22. 34.

    • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  23. 35.

    • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  24. 37.

    Not so good: Vague Will this movie win an Oscar?

    Good: Likelihood What is the likelihood that this movie will win an Oscar given that it has X features? Good: Correlation What attributes of a movie are correlated with the movie winning an Oscar? Define a question you can answer
  25. 38.

    • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  26. 39.

    Not so good: Vague Will this movie win an Oscar?

    Good: Likelihood What is the likelihood that this movie will be a box office hit given that it has X features? Good: Correlation What attributes of a movie are correlated with the movie winning an Oscar? Define a question you can answer Best: Conditional correlation Given that a movie has been nominated for an Oscar, what attributes are correlated with winning?
  27. 40.

    • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  28. 41.

    Factors we can explore: • Movie nomination category • Thematic

    content (e.g. smoking, family, violence, war, father-son relationship) • Movie genre • Where the movie was made • When the movie debuts Exploratory Data Analysis
  29. 45.

    • Define a question you can answer • Acquire good

    data • Explore the data • Draw conclusions from the data How to build & use a great dataset
  30. 46.

    • Nominated films from Italy & Spain are likely to

    win • Films are more likely to win if they are released later in the year • Tone down the gore (unless it is a war film) • If you are nominated for best picture, your odds for winning are good. • If you are nominated for best cinematography, your odds are less good. What did we find?
  31. 47.

    • Defined an answerable question • Built a web scraper

    using Requests & BeautifulSoup • Explored the data • Used it to answer our questions What did we do?
  32. 48.
  33. 49.

    Building a scraper Requests - HTTP for Humans BeautifulSoup or

    PyQuery Analyzing data Jupyter SciKit Learn Statsmodel Example Projects http://oscarpredictor.github.io Where to go from here? Deborah Hanus @deborahhanus