Upgrade to Pro — share decks privately, control downloads, hide ads and more …

2017 - Predicting Oscar winners & box office hits

PyBay
August 21, 2017

2017 - Predicting Oscar winners & box office hits

Description

Using Jupyter notebooks and scikit-learn, you’ll predict whether a movie is likely to win an Oscar or be a box office hit. Together, we’ll step through the creation of an effective dataset: asking a question your data can answer, writing a web scraper, and answering those questions using nothing but Python libraries and data from the Internet.

Abstract

Using Jupyter notebooks and scikit-learn, you’ll predict whether a movie is likely to win an Oscar or be a box office hit. Together, we’ll step through the creation of an effective dataset: asking a question your data can answer, writing a web scraper, and answering those questions using nothing but Python libraries and data from the Internet.

This talk is for engineers, data scientists, and movie lovers who want to learn how to scrape information from the Internet, and then use python libraries (and some domain knowledge) to answer interesting questions using that data. This presentation could be informative for people with a wide range of skill-levels, but I expect it to be especially useful for anyone getting started with data science, http requests, pandas, and sklearn.

By the end of this talk, the you should expect to (a) understand how to scrape and manage small to medium data sets, (b) know how to overcome the most common roadblocks (i.e. dealing with timeouts or API keys), (c) understand the tools you need to use and steps you need to take to answer interesting questions in data science, and (d) have access to a great example project in a Jupyter notebook that you can use as a template or extend.

Bio

Deborah is a PhD student, studying machine learning at Harvard University, and she graduated from MIT with a M.Eng. in Electrical Engineering & Computer Science. Her work in machine learning has spanned developing models of human perception to exploring medical data. She has also been awarded the NSF, Fulbright, and ACM/Intel Computational & Data Science Fellowship. She has spoken at PyTennessee, SciPy Conf, AI With the Best, QConNY, and PyCon US.

https://www.youtube.com/watch?v=5R1OwnGjRQM

PyBay

August 21, 2017
Tweet

More Decks by PyBay

Other Decks in Programming

Transcript

  1. Predicting box office hits & Oscar winners using things you

    found on the Internet Deborah Hanus @deborahhanus
  2. 82% DATA WRANGLING 13% EDA & ML 5% OTHER 62%

    DATA WRANGLING 20% DATA UNDERSTANDING
  3. • Define a question you can answer • Acquire good

    data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset
  4. • Define a question you can answer • Acquire good

    data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset
  5. Not so good: Vague Will my movie be a box

    office hit? Good: Likelihood What is the likelihood that my movie will be a box office hit given that it has X features? Good: Correlation What attributes of a movie are correlated with box office success? Define a question you can answer
  6. • Define a question you can answer • Acquire good

    data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset
  7. • Use an API • Write a web scraper •

    Get all the text • Make the text queryable How to get good data
  8. • Make an HTTP request to get the HTML Writing

    a web scraper Requests Example
  9. • Define a question you can answer • Acquire good

    data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset
  10. Factors we can explore: • Movie budget • IMDB Rating

    • Power Studios • Opening Weekend • How many opening theaters • Seasonality • MPAA Rating Exploratory Data Analysis
  11. • Define a question you can answer • Acquire good

    data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset
  12. • Define a question you can answer • Acquire good

    data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset
  13. • Budget helps (but only a little). • Timing is

    important. December is a great release date. • PG & G rated movies make more. • Money made in opening weekend is important. What did we find?
  14. • Define a question you can answer • Acquire good

    data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset
  15. • Define a question you can answer • Acquire good

    data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset
  16. Not so good: Vague Will this movie win an Oscar?

    Good: Likelihood What is the likelihood that this movie will win an Oscar given that it has X features? Good: Correlation What attributes of a movie are correlated with the movie winning an Oscar? Define a question you can answer
  17. • Define a question you can answer • Acquire good

    data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset
  18. Not so good: Vague Will this movie win an Oscar?

    Good: Likelihood What is the likelihood that this movie will be a box office hit given that it has X features? Good: Correlation What attributes of a movie are correlated with the movie winning an Oscar? Define a question you can answer Best: Conditional correlation Given that a movie has been nominated for an Oscar, what attributes are correlated with winning?
  19. • Define a question you can answer • Acquire good

    data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset
  20. Factors we can explore: • Movie nomination category • Thematic

    content (e.g. family, violence, war, father- son relationship, smoking) • Movie genre • Where the movie was made • When the movie debuted Exploratory Data Analysis
  21. • Define a question you can answer • Acquire good

    data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset
  22. • Define a question you can answer • Acquire good

    data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset
  23. • Accuracy = (TP+TN)/(TP+TN+FN+FP) • Recall = TP/(TP+FN) • Precision

    = TP/(TP+FP) • F1 = (Precision*Recall)/(Precision+Recall) Selecting a model
  24. • Define a question you can answer • Acquire good

    data • Understand data • Fit a model & analyze error • Draw conclusions from the data How to build & use a great dataset
  25. • Nominated films made in Italy & Spain have a

    good chance at winning. • Films are more likely to win if they are released later in the year. • Tone down the gore (unless it is a war film). • If a film is nominated for best picture, its odds of winning are good. • If a film is nominated for best cinematography, its odds are less good. What did we find?
  26. • What did your model misclassify? • Are any of

    those errors systematic? Analyze errors
  27. • Defined an answerable question • Built a web scraper

    • Explored the data • Fit a model to the data • Analyzed our errors What did we do?
  28. Building a scraper Requests - HTTP for Humans BeautifulSoup or

    PyQuery Analyzing data Jupyter SciKit Learn Statsmodels Example Projects http://oscarpredictor.github.io Where to go from here? Deborah Hanus @deborahhanus