Predict_the_Oscars.pdf

Predicting the Oscars with data science March 2017

Data Science Process • Frame the question. • Collect the
raw data. • Process the data. • Explore the data. • Communicate results.

Frame the question • Who will win the Oscar for
Best Picture?

Collect the Data • What kind of data do we
need? • Financial data (Budget, box ofﬁce…) • Reviews, ratings and scores. • Awards and nominations.

Process the data • How’s the data “dirty” and how
can we ﬁx it? • User input, redundancies, missing data… • Formatting: adapt the data to meet certain speciﬁcations. • Cleaning: detecting and correcting corrupt or inaccurate records.

Explore the data • What are the meaningful patterns in
the data? • How meaningful is each data point for our predictions?

Communicate results • Tell story at the right technical level
for each audience • Make sure to focus on Whats In It For You (WIIFY!) • Be objective, don’t lie with statistics • Be visual! Show, don’t just tell

Goals • Introduction to a data scientist's tools and methods:
• Jupyter notebooks, numpy, pandas, sklearn… • Overview of basic machine learning concepts: • Data formatting and cleaning, Decision trees, Overﬁtting, Random Forests…

Jupyter Notebooks • One of data scientist’s everyday tools. •
Find the link in our classroom tool: • (bit.ly/atl-oscars) • Contains cells with code. They have already been executed for you.

NumPy • The fundamental package for scientiﬁc computing with Python.
• Provides powerful multi-dimensional array objects. • Many methods for fast operations on arrays.

Pandas • Fundamental high-level building block for doing practical, real
world data analysis in Python. • Built on top of NumPy. • Offers data structures and operations for manipulating numerical tables and time series.

Scikit-learn • Python module for machine learning. • Provides a
large menu of libraries for scientiﬁc computation, such as integration, interpolation, signal processing, linear algebra, statistics, etc.

Initial imports and loading data with Pandas

Understanding your data • .head(n) method: Returns ﬁrst n rows.
• .value_counts() method: Returns the counts of unique values in the DataFrame.

Formatting your Data

Formatting your Data • Rate values in a non-numeric format.
Thus, we will need to assign each rate a unique integer so that Python can handle the information. • With the .ix method you create a subset of rows and assign a value to a certain variable of that subset of observations.

Cleaning your Data

Decision Trees • It breaks down a dataset into smaller
and smaller subsets. • The ﬁnal result is a model with a tree structure that has: • Decision nodes: ask a question and have two or more branches. • Leaf nodes: represent a classiﬁcation or decision.

Classiﬁcation vs Regression • Classiﬁcation — Predict categories. • Identifying
group membership. • Regression — Predict values. • Involves estimating or predicting a response.

Classiﬁcation

Classiﬁcation ?

Creating your ﬁrst Decision Tree You will use the scikit-learn
and numpy libraries to build your ﬁrst decision tree. We will need the following to build a decision tree • target: A one-dimensional numpy array containing the target from the train data. • features: A multidimensional numpy array containing the features/predictors from the train data.

Creating your ﬁrst Decision Tree

Importances and Score • .feature_importances_ attribute: tells us how important
the features are for the ﬁnal result. • .score() method: returns the mean accuracy of our ﬁtting.

Importances and Score

Predicting

Overﬁtting • Resulting model too tied to the training set.
• It doesn’t generalize to new data, which is the point of prediction.

Random Forest Classifier • Random Forest Classifiers use many Decision
Trees to build a classifier. • We introduce a bit of randomness. • Each Tree can give a different answer (a vote). The final classification is the most common amongst the Trees.

Random Forest Classiﬁer

Importances and Score

Predicting with Random Forest Classiﬁers

Results

1976 Rocky

1984 Amadeus

1996 The English Patient

2009 The Hurt Locker

And the Oscar goes to…

La La Land!!

The End Nothing happened after that. Right?? RIGHT??

We can predict the Oscars Except for 2017 ¯\_(ϑ)_/¯

What is Thinkful? Online skills bootcamp with 1-on-1 mentorship —
learn anytime & anywhere & get a job, guaranteed. Anyone who’s committed can learn to code.

Our Philosophy • 1-on-1 mentorship is the best way to
learn • Flexibility matters — learn anywhere, anytime • We only make money when you get a job…

Our Results — Job Guarantee Bhaumik Liz

Data Science Bootcamp Syllabus: Python Toolkit, Statistics & Probability, Experimentation,
Machine Learning, Communicating Data, Algorithms and Big Data

Web Development Bootcamp Syllabus: Beginner and Intermediate Frontend Development, Backend
Development, CS Fundamentals, Product Engineering

Special Prep Course Offer • Three-week program, includes six mentor
sessions • Web: HTML/CSS, Javascript, jQuery, Responsive Design • Data: Basic Python & Stats, Data Science Toolkit, Project • Option to continue into web development bootcamp • Prep courses cost $500 (can apply to cost of full bootcamp) • Talk to us about special offers for both programs

Predict_the_Oscars.pdf

Predict_the_Oscars.pdf

More Decks by Thinkful

Other Decks in Programming

Featured

Transcript