R - Python ID Meetup, Feb 2015

R Data Mining 101 Python ID Meetup, February 2015 @aliakbars
/ @okiriza

R: What Is It? A statistical programming language • Free
(GNU GPL) • Interpreted, command-line based • Lots of and up to date packages • Great IDE: RStudio • Highly sought after for data science (also Python)

First Look

Slight Catch: Vectorised Operations Python A = range(1, 6) B
= [x + 5 for x in A] C = [A[i]*B[i] for i in range(len(A))] D = [math.log(x) for x in C] R A <- 1:5 B <- A + 5 C <- A*B  D <- log(C) Operators and built- in functions accept vectors (list of values) as input

Our Data Set • 13k + 4k rows (records) of
house advertisements, with 10 columns (variables) each:  city, time of posting, price, sale or rent, rental term, type of property, land size, building size, bedroom, bathroom • Data are from 2 cities (anonymised) over 2-year period (2012-2013) • Objective: predict house price for quarter 4 2013, given the other 9 variables • Data generously provided by:

Data Understanding: Summary Statistics Perhaps too detail? Need to cleanup
price -1 bedroom (and bathroom)? Are we going to analyse rent?

Data Understanding: Plots Sometimes need to clean data for exploration
Extremely skewed histogram, obvious sign of outliers More than 1 trillion rupiahs for a house?

Data Preparation: Removing Rows and Columns Retain only sale properties
(not rent) Remove irrelevant columns

Data Preparation: Feature Engineering Introduce month column Introduce quarter column
Introduce year column Set them as categories (not numeric)

Data Preparation: Handling Outliers Remove the outliers CAUTION: In practice
outliers may actually be valid data, so need to handle / remove them with great care. 600 rows removed

Data Cleaning: Data Splitting Deﬁne data splitting and split the
data

Modeling: Simple Linear Regression Define model formula and build model
Overall model fit e.g. coefficient, p-value, model diagnostics Evaluate on validation set

Modeling: Multiple Regression RMSE smaller by >60 million rupiahs New
model with year and quarter variables

Before Tapping into Test Data • Need to explore and
clean the data more! • Potential model improvements: • Include more variables • Include variable interaction (e.g. land x building size) • Build separate model for each city • Try other regression models (e.g. SVR, kNN, regularization) • If possible, gather additional relevant variables (e.g. kecamatan, property agent, nearby facilities)

No Turning Back: Evaluation Predict on test data Fit ﬁnal
model with all training data

What We Did: Aligning with CRISP DM 1. No data
collection nor business understanding 2. Data understanding  summary statistics, plotting 3. Data preparation  excluding data, feature engineering, outlier handling, data splitting 4. Modeling  simple linear regression, multiple regression, model validation 5. Evaluation

Where to Learn R • Coursera’s Data Science specialization (no
need to pay) • R-bloggers • stackoverﬂow.com, stats.stackexchange.com • Books: R for Everyone, The Art of R Programming

Python vs R for Data Analysis Python  + numpy +
scipy + pandas + scikit-learn • Faster • More programming features • Better for production R  • More statistical communities • More statistical libraries • Better for visualisation / reporting   quora.com/Which-‐is-‐be1er-‐for-‐data-‐analysis-‐R-‐or-‐Python blog.udacity.com/2015/01/python-‐vs-‐r-‐learn-‐ﬁrst.html

R - Python ID Meetup, Feb 2015

R - Python ID Meetup, Feb 2015

Okiriza Wibisono

Other Decks in Programming

Featured

Transcript

R Data Mining 101 Python ID Meetup, February 2015 @aliakbars

R: What Is It? A statistical programming language • Free

First Look

Slight Catch: Vectorised Operations Python A = range(1, 6) B

Our Data Set • 13k + 4k rows (records) of

Data Understanding: Summary Statistics Perhaps too detail? Need to cleanup

Data Understanding: Plots Sometimes need to clean data for exploration

Data Preparation: Removing Rows and Columns Retain only sale properties

Data Preparation: Feature Engineering Introduce month column Introduce quarter column

Data Preparation: Handling Outliers Remove the outliers CAUTION: In practice

Data Cleaning: Data Splitting Deﬁne data splitting and split the

Modeling: Simple Linear Regression Deﬁne model formula and build model

Modeling: Multiple Regression RMSE smaller by >60 million rupiahs New

Before Tapping into Test Data • Need to explore and

No Turning Back: Evaluation Predict on test data Fit ﬁnal

What We Did: Aligning with CRISP DM 1. No data

Where to Learn R • Coursera’s Data Science specialization (no

Python vs R for Data Analysis Python  + numpy +