Okiriza Wibisono
February 21, 2015
160

# R - Python ID Meetup, Feb 2015

## Okiriza Wibisono

February 21, 2015

## Transcript

1. R
Data Mining 101
Python ID Meetup, February 2015
@aliakbars / @okiriza

2. R: What Is It?
A statistical programming language
• Free (GNU GPL)
• Interpreted, command-line based
• Lots of and up to date packages
• Great IDE: RStudio
• Highly sought after for data science (also Python)

3. First Look

4. Slight Catch:
Vectorised Operations
Python
A = range(1, 6)
B = [x + 5 for x in A]
C = [A[i]*B[i] for i in
range(len(A))]
D = [math.log(x) for x in C]
R
A <- 1:5
B <- A + 5
C <- A*B
D <- log(C)
Operators and built-
in functions accept
vectors (list of values)
as input

5. Our Data Set
• 13k + 4k rows (records) of house advertisements,
with 10 columns (variables) each:
city, time of posting, price, sale or rent, rental term, type of
property, land size, building size, bedroom, bathroom
• Data are from 2 cities (anonymised) over 2-year
period (2012-2013)
• Objective: predict house price for quarter 4 2013,
given the other 9 variables
• Data generously provided by:

6. Data Understanding:
Summary Statistics
Perhaps
too detail?
Need to
cleanup
price
-1 bedroom
(and bathroom)?
Are we going to
analyse rent?

7. Data Understanding:
Plots
Sometimes
need to clean data
for exploration
Extremely skewed
histogram, obvious
sign of outliers
More than 1 trillion
rupiahs for a house?

8. Data Preparation:
Removing Rows and Columns
Retain only sale
properties (not rent)
Remove
irrelevant columns

9. Data Preparation:
Feature Engineering
Introduce
month column
Introduce
quarter column
Introduce
year column
Set them as
categories
(not numeric)

10. Data Preparation:
Handling Outliers
Remove the outliers
CAUTION: In practice
outliers may actually
be valid data, so
need to handle /
remove them with
great care.
600 rows removed

11. Data Cleaning:
Data Splitting
Deﬁne data splitting
and split the data

12. Modeling:
Simple Linear Regression
Deﬁne model formula
and build model
Overall model ﬁt e.g.
coefﬁcient, p-value,
model diagnostics
Evaluate on
validation set

13. Modeling:
Multiple Regression
RMSE smaller by
>60 million rupiahs
New model with year
and quarter variables

14. Before Tapping into Test Data
• Need to explore and clean the data more!
• Potential model improvements:
• Include more variables
• Include variable interaction (e.g. land x building size)
• Build separate model for each city
• Try other regression models (e.g. SVR, kNN, regularization)
• If possible, gather additional relevant variables (e.g.
kecamatan, property agent, nearby facilities)

15. No Turning Back:
Evaluation
Predict on test data
Fit ﬁnal model with
all training data

16. What We Did:
Aligning with CRISP DM
1. No data collection nor business understanding
2. Data understanding
summary statistics, plotting
3. Data preparation
excluding data, feature engineering, outlier handling, data splitting
4. Modeling
simple linear regression, multiple regression, model validation
5. Evaluation

17. Where to Learn R
• Coursera’s Data Science specialization (no need to pay)
• R-bloggers
• stackoverﬂow.com, stats.stackexchange.com
• Books: R for Everyone, The Art of R Programming

18. Python vs R
for Data Analysis
Python
+ numpy + scipy + pandas + scikit-learn
• Faster
• More programming features
• Better for production
R
• More statistical communities
• More statistical libraries
• Better for visualisation /
reporting

quora.com/Which-­‐is-­‐be1er-­‐for-­‐data-­‐analysis-­‐R-­‐or-­‐Python
blog.udacity.com/2015/01/python-­‐vs-­‐r-­‐learn-­‐ﬁrst.html