Okiriza Wibisono
February 21, 2015
190

# R - Python ID Meetup, Feb 2015

## Okiriza Wibisono

February 21, 2015

## Transcript

/ @okiriza
2. ### R: What Is It? A statistical programming language • Free

(GNU GPL) • Interpreted, command-line based • Lots of and up to date packages • Great IDE: RStudio • Highly sought after for data science (also Python)

4. ### Slight Catch: Vectorised Operations Python A = range(1, 6) B

= [x + 5 for x in A] C = [A[i]*B[i] for i in range(len(A))] D = [math.log(x) for x in C] R A <- 1:5 B <- A + 5 C <- A*B  D <- log(C) Operators and built- in functions accept vectors (list of values) as input
5. ### Our Data Set • 13k + 4k rows (records) of

house advertisements, with 10 columns (variables) each:  city, time of posting, price, sale or rent, rental term, type of property, land size, building size, bedroom, bathroom • Data are from 2 cities (anonymised) over 2-year period (2012-2013) • Objective: predict house price for quarter 4 2013, given the other 9 variables • Data generously provided by:
6. ### Data Understanding: Summary Statistics Perhaps too detail? Need to cleanup

price -1 bedroom (and bathroom)? Are we going to analyse rent?
7. ### Data Understanding: Plots Sometimes need to clean data for exploration

Extremely skewed histogram, obvious sign of outliers More than 1 trillion rupiahs for a house?
8. ### Data Preparation: Removing Rows and Columns Retain only sale properties

(not rent) Remove irrelevant columns
9. ### Data Preparation: Feature Engineering Introduce month column Introduce quarter column

Introduce year column Set them as categories (not numeric)
10. ### Data Preparation: Handling Outliers Remove the outliers CAUTION: In practice

outliers may actually be valid data, so need to handle / remove them with great care. 600 rows removed

data
12. ### Modeling: Simple Linear Regression Deﬁne model formula and build model

Overall model ﬁt e.g. coefﬁcient, p-value, model diagnostics Evaluate on validation set
13. ### Modeling: Multiple Regression RMSE smaller by >60 million rupiahs New

model with year and quarter variables
14. ### Before Tapping into Test Data • Need to explore and

clean the data more! • Potential model improvements: • Include more variables • Include variable interaction (e.g. land x building size) • Build separate model for each city • Try other regression models (e.g. SVR, kNN, regularization) • If possible, gather additional relevant variables (e.g. kecamatan, property agent, nearby facilities)
15. ### No Turning Back: Evaluation Predict on test data Fit ﬁnal

model with all training data
16. ### What We Did: Aligning with CRISP DM 1. No data

collection nor business understanding 2. Data understanding  summary statistics, plotting 3. Data preparation  excluding data, feature engineering, outlier handling, data splitting 4. Modeling  simple linear regression, multiple regression, model validation 5. Evaluation
17. ### Where to Learn R • Coursera’s Data Science specialization (no

need to pay) • R-bloggers • stackoverﬂow.com, stats.stackexchange.com • Books: R for Everyone, The Art of R Programming
18. ### Python vs R for Data Analysis Python  + numpy +

scipy + pandas + scikit-learn • Faster • More programming features • Better for production R  • More statistical communities • More statistical libraries • Better for visualisation / reporting   quora.com/Which-­‐is-­‐be1er-­‐for-­‐data-­‐analysis-­‐R-­‐or-­‐Python   blog.udacity.com/2015/01/python-­‐vs-­‐r-­‐learn-­‐ﬁrst.html