Upgrade to Pro — share decks privately, control downloads, hide ads and more …

R - Python ID Meetup, Feb 2015

R - Python ID Meetup, Feb 2015

Okiriza Wibisono

February 21, 2015

Other Decks in Programming


  1. R: What Is It? A statistical programming language • Free

    (GNU GPL) • Interpreted, command-line based • Lots of and up to date packages • Great IDE: RStudio • Highly sought after for data science (also Python)
  2. Slight Catch: Vectorised Operations Python A = range(1, 6) B

    = [x + 5 for x in A] C = [A[i]*B[i] for i in range(len(A))] D = [math.log(x) for x in C] R A <- 1:5 B <- A + 5 C <- A*B
 D <- log(C) Operators and built- in functions accept vectors (list of values) as input
  3. Our Data Set • 13k + 4k rows (records) of

    house advertisements, with 10 columns (variables) each:
 city, time of posting, price, sale or rent, rental term, type of property, land size, building size, bedroom, bathroom • Data are from 2 cities (anonymised) over 2-year period (2012-2013) • Objective: predict house price for quarter 4 2013, given the other 9 variables • Data generously provided by:
  4. Data Understanding: Summary Statistics Perhaps too detail? Need to cleanup

    price -1 bedroom (and bathroom)? Are we going to analyse rent?
  5. Data Understanding: Plots Sometimes need to clean data for exploration

    Extremely skewed histogram, obvious sign of outliers More than 1 trillion rupiahs for a house?
  6. Data Preparation: Feature Engineering Introduce month column Introduce quarter column

    Introduce year column Set them as categories (not numeric)
  7. Data Preparation: Handling Outliers Remove the outliers CAUTION: In practice

    outliers may actually be valid data, so need to handle / remove them with great care. 600 rows removed
  8. Modeling: Simple Linear Regression Define model formula and build model

    Overall model fit e.g. coefficient, p-value, model diagnostics Evaluate on validation set
  9. Before Tapping into Test Data • Need to explore and

    clean the data more! • Potential model improvements: • Include more variables • Include variable interaction (e.g. land x building size) • Build separate model for each city • Try other regression models (e.g. SVR, kNN, regularization) • If possible, gather additional relevant variables (e.g. kecamatan, property agent, nearby facilities)
  10. What We Did: Aligning with CRISP DM 1. No data

    collection nor business understanding 2. Data understanding
 summary statistics, plotting 3. Data preparation
 excluding data, feature engineering, outlier handling, data splitting 4. Modeling
 simple linear regression, multiple regression, model validation 5. Evaluation
  11. Where to Learn R • Coursera’s Data Science specialization (no

    need to pay) • R-bloggers • stackoverflow.com, stats.stackexchange.com • Books: R for Everyone, The Art of R Programming
  12. Python vs R for Data Analysis Python
 + numpy +

    scipy + pandas + scikit-learn • Faster • More programming features • Better for production R
 • More statistical communities • More statistical libraries • Better for visualisation / reporting 
 quora.com/Which-­‐is-­‐be1er-­‐for-­‐data-­‐analysis-­‐R-­‐or-­‐Python   blog.udacity.com/2015/01/python-­‐vs-­‐r-­‐learn-­‐first.html