Upgrade to Pro — share decks privately, control downloads, hide ads and more …

R - Python ID Meetup, Feb 2015

R - Python ID Meetup, Feb 2015

36ccaafe0419ce412f21894b084851e1?s=128

Okiriza Wibisono

February 21, 2015
Tweet

Other Decks in Programming

Transcript

  1. R Data Mining 101 Python ID Meetup, February 2015 @aliakbars

    / @okiriza
  2. R: What Is It? A statistical programming language • Free

    (GNU GPL) • Interpreted, command-line based • Lots of and up to date packages • Great IDE: RStudio • Highly sought after for data science (also Python)
  3. First Look

  4. Slight Catch: Vectorised Operations Python A = range(1, 6) B

    = [x + 5 for x in A] C = [A[i]*B[i] for i in range(len(A))] D = [math.log(x) for x in C] R A <- 1:5 B <- A + 5 C <- A*B
 D <- log(C) Operators and built- in functions accept vectors (list of values) as input
  5. Our Data Set • 13k + 4k rows (records) of

    house advertisements, with 10 columns (variables) each:
 city, time of posting, price, sale or rent, rental term, type of property, land size, building size, bedroom, bathroom • Data are from 2 cities (anonymised) over 2-year period (2012-2013) • Objective: predict house price for quarter 4 2013, given the other 9 variables • Data generously provided by:
  6. Data Understanding: Summary Statistics Perhaps too detail? Need to cleanup

    price -1 bedroom (and bathroom)? Are we going to analyse rent?
  7. Data Understanding: Plots Sometimes need to clean data for exploration

    Extremely skewed histogram, obvious sign of outliers More than 1 trillion rupiahs for a house?
  8. Data Preparation: Removing Rows and Columns Retain only sale properties

    (not rent) Remove irrelevant columns
  9. Data Preparation: Feature Engineering Introduce month column Introduce quarter column

    Introduce year column Set them as categories (not numeric)
  10. Data Preparation: Handling Outliers Remove the outliers CAUTION: In practice

    outliers may actually be valid data, so need to handle / remove them with great care. 600 rows removed
  11. Data Cleaning: Data Splitting Define data splitting and split the

    data
  12. Modeling: Simple Linear Regression Define model formula and build model

    Overall model fit e.g. coefficient, p-value, model diagnostics Evaluate on validation set
  13. Modeling: Multiple Regression RMSE smaller by >60 million rupiahs New

    model with year and quarter variables
  14. Before Tapping into Test Data • Need to explore and

    clean the data more! • Potential model improvements: • Include more variables • Include variable interaction (e.g. land x building size) • Build separate model for each city • Try other regression models (e.g. SVR, kNN, regularization) • If possible, gather additional relevant variables (e.g. kecamatan, property agent, nearby facilities)
  15. No Turning Back: Evaluation Predict on test data Fit final

    model with all training data
  16. What We Did: Aligning with CRISP DM 1. No data

    collection nor business understanding 2. Data understanding
 summary statistics, plotting 3. Data preparation
 excluding data, feature engineering, outlier handling, data splitting 4. Modeling
 simple linear regression, multiple regression, model validation 5. Evaluation
  17. Where to Learn R • Coursera’s Data Science specialization (no

    need to pay) • R-bloggers • stackoverflow.com, stats.stackexchange.com • Books: R for Everyone, The Art of R Programming
  18. Python vs R for Data Analysis Python
 + numpy +

    scipy + pandas + scikit-learn • Faster • More programming features • Better for production R
 • More statistical communities • More statistical libraries • Better for visualisation / reporting 
 quora.com/Which-­‐is-­‐be1er-­‐for-­‐data-­‐analysis-­‐R-­‐or-­‐Python   blog.udacity.com/2015/01/python-­‐vs-­‐r-­‐learn-­‐first.html