(GNU GPL) • Interpreted, command-line based • Lots of and up to date packages • Great IDE: RStudio • Highly sought after for data science (also Python)
= [x + 5 for x in A] C = [A[i]*B[i] for i in range(len(A))] D = [math.log(x) for x in C] R A <- 1:5 B <- A + 5 C <- A*B D <- log(C) Operators and built- in functions accept vectors (list of values) as input
house advertisements, with 10 columns (variables) each: city, time of posting, price, sale or rent, rental term, type of property, land size, building size, bedroom, bathroom • Data are from 2 cities (anonymised) over 2-year period (2012-2013) • Objective: predict house price for quarter 4 2013, given the other 9 variables • Data generously provided by:
clean the data more! • Potential model improvements: • Include more variables • Include variable interaction (e.g. land x building size) • Build separate model for each city • Try other regression models (e.g. SVR, kNN, regularization) • If possible, gather additional relevant variables (e.g. kecamatan, property agent, nearby facilities)
collection nor business understanding 2. Data understanding summary statistics, plotting 3. Data preparation excluding data, feature engineering, outlier handling, data splitting 4. Modeling simple linear regression, multiple regression, model validation 5. Evaluation
scipy + pandas + scikit-learn • Faster • More programming features • Better for production R • More statistical communities • More statistical libraries • Better for visualisation / reporting quora.com/Which-‐is-‐be1er-‐for-‐data-‐analysis-‐R-‐or-‐Python blog.udacity.com/2015/01/python-‐vs-‐r-‐learn-‐first.html