Outline • Machine Learning refresher • scikit-learn • Demo: interactive predictive modeling on Census Data with IPython notebook / pandas / scikit-learn • What’s new in scikit-learn 0.17 and what’s next
Predictive modeling ~= machine learning • Make predictions of outcome on new data • Extract the structure of historical data • Statistical tools to summarize the training data into a executable predictive model • Alternative to hard-coded rules written by experts
New text doc image sound transaction Model Expected Label Predictive Modeling Data Flow Feature vector Training text docs images sounds transactions Labels Machine Learning Algorithm Feature vectors
Train data Train labels Model Fitted model Test data Predicted labels Test labels Evaluation model = LogisticRegression(C=1) model.fit(X_train, y_train)
Train data Train labels Model Fitted model Test data Predicted labels Test labels Evaluation model = LogisticRegression(C=1) model.fit(X_train, y_train) y_pred = model.predict(X_test)
Train data Train labels Model Fitted model Test data Predicted labels Test labels Evaluation model = LogisticRegression(C=1) model.fit(X_train, y_train) y_pred = model.predict(X_test) accuracy_score(y_test, y_pred)
Random Forests from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(n_estimators=200) model.fit(X_train, y_train) y_predicted = model.predict(X_test) from sklearn.metrics import f1_score f1_score(y_test, y_predicted)
SAG solver • LogisticRegression, LogisticRegressionCV and Ridge regression with solver='sag' • Fastest convergence rate for large number of samples: • As good optimizer as L-BFGS but often faster • As efficient as SGD after few epochs
Faster regression trees • Do not compute constant part of the criterion when searching for the best split value • Speed up between 1.1x and 2x depending on data and CPU architecture • Still not as fast as XGBoost but getting closer
Barnes Hut T-SNE • T-SNE: Dimensionality reduction that preserves small distances • Very useful for visualization of high dim data when PCA does not work • BH-T-SNE: Approximate methods to skip useless pairwise distance computation • Still work to do to cut memory usage
Incremental Scalers • Incremental fitting with partial_fit method • Pre-process data that does not fit in RAM • MaxAbsScaler, MinMaxScaler, StandardScaler
Latent Dirichlet Allocation • Probabilistic model of word counts in document based on topics • Online solver: incremental fitting on data that does not fit in RAM • Based on an implementation by Matt Hoffman adapted for the scikit-learn common API
LDA topics on 20 newsgroups Topic #9: key chip encryption keys clipper use security public technology bit Topic #11: memory use video bus monitor board ground pc ram need Topic #14: game team games play season hockey league players bike win Topic #19: drive scsi disk mac problem hard card apple drives controller
New solver for NMF • Coordinate Descent solver to replace Projected Gradient solver • CD less sensitive to initialization scheme than PG • Change in hyper-parameters to make them more consistent with other scikit-learn models
NMF topics on 20 newsgroups Topic #15: card video monitor vga bus cards drivers color driver ram Topic #16: team games players year season hockey play teams nhl league Topic #18: jesus christ christian bible christians faith law sin church christianity Topic #19: encryption chip clipper government privacy law escrow algorithm enforcement secure
Recently merged in master • Multi-layer perceptrons with Adam / SGD / L-BFGS solvers (pure numpy, no GPU) • Gaussian Processes big refactoring • Anomaly detection with Isolation Forests • More flexible API for Cross-Validation splitters