SciRuby Machine Learning Current Status and Future

SciRuby  Machine Learning Current Status and Future Kenta Murata 2016.09.09
Kyoto Japan

self.introduce

@mrkn ✓ Kenta Murata ✓ CRuby committer ✓ Start contributing
to SciRuby since last year ✓ Recruit Holdings Co., Ltd.  Media Technology Lab.

my gems ✓ bigdecimal ✓ daru-td ✓ iruby-rails ✓ enumerable-statistics

enumerable-statistics.gem ✓ Compute statistical summaries as fast and precise as
possible ‣Array#sum, Enumerable#sum (for Ruby < 2.4) ‣Array#mean, Enumerable#mean ‣Array#variance, Enumerable#variance ‣etc.

enumerable-statistics.gem

Agenda ✓ Introduction ✓ Machine Learning ✓ SciRuby's Current Status
✓ Scikit-learn ✓ Future

Introduction

I want to ✓ Do machine learning with Ruby

Machine Learning w/ Ruby ✓ What does it mean?

NG ✓ Write machine learning algorithms by Ruby

OK ✓ Perform data science works with Ruby

Data Science Workﬂow ✓ Collecting data ✓ Exploratory data analysis
✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model ✓ Applying to real world

Machine Learning Related Processes ✓ Collecting data ✓ Exploratory data
analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model ✓ Applying to real world

How many things can be done with Ruby? ✓ Exploratory
data analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model

How many things can be done in Ruby? ✓ Generally
nothing ✓ Python can do total workﬂow ✓ That's why everyone uses Python

Change the Current Situation ✓ Make Ruby available Data Science
✓ What is wrong now? ✓ I'll make it clear in this talk

Most Important Thing  in This Talk at First

Help!! ✓ Join SciRuby development ✓ A lot of issues
are waiting your contribution ✓ Discussion in Slack ✓ https://sciruby-slack.herokuapp.com

Machine Learning

Why we use machine learning? ✓ We want to make
business decisions from real data ✓ The use of machine learning algorithms is optional ✓ We need machine learning to drive our business by "big data"

Machine Learning can do ✓ Tasks impossible to do by
human ✓ Tasks that the solution is diﬃcult to program by hand ✓ Tasks that how to solve is unknown

For example ✓ Recommendation ✓ Outlier detection ✓ Sentiment analysis
✓ etc.

Machine Learning Problems ✓ Supervised learning ✓ Unsupervised learning ✓
Reinforcement learning

Supervised learning ✓ To learn a general rule that maps
inputs to outputs from the given example input-output pairs ✓ Two types of problems: ‣Classiﬁcation - To predict what the weather is tomorrow ‣Regression - To estimate the expected highest temperature tomorrow ✓ Example use case ‣Recommender system ‣Sentiment analysis

Unsupervised learning ✓ To extract the structural features of input
data distribution ✓ Typical problem types: ‣Clustering ‣Density estimation ‣Dimensionality reduction ✓ Example use cases: ‣Exploratory data analysis ‣Outlier detection

Reinforcement learning ✓To learn rules of decision making on a
dynamic environment ✓Typical problem types: ‣Multi-armed bandit problem ‣Adaptive scheduling ‣Automatic control ✓Example use cases: ‣Shogi AI ‣Automatic car driving

Machine Learning Problems ✓ Supervised learning ✓ Unsupervised learning ✓
Reinforcement learning

This talk focuses on ✓ Supervised learning ✓ Unsupervised learning
✓ Reinforcement learning

SciRuby Machine Learning  Current Status

Existing Gems  for Machine Learning ✓ liblinear-ruby.gem ✓ rb-libsvm.gem ✓
decisiontree.gem ✓ etc.

liblinear-ruby.gem example require 'liblinear' # model parameters parameters = {
solver_type: Liblinear::L2R_LR } # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # train model = Liblinear.train(parameters, labels, examples) # predict (the result will be 1) puts Liblinear.predict(model, [0.5, 0.5])

liblinear-ruby.gem features ✓ Just wrapper of liblinear ✓ Logistic regression
‣Classiﬁcation ✓ Linear SVC ‣Classiﬁcation ✓ Linear SVR ‣Regression ✓ Cross validation

liblinear-ruby.gem example require 'liblinear' # model parameters parameters = {
solver_type: Liblinear::L2R_LR } # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # train model = Liblinear.train(parameters, labels, examples) # predict (the result will be 1) puts Liblinear.predict(model, [0.5, 0.5]) # cross validation fold = 5 # Means 5-fold cross validation results = Liblinear.cross_validation(fold, parameters, labels, examples) accuracy = results.zip(labels).map {|a, b| a == b ? 1.0 : 0.0 }.sum / labels.length puts "Cross validation accuracy: #{accuracy}"

rb-libsvm.gem features ✓ Just wrapper of libsvm ✓ C-SVC, nu-SVC
‣Classiﬁcation ✓ epsilon-SVR, nu-SVR ‣Regression ✓ One-class SVM ‣Unsupervised outlier detection ✓ Cross validation

rb-libsvm.gem example require 'libsvm' require 'enumerable/statistics' # model parameters parameter
= Libsvm::SvmParameter.new parameter.svm_type = Libsvm::SvmType::C_SVC parameter.kernel_type = Libsvm::KernelType::RBF parameter.cache_size = 1 # in megabytes parameter.eps = 0.001 parameter.c = 10 # labels of training data labels = [1, -1] # training data examples = [[1, 0, 1], [-1, 0, -1]].map {|xs| Libsvm::Node.features(xs) } # train model problem = Libsvm::Problem.new problem.set_examples(labels, examples) model = Libsvm::Model.train(problem, parameter)

decisiontree.gem features ✓ ID3 decision tree ‣Classiﬁcation only ✓ No
parameter conﬁguration ‣e.g. criterion, minimum samples in leaf, etc. ✓ No cross validation ✓ Pure Ruby implementation

decisiontree.gem usage require 'decisiontree' # training data (last items are
labels) feature_names = ['hungers', 'color'] examples = [ [8, 'red', 'angry'], [6, 'red', 'angry'], [7, 'red', 'angry'], [7, 'blue', 'not angry'], [2, 'red', 'not angry'], [3, 'blue', 'not angry'], [2, 'blue', 'not angry'], [1, 'red', 'not angry'] ] # train model tree = DecisionTree::ID3Tree.new( feature_names, examples, 'not angry', color: :descrete, hunger: :continuous ) tree.train # prediction pred = tree.predict([7, 'red', 'angry']) puts "Predicted: #{pred} for angry"

Etc. ✓ ai4r.gem ✓ classiﬁer-reborn.gem ✓ data_mining.gem ✓ etc.

With Existing Gems ✓ Several machine learning algorithms are provided
for classification, regression, clustering, etc. ✓ We must use these algorithms in library-specific implementations because they have different APIs

Issues of Existing Gems ✓ Diﬀerent ways to specify model
parameters ✓ Diﬀerent ways and formats of training data ✓ Many gems don't support cross validation ✓ Not for practical use because of their toy- implementations

Real World Machine Learning

Real World Data ✓ Large amount of data ✓ High-dimensional
features ✓ A lot of missing values

Machine Learning  in Real World ✓ We couldn't look at
the whole data ✓ We couldn't know what algorithms were preferred to the given data ✓ We must try, compare, and combine as many algorithms as possible

Try, Compare, and Combine multiple algorithms ✓ Need to unify
data formats ✓ Need to apply cross validation for all algorithms ✓ Need to unify interfaces of algorithms for searching optimal hyper parameters and combine algorithms

In Current SciRuby ✓ Couldn't build up practical machine learning
systems with SciRuby ✓ Python can do with scikit-learn

Scikit-learn

What is scikit-learn ✓ Machine learning framework for Scipy stack

Scipy stack ✓ Numpy ‣Dense tensors ✓ Scipy ‣Scientiﬁc functions
‣Sparse matrices ✓ Pandas ‣Data frames ✓ Matplotlib ‣Visualization infrastructure ✓ Jupyter notebook ✓ Etc.

What is scikit-learn ✓ Machine learning framework for Scipy stack
✓ Python's machine learning standard

Scikit-learn is elegant ✓ Input data is feature matrix and
label vector for all algorithms ✓ Input data type can be any objects compatible with numpy's ndarray ✓ Machine learning models follow the uniﬁed interface

Logistic regression from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import cross_val_score
# labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # learning classifier = LogisticRegression(penalty="l2") classifier.fit(examples, labels) # prediction print(classifier.predict([[0.5, 0.5]])) # 5-fold cross validation classifier = LogisticRegression(penalty="l2") scores = cross_val_score(classifier, examples, labels, cv=5, scoring='roc_auc') print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

dmlc/xgboost import xgboost as xgb from sklearn.cross_validation import cross_val_score #
labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # learning classifier = xgb.XGBClassifier() classifier.fit(examples, labels) # prediction print(classifier.predict([[0.5, 0.5]])) # 5-fold cross validation classifier = xgb.XGBClassifier() scores = cross_val_score(classifier, examples, labels, cv=5, scoring='roc_auc') print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Grid search from sklearn.linear_model import LogisticRegression from sklearn.grid_search import GridSearchCV
# labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # Finding best parameter set by grid search parameters = { 'penalty' : ['l2', 'l1'], 'C' : numpy.logspace(-4, 4, 10) } classifier = GridSearchCV(LogisticRegression(), parameters, cv=5) classifier.fit(examples, labels) # Report best parameters best_params = classifier.best_estimator_.get_params() print 'Best parameters = {}'.format(best_params)

Combination with Pipeline from sklearn import svm from sklearn.decomposition import
PCA  from sklearn.grid_search import GridSearchCV from sklearn.pipeline import Pipeline # Combine PCA and SVC by pipeline pipeline = Pipeline([ ('pca', PCA()), ('svc', svm.SVC()) ]) # Finding best parameter set by grid search parameters = { 'pca__n_components' : range(2, 6), 'svc__kernel' : ['linear', 'rbf'], 'svc__C' : numpy.logspace(-4, 4, 10), 'svc__gamma' : numpy.logspace(-4, 4, 10) } classifier = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1) classifier.fit(features, labels) # Report best parameter set best_params = classifier.best_estimator_.get_params() print 'Best parameters = {}'.format(best_params) 1$" 47$ Input Output

With scikit-learn ✓ We can prepare training data in a
common format ‣vectors for labels and matrices for features ✓ We can use all the algorithms in the same interface ✓ We can make a combination model by using pipeline ✓ We can grid search for optimizing hyper parameters

Scikit-learn is a standard ✓ Several libraries provide scikit-learn compatible
interface ‣xgboost ‣tensorﬂow

Scikit-learn is ideal framework for machine learning

The Future of SciRuby in Machine Learning

Key Point ✓ Make scikit-learn like thing available for Ruby
programs

Two ways ✓ Make scikit-learn itself to be available from
ruby ✓ Make own libraries wrote by Ruby like scikit-learn

Use scikit-learn itself ✓ Learn from PyCall.jl and ScikitLearn.jl ✓
PyCall.jl ‣Call python things from Julia code ✓ ScikitLearn.jl ‣Binding to scikit-learn via PyCall.jl ✓ Make pycall.gem and scikit-learn.gem

Make scikit-learn like libraries ✓ Very hard work ✓ Need
Cython-like system to make writing extension library easy ‣rubex planned by v0dro ✓ Numerical arrays

Numerical array issues ✓ NMatrix ✓ Numo::NArray ✓ NumBuﬀer

NMatrix ✓ Slow implementation ✓ Lack of linear algebra operations
for sparse matrices ✓ Installation issues

Numo::NArray ✓ Lack of sparse matrix features ✓ Too few
supported libraries

NumBuﬀer ✓ What is? ‣Supporting to exchange numerical array data
among different libraries ✓ Developer is only me ✓ Need more contributors

Benchmark $ ruby -r benchmark/ips -r nmatrix -r numo/narray -e
' Benchmark.ips do |x| ar = Array.new(100*100) { rand } nm = NMatrix.random [100*100] na = Numo::DFloat.new(100*100).rand x.report('ar') { Array.new(ar.length) {|i| ar[i] + ar[i] } } x.report('nm') { nm + nm } x.report('na') { na + na } end ' Warming up -------------------------------------- ar 111.000 i/100ms nm 59.000 i/100ms na 3.133k i/100ms Calculating ------------------------------------- ar 1.068k (±12.3%) i/s - 5.328k in 5.078079s nm 618.334 (±10.0%) i/s - 3.068k in 5.021136s na 34.110k (±19.0%) i/s - 166.049k in 5.028910s

Benchmark $ ruby -r benchmark/ips -r nmatrix -r numo/narray -e
' Benchmark.ips do |x| nm = NMatrix.random [100, 100] na = Numo::DFloat.new(100, 100).rand x.report('nm') { nm.dot nm } x.report('na') { na.inplace.dot na } end ' Warming up -------------------------------------- nm 189.000 i/100ms na 60.000 i/100ms Calculating ------------------------------------- nm 2.083k (± 8.0%) i/s - 10.395k in 5.022906s na 658.759 (± 7.4%) i/s - 3.300k in 5.039515s

NMatrix and NArray compatibility ✓ Which is best? ‣Both of
them are not best now ✓ Interface and feature incompatibility ‣NumBuffer can't resolve this issue ✓ I want both of them to be uniﬁed ‣NMatrix is good for sparse matrices ‣NArray is good for dense arrays

SciRuby JP ✓ SciRuby developer community in Japan ✓ Perform
survey study in this summer

Some Achievements ✓Tutorials ‣100 narray exercises (by masa16 & kozo2)
‣10 minutes to daru (by kozo2) ‣pandas cookbook with daru (by kozo2) ‣Rewrite pandas doc with daru (by chart-linux) ✓Installation ‣IRuby on Windows (by kimura) ‣ZeroMQ related things (by kozo2 & mrkn)

Some Achievements ✓ NLP ‣Survey (by himkt) ✓ Machine Learning
‣Survey (by mrkn) ✓ Visualization ‣New plotly binding (by y4ashida) ✓ Other Languages ‣Ruby support in runr (by y4ashida)

Let's go forward ✓ Join SciRuby contribution ‣English is preferred,
but Japanese is OK ✓ A lot of issues are waiting your contribution ‣Not only for machine learning ✓ Discuss in Slack ‣https://sciruby-slack.herokuapp.com

ςετεϥΠυ

SciRuby Machine Learning Current Status and Future

SciRuby Machine Learning Current Status and Future

More Decks by Kenta Murata

Other Decks in Technology

Featured

Transcript