Slide 1

Slide 1 text

SciRuby
 Machine Learning Current Status and Future Kenta Murata 2016.09.09 Kyoto Japan

Slide 2

Slide 2 text

self.introduce

Slide 3

Slide 3 text

@mrkn ✓ Kenta Murata ✓ CRuby committer ✓ Start contributing to SciRuby since last year ✓ Recruit Holdings Co., Ltd.
 Media Technology Lab.

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

my gems ✓ bigdecimal ✓ daru-td ✓ iruby-rails ✓ enumerable-statistics

Slide 6

Slide 6 text

enumerable-statistics.gem ✓ Compute statistical summaries as fast and precise as possible ‣Array#sum, Enumerable#sum (for Ruby < 2.4) ‣Array#mean, Enumerable#mean ‣Array#variance, Enumerable#variance ‣etc.

Slide 7

Slide 7 text

enumerable-statistics.gem

Slide 8

Slide 8 text

Agenda ✓ Introduction ✓ Machine Learning ✓ SciRuby's Current Status ✓ Scikit-learn ✓ Future

Slide 9

Slide 9 text

Introduction

Slide 10

Slide 10 text

I want to ✓ Do machine learning with Ruby

Slide 11

Slide 11 text

Machine Learning w/ Ruby ✓ What does it mean?

Slide 12

Slide 12 text

NG ✓ Write machine learning algorithms by Ruby

Slide 13

Slide 13 text

OK ✓ Perform data science works with Ruby

Slide 14

Slide 14 text

Data Science Workflow ✓ Collecting data ✓ Exploratory data analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model ✓ Applying to real world

Slide 15

Slide 15 text

Machine Learning Related Processes ✓ Collecting data ✓ Exploratory data analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model ✓ Applying to real world

Slide 16

Slide 16 text

How many things can be done with Ruby? ✓ Exploratory data analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model

Slide 17

Slide 17 text

How many things can be done with Ruby? ✓ Exploratory data analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model

Slide 18

Slide 18 text

How many things can be done in Ruby? ✓ Generally nothing ✓ Python can do total workflow ✓ That's why everyone uses Python

Slide 19

Slide 19 text

Change the Current Situation ✓ Make Ruby available Data Science ✓ What is wrong now? ✓ I'll make it clear in this talk

Slide 20

Slide 20 text

Most Important Thing
 in This Talk at First

Slide 21

Slide 21 text

Help!! ✓ Join SciRuby development ✓ A lot of issues are waiting your contribution ✓ Discussion in Slack ✓ https://sciruby-slack.herokuapp.com

Slide 22

Slide 22 text

Machine Learning

Slide 23

Slide 23 text

Why we use machine learning? ✓ We want to make business decisions from real data ✓ The use of machine learning algorithms is optional ✓ We need machine learning to drive our business by "big data"

Slide 24

Slide 24 text

Machine Learning can do ✓ Tasks impossible to do by human ✓ Tasks that the solution is difficult to program by hand ✓ Tasks that how to solve is unknown

Slide 25

Slide 25 text

For example ✓ Recommendation ✓ Outlier detection ✓ Sentiment analysis ✓ etc.

Slide 26

Slide 26 text

Machine Learning Problems ✓ Supervised learning ✓ Unsupervised learning ✓ Reinforcement learning

Slide 27

Slide 27 text

Supervised learning ✓ To learn a general rule that maps inputs to outputs from the given example input-output pairs ✓ Two types of problems: ‣Classification - To predict what the weather is tomorrow ‣Regression - To estimate the expected highest temperature tomorrow ✓ Example use case ‣Recommender system ‣Sentiment analysis

Slide 28

Slide 28 text

Unsupervised learning ✓ To extract the structural features of input data distribution ✓ Typical problem types: ‣Clustering ‣Density estimation ‣Dimensionality reduction ✓ Example use cases: ‣Exploratory data analysis ‣Outlier detection

Slide 29

Slide 29 text

Reinforcement learning ✓To learn rules of decision making on a dynamic environment ✓Typical problem types: ‣Multi-armed bandit problem ‣Adaptive scheduling ‣Automatic control ✓Example use cases: ‣Shogi AI ‣Automatic car driving

Slide 30

Slide 30 text

Machine Learning Problems ✓ Supervised learning ✓ Unsupervised learning ✓ Reinforcement learning

Slide 31

Slide 31 text

This talk focuses on ✓ Supervised learning ✓ Unsupervised learning ✓ Reinforcement learning

Slide 32

Slide 32 text

SciRuby Machine Learning
 Current Status

Slide 33

Slide 33 text

Existing Gems
 for Machine Learning ✓ liblinear-ruby.gem ✓ rb-libsvm.gem ✓ decisiontree.gem ✓ etc.

Slide 34

Slide 34 text

liblinear-ruby.gem example require 'liblinear' # model parameters parameters = { solver_type: Liblinear::L2R_LR } # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # train model = Liblinear.train(parameters, labels, examples) # predict (the result will be 1) puts Liblinear.predict(model, [0.5, 0.5])

Slide 35

Slide 35 text

liblinear-ruby.gem features ✓ Just wrapper of liblinear ✓ Logistic regression ‣Classification ✓ Linear SVC ‣Classification ✓ Linear SVR ‣Regression ✓ Cross validation

Slide 36

Slide 36 text

liblinear-ruby.gem example require 'liblinear' # model parameters parameters = { solver_type: Liblinear::L2R_LR } # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # train model = Liblinear.train(parameters, labels, examples) # predict (the result will be 1) puts Liblinear.predict(model, [0.5, 0.5]) # cross validation fold = 5 # Means 5-fold cross validation results = Liblinear.cross_validation(fold, parameters, labels, examples) accuracy = results.zip(labels).map {|a, b| a == b ? 1.0 : 0.0 }.sum / labels.length puts "Cross validation accuracy: #{accuracy}"

Slide 37

Slide 37 text

rb-libsvm.gem features ✓ Just wrapper of libsvm ✓ C-SVC, nu-SVC ‣Classification ✓ epsilon-SVR, nu-SVR ‣Regression ✓ One-class SVM ‣Unsupervised outlier detection ✓ Cross validation

Slide 38

Slide 38 text

rb-libsvm.gem example require 'libsvm' require 'enumerable/statistics' # model parameters parameter = Libsvm::SvmParameter.new parameter.svm_type = Libsvm::SvmType::C_SVC parameter.kernel_type = Libsvm::KernelType::RBF parameter.cache_size = 1 # in megabytes parameter.eps = 0.001 parameter.c = 10 # labels of training data labels = [1, -1] # training data examples = [[1, 0, 1], [-1, 0, -1]].map {|xs| Libsvm::Node.features(xs) } # train model problem = Libsvm::Problem.new problem.set_examples(labels, examples) model = Libsvm::Model.train(problem, parameter)

Slide 39

Slide 39 text

decisiontree.gem features ✓ ID3 decision tree ‣Classification only ✓ No parameter configuration ‣e.g. criterion, minimum samples in leaf, etc. ✓ No cross validation ✓ Pure Ruby implementation

Slide 40

Slide 40 text

decisiontree.gem usage require 'decisiontree' # training data (last items are labels) feature_names = ['hungers', 'color'] examples = [ [8, 'red', 'angry'], [6, 'red', 'angry'], [7, 'red', 'angry'], [7, 'blue', 'not angry'], [2, 'red', 'not angry'], [3, 'blue', 'not angry'], [2, 'blue', 'not angry'], [1, 'red', 'not angry'] ] # train model tree = DecisionTree::ID3Tree.new( feature_names, examples, 'not angry', color: :descrete, hunger: :continuous ) tree.train # prediction pred = tree.predict([7, 'red', 'angry']) puts "Predicted: #{pred} for angry"

Slide 41

Slide 41 text

Etc. ✓ ai4r.gem ✓ classifier-reborn.gem ✓ data_mining.gem ✓ etc.

Slide 42

Slide 42 text

With Existing Gems ✓ Several machine learning algorithms are provided for classification, regression, clustering, etc. ✓ We must use these algorithms in library-specific implementations because they have different APIs

Slide 43

Slide 43 text

Issues of Existing Gems ✓ Different ways to specify model parameters ✓ Different ways and formats of training data ✓ Many gems don't support cross validation ✓ Not for practical use because of their toy- implementations

Slide 44

Slide 44 text

Real World Machine Learning

Slide 45

Slide 45 text

Real World Data ✓ Large amount of data ✓ High-dimensional features ✓ A lot of missing values

Slide 46

Slide 46 text

Machine Learning
 in Real World ✓ We couldn't look at the whole data ✓ We couldn't know what algorithms were preferred to the given data ✓ We must try, compare, and combine as many algorithms as possible

Slide 47

Slide 47 text

Try, Compare, and Combine multiple algorithms ✓ Need to unify data formats ✓ Need to apply cross validation for all algorithms ✓ Need to unify interfaces of algorithms for searching optimal hyper parameters and combine algorithms

Slide 48

Slide 48 text

In Current SciRuby ✓ Couldn't build up practical machine learning systems with SciRuby ✓ Python can do with scikit-learn

Slide 49

Slide 49 text

Scikit-learn

Slide 50

Slide 50 text

What is scikit-learn ✓ Machine learning framework for Scipy stack

Slide 51

Slide 51 text

Scipy stack ✓ Numpy ‣Dense tensors ✓ Scipy ‣Scientific functions ‣Sparse matrices ✓ Pandas ‣Data frames ✓ Matplotlib ‣Visualization infrastructure ✓ Jupyter notebook ✓ Etc.

Slide 52

Slide 52 text

What is scikit-learn ✓ Machine learning framework for Scipy stack ✓ Python's machine learning standard

Slide 53

Slide 53 text

Scikit-learn is elegant ✓ Input data is feature matrix and label vector for all algorithms ✓ Input data type can be any objects compatible with numpy's ndarray ✓ Machine learning models follow the unified interface

Slide 54

Slide 54 text

Logistic regression from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import cross_val_score # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # learning classifier = LogisticRegression(penalty="l2") classifier.fit(examples, labels) # prediction print(classifier.predict([[0.5, 0.5]])) # 5-fold cross validation classifier = LogisticRegression(penalty="l2") scores = cross_val_score(classifier, examples, labels, cv=5, scoring='roc_auc') print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Slide 55

Slide 55 text

dmlc/xgboost import xgboost as xgb from sklearn.cross_validation import cross_val_score # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # learning classifier = xgb.XGBClassifier() classifier.fit(examples, labels) # prediction print(classifier.predict([[0.5, 0.5]])) # 5-fold cross validation classifier = xgb.XGBClassifier() scores = cross_val_score(classifier, examples, labels, cv=5, scoring='roc_auc') print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Slide 56

Slide 56 text

Grid search from sklearn.linear_model import LogisticRegression from sklearn.grid_search import GridSearchCV # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # Finding best parameter set by grid search parameters = { 'penalty' : ['l2', 'l1'], 'C' : numpy.logspace(-4, 4, 10) } classifier = GridSearchCV(LogisticRegression(), parameters, cv=5) classifier.fit(examples, labels) # Report best parameters best_params = classifier.best_estimator_.get_params() print 'Best parameters = {}'.format(best_params)

Slide 57

Slide 57 text

Combination with Pipeline from sklearn import svm from sklearn.decomposition import PCA
 from sklearn.grid_search import GridSearchCV from sklearn.pipeline import Pipeline # Combine PCA and SVC by pipeline pipeline = Pipeline([ ('pca', PCA()), ('svc', svm.SVC()) ]) # Finding best parameter set by grid search parameters = { 'pca__n_components' : range(2, 6), 'svc__kernel' : ['linear', 'rbf'], 'svc__C' : numpy.logspace(-4, 4, 10), 'svc__gamma' : numpy.logspace(-4, 4, 10) } classifier = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1) classifier.fit(features, labels) # Report best parameter set best_params = classifier.best_estimator_.get_params() print 'Best parameters = {}'.format(best_params) 1$" 47$ Input Output

Slide 58

Slide 58 text

With scikit-learn ✓ We can prepare training data in a common format ‣vectors for labels and matrices for features ✓ We can use all the algorithms in the same interface ✓ We can make a combination model by using pipeline ✓ We can grid search for optimizing hyper parameters

Slide 59

Slide 59 text

Scikit-learn is a standard ✓ Several libraries provide scikit-learn compatible interface ‣xgboost ‣tensorflow

Slide 60

Slide 60 text

Scikit-learn is ideal framework for machine learning

Slide 61

Slide 61 text

The Future of SciRuby in Machine Learning

Slide 62

Slide 62 text

Key Point ✓ Make scikit-learn like thing available for Ruby programs

Slide 63

Slide 63 text

Two ways ✓ Make scikit-learn itself to be available from ruby ✓ Make own libraries wrote by Ruby like scikit-learn

Slide 64

Slide 64 text

Use scikit-learn itself ✓ Learn from PyCall.jl and ScikitLearn.jl ✓ PyCall.jl ‣Call python things from Julia code ✓ ScikitLearn.jl ‣Binding to scikit-learn via PyCall.jl ✓ Make pycall.gem and scikit-learn.gem

Slide 65

Slide 65 text

Make scikit-learn like libraries ✓ Very hard work ✓ Need Cython-like system to make writing extension library easy ‣rubex planned by v0dro ✓ Numerical arrays

Slide 66

Slide 66 text

Numerical array issues ✓ NMatrix ✓ Numo::NArray ✓ NumBuffer

Slide 67

Slide 67 text

NMatrix ✓ Slow implementation ✓ Lack of linear algebra operations for sparse matrices ✓ Installation issues

Slide 68

Slide 68 text

Numo::NArray ✓ Lack of sparse matrix features ✓ Too few supported libraries

Slide 69

Slide 69 text

NumBuffer ✓ What is? ‣Supporting to exchange numerical array data among different libraries ✓ Developer is only me ✓ Need more contributors

Slide 70

Slide 70 text

Benchmark $ ruby -r benchmark/ips -r nmatrix -r numo/narray -e ' Benchmark.ips do |x| ar = Array.new(100*100) { rand } nm = NMatrix.random [100*100] na = Numo::DFloat.new(100*100).rand x.report('ar') { Array.new(ar.length) {|i| ar[i] + ar[i] } } x.report('nm') { nm + nm } x.report('na') { na + na } end ' Warming up -------------------------------------- ar 111.000 i/100ms nm 59.000 i/100ms na 3.133k i/100ms Calculating ------------------------------------- ar 1.068k (±12.3%) i/s - 5.328k in 5.078079s nm 618.334 (±10.0%) i/s - 3.068k in 5.021136s na 34.110k (±19.0%) i/s - 166.049k in 5.028910s

Slide 71

Slide 71 text

Benchmark $ ruby -r benchmark/ips -r nmatrix -r numo/narray -e ' Benchmark.ips do |x| nm = NMatrix.random [100, 100] na = Numo::DFloat.new(100, 100).rand x.report('nm') { nm.dot nm } x.report('na') { na.inplace.dot na } end ' Warming up -------------------------------------- nm 189.000 i/100ms na 60.000 i/100ms Calculating ------------------------------------- nm 2.083k (± 8.0%) i/s - 10.395k in 5.022906s na 658.759 (± 7.4%) i/s - 3.300k in 5.039515s

Slide 72

Slide 72 text

NMatrix and NArray compatibility ✓ Which is best? ‣Both of them are not best now ✓ Interface and feature incompatibility ‣NumBuffer can't resolve this issue ✓ I want both of them to be unified ‣NMatrix is good for sparse matrices ‣NArray is good for dense arrays

Slide 73

Slide 73 text

SciRuby JP ✓ SciRuby developer community in Japan ✓ Perform survey study in this summer

Slide 74

Slide 74 text

Some Achievements ✓Tutorials ‣100 narray exercises (by masa16 & kozo2) ‣10 minutes to daru (by kozo2) ‣pandas cookbook with daru (by kozo2) ‣Rewrite pandas doc with daru (by chart-linux) ✓Installation ‣IRuby on Windows (by kimura) ‣ZeroMQ related things (by kozo2 & mrkn)

Slide 75

Slide 75 text

Some Achievements ✓ NLP ‣Survey (by himkt) ✓ Machine Learning ‣Survey (by mrkn) ✓ Visualization ‣New plotly binding (by y4ashida) ✓ Other Languages ‣Ruby support in runr (by y4ashida)

Slide 76

Slide 76 text

Let's go forward ✓ Join SciRuby contribution ‣English is preferred, but Japanese is OK ✓ A lot of issues are waiting your contribution ‣Not only for machine learning ✓ Discuss in Slack ‣https://sciruby-slack.herokuapp.com

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

ςετεϥΠυ

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

No content

Slide 82

Slide 82 text

No content