Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SciRuby Machine Learning Current Status and Future

Kenta Murata
September 09, 2016

SciRuby Machine Learning Current Status and Future

RubyKaigi 2016

Kenta Murata

September 09, 2016
Tweet

More Decks by Kenta Murata

Other Decks in Technology

Transcript

  1. @mrkn ✓ Kenta Murata ✓ CRuby committer ✓ Start contributing

    to SciRuby since last year ✓ Recruit Holdings Co., Ltd.
 Media Technology Lab.
  2. enumerable-statistics.gem ✓ Compute statistical summaries as fast and precise as

    possible ‣Array#sum, Enumerable#sum (for Ruby < 2.4) ‣Array#mean, Enumerable#mean ‣Array#variance, Enumerable#variance ‣etc.
  3. Data Science Workflow ✓ Collecting data ✓ Exploratory data analysis

    ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model ✓ Applying to real world
  4. Machine Learning Related Processes ✓ Collecting data ✓ Exploratory data

    analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model ✓ Applying to real world
  5. How many things can be done with Ruby? ✓ Exploratory

    data analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model
  6. How many things can be done with Ruby? ✓ Exploratory

    data analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model
  7. How many things can be done in Ruby? ✓ Generally

    nothing ✓ Python can do total workflow ✓ That's why everyone uses Python
  8. Change the Current Situation ✓ Make Ruby available Data Science

    ✓ What is wrong now? ✓ I'll make it clear in this talk
  9. Help!! ✓ Join SciRuby development ✓ A lot of issues

    are waiting your contribution ✓ Discussion in Slack ✓ https://sciruby-slack.herokuapp.com
  10. Why we use machine learning? ✓ We want to make

    business decisions from real data ✓ The use of machine learning algorithms is optional ✓ We need machine learning to drive our business by "big data"
  11. Machine Learning can do ✓ Tasks impossible to do by

    human ✓ Tasks that the solution is difficult to program by hand ✓ Tasks that how to solve is unknown
  12. Supervised learning ✓ To learn a general rule that maps

    inputs to outputs from the given example input-output pairs ✓ Two types of problems: ‣Classification - To predict what the weather is tomorrow ‣Regression - To estimate the expected highest temperature tomorrow ✓ Example use case ‣Recommender system ‣Sentiment analysis
  13. Unsupervised learning ✓ To extract the structural features of input

    data distribution ✓ Typical problem types: ‣Clustering ‣Density estimation ‣Dimensionality reduction ✓ Example use cases: ‣Exploratory data analysis ‣Outlier detection
  14. Reinforcement learning ✓To learn rules of decision making on a

    dynamic environment ✓Typical problem types: ‣Multi-armed bandit problem ‣Adaptive scheduling ‣Automatic control ✓Example use cases: ‣Shogi AI ‣Automatic car driving
  15. liblinear-ruby.gem example require 'liblinear' # model parameters parameters = {

    solver_type: Liblinear::L2R_LR } # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # train model = Liblinear.train(parameters, labels, examples) # predict (the result will be 1) puts Liblinear.predict(model, [0.5, 0.5])
  16. liblinear-ruby.gem features ✓ Just wrapper of liblinear ✓ Logistic regression

    ‣Classification ✓ Linear SVC ‣Classification ✓ Linear SVR ‣Regression ✓ Cross validation
  17. liblinear-ruby.gem example require 'liblinear' # model parameters parameters = {

    solver_type: Liblinear::L2R_LR } # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # train model = Liblinear.train(parameters, labels, examples) # predict (the result will be 1) puts Liblinear.predict(model, [0.5, 0.5]) # cross validation fold = 5 # Means 5-fold cross validation results = Liblinear.cross_validation(fold, parameters, labels, examples) accuracy = results.zip(labels).map {|a, b| a == b ? 1.0 : 0.0 }.sum / labels.length puts "Cross validation accuracy: #{accuracy}"
  18. rb-libsvm.gem features ✓ Just wrapper of libsvm ✓ C-SVC, nu-SVC

    ‣Classification ✓ epsilon-SVR, nu-SVR ‣Regression ✓ One-class SVM ‣Unsupervised outlier detection ✓ Cross validation
  19. rb-libsvm.gem example require 'libsvm' require 'enumerable/statistics' # model parameters parameter

    = Libsvm::SvmParameter.new parameter.svm_type = Libsvm::SvmType::C_SVC parameter.kernel_type = Libsvm::KernelType::RBF parameter.cache_size = 1 # in megabytes parameter.eps = 0.001 parameter.c = 10 # labels of training data labels = [1, -1] # training data examples = [[1, 0, 1], [-1, 0, -1]].map {|xs| Libsvm::Node.features(xs) } # train model problem = Libsvm::Problem.new problem.set_examples(labels, examples) model = Libsvm::Model.train(problem, parameter)
  20. decisiontree.gem features ✓ ID3 decision tree ‣Classification only ✓ No

    parameter configuration ‣e.g. criterion, minimum samples in leaf, etc. ✓ No cross validation ✓ Pure Ruby implementation
  21. decisiontree.gem usage require 'decisiontree' # training data (last items are

    labels) feature_names = ['hungers', 'color'] examples = [ [8, 'red', 'angry'], [6, 'red', 'angry'], [7, 'red', 'angry'], [7, 'blue', 'not angry'], [2, 'red', 'not angry'], [3, 'blue', 'not angry'], [2, 'blue', 'not angry'], [1, 'red', 'not angry'] ] # train model tree = DecisionTree::ID3Tree.new( feature_names, examples, 'not angry', color: :descrete, hunger: :continuous ) tree.train # prediction pred = tree.predict([7, 'red', 'angry']) puts "Predicted: #{pred} for angry"
  22. With Existing Gems ✓ Several machine learning algorithms are provided

    for classification, regression, clustering, etc. ✓ We must use these algorithms in library-specific implementations because they have different APIs
  23. Issues of Existing Gems ✓ Different ways to specify model

    parameters ✓ Different ways and formats of training data ✓ Many gems don't support cross validation ✓ Not for practical use because of their toy- implementations
  24. Machine Learning
 in Real World ✓ We couldn't look at

    the whole data ✓ We couldn't know what algorithms were preferred to the given data ✓ We must try, compare, and combine as many algorithms as possible
  25. Try, Compare, and Combine multiple algorithms ✓ Need to unify

    data formats ✓ Need to apply cross validation for all algorithms ✓ Need to unify interfaces of algorithms for searching optimal hyper parameters and combine algorithms
  26. In Current SciRuby ✓ Couldn't build up practical machine learning

    systems with SciRuby ✓ Python can do with scikit-learn
  27. Scipy stack ✓ Numpy ‣Dense tensors ✓ Scipy ‣Scientific functions

    ‣Sparse matrices ✓ Pandas ‣Data frames ✓ Matplotlib ‣Visualization infrastructure ✓ Jupyter notebook ✓ Etc.
  28. Scikit-learn is elegant ✓ Input data is feature matrix and

    label vector for all algorithms ✓ Input data type can be any objects compatible with numpy's ndarray ✓ Machine learning models follow the unified interface
  29. Logistic regression from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import cross_val_score

    # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # learning classifier = LogisticRegression(penalty="l2") classifier.fit(examples, labels) # prediction print(classifier.predict([[0.5, 0.5]])) # 5-fold cross validation classifier = LogisticRegression(penalty="l2") scores = cross_val_score(classifier, examples, labels, cv=5, scoring='roc_auc') print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
  30. dmlc/xgboost import xgboost as xgb from sklearn.cross_validation import cross_val_score #

    labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # learning classifier = xgb.XGBClassifier() classifier.fit(examples, labels) # prediction print(classifier.predict([[0.5, 0.5]])) # 5-fold cross validation classifier = xgb.XGBClassifier() scores = cross_val_score(classifier, examples, labels, cv=5, scoring='roc_auc') print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
  31. Grid search from sklearn.linear_model import LogisticRegression from sklearn.grid_search import GridSearchCV

    # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # Finding best parameter set by grid search parameters = { 'penalty' : ['l2', 'l1'], 'C' : numpy.logspace(-4, 4, 10) } classifier = GridSearchCV(LogisticRegression(), parameters, cv=5) classifier.fit(examples, labels) # Report best parameters best_params = classifier.best_estimator_.get_params() print 'Best parameters = {}'.format(best_params)
  32. Combination with Pipeline from sklearn import svm from sklearn.decomposition import

    PCA
 from sklearn.grid_search import GridSearchCV from sklearn.pipeline import Pipeline # Combine PCA and SVC by pipeline pipeline = Pipeline([ ('pca', PCA()), ('svc', svm.SVC()) ]) # Finding best parameter set by grid search parameters = { 'pca__n_components' : range(2, 6), 'svc__kernel' : ['linear', 'rbf'], 'svc__C' : numpy.logspace(-4, 4, 10), 'svc__gamma' : numpy.logspace(-4, 4, 10) } classifier = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1) classifier.fit(features, labels) # Report best parameter set best_params = classifier.best_estimator_.get_params() print 'Best parameters = {}'.format(best_params) 1$" 47$ Input Output
  33. With scikit-learn ✓ We can prepare training data in a

    common format ‣vectors for labels and matrices for features ✓ We can use all the algorithms in the same interface ✓ We can make a combination model by using pipeline ✓ We can grid search for optimizing hyper parameters
  34. Two ways ✓ Make scikit-learn itself to be available from

    ruby ✓ Make own libraries wrote by Ruby like scikit-learn
  35. Use scikit-learn itself ✓ Learn from PyCall.jl and ScikitLearn.jl ✓

    PyCall.jl ‣Call python things from Julia code ✓ ScikitLearn.jl ‣Binding to scikit-learn via PyCall.jl ✓ Make pycall.gem and scikit-learn.gem
  36. Make scikit-learn like libraries ✓ Very hard work ✓ Need

    Cython-like system to make writing extension library easy ‣rubex planned by v0dro ✓ Numerical arrays
  37. NMatrix ✓ Slow implementation ✓ Lack of linear algebra operations

    for sparse matrices ✓ Installation issues
  38. NumBuffer ✓ What is? ‣Supporting to exchange numerical array data

    among different libraries ✓ Developer is only me ✓ Need more contributors
  39. Benchmark $ ruby -r benchmark/ips -r nmatrix -r numo/narray -e

    ' Benchmark.ips do |x| ar = Array.new(100*100) { rand } nm = NMatrix.random [100*100] na = Numo::DFloat.new(100*100).rand x.report('ar') { Array.new(ar.length) {|i| ar[i] + ar[i] } } x.report('nm') { nm + nm } x.report('na') { na + na } end ' Warming up -------------------------------------- ar 111.000 i/100ms nm 59.000 i/100ms na 3.133k i/100ms Calculating ------------------------------------- ar 1.068k (±12.3%) i/s - 5.328k in 5.078079s nm 618.334 (±10.0%) i/s - 3.068k in 5.021136s na 34.110k (±19.0%) i/s - 166.049k in 5.028910s
  40. Benchmark $ ruby -r benchmark/ips -r nmatrix -r numo/narray -e

    ' Benchmark.ips do |x| nm = NMatrix.random [100, 100] na = Numo::DFloat.new(100, 100).rand x.report('nm') { nm.dot nm } x.report('na') { na.inplace.dot na } end ' Warming up -------------------------------------- nm 189.000 i/100ms na 60.000 i/100ms Calculating ------------------------------------- nm 2.083k (± 8.0%) i/s - 10.395k in 5.022906s na 658.759 (± 7.4%) i/s - 3.300k in 5.039515s
  41. NMatrix and NArray compatibility ✓ Which is best? ‣Both of

    them are not best now ✓ Interface and feature incompatibility ‣NumBuffer can't resolve this issue ✓ I want both of them to be unified ‣NMatrix is good for sparse matrices ‣NArray is good for dense arrays
  42. Some Achievements ✓Tutorials ‣100 narray exercises (by masa16 & kozo2)

    ‣10 minutes to daru (by kozo2) ‣pandas cookbook with daru (by kozo2) ‣Rewrite pandas doc with daru (by chart-linux) ✓Installation ‣IRuby on Windows (by kimura) ‣ZeroMQ related things (by kozo2 & mrkn)
  43. Some Achievements ✓ NLP ‣Survey (by himkt) ✓ Machine Learning

    ‣Survey (by mrkn) ✓ Visualization ‣New plotly binding (by y4ashida) ✓ Other Languages ‣Ruby support in runr (by y4ashida)
  44. Let's go forward ✓ Join SciRuby contribution ‣English is preferred,

    but Japanese is OK ✓ A lot of issues are waiting your contribution ‣Not only for machine learning ✓ Discuss in Slack ‣https://sciruby-slack.herokuapp.com