SciRuby Machine Learning Current Status and Future

7cca11c5257fda526eeb4b1ada28f904?s=47 Kenta Murata
September 09, 2016

SciRuby Machine Learning Current Status and Future

RubyKaigi 2016

7cca11c5257fda526eeb4b1ada28f904?s=128

Kenta Murata

September 09, 2016
Tweet

Transcript

  1. SciRuby
 Machine Learning Current Status and Future Kenta Murata 2016.09.09

    Kyoto Japan
  2. self.introduce

  3. @mrkn ✓ Kenta Murata ✓ CRuby committer ✓ Start contributing

    to SciRuby since last year ✓ Recruit Holdings Co., Ltd.
 Media Technology Lab.
  4. None
  5. my gems ✓ bigdecimal ✓ daru-td ✓ iruby-rails ✓ enumerable-statistics

  6. enumerable-statistics.gem ✓ Compute statistical summaries as fast and precise as

    possible ‣Array#sum, Enumerable#sum (for Ruby < 2.4) ‣Array#mean, Enumerable#mean ‣Array#variance, Enumerable#variance ‣etc.
  7. enumerable-statistics.gem

  8. Agenda ✓ Introduction ✓ Machine Learning ✓ SciRuby's Current Status

    ✓ Scikit-learn ✓ Future
  9. Introduction

  10. I want to ✓ Do machine learning with Ruby

  11. Machine Learning w/ Ruby ✓ What does it mean?

  12. NG ✓ Write machine learning algorithms by Ruby

  13. OK ✓ Perform data science works with Ruby

  14. Data Science Workflow ✓ Collecting data ✓ Exploratory data analysis

    ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model ✓ Applying to real world
  15. Machine Learning Related Processes ✓ Collecting data ✓ Exploratory data

    analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model ✓ Applying to real world
  16. How many things can be done with Ruby? ✓ Exploratory

    data analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model
  17. How many things can be done with Ruby? ✓ Exploratory

    data analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model
  18. How many things can be done in Ruby? ✓ Generally

    nothing ✓ Python can do total workflow ✓ That's why everyone uses Python
  19. Change the Current Situation ✓ Make Ruby available Data Science

    ✓ What is wrong now? ✓ I'll make it clear in this talk
  20. Most Important Thing
 in This Talk at First

  21. Help!! ✓ Join SciRuby development ✓ A lot of issues

    are waiting your contribution ✓ Discussion in Slack ✓ https://sciruby-slack.herokuapp.com
  22. Machine Learning

  23. Why we use machine learning? ✓ We want to make

    business decisions from real data ✓ The use of machine learning algorithms is optional ✓ We need machine learning to drive our business by "big data"
  24. Machine Learning can do ✓ Tasks impossible to do by

    human ✓ Tasks that the solution is difficult to program by hand ✓ Tasks that how to solve is unknown
  25. For example ✓ Recommendation ✓ Outlier detection ✓ Sentiment analysis

    ✓ etc.
  26. Machine Learning Problems ✓ Supervised learning ✓ Unsupervised learning ✓

    Reinforcement learning
  27. Supervised learning ✓ To learn a general rule that maps

    inputs to outputs from the given example input-output pairs ✓ Two types of problems: ‣Classification - To predict what the weather is tomorrow ‣Regression - To estimate the expected highest temperature tomorrow ✓ Example use case ‣Recommender system ‣Sentiment analysis
  28. Unsupervised learning ✓ To extract the structural features of input

    data distribution ✓ Typical problem types: ‣Clustering ‣Density estimation ‣Dimensionality reduction ✓ Example use cases: ‣Exploratory data analysis ‣Outlier detection
  29. Reinforcement learning ✓To learn rules of decision making on a

    dynamic environment ✓Typical problem types: ‣Multi-armed bandit problem ‣Adaptive scheduling ‣Automatic control ✓Example use cases: ‣Shogi AI ‣Automatic car driving
  30. Machine Learning Problems ✓ Supervised learning ✓ Unsupervised learning ✓

    Reinforcement learning
  31. This talk focuses on ✓ Supervised learning ✓ Unsupervised learning

    ✓ Reinforcement learning
  32. SciRuby Machine Learning
 Current Status

  33. Existing Gems
 for Machine Learning ✓ liblinear-ruby.gem ✓ rb-libsvm.gem ✓

    decisiontree.gem ✓ etc.
  34. liblinear-ruby.gem example require 'liblinear' # model parameters parameters = {

    solver_type: Liblinear::L2R_LR } # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # train model = Liblinear.train(parameters, labels, examples) # predict (the result will be 1) puts Liblinear.predict(model, [0.5, 0.5])
  35. liblinear-ruby.gem features ✓ Just wrapper of liblinear ✓ Logistic regression

    ‣Classification ✓ Linear SVC ‣Classification ✓ Linear SVR ‣Regression ✓ Cross validation
  36. liblinear-ruby.gem example require 'liblinear' # model parameters parameters = {

    solver_type: Liblinear::L2R_LR } # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # train model = Liblinear.train(parameters, labels, examples) # predict (the result will be 1) puts Liblinear.predict(model, [0.5, 0.5]) # cross validation fold = 5 # Means 5-fold cross validation results = Liblinear.cross_validation(fold, parameters, labels, examples) accuracy = results.zip(labels).map {|a, b| a == b ? 1.0 : 0.0 }.sum / labels.length puts "Cross validation accuracy: #{accuracy}"
  37. rb-libsvm.gem features ✓ Just wrapper of libsvm ✓ C-SVC, nu-SVC

    ‣Classification ✓ epsilon-SVR, nu-SVR ‣Regression ✓ One-class SVM ‣Unsupervised outlier detection ✓ Cross validation
  38. rb-libsvm.gem example require 'libsvm' require 'enumerable/statistics' # model parameters parameter

    = Libsvm::SvmParameter.new parameter.svm_type = Libsvm::SvmType::C_SVC parameter.kernel_type = Libsvm::KernelType::RBF parameter.cache_size = 1 # in megabytes parameter.eps = 0.001 parameter.c = 10 # labels of training data labels = [1, -1] # training data examples = [[1, 0, 1], [-1, 0, -1]].map {|xs| Libsvm::Node.features(xs) } # train model problem = Libsvm::Problem.new problem.set_examples(labels, examples) model = Libsvm::Model.train(problem, parameter)
  39. decisiontree.gem features ✓ ID3 decision tree ‣Classification only ✓ No

    parameter configuration ‣e.g. criterion, minimum samples in leaf, etc. ✓ No cross validation ✓ Pure Ruby implementation
  40. decisiontree.gem usage require 'decisiontree' # training data (last items are

    labels) feature_names = ['hungers', 'color'] examples = [ [8, 'red', 'angry'], [6, 'red', 'angry'], [7, 'red', 'angry'], [7, 'blue', 'not angry'], [2, 'red', 'not angry'], [3, 'blue', 'not angry'], [2, 'blue', 'not angry'], [1, 'red', 'not angry'] ] # train model tree = DecisionTree::ID3Tree.new( feature_names, examples, 'not angry', color: :descrete, hunger: :continuous ) tree.train # prediction pred = tree.predict([7, 'red', 'angry']) puts "Predicted: #{pred} for angry"
  41. Etc. ✓ ai4r.gem ✓ classifier-reborn.gem ✓ data_mining.gem ✓ etc.

  42. With Existing Gems ✓ Several machine learning algorithms are provided

    for classification, regression, clustering, etc. ✓ We must use these algorithms in library-specific implementations because they have different APIs
  43. Issues of Existing Gems ✓ Different ways to specify model

    parameters ✓ Different ways and formats of training data ✓ Many gems don't support cross validation ✓ Not for practical use because of their toy- implementations
  44. Real World Machine Learning

  45. Real World Data ✓ Large amount of data ✓ High-dimensional

    features ✓ A lot of missing values
  46. Machine Learning
 in Real World ✓ We couldn't look at

    the whole data ✓ We couldn't know what algorithms were preferred to the given data ✓ We must try, compare, and combine as many algorithms as possible
  47. Try, Compare, and Combine multiple algorithms ✓ Need to unify

    data formats ✓ Need to apply cross validation for all algorithms ✓ Need to unify interfaces of algorithms for searching optimal hyper parameters and combine algorithms
  48. In Current SciRuby ✓ Couldn't build up practical machine learning

    systems with SciRuby ✓ Python can do with scikit-learn
  49. Scikit-learn

  50. What is scikit-learn ✓ Machine learning framework for Scipy stack

  51. Scipy stack ✓ Numpy ‣Dense tensors ✓ Scipy ‣Scientific functions

    ‣Sparse matrices ✓ Pandas ‣Data frames ✓ Matplotlib ‣Visualization infrastructure ✓ Jupyter notebook ✓ Etc.
  52. What is scikit-learn ✓ Machine learning framework for Scipy stack

    ✓ Python's machine learning standard
  53. Scikit-learn is elegant ✓ Input data is feature matrix and

    label vector for all algorithms ✓ Input data type can be any objects compatible with numpy's ndarray ✓ Machine learning models follow the unified interface
  54. Logistic regression from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import cross_val_score

    # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # learning classifier = LogisticRegression(penalty="l2") classifier.fit(examples, labels) # prediction print(classifier.predict([[0.5, 0.5]])) # 5-fold cross validation classifier = LogisticRegression(penalty="l2") scores = cross_val_score(classifier, examples, labels, cv=5, scoring='roc_auc') print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
  55. dmlc/xgboost import xgboost as xgb from sklearn.cross_validation import cross_val_score #

    labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # learning classifier = xgb.XGBClassifier() classifier.fit(examples, labels) # prediction print(classifier.predict([[0.5, 0.5]])) # 5-fold cross validation classifier = xgb.XGBClassifier() scores = cross_val_score(classifier, examples, labels, cv=5, scoring='roc_auc') print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
  56. Grid search from sklearn.linear_model import LogisticRegression from sklearn.grid_search import GridSearchCV

    # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # Finding best parameter set by grid search parameters = { 'penalty' : ['l2', 'l1'], 'C' : numpy.logspace(-4, 4, 10) } classifier = GridSearchCV(LogisticRegression(), parameters, cv=5) classifier.fit(examples, labels) # Report best parameters best_params = classifier.best_estimator_.get_params() print 'Best parameters = {}'.format(best_params)
  57. Combination with Pipeline from sklearn import svm from sklearn.decomposition import

    PCA
 from sklearn.grid_search import GridSearchCV from sklearn.pipeline import Pipeline # Combine PCA and SVC by pipeline pipeline = Pipeline([ ('pca', PCA()), ('svc', svm.SVC()) ]) # Finding best parameter set by grid search parameters = { 'pca__n_components' : range(2, 6), 'svc__kernel' : ['linear', 'rbf'], 'svc__C' : numpy.logspace(-4, 4, 10), 'svc__gamma' : numpy.logspace(-4, 4, 10) } classifier = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1) classifier.fit(features, labels) # Report best parameter set best_params = classifier.best_estimator_.get_params() print 'Best parameters = {}'.format(best_params) 1$" 47$ Input Output
  58. With scikit-learn ✓ We can prepare training data in a

    common format ‣vectors for labels and matrices for features ✓ We can use all the algorithms in the same interface ✓ We can make a combination model by using pipeline ✓ We can grid search for optimizing hyper parameters
  59. Scikit-learn is a standard ✓ Several libraries provide scikit-learn compatible

    interface ‣xgboost ‣tensorflow
  60. Scikit-learn is ideal framework for machine learning

  61. The Future of SciRuby in Machine Learning

  62. Key Point ✓ Make scikit-learn like thing available for Ruby

    programs
  63. Two ways ✓ Make scikit-learn itself to be available from

    ruby ✓ Make own libraries wrote by Ruby like scikit-learn
  64. Use scikit-learn itself ✓ Learn from PyCall.jl and ScikitLearn.jl ✓

    PyCall.jl ‣Call python things from Julia code ✓ ScikitLearn.jl ‣Binding to scikit-learn via PyCall.jl ✓ Make pycall.gem and scikit-learn.gem
  65. Make scikit-learn like libraries ✓ Very hard work ✓ Need

    Cython-like system to make writing extension library easy ‣rubex planned by v0dro ✓ Numerical arrays
  66. Numerical array issues ✓ NMatrix ✓ Numo::NArray ✓ NumBuffer

  67. NMatrix ✓ Slow implementation ✓ Lack of linear algebra operations

    for sparse matrices ✓ Installation issues
  68. Numo::NArray ✓ Lack of sparse matrix features ✓ Too few

    supported libraries
  69. NumBuffer ✓ What is? ‣Supporting to exchange numerical array data

    among different libraries ✓ Developer is only me ✓ Need more contributors
  70. Benchmark $ ruby -r benchmark/ips -r nmatrix -r numo/narray -e

    ' Benchmark.ips do |x| ar = Array.new(100*100) { rand } nm = NMatrix.random [100*100] na = Numo::DFloat.new(100*100).rand x.report('ar') { Array.new(ar.length) {|i| ar[i] + ar[i] } } x.report('nm') { nm + nm } x.report('na') { na + na } end ' Warming up -------------------------------------- ar 111.000 i/100ms nm 59.000 i/100ms na 3.133k i/100ms Calculating ------------------------------------- ar 1.068k (±12.3%) i/s - 5.328k in 5.078079s nm 618.334 (±10.0%) i/s - 3.068k in 5.021136s na 34.110k (±19.0%) i/s - 166.049k in 5.028910s
  71. Benchmark $ ruby -r benchmark/ips -r nmatrix -r numo/narray -e

    ' Benchmark.ips do |x| nm = NMatrix.random [100, 100] na = Numo::DFloat.new(100, 100).rand x.report('nm') { nm.dot nm } x.report('na') { na.inplace.dot na } end ' Warming up -------------------------------------- nm 189.000 i/100ms na 60.000 i/100ms Calculating ------------------------------------- nm 2.083k (± 8.0%) i/s - 10.395k in 5.022906s na 658.759 (± 7.4%) i/s - 3.300k in 5.039515s
  72. NMatrix and NArray compatibility ✓ Which is best? ‣Both of

    them are not best now ✓ Interface and feature incompatibility ‣NumBuffer can't resolve this issue ✓ I want both of them to be unified ‣NMatrix is good for sparse matrices ‣NArray is good for dense arrays
  73. SciRuby JP ✓ SciRuby developer community in Japan ✓ Perform

    survey study in this summer
  74. Some Achievements ✓Tutorials ‣100 narray exercises (by masa16 & kozo2)

    ‣10 minutes to daru (by kozo2) ‣pandas cookbook with daru (by kozo2) ‣Rewrite pandas doc with daru (by chart-linux) ✓Installation ‣IRuby on Windows (by kimura) ‣ZeroMQ related things (by kozo2 & mrkn)
  75. Some Achievements ✓ NLP ‣Survey (by himkt) ✓ Machine Learning

    ‣Survey (by mrkn) ✓ Visualization ‣New plotly binding (by y4ashida) ✓ Other Languages ‣Ruby support in runr (by y4ashida)
  76. Let's go forward ✓ Join SciRuby contribution ‣English is preferred,

    but Japanese is OK ✓ A lot of issues are waiting your contribution ‣Not only for machine learning ✓ Discuss in Slack ‣https://sciruby-slack.herokuapp.com
  77. None
  78. ςετεϥΠυ

  79. None
  80. None
  81. None
  82. None