Upgrade to Pro — share decks privately, control downloads, hide ads and more …

SciRuby Machine Learning Current Status and Future

7cca11c5257fda526eeb4b1ada28f904?s=47 Kenta Murata
September 09, 2016

SciRuby Machine Learning Current Status and Future

RubyKaigi 2016

7cca11c5257fda526eeb4b1ada28f904?s=128

Kenta Murata

September 09, 2016
Tweet

More Decks by Kenta Murata

Other Decks in Technology

Transcript

  1. SciRuby
 Machine Learning Current Status and Future Kenta Murata 2016.09.09

    Kyoto Japan
  2. self.introduce

  3. @mrkn ✓ Kenta Murata ✓ CRuby committer ✓ Start contributing

    to SciRuby since last year ✓ Recruit Holdings Co., Ltd.
 Media Technology Lab.
  4. None
  5. my gems ✓ bigdecimal ✓ daru-td ✓ iruby-rails ✓ enumerable-statistics

  6. enumerable-statistics.gem ✓ Compute statistical summaries as fast and precise as

    possible ‣Array#sum, Enumerable#sum (for Ruby < 2.4) ‣Array#mean, Enumerable#mean ‣Array#variance, Enumerable#variance ‣etc.
  7. enumerable-statistics.gem

  8. Agenda ✓ Introduction ✓ Machine Learning ✓ SciRuby's Current Status

    ✓ Scikit-learn ✓ Future
  9. Introduction

  10. I want to ✓ Do machine learning with Ruby

  11. Machine Learning w/ Ruby ✓ What does it mean?

  12. NG ✓ Write machine learning algorithms by Ruby

  13. OK ✓ Perform data science works with Ruby

  14. Data Science Workflow ✓ Collecting data ✓ Exploratory data analysis

    ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model ✓ Applying to real world
  15. Machine Learning Related Processes ✓ Collecting data ✓ Exploratory data

    analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model ✓ Applying to real world
  16. How many things can be done with Ruby? ✓ Exploratory

    data analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model
  17. How many things can be done with Ruby? ✓ Exploratory

    data analysis ✓ Cleansing data ✓ Integrating multiple data sources ✓ Preprocessing ✓ Making machine learning model
  18. How many things can be done in Ruby? ✓ Generally

    nothing ✓ Python can do total workflow ✓ That's why everyone uses Python
  19. Change the Current Situation ✓ Make Ruby available Data Science

    ✓ What is wrong now? ✓ I'll make it clear in this talk
  20. Most Important Thing
 in This Talk at First

  21. Help!! ✓ Join SciRuby development ✓ A lot of issues

    are waiting your contribution ✓ Discussion in Slack ✓ https://sciruby-slack.herokuapp.com
  22. Machine Learning

  23. Why we use machine learning? ✓ We want to make

    business decisions from real data ✓ The use of machine learning algorithms is optional ✓ We need machine learning to drive our business by "big data"
  24. Machine Learning can do ✓ Tasks impossible to do by

    human ✓ Tasks that the solution is difficult to program by hand ✓ Tasks that how to solve is unknown
  25. For example ✓ Recommendation ✓ Outlier detection ✓ Sentiment analysis

    ✓ etc.
  26. Machine Learning Problems ✓ Supervised learning ✓ Unsupervised learning ✓

    Reinforcement learning
  27. Supervised learning ✓ To learn a general rule that maps

    inputs to outputs from the given example input-output pairs ✓ Two types of problems: ‣Classification - To predict what the weather is tomorrow ‣Regression - To estimate the expected highest temperature tomorrow ✓ Example use case ‣Recommender system ‣Sentiment analysis
  28. Unsupervised learning ✓ To extract the structural features of input

    data distribution ✓ Typical problem types: ‣Clustering ‣Density estimation ‣Dimensionality reduction ✓ Example use cases: ‣Exploratory data analysis ‣Outlier detection
  29. Reinforcement learning ✓To learn rules of decision making on a

    dynamic environment ✓Typical problem types: ‣Multi-armed bandit problem ‣Adaptive scheduling ‣Automatic control ✓Example use cases: ‣Shogi AI ‣Automatic car driving
  30. Machine Learning Problems ✓ Supervised learning ✓ Unsupervised learning ✓

    Reinforcement learning
  31. This talk focuses on ✓ Supervised learning ✓ Unsupervised learning

    ✓ Reinforcement learning
  32. SciRuby Machine Learning
 Current Status

  33. Existing Gems
 for Machine Learning ✓ liblinear-ruby.gem ✓ rb-libsvm.gem ✓

    decisiontree.gem ✓ etc.
  34. liblinear-ruby.gem example require 'liblinear' # model parameters parameters = {

    solver_type: Liblinear::L2R_LR } # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # train model = Liblinear.train(parameters, labels, examples) # predict (the result will be 1) puts Liblinear.predict(model, [0.5, 0.5])
  35. liblinear-ruby.gem features ✓ Just wrapper of liblinear ✓ Logistic regression

    ‣Classification ✓ Linear SVC ‣Classification ✓ Linear SVR ‣Regression ✓ Cross validation
  36. liblinear-ruby.gem example require 'liblinear' # model parameters parameters = {

    solver_type: Liblinear::L2R_LR } # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # train model = Liblinear.train(parameters, labels, examples) # predict (the result will be 1) puts Liblinear.predict(model, [0.5, 0.5]) # cross validation fold = 5 # Means 5-fold cross validation results = Liblinear.cross_validation(fold, parameters, labels, examples) accuracy = results.zip(labels).map {|a, b| a == b ? 1.0 : 0.0 }.sum / labels.length puts "Cross validation accuracy: #{accuracy}"
  37. rb-libsvm.gem features ✓ Just wrapper of libsvm ✓ C-SVC, nu-SVC

    ‣Classification ✓ epsilon-SVR, nu-SVR ‣Regression ✓ One-class SVM ‣Unsupervised outlier detection ✓ Cross validation
  38. rb-libsvm.gem example require 'libsvm' require 'enumerable/statistics' # model parameters parameter

    = Libsvm::SvmParameter.new parameter.svm_type = Libsvm::SvmType::C_SVC parameter.kernel_type = Libsvm::KernelType::RBF parameter.cache_size = 1 # in megabytes parameter.eps = 0.001 parameter.c = 10 # labels of training data labels = [1, -1] # training data examples = [[1, 0, 1], [-1, 0, -1]].map {|xs| Libsvm::Node.features(xs) } # train model problem = Libsvm::Problem.new problem.set_examples(labels, examples) model = Libsvm::Model.train(problem, parameter)
  39. decisiontree.gem features ✓ ID3 decision tree ‣Classification only ✓ No

    parameter configuration ‣e.g. criterion, minimum samples in leaf, etc. ✓ No cross validation ✓ Pure Ruby implementation
  40. decisiontree.gem usage require 'decisiontree' # training data (last items are

    labels) feature_names = ['hungers', 'color'] examples = [ [8, 'red', 'angry'], [6, 'red', 'angry'], [7, 'red', 'angry'], [7, 'blue', 'not angry'], [2, 'red', 'not angry'], [3, 'blue', 'not angry'], [2, 'blue', 'not angry'], [1, 'red', 'not angry'] ] # train model tree = DecisionTree::ID3Tree.new( feature_names, examples, 'not angry', color: :descrete, hunger: :continuous ) tree.train # prediction pred = tree.predict([7, 'red', 'angry']) puts "Predicted: #{pred} for angry"
  41. Etc. ✓ ai4r.gem ✓ classifier-reborn.gem ✓ data_mining.gem ✓ etc.

  42. With Existing Gems ✓ Several machine learning algorithms are provided

    for classification, regression, clustering, etc. ✓ We must use these algorithms in library-specific implementations because they have different APIs
  43. Issues of Existing Gems ✓ Different ways to specify model

    parameters ✓ Different ways and formats of training data ✓ Many gems don't support cross validation ✓ Not for practical use because of their toy- implementations
  44. Real World Machine Learning

  45. Real World Data ✓ Large amount of data ✓ High-dimensional

    features ✓ A lot of missing values
  46. Machine Learning
 in Real World ✓ We couldn't look at

    the whole data ✓ We couldn't know what algorithms were preferred to the given data ✓ We must try, compare, and combine as many algorithms as possible
  47. Try, Compare, and Combine multiple algorithms ✓ Need to unify

    data formats ✓ Need to apply cross validation for all algorithms ✓ Need to unify interfaces of algorithms for searching optimal hyper parameters and combine algorithms
  48. In Current SciRuby ✓ Couldn't build up practical machine learning

    systems with SciRuby ✓ Python can do with scikit-learn
  49. Scikit-learn

  50. What is scikit-learn ✓ Machine learning framework for Scipy stack

  51. Scipy stack ✓ Numpy ‣Dense tensors ✓ Scipy ‣Scientific functions

    ‣Sparse matrices ✓ Pandas ‣Data frames ✓ Matplotlib ‣Visualization infrastructure ✓ Jupyter notebook ✓ Etc.
  52. What is scikit-learn ✓ Machine learning framework for Scipy stack

    ✓ Python's machine learning standard
  53. Scikit-learn is elegant ✓ Input data is feature matrix and

    label vector for all algorithms ✓ Input data type can be any objects compatible with numpy's ndarray ✓ Machine learning models follow the unified interface
  54. Logistic regression from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import cross_val_score

    # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # learning classifier = LogisticRegression(penalty="l2") classifier.fit(examples, labels) # prediction print(classifier.predict([[0.5, 0.5]])) # 5-fold cross validation classifier = LogisticRegression(penalty="l2") scores = cross_val_score(classifier, examples, labels, cv=5, scoring='roc_auc') print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
  55. dmlc/xgboost import xgboost as xgb from sklearn.cross_validation import cross_val_score #

    labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # learning classifier = xgb.XGBClassifier() classifier.fit(examples, labels) # prediction print(classifier.predict([[0.5, 0.5]])) # 5-fold cross validation classifier = xgb.XGBClassifier() scores = cross_val_score(classifier, examples, labels, cv=5, scoring='roc_auc') print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
  56. Grid search from sklearn.linear_model import LogisticRegression from sklearn.grid_search import GridSearchCV

    # labels of training data labels = [-1, -1, 1, 1] # training data examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]] # Finding best parameter set by grid search parameters = { 'penalty' : ['l2', 'l1'], 'C' : numpy.logspace(-4, 4, 10) } classifier = GridSearchCV(LogisticRegression(), parameters, cv=5) classifier.fit(examples, labels) # Report best parameters best_params = classifier.best_estimator_.get_params() print 'Best parameters = {}'.format(best_params)
  57. Combination with Pipeline from sklearn import svm from sklearn.decomposition import

    PCA
 from sklearn.grid_search import GridSearchCV from sklearn.pipeline import Pipeline # Combine PCA and SVC by pipeline pipeline = Pipeline([ ('pca', PCA()), ('svc', svm.SVC()) ]) # Finding best parameter set by grid search parameters = { 'pca__n_components' : range(2, 6), 'svc__kernel' : ['linear', 'rbf'], 'svc__C' : numpy.logspace(-4, 4, 10), 'svc__gamma' : numpy.logspace(-4, 4, 10) } classifier = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1) classifier.fit(features, labels) # Report best parameter set best_params = classifier.best_estimator_.get_params() print 'Best parameters = {}'.format(best_params) 1$" 47$ Input Output
  58. With scikit-learn ✓ We can prepare training data in a

    common format ‣vectors for labels and matrices for features ✓ We can use all the algorithms in the same interface ✓ We can make a combination model by using pipeline ✓ We can grid search for optimizing hyper parameters
  59. Scikit-learn is a standard ✓ Several libraries provide scikit-learn compatible

    interface ‣xgboost ‣tensorflow
  60. Scikit-learn is ideal framework for machine learning

  61. The Future of SciRuby in Machine Learning

  62. Key Point ✓ Make scikit-learn like thing available for Ruby

    programs
  63. Two ways ✓ Make scikit-learn itself to be available from

    ruby ✓ Make own libraries wrote by Ruby like scikit-learn
  64. Use scikit-learn itself ✓ Learn from PyCall.jl and ScikitLearn.jl ✓

    PyCall.jl ‣Call python things from Julia code ✓ ScikitLearn.jl ‣Binding to scikit-learn via PyCall.jl ✓ Make pycall.gem and scikit-learn.gem
  65. Make scikit-learn like libraries ✓ Very hard work ✓ Need

    Cython-like system to make writing extension library easy ‣rubex planned by v0dro ✓ Numerical arrays
  66. Numerical array issues ✓ NMatrix ✓ Numo::NArray ✓ NumBuffer

  67. NMatrix ✓ Slow implementation ✓ Lack of linear algebra operations

    for sparse matrices ✓ Installation issues
  68. Numo::NArray ✓ Lack of sparse matrix features ✓ Too few

    supported libraries
  69. NumBuffer ✓ What is? ‣Supporting to exchange numerical array data

    among different libraries ✓ Developer is only me ✓ Need more contributors
  70. Benchmark $ ruby -r benchmark/ips -r nmatrix -r numo/narray -e

    ' Benchmark.ips do |x| ar = Array.new(100*100) { rand } nm = NMatrix.random [100*100] na = Numo::DFloat.new(100*100).rand x.report('ar') { Array.new(ar.length) {|i| ar[i] + ar[i] } } x.report('nm') { nm + nm } x.report('na') { na + na } end ' Warming up -------------------------------------- ar 111.000 i/100ms nm 59.000 i/100ms na 3.133k i/100ms Calculating ------------------------------------- ar 1.068k (±12.3%) i/s - 5.328k in 5.078079s nm 618.334 (±10.0%) i/s - 3.068k in 5.021136s na 34.110k (±19.0%) i/s - 166.049k in 5.028910s
  71. Benchmark $ ruby -r benchmark/ips -r nmatrix -r numo/narray -e

    ' Benchmark.ips do |x| nm = NMatrix.random [100, 100] na = Numo::DFloat.new(100, 100).rand x.report('nm') { nm.dot nm } x.report('na') { na.inplace.dot na } end ' Warming up -------------------------------------- nm 189.000 i/100ms na 60.000 i/100ms Calculating ------------------------------------- nm 2.083k (± 8.0%) i/s - 10.395k in 5.022906s na 658.759 (± 7.4%) i/s - 3.300k in 5.039515s
  72. NMatrix and NArray compatibility ✓ Which is best? ‣Both of

    them are not best now ✓ Interface and feature incompatibility ‣NumBuffer can't resolve this issue ✓ I want both of them to be unified ‣NMatrix is good for sparse matrices ‣NArray is good for dense arrays
  73. SciRuby JP ✓ SciRuby developer community in Japan ✓ Perform

    survey study in this summer
  74. Some Achievements ✓Tutorials ‣100 narray exercises (by masa16 & kozo2)

    ‣10 minutes to daru (by kozo2) ‣pandas cookbook with daru (by kozo2) ‣Rewrite pandas doc with daru (by chart-linux) ✓Installation ‣IRuby on Windows (by kimura) ‣ZeroMQ related things (by kozo2 & mrkn)
  75. Some Achievements ✓ NLP ‣Survey (by himkt) ✓ Machine Learning

    ‣Survey (by mrkn) ✓ Visualization ‣New plotly binding (by y4ashida) ✓ Other Languages ‣Ruby support in runr (by y4ashida)
  76. Let's go forward ✓ Join SciRuby contribution ‣English is preferred,

    but Japanese is OK ✓ A lot of issues are waiting your contribution ‣Not only for machine learning ✓ Discuss in Slack ‣https://sciruby-slack.herokuapp.com
  77. None
  78. ςετεϥΠυ

  79. None
  80. None
  81. None
  82. None