$30 off During Our Annual Pro Sale. View Details »

SciRuby Machine Learning Current Status and Future

Kenta Murata
September 09, 2016

SciRuby Machine Learning Current Status and Future

RubyKaigi 2016

Kenta Murata

September 09, 2016
Tweet

More Decks by Kenta Murata

Other Decks in Technology

Transcript

  1. SciRuby

    Machine Learning
    Current Status and Future
    Kenta Murata
    2016.09.09 Kyoto Japan

    View Slide

  2. self.introduce

    View Slide

  3. @mrkn
    ✓ Kenta Murata
    ✓ CRuby committer
    ✓ Start contributing to SciRuby since last year
    ✓ Recruit Holdings Co., Ltd.

    Media Technology Lab.

    View Slide

  4. View Slide

  5. my gems
    ✓ bigdecimal
    ✓ daru-td
    ✓ iruby-rails
    ✓ enumerable-statistics

    View Slide

  6. enumerable-statistics.gem
    ✓ Compute statistical summaries as fast
    and precise as possible
    ‣Array#sum, Enumerable#sum (for Ruby < 2.4)
    ‣Array#mean, Enumerable#mean
    ‣Array#variance, Enumerable#variance
    ‣etc.

    View Slide

  7. enumerable-statistics.gem

    View Slide

  8. Agenda
    ✓ Introduction
    ✓ Machine Learning
    ✓ SciRuby's Current Status
    ✓ Scikit-learn
    ✓ Future

    View Slide

  9. Introduction

    View Slide

  10. I want to
    ✓ Do machine learning with Ruby

    View Slide

  11. Machine Learning w/
    Ruby
    ✓ What does it mean?

    View Slide

  12. NG
    ✓ Write machine learning algorithms
    by Ruby

    View Slide

  13. OK
    ✓ Perform data science works with
    Ruby

    View Slide

  14. Data Science
    Workflow
    ✓ Collecting data
    ✓ Exploratory data analysis
    ✓ Cleansing data
    ✓ Integrating multiple data sources
    ✓ Preprocessing
    ✓ Making machine learning model
    ✓ Applying to real world

    View Slide

  15. Machine Learning
    Related Processes
    ✓ Collecting data
    ✓ Exploratory data analysis
    ✓ Cleansing data
    ✓ Integrating multiple data sources
    ✓ Preprocessing
    ✓ Making machine learning model
    ✓ Applying to real world

    View Slide

  16. How many things can
    be done with Ruby?
    ✓ Exploratory data analysis
    ✓ Cleansing data
    ✓ Integrating multiple data sources
    ✓ Preprocessing
    ✓ Making machine learning model

    View Slide

  17. How many things can
    be done with Ruby?
    ✓ Exploratory data analysis
    ✓ Cleansing data
    ✓ Integrating multiple data sources
    ✓ Preprocessing
    ✓ Making machine learning model

    View Slide

  18. How many things can
    be done in Ruby?
    ✓ Generally nothing
    ✓ Python can do total workflow
    ✓ That's why everyone uses Python

    View Slide

  19. Change the Current
    Situation
    ✓ Make Ruby available Data Science
    ✓ What is wrong now?
    ✓ I'll make it clear in this talk

    View Slide

  20. Most Important Thing

    in This Talk at First

    View Slide

  21. Help!!
    ✓ Join SciRuby development
    ✓ A lot of issues are waiting your
    contribution
    ✓ Discussion in Slack
    ✓ https://sciruby-slack.herokuapp.com

    View Slide

  22. Machine Learning

    View Slide

  23. Why we use machine
    learning?
    ✓ We want to make business
    decisions from real data
    ✓ The use of machine learning
    algorithms is optional
    ✓ We need machine learning to drive
    our business by "big data"

    View Slide

  24. Machine Learning can do
    ✓ Tasks impossible to do by human
    ✓ Tasks that the solution is difficult to
    program by hand
    ✓ Tasks that how to solve is unknown

    View Slide

  25. For example
    ✓ Recommendation
    ✓ Outlier detection
    ✓ Sentiment analysis
    ✓ etc.

    View Slide

  26. Machine Learning Problems
    ✓ Supervised learning
    ✓ Unsupervised learning
    ✓ Reinforcement learning

    View Slide

  27. Supervised learning
    ✓ To learn a general rule that maps inputs to outputs from the given
    example input-output pairs
    ✓ Two types of problems:
    ‣Classification
    - To predict what the weather is tomorrow
    ‣Regression
    - To estimate the expected highest temperature tomorrow
    ✓ Example use case
    ‣Recommender system
    ‣Sentiment analysis

    View Slide

  28. Unsupervised learning
    ✓ To extract the structural features of input data distribution
    ✓ Typical problem types:
    ‣Clustering
    ‣Density estimation
    ‣Dimensionality reduction
    ✓ Example use cases:
    ‣Exploratory data analysis
    ‣Outlier detection

    View Slide

  29. Reinforcement
    learning
    ✓To learn rules of decision making on a dynamic environment
    ✓Typical problem types:
    ‣Multi-armed bandit problem
    ‣Adaptive scheduling
    ‣Automatic control
    ✓Example use cases:
    ‣Shogi AI
    ‣Automatic car driving

    View Slide

  30. Machine Learning Problems
    ✓ Supervised learning
    ✓ Unsupervised learning
    ✓ Reinforcement learning

    View Slide

  31. This talk focuses on
    ✓ Supervised learning
    ✓ Unsupervised learning
    ✓ Reinforcement learning

    View Slide

  32. SciRuby Machine Learning

    Current Status

    View Slide

  33. Existing Gems

    for Machine Learning
    ✓ liblinear-ruby.gem
    ✓ rb-libsvm.gem
    ✓ decisiontree.gem
    ✓ etc.

    View Slide

  34. liblinear-ruby.gem example
    require 'liblinear'
    # model parameters
    parameters = { solver_type: Liblinear::L2R_LR }
    # labels of training data
    labels = [-1, -1, 1, 1]
    # training data
    examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]]
    # train
    model = Liblinear.train(parameters, labels, examples)
    # predict (the result will be 1)
    puts Liblinear.predict(model, [0.5, 0.5])

    View Slide

  35. liblinear-ruby.gem
    features
    ✓ Just wrapper of liblinear
    ✓ Logistic regression
    ‣Classification
    ✓ Linear SVC
    ‣Classification
    ✓ Linear SVR
    ‣Regression
    ✓ Cross validation

    View Slide

  36. liblinear-ruby.gem example
    require 'liblinear'
    # model parameters
    parameters = { solver_type: Liblinear::L2R_LR }
    # labels of training data
    labels = [-1, -1, 1, 1]
    # training data
    examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]]
    # train
    model = Liblinear.train(parameters, labels, examples)
    # predict (the result will be 1)
    puts Liblinear.predict(model, [0.5, 0.5])
    # cross validation
    fold = 5 # Means 5-fold cross validation
    results = Liblinear.cross_validation(fold, parameters, labels, examples)
    accuracy = results.zip(labels).map {|a, b| a == b ? 1.0 : 0.0 }.sum / labels.length
    puts "Cross validation accuracy: #{accuracy}"

    View Slide

  37. rb-libsvm.gem
    features
    ✓ Just wrapper of libsvm
    ✓ C-SVC, nu-SVC
    ‣Classification
    ✓ epsilon-SVR, nu-SVR
    ‣Regression
    ✓ One-class SVM
    ‣Unsupervised outlier detection
    ✓ Cross validation

    View Slide

  38. rb-libsvm.gem example
    require 'libsvm'
    require 'enumerable/statistics'
    # model parameters
    parameter = Libsvm::SvmParameter.new
    parameter.svm_type = Libsvm::SvmType::C_SVC
    parameter.kernel_type = Libsvm::KernelType::RBF
    parameter.cache_size = 1 # in megabytes
    parameter.eps = 0.001
    parameter.c = 10
    # labels of training data
    labels = [1, -1]
    # training data
    examples = [[1, 0, 1], [-1, 0, -1]].map {|xs| Libsvm::Node.features(xs) }
    # train model
    problem = Libsvm::Problem.new
    problem.set_examples(labels, examples)
    model = Libsvm::Model.train(problem, parameter)

    View Slide

  39. decisiontree.gem
    features
    ✓ ID3 decision tree
    ‣Classification only
    ✓ No parameter configuration
    ‣e.g. criterion, minimum samples in leaf, etc.
    ✓ No cross validation
    ✓ Pure Ruby implementation

    View Slide

  40. decisiontree.gem usage
    require 'decisiontree'
    # training data (last items are labels)
    feature_names = ['hungers', 'color']
    examples = [
    [8, 'red', 'angry'],
    [6, 'red', 'angry'],
    [7, 'red', 'angry'],
    [7, 'blue', 'not angry'],
    [2, 'red', 'not angry'],
    [3, 'blue', 'not angry'],
    [2, 'blue', 'not angry'],
    [1, 'red', 'not angry']
    ]
    # train model
    tree = DecisionTree::ID3Tree.new(
    feature_names, examples, 'not angry',
    color: :descrete, hunger: :continuous
    )
    tree.train
    # prediction
    pred = tree.predict([7, 'red', 'angry'])
    puts "Predicted: #{pred} for angry"

    View Slide

  41. Etc.
    ✓ ai4r.gem
    ✓ classifier-reborn.gem
    ✓ data_mining.gem
    ✓ etc.

    View Slide

  42. With Existing Gems
    ✓ Several machine learning
    algorithms are provided for
    classification, regression,
    clustering, etc.
    ✓ We must use these algorithms in
    library-specific implementations
    because they have different APIs

    View Slide

  43. Issues of Existing Gems
    ✓ Different ways to specify model
    parameters
    ✓ Different ways and formats of training
    data
    ✓ Many gems don't support cross validation
    ✓ Not for practical use because of their toy-
    implementations

    View Slide

  44. Real World Machine Learning

    View Slide

  45. Real World Data
    ✓ Large amount of data
    ✓ High-dimensional features
    ✓ A lot of missing values

    View Slide

  46. Machine Learning

    in Real World
    ✓ We couldn't look at the whole data
    ✓ We couldn't know what algorithms
    were preferred to the given data
    ✓ We must try, compare, and combine
    as many algorithms as possible

    View Slide

  47. Try, Compare, and Combine
    multiple algorithms
    ✓ Need to unify data formats
    ✓ Need to apply cross validation for all
    algorithms
    ✓ Need to unify interfaces of
    algorithms for searching optimal
    hyper parameters and combine
    algorithms

    View Slide

  48. In Current SciRuby
    ✓ Couldn't build up practical machine
    learning systems with SciRuby
    ✓ Python can do with scikit-learn

    View Slide

  49. Scikit-learn

    View Slide

  50. What is scikit-learn
    ✓ Machine learning framework for
    Scipy stack

    View Slide

  51. Scipy stack
    ✓ Numpy
    ‣Dense tensors
    ✓ Scipy
    ‣Scientific functions
    ‣Sparse matrices
    ✓ Pandas
    ‣Data frames
    ✓ Matplotlib
    ‣Visualization
    infrastructure
    ✓ Jupyter notebook
    ✓ Etc.

    View Slide

  52. What is scikit-learn
    ✓ Machine learning framework for
    Scipy stack
    ✓ Python's machine learning standard

    View Slide

  53. Scikit-learn is elegant
    ✓ Input data is feature matrix and
    label vector for all algorithms
    ✓ Input data type can be any objects
    compatible with numpy's ndarray
    ✓ Machine learning models follow the
    unified interface

    View Slide

  54. Logistic regression
    from sklearn.linear_model import LogisticRegression
    from sklearn.cross_validation import cross_val_score
    # labels of training data
    labels = [-1, -1, 1, 1]
    # training data
    examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]]
    # learning
    classifier = LogisticRegression(penalty="l2")
    classifier.fit(examples, labels)
    # prediction
    print(classifier.predict([[0.5, 0.5]]))
    # 5-fold cross validation
    classifier = LogisticRegression(penalty="l2")
    scores = cross_val_score(classifier, examples, labels, cv=5, scoring='roc_auc')
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

    View Slide

  55. dmlc/xgboost
    import xgboost as xgb
    from sklearn.cross_validation import cross_val_score
    # labels of training data
    labels = [-1, -1, 1, 1]
    # training data
    examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]]
    # learning
    classifier = xgb.XGBClassifier()
    classifier.fit(examples, labels)
    # prediction
    print(classifier.predict([[0.5, 0.5]]))
    # 5-fold cross validation
    classifier = xgb.XGBClassifier()
    scores = cross_val_score(classifier, examples, labels, cv=5, scoring='roc_auc')
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

    View Slide

  56. Grid search
    from sklearn.linear_model import LogisticRegression
    from sklearn.grid_search import GridSearchCV
    # labels of training data
    labels = [-1, -1, 1, 1]
    # training data
    examples = [[-2, -2], [-1, -1], [1, 1], [2, 2]]
    # Finding best parameter set by grid search
    parameters = {
    'penalty' : ['l2', 'l1'],
    'C' : numpy.logspace(-4, 4, 10)
    }
    classifier = GridSearchCV(LogisticRegression(), parameters, cv=5)
    classifier.fit(examples, labels)
    # Report best parameters
    best_params = classifier.best_estimator_.get_params()
    print 'Best parameters = {}'.format(best_params)

    View Slide

  57. Combination with Pipeline
    from sklearn import svm
    from sklearn.decomposition import PCA

    from sklearn.grid_search import GridSearchCV
    from sklearn.pipeline import Pipeline
    # Combine PCA and SVC by pipeline
    pipeline = Pipeline([
    ('pca', PCA()),
    ('svc', svm.SVC())
    ])
    # Finding best parameter set by grid search
    parameters = {
    'pca__n_components' : range(2, 6),
    'svc__kernel' : ['linear', 'rbf'],
    'svc__C' : numpy.logspace(-4, 4, 10),
    'svc__gamma' : numpy.logspace(-4, 4, 10)
    }
    classifier = GridSearchCV(pipeline, parameters, cv=5, n_jobs=-1)
    classifier.fit(features, labels)
    # Report best parameter set
    best_params = classifier.best_estimator_.get_params()
    print 'Best parameters = {}'.format(best_params)
    1$"
    47$
    Input
    Output

    View Slide

  58. With scikit-learn
    ✓ We can prepare training data in a common format
    ‣vectors for labels and matrices for features
    ✓ We can use all the algorithms in the same
    interface
    ✓ We can make a combination model by using
    pipeline
    ✓ We can grid search for optimizing hyper
    parameters

    View Slide

  59. Scikit-learn is a
    standard
    ✓ Several libraries provide scikit-learn
    compatible interface
    ‣xgboost
    ‣tensorflow

    View Slide

  60. Scikit-learn is ideal
    framework for machine
    learning

    View Slide

  61. The Future of SciRuby in
    Machine Learning

    View Slide

  62. Key Point
    ✓ Make scikit-learn like thing
    available for Ruby programs

    View Slide

  63. Two ways
    ✓ Make scikit-learn itself to be
    available from ruby
    ✓ Make own libraries wrote by Ruby
    like scikit-learn

    View Slide

  64. Use scikit-learn itself
    ✓ Learn from PyCall.jl and ScikitLearn.jl
    ✓ PyCall.jl
    ‣Call python things from Julia code
    ✓ ScikitLearn.jl
    ‣Binding to scikit-learn via PyCall.jl
    ✓ Make pycall.gem and scikit-learn.gem

    View Slide

  65. Make scikit-learn like
    libraries
    ✓ Very hard work
    ✓ Need Cython-like system to make
    writing extension library easy
    ‣rubex planned by v0dro
    ✓ Numerical arrays

    View Slide

  66. Numerical array issues
    ✓ NMatrix
    ✓ Numo::NArray
    ✓ NumBuffer

    View Slide

  67. NMatrix
    ✓ Slow implementation
    ✓ Lack of linear algebra operations
    for sparse matrices
    ✓ Installation issues

    View Slide

  68. Numo::NArray
    ✓ Lack of sparse matrix features
    ✓ Too few supported libraries

    View Slide

  69. NumBuffer
    ✓ What is?
    ‣Supporting to exchange numerical
    array data among different libraries
    ✓ Developer is only me
    ✓ Need more contributors

    View Slide

  70. Benchmark
    $ ruby -r benchmark/ips -r nmatrix -r numo/narray -e '
    Benchmark.ips do |x|
    ar = Array.new(100*100) { rand }
    nm = NMatrix.random [100*100]
    na = Numo::DFloat.new(100*100).rand
    x.report('ar') { Array.new(ar.length) {|i| ar[i] + ar[i] } }
    x.report('nm') { nm + nm }
    x.report('na') { na + na }
    end
    '
    Warming up --------------------------------------
    ar 111.000 i/100ms
    nm 59.000 i/100ms
    na 3.133k i/100ms
    Calculating -------------------------------------
    ar 1.068k (±12.3%) i/s - 5.328k in 5.078079s
    nm 618.334 (±10.0%) i/s - 3.068k in 5.021136s
    na 34.110k (±19.0%) i/s - 166.049k in 5.028910s

    View Slide

  71. Benchmark
    $ ruby -r benchmark/ips -r nmatrix -r numo/narray -e '
    Benchmark.ips do |x|
    nm = NMatrix.random [100, 100]
    na = Numo::DFloat.new(100, 100).rand
    x.report('nm') { nm.dot nm }
    x.report('na') { na.inplace.dot na }
    end
    '
    Warming up --------------------------------------
    nm 189.000 i/100ms
    na 60.000 i/100ms
    Calculating -------------------------------------
    nm 2.083k (± 8.0%) i/s - 10.395k in 5.022906s
    na 658.759 (± 7.4%) i/s - 3.300k in 5.039515s

    View Slide

  72. NMatrix and NArray
    compatibility
    ✓ Which is best?
    ‣Both of them are not best now
    ✓ Interface and feature incompatibility
    ‣NumBuffer can't resolve this issue
    ✓ I want both of them to be unified
    ‣NMatrix is good for sparse matrices
    ‣NArray is good for dense arrays

    View Slide

  73. SciRuby JP
    ✓ SciRuby developer community in
    Japan
    ✓ Perform survey study in this
    summer

    View Slide

  74. Some Achievements
    ✓Tutorials
    ‣100 narray exercises (by masa16 & kozo2)
    ‣10 minutes to daru (by kozo2)
    ‣pandas cookbook with daru (by kozo2)
    ‣Rewrite pandas doc with daru (by chart-linux)
    ✓Installation
    ‣IRuby on Windows (by kimura)
    ‣ZeroMQ related things (by kozo2 & mrkn)

    View Slide

  75. Some Achievements
    ✓ NLP
    ‣Survey (by himkt)
    ✓ Machine Learning
    ‣Survey (by mrkn)
    ✓ Visualization
    ‣New plotly binding (by y4ashida)
    ✓ Other Languages
    ‣Ruby support in runr (by y4ashida)

    View Slide

  76. Let's go forward
    ✓ Join SciRuby contribution
    ‣English is preferred, but Japanese is OK
    ✓ A lot of issues are waiting your contribution
    ‣Not only for machine learning
    ✓ Discuss in Slack
    ‣https://sciruby-slack.herokuapp.com

    View Slide

  77. View Slide

  78. ςετεϥΠυ

    View Slide

  79. View Slide

  80. View Slide

  81. View Slide

  82. View Slide