Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large Scale Machine Learning @ PyCon Sei (2015)

Large Scale Machine Learning @ PyCon Sei (2015)

Machine Learning focuses on constructing algorithms for making predictions from data. These algorithms usually require huge amount of data to analyse, thus demanding for high computational costs and requiring easily scalable solutions to be effectively applied. These factors favoured a more and more increasing interest in scaling up machine learning applications.

Scikit-learn is one of the most popular machine learning library in Python, providing implementations for several machine learning methods, along with datasets and (performance) evaluation algorithms.

In this talk some recipes to scale up machine learning algorithms with scikit-learn will be presented. The talk will go over several examples and case studies that will be presented in a problem-to-solution way in order to likely engage discussions during and after the talk

The talk is intended for an intermediate level of audience. It requires (very) basic math skills and a good knowledge of the Python language. Good knowledge of the numpy and scipy packages is also a plus.

Valerio Maggio

April 18, 2015
Tweet

More Decks by Valerio Maggio

Other Decks in Programming

Transcript

  1. • Very Simple API (part of the std. Library) •

    Cross-Platform support (even Windows!) • Improved in Python 3.4 • Py 3.4.3: spawn instead of fork on Unix • Some support for shared memory • Support for synchronisation (lock) multiprocessing
  2. multiprocessing (-) • Vary bad documentation • (Almost) No Docstring

    • Bad support for KeyboardInterrupt • Very tricky to use the shared memory values with NumPy
  3. • transparent disk-caching of the output values and lazy re-evaluation

    (memoization) • easy simple parallel computing • logging and tracing the execution
  4. PARALLEL ML USE CASES • Stateless Feature Extraction • Model

    Assessment with CrossValidation • Model Selection with Grid Search • Ensemble Methods • In-Loop Averaged Models
  5. • Stateless Feature Extraction • Model Assessment with CrossValidation •

    Model Selection with Grid Search • Ensemble Methods • In-Loop Averaged Models EMBARRASSINGLY PARALLEL ML 
 USE CASES
  6. • Stateless Feature Extraction • Model Assessment with CrossValidation •

    Model Selection with Grid Search • Ensemble Methods • In-Loop Averaged Models IPC PARALLEL ML 
 USE CASES
  7. SCALABILITY ISSUES • Builds an In-Memory Vocabulary from text tokens

    to integer feature indices • A Big Python dict: slow to (un)pickle • Large Corpus: ~106 tokens • Vocabulary == Statefulness == Sync barrier • No easy way to run in parallel
  8. THE HASHING TRICK • Replace the Python dict by a

    hash function • Does not need any memory storage • Hashing is stateless: can run in parallel!
  9. HASHING VECTORIZER • (+) Possibility to implement streaming and parallel

    text classification • (-) What about collisions • (-) No Inverse Document Frequency • (-) No easy way to retrieve the original term
  10. TEXT CLASSIFICATION IN PARALLEL Labels 1 Text Data 1 Labels

    2 Text Data 2 Labels 3 Text Data 3 Labels 1 Vec Data 1 Labels 2 Vec Data 2 Labels 3 Vec Data 3 vec vec vec
  11. TEXT CLASSIFICATION IN PARALLEL Labels 1 Text Data 1 Labels

    2 Text Data 2 Labels 3 Text Data 3 Labels 1 Vec Data 1 Labels 2 Vec Data 2 Labels 3 Vec Data 3 vec vec vec clf1 clf2 clf3
  12. TEXT CLASSIFICATION IN PARALLEL clf1 clf2 clf3 + + clf

    = Collect and Average Classifier: Perceptron Model
  13. CROSS VALIDATION Train Labels 1 Train Data 1 Train Labels

    2 Train Data 2 Test Labels 3 Test Data 3
  14. Train Labels 1 Train Data 1 Train Labels 2 Train

    Data 2 Test Labels 3 Test Data 3 Train Labels 1 Train Data 1 Train Labels 3 Train Data 3 Test Labels 2 Test Data 2 Train Labels 2 Train Data 2 Train Labels 3 Train Data 3 Test Labels 1 Test Data 1
  15. OUT-OF-CORE LEARNING • (Def.): The task of training a machine

    learning model on a dataset that does not fit in the main memory. • a feature extraction layer with fixed output dimensionality • knowing the list of all classes in advance • a machine learning algorithm that supports incremental learning (a.k.a. Online Learning) • the partial_fit method in scikit-learn.