Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large Scale Machine Learning @ BigDive 4

Large Scale Machine Learning @ BigDive 4

Valerio Maggio

June 26, 2015
Tweet

More Decks by Valerio Maggio

Other Decks in Programming

Transcript

  1. • Very Simple API (part of the std. Library) •

    Cross-Platform support (even Windows!) • Improved in Python 3.4 • Py 3.4.3: spawn instead of fork on Unix • Some support for shared memory • Support for synchronisation (lock) multiprocessing
  2. multiprocessing (-) • Vary bad documentation • (Almost) No Docstring

    • Bad support for KeyboardInterrupt • Very tricky to use the shared memory values with NumPy
  3. • transparent disk-caching of the output values and lazy re-evaluation

    (memoization) • easy simple parallel computing • logging and tracing the execution
  4. PARALLEL ML USE CASES • Stateless Feature Extraction • Model

    Assessment with CrossValidation • Model Selection with Grid Search • Ensemble Methods • In-Loop Averaged Models
  5. • Stateless Feature Extraction • Model Assessment with CrossValidation •

    Model Selection with Grid Search • Ensemble Methods • In-Loop Averaged Models EMBARRASSINGLY PARALLEL ML 
 USE CASES
  6. • Stateless Feature Extraction • Model Assessment with CrossValidation •

    Model Selection with Grid Search • Ensemble Methods • In-Loop Averaged Models IPC PARALLEL ML 
 USE CASES
  7. SCALABILITY ISSUES • Builds an In-Memory Vocabulary from text tokens

    to integer feature indices • A Big Python dict: slow to (un)pickle • Large Corpus: ~106 tokens • Vocabulary == Statefulness == Sync barrier • No easy way to run in parallel
  8. THE HASHING TRICK • Replace the Python dict by a

    hash function • Does not need any memory storage • Hashing is stateless: can run in parallel!
  9. HASHING VECTORIZER • (+) Possibility to implement streaming and parallel

    text classification • (-) What about collisions • (-) No Inverse Document Frequency • (-) No easy way to retrieve the original term
  10. TEXT CLASSIFICATION IN PARALLEL Labels 1 Text Data 1 Labels

    2 Text Data 2 Labels 3 Text Data 3 Labels 1 Vec Data 1 Labels 2 Vec Data 2 Labels 3 Vec Data 3 vec vec vec
  11. TEXT CLASSIFICATION IN PARALLEL Labels 1 Text Data 1 Labels

    2 Text Data 2 Labels 3 Text Data 3 Labels 1 Vec Data 1 Labels 2 Vec Data 2 Labels 3 Vec Data 3 vec vec vec clf1 clf2 clf3
  12. TEXT CLASSIFICATION IN PARALLEL clf1 clf2 clf3 + + clf

    = Collect and Average Classifier: Perceptron Model
  13. CROSS VALIDATION Train Labels 1 Train Data 1 Train Labels

    2 Train Data 2 Test Labels 3 Test Data 3
  14. Train Labels 1 Train Data 1 Train Labels 2 Train

    Data 2 Test Labels 3 Test Data 3 Train Labels 1 Train Data 1 Train Labels 3 Train Data 3 Test Labels 2 Test Data 2 Train Labels 2 Train Data 2 Train Labels 3 Train Data 3 Test Labels 1 Test Data 1
  15. OUT-OF-CORE LEARNING • (Def.): The task of training a machine

    learning model on a dataset that does not fit in the main memory. • a feature extraction layer with fixed output dimensionality • knowing the list of all classes in advance • a machine learning algorithm that supports incremental learning (a.k.a. Online Learning) • the partial_fit method in scikit-learn.