Large Scale Machine Learning @ BigDive 4

LARGE SCALE MACHINE LEARNING RECIPES SCIKIT L E A R
N [email protected] @leriomaggio +ValerioMaggio

ESSENCE OF MACHINE LEARNING

ALGORITHMS ON DATA (huge amount of)

ALGORITHMS SPEEDUP & SPEEDUP ?

MULTIPLE PROCESSING IN PYTHON?

MULTIPLE PROCESSING IN PYTHON!!!!

PARALLELIZATION Single Machine Multiple Machines

THREADS ARE EVIL (at least in Python)

http://www.slideshare.net/dabeaz/in-search-of-the-perfect-global-interpreter-lock

SINGLE MACHINE MULTIPLE CORES

• Very Simple API (part of the std. Library) •
Cross-Platform support (even Windows!) • Improved in Python 3.4 • Py 3.4.3: spawn instead of fork on Unix • Some support for shared memory • Support for synchronisation (lock) multiprocessing

multiprocessing

multiprocessing (-) • Vary bad documentation • (Almost) No Docstring
• Bad support for KeyboardInterrupt • Very tricky to use the shared memory values with NumPy

• transparent disk-caching of the output values and lazy re-evaluation
(memoization) • easy simple parallel computing • logging and tracing the execution

No-GIL Alternative

MULTIPLE MACHINE MULTIPLE CORES

• Parallel Processing Library • Interactive Exploratory Shell • Multicore
and Distributed • (load-balancing features)

INTERACTIVE

PROGRAMMATIC

LOOKING FORWARD TO   USING IT, RIGHT?

SCALING UP ML USE C A S E S

PARALLEL ML USE CASES • Stateless Feature Extraction • Model
Assessment with CrossValidation • Model Selection with Grid Search • Ensemble Methods • In-Loop Averaged Models

• Stateless Feature Extraction • Model Assessment with CrossValidation •
Model Selection with Grid Search • Ensemble Methods • In-Loop Averaged Models EMBARRASSINGLY PARALLEL ML   USE CASES

• Stateless Feature Extraction • Model Assessment with CrossValidation •
Model Selection with Grid Search • Ensemble Methods • In-Loop Averaged Models IPC PARALLEL ML   USE CASES

SCALING TEXT FEATURE EXTRACTION USE C A S E  
# 1

SCALABILITY ISSUES • Builds an In-Memory Vocabulary from text tokens
to integer feature indices • A Big Python dict: slow to (un)pickle • Large Corpus: ~106 tokens • Vocabulary == Statefulness == Sync barrier • No easy way to run in parallel

THE HASHING TRICK • Replace the Python dict by a
hash function • Does not need any memory storage • Hashing is stateless: can run in parallel!

THE HASHING TRICK

HASHING VECTORIZER • (+) Possibility to implement streaming and parallel
text classification • (-) What about collisions • (-) No Inverse Document Frequency • (-) No easy way to retrieve the original term

SENTIMENT140 http://help.sentiment140.com/for-students

SENTIMENT140

PARALLEL TEXT CLASSIFICATION USE C A S E   #
2

TEXT CLASSIFICATION All Labels to Predict All Text Data

TEXT CLASSIFICATION IN PARALLEL Labels 1 Text Data 1 Labels
2 Text Data 2 Labels 3 Text Data 3

2 Text Data 2 Labels 3 Text Data 3 Labels 1 Vec Data 1 Labels 2 Vec Data 2 Labels 3 Vec Data 3 vec vec vec

2 Text Data 2 Labels 3 Text Data 3 Labels 1 Vec Data 1 Labels 2 Vec Data 2 Labels 3 Vec Data 3 vec vec vec clf1 clf2 clf3

TEXT CLASSIFICATION IN PARALLEL clf1 clf2 clf3 + + clf
= Collect and Average

TEXT CLASSIFICATION IN PARALLEL clf1 clf2 clf3 + + clf
= Collect and Average Classifier: Perceptron Model

RUNNING

GATHERING

AVERAGING

MODEL NON-LINEAR FEATURES

MODEL SELECTION Find parameters of the model maximizing the CV
score USE C A S E   # 3

CROSS VALIDATION All Labels to Predict All Text Data

CROSS VALIDATION Labels 1 Text Data 1 Labels 2 Text
Data 2 Labels 3 Text Data 3

CROSS VALIDATION Train Labels 1 Train Data 1 Train Labels
2 Train Data 2 Test Labels 3 Test Data 3

Train Labels 1 Train Data 1 Train Labels 2 Train
Data 2 Test Labels 3 Test Data 3 Train Labels 1 Train Data 1 Train Labels 3 Train Data 3 Test Labels 2 Test Data 2 Train Labels 2 Train Data 2 Train Labels 3 Train Data 3 Test Labels 1 Test Data 1

WHAT IF DATA DO NOT FIT INTO MEMORY?

OUT-OF-CORE LEARNING • (Def.): The task of training a machine
learning model on a dataset that does not fit in the main memory. • a feature extraction layer with fixed output dimensionality • knowing the list of all classes in advance • a machine learning algorithm that supports incremental learning (a.k.a. Online Learning) • the partial_fit method in scikit-learn.

THANKS FOR YOUR KIND ATTENTION [email protected] @leriomaggio +ValerioMaggio

Large Scale Machine Learning @ BigDive 4

Large Scale Machine Learning @ BigDive 4

More Decks by Valerio Maggio

Other Decks in Programming

Featured

Transcript