Parallel and Large Scale Machine Learning with scikit-learn

PARALLEL AND LARGE SCALE MACHINE LEARNING SCIKIT L E A
R N Valerio Maggio [email protected]

ML AT A GLANCE

@jakevdp

ML PYTHON POWERED SCIKIT L E A R N pyML
scikit-learn shogun orange mlpy pybrain numpy scipy

SCIKIT AT A GLANCE

‣ A Pattern exists ‣ We cannot pin it down
mathematically ‣ We have data on it ESSENCE OF MACHINE LEARNING

ALGORITHMS ON DATA (huge amount of)

ALGORITHMS SPEEDUP

PARALLELIZATION the python way multiprocessing Single Machine - Multiple Cores
Multiple Machine - Multiple Cores

SINGLE MACHINE MULTIPLE CORES

multiprocessing

multiprocessing • Part of the Std Library • Simple API
• Cross-Platform support (even Windows!) • Some support for shared memory • Support for synchronisation (lock)

• transparent disk-caching of the output values and lazy re-evaluation
(memoization) • easy simple parallel computing • logging and tracing the execution

MULTIPLE MACHINE MULTIPLE CORES

• Parallel Processing Library • Interactive Exploratory Shell • Multicore
and Distributed • (load-balancing features)

SCALING TEXT FEATURE EXTRACTION

SCALABILITY ISSUES • Builds an In-Memory Vocabulary from text tokens
to integer feature indices • A Big Python dict: slow to (un)pickle • Large Corpus: ~106 tokens • Vocabulary == Statefulness == Sync barrier • No easy way to run in parallel

THE HASHING TRICK • Replace the Python dict by a
hash function • Does not need any memory storage • Hashing is stateless: can run in parallel!

THE HASHING TRICK

SENTIMENT140

SENTIMENT140 Loading 20 newsgroups dataset for all categories 11314 documents
- 22.055MB (training set) 7532 documents - 13.801MB (testing set) Extracting features from the training dataset using a sparse vectorizer done in 12.881007s at 1.712MB/s n_samples: 11314, n_features: 129792 Extracting features from the test dataset using the same vectorizer done in 4.043470s at 3.413MB/s n_samples: 7532, n_features: 129792 TfIdfVectorizer

SENTIMENT140 Loading 20 newsgroups dataset for all categories 11314 documents
- 22.055MB (training set) 7532 documents - 13.801MB (testing set) Extracting features from the training dataset using a sparse vectorizer done in 5.281561s at 4.176MB/s n_samples: 11314, n_features: 65536 Extracting features from the test dataset using the same vectorizer done in 3.413027s at 4.044MB/s n_samples: 7532, n_features: 65536 HashingVectorizer

PARALLEL TEXT CLASSIFICATION

TEXT CLASSIFICATION All Labels to Predict All Text Data

TEXT CLASSIFICATION IN PARALLEL Labels 1 Text Data 1 Labels
2 Text Data 2 Labels 3 Text Data 3

2 Text Data 2 Labels 3 Text Data 3 Labels 1 Vec Data 1 Labels 2 Vec Data 2 Labels 3 Vec Data 3 vec vec vec

2 Text Data 2 Labels 3 Text Data 3 Labels 1 Vec Data 1 Labels 2 Vec Data 2 Labels 3 Vec Data 3 vec vec vec clf1 clf2 clf3

TEXT CLASSIFICATION IN PARALLEL clf1 clf2 clf3 + + clf
= Collect and Average

TEXT CLASSIFICATION IN PARALLEL clf1 clf2 clf3 + + clf
= Collect and Average Classifier: Perceptron Model

Σ ! 1 h o 1 w 1,1 w1,2 w
1,m . . x 1 x 2 x m . . . x Σ ! j h o j w j,1 w j,2 w j,m . . x 1 x 2 x m x Σ ! n h o n w n,1 w n,2 wn,m . . x 1 x 2 x m x . . . w 1,0 x 0 w j,0 x 0 w n,0 x 0 THE PERCEPTRON MODEL Input Weights output

THE PERCEPTRON LEARNING • Initialisation • All the weights in
W are initialised (to get started) • Training • The (actual) learning takes place • Recall • Compute the activation of each neuron in the network • (So we may) Feed input and see output

(1) Initialization: set all the weights wij to small random
(positive or negative) numbers (2) Training for T iterations: for each input vector: compute the activation of each neuron j using activation function g update each weight individually   

(1) Initialization: set all the weights wij to small random
(positive or negative) numbers (2) Training (3) Recall Compute the activation of neuron j using       

RUNNING

GATHERING

AVERAGING

THANKS FOR YOUR KIND ATTENTION

Parallel and Large Scale Machine Learning with ...

Parallel and Large Scale Machine Learning with scikit-learn

More Decks by Valerio Maggio

Other Decks in Programming

Featured

Transcript