Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parallel and Large Scale Machine Learning with ...

Parallel and Large Scale Machine Learning with scikit-learn

**Machine Learning** focuses on *constructing algorithms for making predictions
from data*. These algorithms are usually established in two canonical settings,
namely *Supervised* and *Unsupervised learning*.
These algorithms usually require to analyze *huge* amount of data, thus demanding
for high computational costs and requiring easily scalable solutions to be
effectively applied.

The **distributed** and **parallel** processing of very large datasets has been employed for
decades in specialized, and high-budget settings.
However, the evolution of hardware architectures, and the advent of the
*cloud* technologies, have brought nowadays dramatic progresses in programming frameworks,
along with a diversity of parallel computing platforms [[Bekkerman et. al.]][0].

These factors favored a more and more increasing interest in *scaling up* machine
learning applications. This increased attention to large scale machine learning
is also due to the spread of very large datasets that are often stored on
distributed storage platforms (a.k.a. *cloud storage*), motivating the
development of learning algorithms that can be properly distributed.

[**Scikit-learn**](http://scikit-learn.org/stable/) is an actively
developing Python library, built on top of the solid `numpy` and `scipy`
packages.

Scikit-learn (`sklearn`) is an *all-in-one* software solution, providing
implementations for several machine learning methods, along with datasets and
(performance) evaluation algorithms.

Thanks to its nice and intuitive API, this library can be easily integrated with
other *Python-powered* solutions for parallel and distributed computations.

In this talk, different solutions to scaling machine learning computations with
scikit-learn will be presented. The *scaling* process will be
described at two different "complexity" levels:
(1) *Single* Machine with Multiple Cores; and (2) *Multiple* Machines with
Multiple Cores.

Working code examples will be discussed during the talk, in order to present solutions
applying the [`multiprocessing`](http://docs.python.org/2/library/multiprocessing.html),
[`joblib`](https://pythonhosted.org/joblib/), and [`mpi4py`](http://mpi4py.scipy.org)
Python libraries (as for the former case). On the other hand,
[`iPython.parallel`](http://ipython.org/ipython-doc/dev/parallel/) and
Map-Reduce based solutions (e.g., [Disco](http://discoproject.org)\) will be considered.

The talk is intended for an intermediate level audience (i.e., Advanced).

It requires basic math skills and a good knowledge of the Python language.

Good knowledge of the `numpy` and `scipy` packages is also a plus.

[0]: http://www.amazon.com/dp/B00AKE1ZUU "Scaling Up Machine Learning"

Valerio Maggio

May 24, 2014
Tweet

More Decks by Valerio Maggio

Other Decks in Programming

Transcript

  1. ML PYTHON POWERED SCIKIT L E A R N pyML

    scikit-learn shogun orange mlpy pybrain numpy scipy
  2. ‣ A Pattern exists ‣ We cannot pin it down

    mathematically ‣ We have data on it ESSENCE OF MACHINE LEARNING
  3. multiprocessing • Part of the Std Library • Simple API

    • Cross-Platform support (even Windows!) • Some support for shared memory • Support for synchronisation (lock)
  4. • transparent disk-caching of the output values and lazy re-evaluation

    (memoization) • easy simple parallel computing • logging and tracing the execution
  5. SCALABILITY ISSUES • Builds an In-Memory Vocabulary from text tokens

    to integer feature indices • A Big Python dict: slow to (un)pickle • Large Corpus: ~106 tokens • Vocabulary == Statefulness == Sync barrier • No easy way to run in parallel
  6. THE HASHING TRICK • Replace the Python dict by a

    hash function • Does not need any memory storage • Hashing is stateless: can run in parallel!
  7. SENTIMENT140 Loading 20 newsgroups dataset for all categories 11314 documents

    - 22.055MB (training set) 7532 documents - 13.801MB (testing set) Extracting features from the training dataset using a sparse vectorizer done in 12.881007s at 1.712MB/s n_samples: 11314, n_features: 129792 Extracting features from the test dataset using the same vectorizer done in 4.043470s at 3.413MB/s n_samples: 7532, n_features: 129792 TfIdfVectorizer
  8. SENTIMENT140 Loading 20 newsgroups dataset for all categories 11314 documents

    - 22.055MB (training set) 7532 documents - 13.801MB (testing set) Extracting features from the training dataset using a sparse vectorizer done in 5.281561s at 4.176MB/s n_samples: 11314, n_features: 65536 Extracting features from the test dataset using the same vectorizer done in 3.413027s at 4.044MB/s n_samples: 7532, n_features: 65536 HashingVectorizer
  9. TEXT CLASSIFICATION IN PARALLEL Labels 1 Text Data 1 Labels

    2 Text Data 2 Labels 3 Text Data 3 Labels 1 Vec Data 1 Labels 2 Vec Data 2 Labels 3 Vec Data 3 vec vec vec
  10. TEXT CLASSIFICATION IN PARALLEL Labels 1 Text Data 1 Labels

    2 Text Data 2 Labels 3 Text Data 3 Labels 1 Vec Data 1 Labels 2 Vec Data 2 Labels 3 Vec Data 3 vec vec vec clf1 clf2 clf3
  11. TEXT CLASSIFICATION IN PARALLEL clf1 clf2 clf3 + + clf

    = Collect and Average Classifier: Perceptron Model
  12. Σ ! 1 h o 1 w 1,1 w1,2 w

    1,m . . x 1 x 2 x m . . . x Σ ! j h o j w j,1 w j,2 w j,m . . x 1 x 2 x m x Σ ! n h o n w n,1 w n,2 wn,m . . x 1 x 2 x m x . . . w 1,0 x 0 w j,0 x 0 w n,0 x 0 THE PERCEPTRON MODEL Input Weights output
  13. THE PERCEPTRON LEARNING • Initialisation • All the weights in

    W are initialised (to get started) • Training • The (actual) learning takes place • Recall • Compute the activation of each neuron in the network • (So we may) Feed input and see output
  14. (1) Initialization: set all the weights wij to small random

    (positive or negative) numbers (2) Training for T iterations: for each input vector: compute the activation of each neuron j using activation function g update each weight individually
 

  15. (1) Initialization: set all the weights wij to small random

    (positive or negative) numbers (2) Training (3) Recall Compute the activation of neuron j using