Slide 1

Slide 1 text

Getting Started with Vowpal Wabbit Ike Okonkwo (@ikeondata) Data Scientist @MerchantAtlas 1.19.15 SF Vowpal Wabbit Meetup

Slide 2

Slide 2 text

Agenda • Installation • Background • Demo • Other features • References

Slide 3

Slide 3 text

About Me • Data Scientist • Merchant Atlas (enterprise digital sales automation using machine learning) • Organizer - SF Vowpal Wabbit Meetup • Background • Physics / Electrical Engineering • Industrial & Systems Engineering

Slide 4

Slide 4 text

Installation • Local Install : https://www.github.com/JohnLangford/vowpal_wabbit • Docker Image : bradleypallen/ml-dev • On OSX : http://yet-another-data-blog.blogspot.com/2014/08/getting-started- with-vowpal-wabbit-part.html • On Windows : http://mlwave.com/install-vowpal-wabbit-on-windows-and-cygwin/

Slide 5

Slide 5 text

Background • John Langford - Yahoo Research / MS Research • Fast out-of-core Scalable ML • Can learn on Terafeature datasets (10^12) • Supports Online Learning / Feature Hashing • Learning Reductions • Cloud Deployment via Azure ML • Progressive Validation , Linear Learning, Fixed Memory footprint Terafeature Learning http://arxiv.org/pdf/1110.4198v3.pdf

Slide 6

Slide 6 text

Input Format • Labels [-1,1] : binary, [1..n] : multi-class • Weight : is a +ve number indicating importance of example over others. Default :1 • Namespace is used for grouping features - string • Features - string[:float] [label][ Weight] | Namespace Feature ... Feature |Namespace Feature ... Feature https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format

Slide 7

Slide 7 text

Wrappers / Input • Ingest : text, binary, compressed data, io : file, pipe, tcp • python : pyvw, rosetta, wabbit_wappa, vowpal_porpoise • R : rvowpalwabbit

Slide 8

Slide 8 text

Useful Command Line Arguments • -f <file_name> : save model • -t : test mode • -i : load predictor • -p <file_name> : save predictions • --passes : iterate over data n times • --loss_function : loss function , default : squared loss • --l1 ,--l2 : lasso and ridge regularization • --oaa, --etc, --csoaa : multiclass classification

Slide 9

Slide 9 text

Demo

Slide 10

Slide 10 text

Demo • IRIS • 3 classes • 150 examples • 4 features

Slide 11

Slide 11 text

Demo • MNIST • 10 classes (0 -9) • 60000 examples • 784 features (28 x 28)

Slide 12

Slide 12 text

Demo • RCV1 (Reuters Corpora) • 2 classes ( CCAT or not) • 781k examples (train) , 23k (test)

Slide 13

Slide 13 text

Other Features • Allreduce - Distributed Linear Learning • Contextual Bandits • Matrix Factorization • Sequence Predictions • Topic Modeling / LDA • Variety of loss functions and optimizers • Utilities : perf, vw-varinfo, vw-hypersearch,vw-top-errors

Slide 14

Slide 14 text

References • FastML : http://fastml.com/ • MLWave : http://mlwave.com/ • Kaggle Competition boards • John Langford : github.com/JohnLangford/vowpal_wabbit • Terafeature Linear Learning : http://arxiv.org/pdf/1110.4198v3.pdf • Docker image : https://registry.hub.docker.com/u/bradleypallen/ml-dev/ • NYU Large Scale Learning : http://cilvr.cs.nyu.edu/doku.php? id=courses:bigdata:start

Slide 15

Slide 15 text

No content