Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Getting Started with Vowpal Wabbit

Ike Okonkwo
January 19, 2015

Getting Started with Vowpal Wabbit

Introductory talk on Vowpal Wabbit

Ike Okonkwo

January 19, 2015
Tweet

More Decks by Ike Okonkwo

Other Decks in Technology

Transcript

  1. About Me • Data Scientist • Merchant Atlas (enterprise digital

    sales automation using machine learning) • Organizer - SF Vowpal Wabbit Meetup • Background • Physics / Electrical Engineering • Industrial & Systems Engineering
  2. Installation • Local Install : https://www.github.com/JohnLangford/vowpal_wabbit • Docker Image :

    bradleypallen/ml-dev • On OSX : http://yet-another-data-blog.blogspot.com/2014/08/getting-started- with-vowpal-wabbit-part.html • On Windows : http://mlwave.com/install-vowpal-wabbit-on-windows-and-cygwin/
  3. Background • John Langford - Yahoo Research / MS Research

    • Fast out-of-core Scalable ML • Can learn on Terafeature datasets (10^12) • Supports Online Learning / Feature Hashing • Learning Reductions • Cloud Deployment via Azure ML • Progressive Validation , Linear Learning, Fixed Memory footprint Terafeature Learning http://arxiv.org/pdf/1110.4198v3.pdf
  4. Input Format • Labels [-1,1] : binary, [1..n] : multi-class

    • Weight : is a +ve number indicating importance of example over others. Default :1 • Namespace is used for grouping features - string • Features - string[:float] [label][ Weight] | Namespace Feature ... Feature |Namespace Feature ... Feature https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format
  5. Wrappers / Input • Ingest : text, binary, compressed data,

    io : file, pipe, tcp • python : pyvw, rosetta, wabbit_wappa, vowpal_porpoise • R : rvowpalwabbit
  6. Useful Command Line Arguments • -f <file_name> : save model

    • -t : test mode • -i : load predictor • -p <file_name> : save predictions • --passes <n> : iterate over data n times • --loss_function : loss function , default : squared loss • --l1 ,--l2 : lasso and ridge regularization • --oaa, --etc, --csoaa : multiclass classification
  7. Demo • MNIST • 10 classes (0 -9) • 60000

    examples • 784 features (28 x 28)
  8. Demo • RCV1 (Reuters Corpora) • 2 classes ( CCAT

    or not) • 781k examples (train) , 23k (test)
  9. Other Features • Allreduce - Distributed Linear Learning • Contextual

    Bandits • Matrix Factorization • Sequence Predictions • Topic Modeling / LDA • Variety of loss functions and optimizers • Utilities : perf, vw-varinfo, vw-hypersearch,vw-top-errors
  10. References • FastML : http://fastml.com/ • MLWave : http://mlwave.com/ •

    Kaggle Competition boards • John Langford : github.com/JohnLangford/vowpal_wabbit • Terafeature Linear Learning : http://arxiv.org/pdf/1110.4198v3.pdf • Docker image : https://registry.hub.docker.com/u/bradleypallen/ml-dev/ • NYU Large Scale Learning : http://cilvr.cs.nyu.edu/doku.php? id=courses:bigdata:start