Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Cloud Machine Learning Comparison

Cloud Machine Learning Comparison

A developer-centric comparison of machine learning as a service offerings from Amazon Web Services, Microsoft Azure, and Google Cloud Platform.

Sandeep Parikh

October 19, 2015
Tweet

More Decks by Sandeep Parikh

Other Decks in Technology

Transcript

  1. Agenda Who offers machine learning services? How do the services

    work? How do you use it in your applications? What we’ll cover
  2. Hi, I’m Sandeep I’m a Solutions Architect at Google, working

    on Google Cloud Platform; I spend my time designing and documenting architectural patterns and solutions. Before that I worked for MongoDB. Before that...well a bunch of other places. I’ve been in Austin for about 12 years so I get to complain about everything. @crcsmnky
  3. Just To Be Clear I am not a data scientist

    This talk is not sanctioned by my employer I will do my best to be unbiased
  4. Required Reading Thanks to Inês Almeida [1] for putting together

    an incredibly detailed and thorough Machine Learning Service Comparison [2]. She covers: • Data sourcing • Data preprocessing • Model building • Model evaluation [1] https://blog.onliquid.com/author/isbalmeida/ [2] https://blog.onliquid.com/machine-learning-service-benchmark/
  5. Developer Centric Approach Not all who need machine learning are

    data scientists Could be used to bootstrap a larger effort Perhaps other challenges take precedence
  6. Example Application Browse and rate movies Using the Movielens 10M

    dataset [1] Thanks to the fine folks at Grouplens [2] [1] https://movielens.org/ [2] http://grouplens.org/
  7. Code and Components https://github.com/crcsmnky/movieweb Built using Python, Flask, MongoDB Updated

    to use machine learning real-time endpoints from each service Will push updated code soon™
  8. What’s Missing From MovieWeb? How do you know if a

    user will like a particular movie?
  9. Steps and Goals Use Movielens training data Contains userId, movieId,

    rating Train a model Evaluate model Create “recommendation” service Profit!
  10. AWS

  11. AWS: Schema configuration By default schema read in as: userId

    (numeric), movieId (numeric), rating (numeric) But I need: userId (numeric), movieId (numeric), rating (categorical)
  12. AWS: Predictions Generate real-time predictions endpoint Supply userId, movieId Returns

    predicted rating Upload dataset and run batch predictions One time, but need to generate dataset (about 8 GB) Repeat as-needed (daily? weekly?)
  13. Azure: Create Experiment Create new datasource and upload ratings.csv Then

    jump into Azure ML Studio...wow! After some trial and error, landed on an approach
  14. Azure: Predictions Generate real-time predictions endpoint Supply userId, movieId Returns

    predicted rating (and probabilities of other labels) Upload dataset and run batch predictions One time, but need to generate dataset Repeat as-needed (daily? weekly?)
  15. Azure: Limits 20 requests per endpoint (up to 10,000 endpoints)

    Up to 10 GB for training data No mention of batch limitations or daily limits
  16. Google: Upload Data Create a Cloud Storage bucket and upload

    ratings.csv ...Then it’s all API-driven But, it does support Predictive Modeling Markup Language for metadata, transformations, etc.
  17. Model training is asynchronous, so you have to check when

    it’s ready Google: Model Availability
  18. Google: Real-time Predictions Once model is trained and available, send

    individual API requests to predict rating Batch predictions available, done by batching API requests
  19. Google: Real-time Predictions Using real-time predictions API Supply userId, movieId

    Returns predicted rating (and probabilities of other labels)
  20. Takeaways AWS has a basic but powerful interface - supports

    regression or multiclass classification and easy model evaluation. Azure has crazy robust Studio interface with lots of algorithms and power - including ability to use BYO Python or R code Google has a limited interface and everything is opaque but supports standard PMML for metadata and transformation. Missing quality UI and scoring/eval are all behind the scenes. Marketing would tout this as “simple”
  21. Takeaways AWS and Azure have support for batch predictions, which

    would be useful for this application Azure’s algorithm and code support is top-notch. Generates code for C#, Python, and R for batch and streaming endpoints Google requires you to manually batch but you can update your trained model with more ratings come in, reducing re-training time
  22. But Wait, There’s More I’ve made a huge mistake! My

    training dataset had userId and movieId - those aren’t nearly enough to predict using multiclass regression Training data should have included userId and movie metadata (like year, genres, title) to generate the best model
  23. Other Approaches Collaborative Filtering “If you and I like the

    same thing, I might like other things you like” ...Or something like that Other tools can do this pretty easily with userId and movieId See http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
  24. Final Thoughts Garbage In == Garbage Out Well, Garbage Applied

    == Garbage Out too Can’t just throw algorithms against the wall and see what sticks That’s not data science - your output and user experience will suffer
  25. Final Thoughts Compare output from each service to see how

    they performed Didn’t even talk about cost or pricing so YMMV Consider other approaches Not just about finding well-performing algorithms Must also consider what makes sense for your use case and/or application
  26. Final Thoughts These tools are very powerful but they aren’t

    a panacea Expertise and an understanding of the underlying analyses is critical to making this useful Be careful going down this road - make sure you’ve understood the data science problem before leveraging