Cloud Machine Learning Comparison

Machine Learning as a Service An Entertaining Comparison by Sandeep
Parikh

Agenda Who offers machine learning services? How do the services
work? How do you use it in your applications? What we’ll cover

Hi, I’m Sandeep I’m a Solutions Architect at Google, working
on Google Cloud Platform; I spend my time designing and documenting architectural patterns and solutions. Before that I worked for MongoDB. Before that...well a bunch of other places. I’ve been in Austin for about 12 years so I get to complain about everything. @crcsmnky

Just To Be Clear I am not a data scientist
This talk is not sanctioned by my employer I will do my best to be unbiased

The Contenders Amazon Web Services Google Cloud Platform Microsoft Azure

Required Reading Thanks to Inês Almeida [1] for putting together
an incredibly detailed and thorough Machine Learning Service Comparison [2]. She covers: • Data sourcing • Data preprocessing • Model building • Model evaluation [1] https://blog.onliquid.com/author/isbalmeida/ [2] https://blog.onliquid.com/machine-learning-service-benchmark/

Developer Centric Approach Not all who need machine learning are
data scientists Could be used to bootstrap a larger effort Perhaps other challenges take precedence

Example Application Browse and rate movies Using the Movielens 10M
dataset [1] Thanks to the fine folks at Grouplens [2] [1] https://movielens.org/ [2] http://grouplens.org/

Code and Components https://github.com/crcsmnky/movieweb Built using Python, Flask, MongoDB Updated
to use machine learning real-time endpoints from each service Will push updated code soon™

What’s Missing From MovieWeb? How do you know if a
user will like a particular movie?

Steps and Goals Use Movielens training data Contains userId, movieId,
rating Train a model Evaluate model Create “recommendation” service Profit!

AWS: Create Datasource ratings.csv S3

AWS: Create Model

AWS: Correcting Model Type

AWS: Schema configuration By default schema read in as: userId
(numeric), movieId (numeric), rating (numeric) But I need: userId (numeric), movieId (numeric), rating (categorical)

AWS: Create Model

AWS: Evaluate Model 1-Click (patented!)

AWS: Predictions Generate real-time predictions endpoint Supply userId, movieId Returns
predicted rating Upload dataset and run batch predictions One time, but need to generate dataset (about 8 GB) Repeat as-needed (daily? weekly?)

AWS: Limits

Azure: Create Experiment Create new datasource and upload ratings.csv Then
jump into Azure ML Studio...wow! After some trial and error, landed on an approach

Azure: Studio

Azure: Training Experiment choose features split 70/30

Azure: Evaluation Results

Azure: Convert to Predictive Experiment First, run training experiment Then,
you’ve got a predictive experiment

Azure: Predictions Generate real-time predictions endpoint Supply userId, movieId Returns
predicted rating (and probabilities of other labels) Upload dataset and run batch predictions One time, but need to generate dataset Repeat as-needed (daily? weekly?)

Azure: Limits 20 requests per endpoint (up to 10,000 endpoints)
Up to 10 GB for training data No mention of batch limitations or daily limits

Google Cloud Platform

Google: Upload Data Create a Cloud Storage bucket and upload
ratings.csv ...Then it’s all API-driven But, it does support Predictive Modeling Markup Language for metadata, transformations, etc.

Google: Train API

Model training is asynchronous, so you have to check when
it’s ready Google: Model Availability

Google: Real-time Predictions Once model is trained and available, send
individual API requests to predict rating Batch predictions available, done by batching API requests

Google: Real-time Predictions Using real-time predictions API Supply userId, movieId
Returns predicted rating (and probabilities of other labels)

Google: Limits 2,000,000 predictions per day Training data up to
2.5 GB

Wrap Up

Takeaways AWS has a basic but powerful interface - supports
regression or multiclass classification and easy model evaluation. Azure has crazy robust Studio interface with lots of algorithms and power - including ability to use BYO Python or R code Google has a limited interface and everything is opaque but supports standard PMML for metadata and transformation. Missing quality UI and scoring/eval are all behind the scenes. Marketing would tout this as “simple”

Takeaways AWS and Azure have support for batch predictions, which
would be useful for this application Azure’s algorithm and code support is top-notch. Generates code for C#, Python, and R for batch and streaming endpoints Google requires you to manually batch but you can update your trained model with more ratings come in, reducing re-training time

But Wait, There’s More I’ve made a huge mistake! My
training dataset had userId and movieId - those aren’t nearly enough to predict using multiclass regression Training data should have included userId and movie metadata (like year, genres, title) to generate the best model

Other Approaches Collaborative Filtering “If you and I like the
same thing, I might like other things you like” ...Or something like that Other tools can do this pretty easily with userId and movieId See http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

Final Thoughts Garbage In == Garbage Out Well, Garbage Applied
== Garbage Out too Can’t just throw algorithms against the wall and see what sticks That’s not data science - your output and user experience will suffer

Final Thoughts Compare output from each service to see how
they performed Didn’t even talk about cost or pricing so YMMV Consider other approaches Not just about finding well-performing algorithms Must also consider what makes sense for your use case and/or application

Final Thoughts These tools are very powerful but they aren’t
a panacea Expertise and an understanding of the underlying analyses is critical to making this useful Be careful going down this road - make sure you’ve understood the data science problem before leveraging

Thanks! Where can you find me? @crcsmnky http://github.com/crcsmnky [email protected] Questions?

Cloud Machine Learning Comparison

Cloud Machine Learning Comparison

More Decks by Sandeep Parikh

Other Decks in Technology

Featured

Transcript