Diagnosing Cancer with Azure Machine Learning Preview

Diagnosing Cancer with Azure Machine Learning Preview Craig Stuntz •
Improving Enterprises https://www.ﬂickr.com/photos/nasamarshall/12815430035 https://www.ﬂickr.com/photos/javism/8737879875

Slides https://speakerdeck.com/craigstuntz

Machine Learning

Functions y = x2 If I give you the formula,
it’s easy to produce the curve. A bit harder to do in reverse, but maybe you recognize the shape? Machine learning in a nutshell: Derive algorithms from data. “Running programs backwards.”

Real-World Machine Learning • Spam filters • Shopping recommendations •
Pricing • Credit fraud detection • Identify cats on YouTube http://arxiv.org/pdf/1112.6209v5.pdf

Linear Regression http://commons.wikimedia.org/wiki/File:Linear_regression.svg Real data doesn’t always fit the curve.
Red line is a model of real-world system. There is error. Is it in the model, the measurements, or is the real world just complicated? There is no clear answer without more information.

Overfitting http://commons.wikimedia.org/wiki/File:Overﬁt.png Curved line passes through all points, but straight
line is a better model — reflects data we haven’t seen yet.

Categorization http://commons.wikimedia.org/wiki/File:CART_tree_titanic_survivors.png We’ve discussed regression. Categorization is… This is a
decision tree to predict Titanic survivors. Regression and categorization are supervised learning.

Unsupervised Learning http://commons.wikimedia.org/wiki/File:KMeans-Gaussian-data.svg Everything so far presumed there were examples
with known values. This is k-means clustering. “What can you tell me about X” instead of “Predict Y for X.” Supervised (regression, categorization) /unsupervised (clustering)

The Language of Data ML full of jargon. Features, target
variable, categorical or nominal data, continuous data, examples, classification, two class data

Data Sets Training Validation Test For supervised learning, we often
partition/sample data Training set: Adjust weights Validation set: Minimize overfitting Test set: Test final system. Omitted in simple examples.

Classification Imbalance Dataset imbalanced. Can use oversampling, under sampling. For
some problems it’s better to have a false positive than a false negative, or vice versa.

Evaluation Accuracy ((TP+TN)/n), Recall (fraction that classifier got right TP/(TP+FN)),
Precision (fraction positive predicted to be positive TP/(TP+FP)), Receiver Operating Characteristic, AUC useful but still need to look at curve. Also, some algorithms have different error characteristics FP vs. FN.

Recall vs. Precision • For an imperfect model, you can
construct classifier which is perfect for recall or precision, but not both. • Classifier which always reports positive has perfect recall but low precision. (“Favors” false positive.) • Classifier which always reports negative has perfect precision but low recall. (“Favors” false negative.) • Real world problems want best mix of both, with a bias dictated by the problem itself. • Use cost function to influence model

Workflow Collect Data Prepare - Clean, normalize, reduce dimensionality Analyze,
Consider Goal, Choose Algorithm Train Algorithm Test Algorithm Iterate Until Satisfactory Use System

Azure Machine Learning “Predictions as a Service”

Azure Machine Learning • Experiment, create web services for predictions,
then sell them on the Azure Marketplace • Machine learning “IDE” • Algorithms from Xbox, Bing, and more • First class R support • Data from SQL Azure, HDInsight Features

Pricing Free tier Limited duration, nodes, API Studio experiment hour
$0.38 Studio predictions Free API hour $0.75 1000 API predictions $0.18 Free tier: No Azure account required, max 1 hour experiment duration, single node, staging API only (no production). Standard tier: Need Azure account, but no other limit.

Still in Beta

Where to Learn More • Predictive Modeling with Azure ML
Studio video • Machine Learning in Action, by Peter Harrington • Andrew Ng’s Machine Learning class, Stanford/ Coursera • UC Irvine Machine Learning Dataset Repository

CRAIG STUNTZ @CraigStuntz [email protected] http://blogs.teamb.com/craigstuntz http://www.meetup.com/Papers-We-Love-Columbus/ Questions?

Diagnosing Cancer with Azure Machine Learning P...

Diagnosing Cancer with Azure Machine Learning Preview

Craig Stuntz

More Decks by Craig Stuntz

Other Decks in Programming

Featured

Transcript