Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Diagnosing Cancer with Azure Machine Learning P...

Diagnosing Cancer with Azure Machine Learning Preview

Craig Stuntz

November 10, 2014
Tweet

More Decks by Craig Stuntz

Other Decks in Programming

Transcript

  1. Diagnosing Cancer with Azure Machine Learning Preview Craig Stuntz •

    Improving Enterprises https://www.flickr.com/photos/nasamarshall/12815430035 https://www.flickr.com/photos/javism/8737879875
  2. Functions y = x2 If I give you the formula,

    it’s easy to produce the curve. A bit harder to do in reverse, but maybe you recognize the shape? Machine learning in a nutshell: Derive algorithms from data. “Running programs backwards.”
  3. Real-World Machine Learning • Spam filters • Shopping recommendations •

    Pricing • Credit fraud detection • Identify cats on YouTube http://arxiv.org/pdf/1112.6209v5.pdf
  4. Linear Regression http://commons.wikimedia.org/wiki/File:Linear_regression.svg Real data doesn’t always fit the curve.

    Red line is a model of real-world system. There is error. Is it in the model, the measurements, or is the real world just complicated? There is no clear answer without more information.
  5. Categorization http://commons.wikimedia.org/wiki/File:CART_tree_titanic_survivors.png We’ve discussed regression. Categorization is… This is a

    decision tree to predict Titanic survivors. Regression and categorization are supervised learning.
  6. Unsupervised Learning http://commons.wikimedia.org/wiki/File:KMeans-Gaussian-data.svg Everything so far presumed there were examples

    with known values. This is k-means clustering. “What can you tell me about X” instead of “Predict Y for X.” Supervised (regression, categorization) /unsupervised (clustering)
  7. The Language of Data ML full of jargon. Features, target

    variable, categorical or nominal data, continuous data, examples, classification, two class data
  8. Data Sets Training Validation Test For supervised learning, we often

    partition/sample data Training set: Adjust weights Validation set: Minimize overfitting Test set: Test final system. Omitted in simple examples.
  9. Classification Imbalance Dataset imbalanced. Can use oversampling, under sampling. For

    some problems it’s better to have a false positive than a false negative, or vice versa.
  10. Evaluation Accuracy ((TP+TN)/n), Recall (fraction that classifier got right TP/(TP+FN)),

    Precision (fraction positive predicted to be positive TP/(TP+FP)), Receiver Operating Characteristic, AUC useful but still need to look at curve. Also, some algorithms have different error characteristics FP vs. FN.
  11. Recall vs. Precision • For an imperfect model, you can

    construct classifier which is perfect for recall or precision, but not both. • Classifier which always reports positive has perfect recall but low precision. (“Favors” false positive.) • Classifier which always reports negative has perfect precision but low recall. (“Favors” false negative.) • Real world problems want best mix of both, with a bias dictated by the problem itself. • Use cost function to influence model
  12. Workflow Collect Data Prepare - Clean, normalize, reduce dimensionality Analyze,

    Consider Goal, Choose Algorithm Train Algorithm Test Algorithm Iterate Until Satisfactory Use System
  13. Azure Machine Learning • Experiment, create web services for predictions,

    then sell them on the Azure Marketplace • Machine learning “IDE” • Algorithms from Xbox, Bing, and more • First class R support • Data from SQL Azure, HDInsight Features
  14. Pricing Free tier Limited duration, nodes, API Studio experiment hour

    $0.38 Studio predictions Free API hour $0.75 1000 API predictions $0.18 Free tier: No Azure account required, max 1 hour experiment duration, single node, staging API only (no production). Standard tier: Need Azure account, but no other limit.
  15. Where to Learn More • Predictive Modeling with Azure ML

    Studio video • Machine Learning in Action, by Peter Harrington • Andrew Ng’s Machine Learning class, Stanford/ Coursera • UC Irvine Machine Learning Dataset Repository