Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Predictive Modeling with IPython and scikit-learn

Predictive Modeling with IPython and scikit-learn

Talk given at Strata Santa Clara 2014. #strataconf

Olivier Grisel

February 12, 2014
Tweet

More Decks by Olivier Grisel

Other Decks in Programming

Transcript

  1. Predictive Modeling with IPython & scikit-learn Strata - 2014 -

    Santa Clara
  2. Outline • Predictive Modeling & scikit-learn • Application to “Learning

    to Rank” for web search • Forests of Randomized Trees • The Python ecosystem: IPython, StarCluster
  3. Training! text docs! images! sounds! transactions Predictive Modeling Data Flow

  4. Training! text docs! images! sounds! transactions Labels Predictive Modeling Data

    Flow
  5. Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm

    Predictive Modeling Data Flow Feature vectors
  6. Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm

    Model Predictive Modeling Data Flow Feature vectors
  7. New! text doc! image! sound! transaction Model Expected! Label Predictive

    Modeling Data Flow Training! text docs! images! sounds! transactions Labels Machine! Learning! Algorithm Feature vectors Feature vector
  8. New! text doc! image! sound! transaction Model Expected! Label Predictive

    Modeling Data Flow Feature vector
  9. Possible Applications • Text Classification / Sequence Tagging NLP •

    Spam Filtering, Sentiment Analysis... • Computer Vision / Speech Recognition • Learning To Rank - IR and advertisement • Science: Statistical Analysis of the Brain, Astronomy, Biology, Social Sciences...
  10. • Library of Machine Learning algorithms • Focus on standard

    methods (e.g. ESL-II) • Open Source (BSD) • Simple fit / predict / transform API • Python / NumPy / SciPy / Cython • Model Assessment, Selection & Ensembles
  11. 46. 200. 1 0 0.0 N -30. 150. 2. 149

    0.1 S 87. 50 1000 10 0.1 W 45. 10 10. 1 0.4 NW 5. 2. 67. 1. 0.2 E Latitude Altitude D istance to closest river Altitude closest river Slope Slope orientation location 1 X 0 1 0 1 0 0 0 0 1 0 1 0 1 0 0 Rain forest G rassland Arid Ice y location 2 location 3 location 4 location 5 Vegetation Cover Type
  12. Support Vector Machine from sklearn.svm import SVC! ! model =

    SVC(kernel=“rbf”, C=1.0, gamma=1e-4)! model.fit(X_train, y_train)! ! ! y_predicted = model.predict(X_test)! ! from sklearn.metrics import f1_score! f1_score(y_test, y_predicted)
  13. Linear Classifier from sklearn.linear_model import SGDClassifier! ! model = SGDClassifier(alpha=1e-4,

    penalty=“elasticnet")! model.fit(X_train, y_train)! ! ! y_predicted = model.predict(X_test)! ! from sklearn.metrics import f1_score! f1_score(y_test, y_predicted)
  14. Random Forests from sklearn.ensemble import RandomForestClassifier! ! model = RandomForestClassifier(n_estimators=200)!

    model.fit(X_train, y_train)! ! ! y_predicted = model.predict(X_test)! ! from sklearn.metrics import f1_score! f1_score(y_test, y_predicted)
  15. None
  16. scikit-learn contributors • GitHub-centric contribution workflow • each pull request

    needs 2 x [+1] reviews • code + tests + doc + example • 92% test coverage / Continuous Integration • 2-3 major releases per years + bug-fix • 80+ contributors for release 0.14
  17. Example: Learning to Rank

  18. Example: Learning to Rank • Learning to rank web search

    results • Input: numerical descriptors for query / results pairs • Target: relevance score • 0: irrelevant • 1: somewhat relevant • 2: relevant
  19. Input Features • Result page descriptors: • PageRank, Click Through

    Rate,
 last update time… • Query / Result page descriptors • BM25, TF*IDF cosine similarity • Ratio of covered query terms • User context descriptors: past user interactions (clicks, +1), time of the day, day of the month, month of the year and user language • … typically more than 40 descriptors and up to several hundreds
  20. Quantifying Success • Measure discrepancy between predicted and true relevance

    scores • Traditional Regression Metrics: • Mean Absolute Error • Explained Variance • But the ranking quality is more important than the predicted scores…
  21. NDCG: a ranking metric

  22. NDCG in Greek DCGk == Discounted Cumulative Gains at rank

    k
  23. Data from Microsoft Bing • http://research.microsoft.com/en-us/projects/mslr • 10K or 30K

    anonymized queries (terms and results URLs) • 10K queries: • ~1.2M search results • 136 descriptors • 5 target relevance levels • ~650MB in NumPy
  24. Disclaimer: this is not Big Data • Couple of GB:

    fits in RAM on my laptop • But painful to download / upload over the internet. • Processing and modeling can be CPU intensive (and sometimes distributed).
  25. Growing randomized trees

  26. Training a Decision Tree

  27. Training a Decision Tree Term Match! Rate

  28. Training a Decision Tree Term Match! Rate < 0.2 >

    0.2
  29. Training a Decision Tree Term Match! Rate Score == 0

    < 0.2 > 0.2
  30. Training a Decision Tree Term Match! Rate Score == 0

    < 0.2 PageRank > 0.2 < 3 > 3
  31. Training a Decision Tree Term Match! Rate Score == 0

    < 0.2 PageRank > 0.2 < 3 Score == 1 > 3
  32. Training a Decision Tree Term Match! Rate Score == 0

    < 0.2 PageRank > 0.2 < 3 Score == 1 > 3 Score == 2
  33. Training a Randomized Tree • Pick a random subset of

    features (e.g. TFIDF, BM25, PageRank, CTR…) • Find the feature that best splits the dataset • Randomize the split threshold between observed min and max values • Send each half of the split dataset to build the 2 subtrees
  34. Training a Forest • Train n random trees independently •

    Use different PRNG seeds • At prediction time, make each tree predict its best guess and: • make them vote (classification) • average predicted values (regression)
  35. Extra Trees one node with 8 CPUs

  36. Growing randomized trees on the cloud

  37. 10x8 cores cluster on EC2 in 20min

  38. None
  39. >>> Configuring cluster took 12.865 mins! >>> Starting cluster took

    20.144 mins
  40. • Notebook interface: in-browser, interactive data exploration environment • IPython.parallel:

    async load-balancing API for interactive dispatching processing • Based on ZeroMQ and msgpack for IPC
  41. None
  42. Grow random trees in parallel in the cloud

  43. Fetch back all the trees as a big forest on

    one node
  44. None
  45. One big notebook: ! http://j.mp/pyrallel-mslr

  46. None
  47. None
  48. None
  49. Results • NDGC@10: ~0.53 for 200 trees on MSLR-WEB10K •

    Could be improved by: • increasing the number of trees (but model gets too big in memory and slower to predict) • replacing base trees by bagged GBRT models • LambdaMART list-wise ranking models
 (under development for inclusion in sklearn) • Linear regression model baseline: NDGC@10: ~0.45
  50. None
  51. http://blog.kaggle.com/2014/02/06/winning-personalized-web-search-team-dataiku/

  52. Thank you! • http://ipython.org • http://scikit-learn.org! • http://star.mit.edu/cluster • https://github.com/pydata/pyrallel

    • https://speakerdeck.com/ogrisel @ogrisel
  53. Backup slides

  54. None
  55. Loading the data with Scikit-learn

  56. NDCG in Python

  57. None
  58. None
  59. None
  60. scikit-learn users • We support users on & ML •

    650+ questions tagged with [scikit-learn] • Many competitors + benchmarks • 500+ answers on 0.13 release user survey • 60% academics / 40% from industry • Some data-driven Startups use sklearn
  61. None
  62. None
  63. None
  64. None
  65. scikit-learn users • We support users on & ML •

    820+ questions tagged with [scikit-learn] • Many competitors + benchmarks • 500+ answers on 0.13 release user survey • 60% academics / 40% from industry • Some data-driven Startups use sklearn
  66. None