Pro Yearly is on sale from $80 to $50! »

Parametric Machine Learning & Record Linkage

Efcd304fc29a3883d2aabcf1a07c9390?s=47 HaFl
October 08, 2013

Parametric Machine Learning & Record Linkage

The presentation will at first provide a basic introduction to the field of Machine Learning and its core concepts and procedures. Thereafter, we the natural evolution of parametric algorithms will be shown – from linear regression to logistic regression and, if time allows, linear SVM. Finally, a real life use case from TrustYou dealing with Record Linkage is presented.

Efcd304fc29a3883d2aabcf1a07c9390?s=128

HaFl

October 08, 2013
Tweet

Transcript

  1. Parametric Machine learning Record linkage

  2. Machine Learning

  3. Many use cases

  4. questions Definition of Machine Learning „Field of study that gives

    computers the ability to learn without being explicitly programmed“ (Arthur Samuel, 1959) vs. Rules Hidden Patterns What is „Intelligence“? What is „Artificial Intelligence“?
  5. training ... The Process Data Sources ... Learning Algorithm Feature

    Engineering Model Parameters
  6. ... and USING The Process Data Sources ... Feature Engineering

    Model prediction
  7. Parametric Machine learning

  8. Examples Parametric non-parametric vs Inferring a finite set of parameters.

    Parameters define the distribution of the data. Training examples are explicitly used as parameters. Complexity of functions is allowed to grow with more training data. Linear Regression Logistic Regression Linear SVM Random Forest Non-linear svm (kernel) Artificial Neural networks à   à   à   à   à   à   f (x) = w 0 + w 1 x 1 X   Z   Y   ... ...
  9. Hypothesis of Linear Regression h w (x) = w 0

    + w 1 x + w 2 x2 ++ w M xM
  10. Cost function – least squares J(w) = 1 2N h

    w (x)− t n ( )2 n=1 N ∑
  11. Bias / variance / overfitting N  =  10   M

     =  1   N  =  10   M  =  9   N  =  10   M  =  3   N  =  10   M  =  0   N  =  10   J(w) = 1 2N h w (x)− t n ( )2 n=1 N ∑ h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM
  12. One solution is more data N  =  15   N

     =  100  
  13. ...another one is regularization J(w) = 1 2N h w

    (x)− t n ( )2 n=1 N ∑ + λ w2 ∑ J(w) = 1 2N h w (x)− t n ( )2 n=1 N ∑
  14. 0 0,5 1 0   2   4   6

      8   10   12   14   Spam? (No) (Yes) but...
  15. 0 0,5 1 0   2   4   6

      8   10   12   14   Logistic Regression Spam? (No) (Yes)
  16. Goal Linear logistic The „Logistic“ part 0 ≤ h w

    (x) ≤1 f (x) = 1 1+ e−x h w (x) = 1 1+ e−wT x h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM = wT x
  17. 0 0 1 0   0   1   cost

    function of logistic regression Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 if if −log 1− h w (x) ( ) h w (x)
  18. Putting it all together J(w) = − 1 N y(i)

    log h w (x(i) ) ( )+ (1− y(i) )log 1− h w (x(i) ) ( ) i=1 N ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 if if h w (x) = 1 1+ e−wT x hypothesis Cost function Optimization objective
  19. Optimization objective On the road to linear SVM z z

    −log h w (x) ( )= −log 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ −log 1− h w (x) ( )= −log 1− 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ z = wT x z = wT x J(w) = C y(i)cost 1 wT x(i) ( )+ (1− y(i) )cost 0 wT x(i) ( ) ⎡ ⎣ ⎤ ⎦ i=1 N ∑ + 1 2 w2 ∑
  20. Record linkage

  21. Database Bellagio Resort The Problem Database Match Hotel address Name:

    Bellagio Las Vegas Address: 3600 S Las Vegas Blvd, Las Vegas, NV 89109 500k 3600 South Las Vegas Boulevard, Las Vegas, NV 89109 1 matching 2 performance 500k
  22. Alright, let‘s do it To beat Precision = 90% Recall

    = 60% Approach Preprocessing à Feature Engineering à Log Reg First result Accuracy = 99% !!!
  23. That‘s awesome

  24. None
  25. Data Sources ... Learning Algorithm Feature Engineering Model Why it

    didn‘t work Wrong parameter settings Inferior algorithm Bad train/test data selection Bad data
  26. How we made it work matching performance Wrong parameter settings

    Inferior algorithm Bad train/test data selection Bad data Hadoop Pig Normalization Data enrichment Better DB data quality Random Forest Partitioning (Quadkey) Random Forest
  27. Precision > 97% Recall > 80%

  28. 3 Things to remember

  29. Step-by-step leads to clarity 1

  30. The devil is in the details 2

  31. Keep going until you smile 3

  32. attribution   Images:  (CC) from Flickr, © “Pattern Recognition and

    Machine Learning” by Christopher M. Bishop (2006) Colors:  from colourlovers, “Giant Goldfish” palette by manekineko   Images in detail*: Slide 1/2/7/17: by xdxd_vs_xdxd, Concept & realisation: Oriana Persico & Salvatore Iaconesi, Lara Mezzapelle & Giacomo Deriu Curated by: Marco Aion Mangani & Alice Zannoni, Produced by: Miria Baccolini - BT'F Gallery Slide 3: “Backgammon”: by Caroline Jones (CJ Woodworking &), photographed by Kate Fisher (Fishbone1) “Siri”: by Sean MacEntee “Spam”: by ario_ “IBM Watson”: by John Tolva (jntolva) “Google driverless car”: by Ben (loudtiger) “TrustYou”: ©TrustYou GmbH Slide 5/6/18/22: “Data Sources”: by Tim Morgan Slide 5/6/22: “Feature Engineering”: by Duke University Archives (Duke Yearlook) “Learning Algorithm”: by Tom Keene (anthillsocial) “Model”: by Steve Brokaw (SEBImages) – Model: Alexis Farley, MUA: Mandi Lucas, Hair: Jennifer Odom Slide 6: “Prediction”: by Gemma Stiles Slide 9/10: by Christopher M. Bishop: Pattern Recognition and Machine Learning, page 4 & 6, Springer, 2006 Slide 20/24: by Bill Gracey Slide 21: by GREG ANDREWS (gregthemayor) Slide 26: by Gal (baboonTM) Slide 27: by harusday Slide 28: by Viewminder * the quoted names for each image are not the actual titles but help to identify the appropriate image if there are multiple images on one slide