Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Parametric Machine Learning & Record Linkage

HaFl
October 08, 2013

Parametric Machine Learning & Record Linkage

The presentation will at first provide a basic introduction to the field of Machine Learning and its core concepts and procedures. Thereafter, we the natural evolution of parametric algorithms will be shown – from linear regression to logistic regression and, if time allows, linear SVM. Finally, a real life use case from TrustYou dealing with Record Linkage is presented.

HaFl

October 08, 2013
Tweet

More Decks by HaFl

Other Decks in Technology

Transcript

  1. questions Definition of Machine Learning „Field of study that gives

    computers the ability to learn without being explicitly programmed“ (Arthur Samuel, 1959) vs. Rules Hidden Patterns What is „Intelligence“? What is „Artificial Intelligence“?
  2. Examples Parametric non-parametric vs Inferring a finite set of parameters.

    Parameters define the distribution of the data. Training examples are explicitly used as parameters. Complexity of functions is allowed to grow with more training data. Linear Regression Logistic Regression Linear SVM Random Forest Non-linear svm (kernel) Artificial Neural networks à   à   à   à   à   à   f (x) = w 0 + w 1 x 1 X   Z   Y   ... ...
  3. Hypothesis of Linear Regression h w (x) = w 0

    + w 1 x + w 2 x2 ++ w M xM
  4. Cost function – least squares J(w) = 1 2N h

    w (x)− t n ( )2 n=1 N ∑
  5. Bias / variance / overfitting N  =  10   M

     =  1   N  =  10   M  =  9   N  =  10   M  =  3   N  =  10   M  =  0   N  =  10   J(w) = 1 2N h w (x)− t n ( )2 n=1 N ∑ h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM
  6. ...another one is regularization J(w) = 1 2N h w

    (x)− t n ( )2 n=1 N ∑ + λ w2 ∑ J(w) = 1 2N h w (x)− t n ( )2 n=1 N ∑
  7. 0 0,5 1 0   2   4   6

      8   10   12   14   Spam? (No) (Yes) but...
  8. 0 0,5 1 0   2   4   6

      8   10   12   14   Logistic Regression Spam? (No) (Yes)
  9. Goal Linear logistic The „Logistic“ part 0 ≤ h w

    (x) ≤1 f (x) = 1 1+ e−x h w (x) = 1 1+ e−wT x h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM = wT x
  10. 0 0 1 0   0   1   cost

    function of logistic regression Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 if if −log 1− h w (x) ( ) h w (x)
  11. Putting it all together J(w) = − 1 N y(i)

    log h w (x(i) ) ( )+ (1− y(i) )log 1− h w (x(i) ) ( ) i=1 N ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 if if h w (x) = 1 1+ e−wT x hypothesis Cost function Optimization objective
  12. Optimization objective On the road to linear SVM z z

    −log h w (x) ( )= −log 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ −log 1− h w (x) ( )= −log 1− 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ z = wT x z = wT x J(w) = C y(i)cost 1 wT x(i) ( )+ (1− y(i) )cost 0 wT x(i) ( ) ⎡ ⎣ ⎤ ⎦ i=1 N ∑ + 1 2 w2 ∑
  13. Database Bellagio Resort The Problem Database Match Hotel address Name:

    Bellagio Las Vegas Address: 3600 S Las Vegas Blvd, Las Vegas, NV 89109 500k 3600 South Las Vegas Boulevard, Las Vegas, NV 89109 1 matching 2 performance 500k
  14. Alright, let‘s do it To beat Precision = 90% Recall

    = 60% Approach Preprocessing à Feature Engineering à Log Reg First result Accuracy = 99% !!!
  15. Data Sources ... Learning Algorithm Feature Engineering Model Why it

    didn‘t work Wrong parameter settings Inferior algorithm Bad train/test data selection Bad data
  16. How we made it work matching performance Wrong parameter settings

    Inferior algorithm Bad train/test data selection Bad data Hadoop Pig Normalization Data enrichment Better DB data quality Random Forest Partitioning (Quadkey) Random Forest
  17. attribution   Images:  (CC) from Flickr, © “Pattern Recognition and

    Machine Learning” by Christopher M. Bishop (2006) Colors:  from colourlovers, “Giant Goldfish” palette by manekineko   Images in detail*: Slide 1/2/7/17: by xdxd_vs_xdxd, Concept & realisation: Oriana Persico & Salvatore Iaconesi, Lara Mezzapelle & Giacomo Deriu Curated by: Marco Aion Mangani & Alice Zannoni, Produced by: Miria Baccolini - BT'F Gallery Slide 3: “Backgammon”: by Caroline Jones (CJ Woodworking &), photographed by Kate Fisher (Fishbone1) “Siri”: by Sean MacEntee “Spam”: by ario_ “IBM Watson”: by John Tolva (jntolva) “Google driverless car”: by Ben (loudtiger) “TrustYou”: ©TrustYou GmbH Slide 5/6/18/22: “Data Sources”: by Tim Morgan Slide 5/6/22: “Feature Engineering”: by Duke University Archives (Duke Yearlook) “Learning Algorithm”: by Tom Keene (anthillsocial) “Model”: by Steve Brokaw (SEBImages) – Model: Alexis Farley, MUA: Mandi Lucas, Hair: Jennifer Odom Slide 6: “Prediction”: by Gemma Stiles Slide 9/10: by Christopher M. Bishop: Pattern Recognition and Machine Learning, page 4 & 6, Springer, 2006 Slide 20/24: by Bill Gracey Slide 21: by GREG ANDREWS (gregthemayor) Slide 26: by Gal (baboonTM) Slide 27: by harusday Slide 28: by Viewminder * the quoted names for each image are not the actual titles but help to identify the appropriate image if there are multiple images on one slide