# Parametric Machine Learning & Record Linkage

The presentation will at first provide a basic introduction to the field of Machine Learning and its core concepts and procedures. Thereafter, we the natural evolution of parametric algorithms will be shown – from linear regression to logistic regression and, if time allows, linear SVM. Finally, a real life use case from TrustYou dealing with Record Linkage is presented.

October 08, 2013

## Transcript

4. ### questions Definition of Machine Learning „Field of study that gives

computers the ability to learn without being explicitly programmed“ (Arthur Samuel, 1959) vs. Rules Hidden Patterns What is „Intelligence“? What is „Artiﬁcial Intelligence“?
5. ### training ... The Process Data Sources ... Learning Algorithm Feature

Engineering Model Parameters
6. ### ... and USING The Process Data Sources ... Feature Engineering

Model prediction

8. ### Examples Parametric non-parametric vs Inferring a ﬁnite set of parameters.

Parameters deﬁne the distribution of the data. Training examples are explicitly used as parameters. Complexity of functions is allowed to grow with more training data. Linear Regression Logistic Regression Linear SVM Random Forest Non-linear svm (kernel) Artificial Neural networks à   à   à   à   à   à   f (x) = w 0 + w 1 x 1 X   Z   Y   ... ...
9. ### Hypothesis of Linear Regression h w (x) = w 0

+ w 1 x + w 2 x2 ++ w M xM
10. ### Cost function – least squares J(w) = 1 2N h

w (x)− t n ( )2 n=1 N ∑
11. ### Bias / variance / overfitting N  =  10   M

=  1   N  =  10   M  =  9   N  =  10   M  =  3   N  =  10   M  =  0   N  =  10   J(w) = 1 2N h w (x)− t n ( )2 n=1 N ∑ h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM

=  100
13. ### ...another one is regularization J(w) = 1 2N h w

(x)− t n ( )2 n=1 N ∑ + λ w2 ∑ J(w) = 1 2N h w (x)− t n ( )2 n=1 N ∑
14. ### 0 0,5 1 0   2   4   6

8   10   12   14   Spam? (No) (Yes) but...
15. ### 0 0,5 1 0   2   4   6

8   10   12   14   Logistic Regression Spam? (No) (Yes)
16. ### Goal Linear logistic The „Logistic“ part 0 ≤ h w

(x) ≤1 f (x) = 1 1+ e−x h w (x) = 1 1+ e−wT x h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM = wT x
17. ### 0 0 1 0   0   1   cost

function of logistic regression Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 if if −log 1− h w (x) ( ) h w (x)
18. ### Putting it all together J(w) = − 1 N y(i)

log h w (x(i) ) ( )+ (1− y(i) )log 1− h w (x(i) ) ( ) i=1 N ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 if if h w (x) = 1 1+ e−wT x hypothesis Cost function Optimization objective
19. ### Optimization objective On the road to linear SVM z z

−log h w (x) ( )= −log 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ −log 1− h w (x) ( )= −log 1− 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ z = wT x z = wT x J(w) = C y(i)cost 1 wT x(i) ( )+ (1− y(i) )cost 0 wT x(i) ( ) ⎡ ⎣ ⎤ ⎦ i=1 N ∑ + 1 2 w2 ∑

21. ### Database Bellagio Resort The Problem Database Match Hotel address Name:

Bellagio Las Vegas Address: 3600 S Las Vegas Blvd, Las Vegas, NV 89109 500k 3600 South Las Vegas Boulevard, Las Vegas, NV 89109 1 matching 2 performance 500k
22. ### Alright, let‘s do it To beat Precision = 90% Recall

= 60% Approach Preprocessing à Feature Engineering à Log Reg First result Accuracy = 99% !!!

24. None
25. ### Data Sources ... Learning Algorithm Feature Engineering Model Why it

didn‘t work Wrong parameter settings Inferior algorithm Bad train/test data selection Bad data
26. ### How we made it work matching performance Wrong parameter settings

Inferior algorithm Bad train/test data selection Bad data Hadoop Pig Normalization Data enrichment Better DB data quality Random Forest Partitioning (Quadkey) Random Forest

32. ### attribution   Images:  (CC) from Flickr, © “Pattern Recognition and

Machine Learning” by Christopher M. Bishop (2006) Colors:  from colourlovers, “Giant Goldfish” palette by manekineko   Images in detail*: Slide 1/2/7/17: by xdxd_vs_xdxd, Concept & realisation: Oriana Persico & Salvatore Iaconesi, Lara Mezzapelle & Giacomo Deriu Curated by: Marco Aion Mangani & Alice Zannoni, Produced by: Miria Baccolini - BT'F Gallery Slide 3: “Backgammon”: by Caroline Jones (CJ Woodworking &), photographed by Kate Fisher (Fishbone1) “Siri”: by Sean MacEntee “Spam”: by ario_ “IBM Watson”: by John Tolva (jntolva) “Google driverless car”: by Ben (loudtiger) “TrustYou”: ©TrustYou GmbH Slide 5/6/18/22: “Data Sources”: by Tim Morgan Slide 5/6/22: “Feature Engineering”: by Duke University Archives (Duke Yearlook) “Learning Algorithm”: by Tom Keene (anthillsocial) “Model”: by Steve Brokaw (SEBImages) – Model: Alexis Farley, MUA: Mandi Lucas, Hair: Jennifer Odom Slide 6: “Prediction”: by Gemma Stiles Slide 9/10: by Christopher M. Bishop: Pattern Recognition and Machine Learning, page 4 & 6, Springer, 2006 Slide 20/24: by Bill Gracey Slide 21: by GREG ANDREWS (gregthemayor) Slide 26: by Gal (baboonTM) Slide 27: by harusday Slide 28: by Viewminder * the quoted names for each image are not the actual titles but help to identify the appropriate image if there are multiple images on one slide