HaFl
October 08, 2013

# Parametric Machine Learning & Record Linkage

The presentation will at first provide a basic introduction to the field of Machine Learning and its core concepts and procedures. Thereafter, we the natural evolution of parametric algorithms will be shown – from linear regression to logistic regression and, if time allows, linear SVM. Finally, a real life use case from TrustYou dealing with Record Linkage is presented.

October 08, 2013

## Transcript

4. ### questions Definition of Machine Learning „Field of study that gives

computers the ability to learn without being explicitly programmed“ (Arthur Samuel, 1959) vs. Rules Hidden Patterns What is „Intelligence“? What is „Artiﬁcial Intelligence“?
5. ### training ... The Process Data Sources ... Learning Algorithm Feature

Engineering Model Parameters
6. ### ... and USING The Process Data Sources ... Feature Engineering

Model prediction

8. ### Examples Parametric non-parametric vs Inferring a ﬁnite set of parameters.

Parameters deﬁne the distribution of the data. Training examples are explicitly used as parameters. Complexity of functions is allowed to grow with more training data. Linear Regression Logistic Regression Linear SVM Random Forest Non-linear svm (kernel) Artificial Neural networks à   à   à   à   à   à   f (x) = w 0 + w 1 x 1 X   Z   Y   ... ...
9. ### Hypothesis of Linear Regression h w (x) = w 0

+ w 1 x + w 2 x2 ++ w M xM
10. ### Cost function – least squares J(w) = 1 2N h

w (x)− t n ( )2 n=1 N ∑
11. ### Bias / variance / overfitting N  =  10   M

=  1   N  =  10   M  =  9   N  =  10   M  =  3   N  =  10   M  =  0   N  =  10   J(w) = 1 2N h w (x)− t n ( )2 n=1 N ∑ h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM

=  100
13. ### ...another one is regularization J(w) = 1 2N h w

(x)− t n ( )2 n=1 N ∑ + λ w2 ∑ J(w) = 1 2N h w (x)− t n ( )2 n=1 N ∑
14. ### 0 0,5 1 0   2   4   6

8   10   12   14   Spam? (No) (Yes) but...
15. ### 0 0,5 1 0   2   4   6

8   10   12   14   Logistic Regression Spam? (No) (Yes)
16. ### Goal Linear logistic The „Logistic“ part 0 ≤ h w

(x) ≤1 f (x) = 1 1+ e−x h w (x) = 1 1+ e−wT x h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM = wT x
17. ### 0 0 1 0   0   1   cost

function of logistic regression Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 if if −log 1− h w (x) ( ) h w (x)
18. ### Putting it all together J(w) = − 1 N y(i)

log h w (x(i) ) ( )+ (1− y(i) )log 1− h w (x(i) ) ( ) i=1 N ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 if if h w (x) = 1 1+ e−wT x hypothesis Cost function Optimization objective
19. ### Optimization objective On the road to linear SVM z z

−log h w (x) ( )= −log 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ −log 1− h w (x) ( )= −log 1− 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ z = wT x z = wT x J(w) = C y(i)cost 1 wT x(i) ( )+ (1− y(i) )cost 0 wT x(i) ( ) ⎡ ⎣ ⎤ ⎦ i=1 N ∑ + 1 2 w2 ∑

21. ### Database Bellagio Resort The Problem Database Match Hotel address Name:

Bellagio Las Vegas Address: 3600 S Las Vegas Blvd, Las Vegas, NV 89109 500k 3600 South Las Vegas Boulevard, Las Vegas, NV 89109 1 matching 2 performance 500k
22. ### Alright, let‘s do it To beat Precision = 90% Recall

= 60% Approach Preprocessing à Feature Engineering à Log Reg First result Accuracy = 99% !!!

24. None
25. ### Data Sources ... Learning Algorithm Feature Engineering Model Why it

didn‘t work Wrong parameter settings Inferior algorithm Bad train/test data selection Bad data
26. ### How we made it work matching performance Wrong parameter settings

Inferior algorithm Bad train/test data selection Bad data Hadoop Pig Normalization Data enrichment Better DB data quality Random Forest Partitioning (Quadkey) Random Forest