Parametric Machine Learning & Record Linkage

Parametric Machine learning Record linkage

Machine Learning

Many use cases

questions Definition of Machine Learning „Field of study that gives
computers the ability to learn without being explicitly programmed“ (Arthur Samuel, 1959) vs. Rules Hidden Patterns What is „Intelligence“? What is „Artiﬁcial Intelligence“?

training ... The Process Data Sources ... Learning Algorithm Feature
Engineering Model Parameters

... and USING The Process Data Sources ... Feature Engineering
Model prediction

Parametric Machine learning

Examples Parametric non-parametric vs Inferring a ﬁnite set of parameters.
Parameters deﬁne the distribution of the data. Training examples are explicitly used as parameters. Complexity of functions is allowed to grow with more training data. Linear Regression Logistic Regression Linear SVM Random Forest Non-linear svm (kernel) Artificial Neural networks à à à à à à f (x) = w 0 + w 1 x 1 X Z Y ... ...

Hypothesis of Linear Regression h w (x) = w 0
+ w 1 x + w 2 x2 ++ w M xM

Cost function – least squares J(w) = 1 2N h
w (x)− t n ( )2 n=1 N ∑

Bias / variance / overfitting N = 10 M
= 1 N = 10 M = 9 N = 10 M = 3 N = 10 M = 0 N = 10 J(w) = 1 2N h w (x)− t n ( )2 n=1 N ∑ h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM

One solution is more data N = 15 N
= 100

...another one is regularization J(w) = 1 2N h w
(x)− t n ( )2 n=1 N ∑ + λ w2 ∑ J(w) = 1 2N h w (x)− t n ( )2 n=1 N ∑

0 0,5 1 0 2 4 6
8 10 12 14 Spam? (No) (Yes) but...

0 0,5 1 0 2 4 6
8 10 12 14 Logistic Regression Spam? (No) (Yes)

Goal Linear logistic The „Logistic“ part 0 ≤ h w
(x) ≤1 f (x) = 1 1+ e−x h w (x) = 1 1+ e−wT x h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM = wT x

0 0 1 0 0 1 cost
function of logistic regression Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 if if −log 1− h w (x) ( ) h w (x)

Putting it all together J(w) = − 1 N y(i)
log h w (x(i) ) ( )+ (1− y(i) )log 1− h w (x(i) ) ( ) i=1 N ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 if if h w (x) = 1 1+ e−wT x hypothesis Cost function Optimization objective

Optimization objective On the road to linear SVM z z
−log h w (x) ( )= −log 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ −log 1− h w (x) ( )= −log 1− 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ z = wT x z = wT x J(w) = C y(i)cost 1 wT x(i) ( )+ (1− y(i) )cost 0 wT x(i) ( ) ⎡ ⎣ ⎤ ⎦ i=1 N ∑ + 1 2 w2 ∑

Record linkage

Database Bellagio Resort The Problem Database Match Hotel address Name:
Bellagio Las Vegas Address: 3600 S Las Vegas Blvd, Las Vegas, NV 89109 500k 3600 South Las Vegas Boulevard, Las Vegas, NV 89109 1 matching 2 performance 500k

Alright, let‘s do it To beat Precision = 90% Recall
= 60% Approach Preprocessing à Feature Engineering à Log Reg First result Accuracy = 99% !!!

That‘s awesome

Data Sources ... Learning Algorithm Feature Engineering Model Why it
didn‘t work Wrong parameter settings Inferior algorithm Bad train/test data selection Bad data

How we made it work matching performance Wrong parameter settings
Inferior algorithm Bad train/test data selection Bad data Hadoop Pig Normalization Data enrichment Better DB data quality Random Forest Partitioning (Quadkey) Random Forest

Precision > 97% Recall > 80%

3 Things to remember

Step-by-step leads to clarity 1

The devil is in the details 2

Keep going until you smile 3

attribution Images: (CC) from Flickr, © “Pattern Recognition and
Machine Learning” by Christopher M. Bishop (2006) Colors: from colourlovers, “Giant Goldfish” palette by manekineko Images in detail*: Slide 1/2/7/17: by xdxd_vs_xdxd, Concept & realisation: Oriana Persico & Salvatore Iaconesi, Lara Mezzapelle & Giacomo Deriu Curated by: Marco Aion Mangani & Alice Zannoni, Produced by: Miria Baccolini - BT'F Gallery Slide 3: “Backgammon”: by Caroline Jones (CJ Woodworking &), photographed by Kate Fisher (Fishbone1) “Siri”: by Sean MacEntee “Spam”: by ario_ “IBM Watson”: by John Tolva (jntolva) “Google driverless car”: by Ben (loudtiger) “TrustYou”: ©TrustYou GmbH Slide 5/6/18/22: “Data Sources”: by Tim Morgan Slide 5/6/22: “Feature Engineering”: by Duke University Archives (Duke Yearlook) “Learning Algorithm”: by Tom Keene (anthillsocial) “Model”: by Steve Brokaw (SEBImages) – Model: Alexis Farley, MUA: Mandi Lucas, Hair: Jennifer Odom Slide 6: “Prediction”: by Gemma Stiles Slide 9/10: by Christopher M. Bishop: Pattern Recognition and Machine Learning, page 4 & 6, Springer, 2006 Slide 20/24: by Bill Gracey Slide 21: by GREG ANDREWS (gregthemayor) Slide 26: by Gal (baboonTM) Slide 27: by harusday Slide 28: by Viewminder * the quoted names for each image are not the actual titles but help to identify the appropriate image if there are multiple images on one slide

Parametric Machine Learning & Record Linkage

Parametric Machine Learning & Record Linkage

HaFl

More Decks by HaFl

Other Decks in Technology

Featured

Transcript