Parametric Machine Learning & Record Linkage

Slide 1

Slide 1 text

Parametric Machine learning Record linkage

Slide 2

Slide 2 text

Machine Learning

Slide 3

Slide 3 text

Many use cases

Slide 4

Slide 4 text

questions Definition of Machine Learning „Field of study that gives computers the ability to learn without being explicitly programmed“ (Arthur Samuel, 1959) vs. Rules Hidden Patterns What is „Intelligence“? What is „Artiﬁcial Intelligence“?

Slide 5

Slide 5 text

training ... The Process Data Sources ... Learning Algorithm Feature Engineering Model Parameters

Slide 6

Slide 6 text

... and USING The Process Data Sources ... Feature Engineering Model prediction

Slide 7

Slide 7 text

Parametric Machine learning

Slide 8

Slide 8 text

Examples Parametric non-parametric vs Inferring a ﬁnite set of parameters. Parameters deﬁne the distribution of the data. Training examples are explicitly used as parameters. Complexity of functions is allowed to grow with more training data. Linear Regression Logistic Regression Linear SVM Random Forest Non-linear svm (kernel) Artificial Neural networks à à à à à à f (x) = w 0 + w 1 x 1 X Z Y ... ...

Slide 9

Slide 9 text

Hypothesis of Linear Regression h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM

Slide 10

Slide 10 text

Cost function – least squares J(w) = 1 2N h w (x)− t n ( )2 n=1 N ∑

Slide 11

Slide 11 text

Bias / variance / overfitting N = 10 M = 1 N = 10 M = 9 N = 10 M = 3 N = 10 M = 0 N = 10 J(w) = 1 2N h w (x)− t n ( )2 n=1 N ∑ h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM

Slide 12

Slide 12 text

One solution is more data N = 15 N = 100

Slide 13

Slide 13 text

...another one is regularization J(w) = 1 2N h w (x)− t n ( )2 n=1 N ∑ + λ w2 ∑ J(w) = 1 2N h w (x)− t n ( )2 n=1 N ∑

Slide 14

Slide 14 text

0 0,5 1 0 2 4 6 8 10 12 14 Spam? (No) (Yes) but...

Slide 15

Slide 15 text

0 0,5 1 0 2 4 6 8 10 12 14 Logistic Regression Spam? (No) (Yes)

Slide 16

Slide 16 text

Goal Linear logistic The „Logistic“ part 0 ≤ h w (x) ≤1 f (x) = 1 1+ e−x h w (x) = 1 1+ e−wT x h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM = wT x

Slide 17

Slide 17 text

0 0 1 0 0 1 cost function of logistic regression Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 if if −log 1− h w (x) ( ) h w (x)

Slide 18

Slide 18 text

Putting it all together J(w) = − 1 N y(i) log h w (x(i) ) ( )+ (1− y(i) )log 1− h w (x(i) ) ( ) i=1 N ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 if if h w (x) = 1 1+ e−wT x hypothesis Cost function Optimization objective

Slide 19

Slide 19 text

Optimization objective On the road to linear SVM z z −log h w (x) ( )= −log 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ −log 1− h w (x) ( )= −log 1− 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ z = wT x z = wT x J(w) = C y(i)cost 1 wT x(i) ( )+ (1− y(i) )cost 0 wT x(i) ( ) ⎡ ⎣ ⎤ ⎦ i=1 N ∑ + 1 2 w2 ∑

Slide 20

Slide 20 text

Record linkage

Slide 21

Slide 21 text

Database Bellagio Resort The Problem Database Match Hotel address Name: Bellagio Las Vegas Address: 3600 S Las Vegas Blvd, Las Vegas, NV 89109 500k 3600 South Las Vegas Boulevard, Las Vegas, NV 89109 1 matching 2 performance 500k

Slide 22

Slide 22 text

Alright, let‘s do it To beat Precision = 90% Recall = 60% Approach Preprocessing à Feature Engineering à Log Reg First result Accuracy = 99% !!!

Slide 23

Slide 23 text

That‘s awesome

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

Data Sources ... Learning Algorithm Feature Engineering Model Why it didn‘t work Wrong parameter settings Inferior algorithm Bad train/test data selection Bad data

Slide 26

Slide 26 text

How we made it work matching performance Wrong parameter settings Inferior algorithm Bad train/test data selection Bad data Hadoop Pig Normalization Data enrichment Better DB data quality Random Forest Partitioning (Quadkey) Random Forest

Slide 27

Slide 27 text

Precision > 97% Recall > 80%

Slide 28

Slide 28 text

3 Things to remember

Slide 29

Slide 29 text

Step-by-step leads to clarity 1

Slide 30

Slide 30 text

The devil is in the details 2

Slide 31

Slide 31 text

Keep going until you smile 3

Slide 32

Slide 32 text

attribution Images: (CC) from Flickr, © “Pattern Recognition and Machine Learning” by Christopher M. Bishop (2006) Colors: from colourlovers, “Giant Goldfish” palette by manekineko Images in detail*: Slide 1/2/7/17: by xdxd_vs_xdxd, Concept & realisation: Oriana Persico & Salvatore Iaconesi, Lara Mezzapelle & Giacomo Deriu Curated by: Marco Aion Mangani & Alice Zannoni, Produced by: Miria Baccolini - BT'F Gallery Slide 3: “Backgammon”: by Caroline Jones (CJ Woodworking &), photographed by Kate Fisher (Fishbone1) “Siri”: by Sean MacEntee “Spam”: by ario_ “IBM Watson”: by John Tolva (jntolva) “Google driverless car”: by Ben (loudtiger) “TrustYou”: ©TrustYou GmbH Slide 5/6/18/22: “Data Sources”: by Tim Morgan Slide 5/6/22: “Feature Engineering”: by Duke University Archives (Duke Yearlook) “Learning Algorithm”: by Tom Keene (anthillsocial) “Model”: by Steve Brokaw (SEBImages) – Model: Alexis Farley, MUA: Mandi Lucas, Hair: Jennifer Odom Slide 6: “Prediction”: by Gemma Stiles Slide 9/10: by Christopher M. Bishop: Pattern Recognition and Machine Learning, page 4 & 6, Springer, 2006 Slide 20/24: by Bill Gracey Slide 21: by GREG ANDREWS (gregthemayor) Slide 26: by Gal (baboonTM) Slide 27: by harusday Slide 28: by Viewminder * the quoted names for each image are not the actual titles but help to identify the appropriate image if there are multiple images on one slide