Florian Hartl - Parametric Machine Learning & Record Linkage

Parametric Machine learning Record linkage

Machine Learning

Many !"# $%"#"

questions Definition &f M%$'()# L#%r)()* „Field of study that gives
computers the ability to +#%r) w(,'&!, being explicitly programmed“ (Arthur Samuel, 1959) v!. R"#$! H%&&$' P())$r'! What is „I')$##%*$'+$“? What is „Ar)%,+%(# I')$##%*$'+$“?

training ... -# Pr&$#"" Data Sources ... Learning Algorithm Feature
Engineering Model P(r(-$)$r! S"p$rv%!$& L$(r'%'*

... ,'#) USING -# Pr&$#"" Data Sources ... Feature Engineering
Model S"p$rv%!$& L$(r'%'* prediction

Parametric Machine learning

Parametric non-parametric v" I)f#rr()* % .)(,# "#, &f p%r%/#,#r". P%r%/#,#r"
0#.)# ,'# 0(",r(b!,(&) &f ,'# 0%,%. Tr%()()* 1%/p+#" %r# 1p+($(,+2 !"#0 %" p%r%/#,#r". C&/p+1(,2 &f 3)$,(&)" (" %++&w#0 ,& *r&w w(,' /&r# ,r%()()* 0%,%. Linear Regression Logistic Regression Linear SVM Random Forest Non-linear svm (kernel) Artificial Neural networks à à à à à à f (x) = w 0 + w 1 x 1 X Z Y ... ... .(-p#$!

Hypothesis &f L%'$(r R$*r$!!%/' h w (x) = w 0
+ w 1 x + w 2 x2 ++ w M xM

Cost function – +#%", "q!%r#" J(w) = 1 2N h
w (x)− t n ( )2 n=1 N ∑

0 0,5 1 0 2 4 6
8 10 12 14 Sp(-? (N/) (Y$!) but...

0 0,5 1 0 2 4 6
8 10 12 14 Logistic R#*r#""(&) Sp(-? (N/) (Y$!)

Goal Linear logistic -# „Logistic“ p%r, 0 ≤ h w
(x) ≤1 f (x) = 1 1+ e−x h w (x) = 1 1+ e−wT x h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM = wT x

0 0 1 −log 1− h w
(x) ( ) 0 0 1 cost function &f +&*(",($ r#*r#""(&) Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 (f (f −log h w (x) ( ) h w (x)

P!,,()* (, %++ together J(w) = − 1 N y(i)
log h w (x(i) ) ( )+ (1− y(i) )log 1− h w (x(i) ) ( ) i=1 N ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 (f (f h w (x) = 1 1+ e−wT x hypothesis Cost function Optimization objective

Optimization objective O) ,'# r&%0 ,& linear SVM z z
−log h w (x) ( )= −log 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ −log 1− h w (x) ( )= −log 1− 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ z = wT x z = wT x J(w) = C y(i)cost 1 wT x(i) ( )+ (1− y(i) )cost 0 wT x(i) ( ) ⎡ ⎣ ⎤ ⎦ i=1 N ∑ + 1 2 w2 ∑

Record linkage

Database Belagio Resort -# Problem Database M%,$' H&,#+ %00r#"" N(-$:
Bellagio Las Vegas A&&r$!": 3600 S Las Vegas Blvd, Las Vegas, NV 89109 500k 3600 South Las Vegas Boulevard, Las Vegas, NV 89109 1 matching 2 performance 500k

A+r(*',, +#,‘" do it To beat Pr#$("(&) = 90% R#$%++
= 60% Approach Pr#pr&$#""()* à F#%,!r# E)*()##r()* à L&* R#* First result A$$!r%$2 = 99% !!!

-%,‘" awesome

Data Sources ... Learning Algorithm Feature Engineering Model Why (,
0(0)‘, w&r4 Wr/'* p(r(-$)$r !$))%'*! I'f$r%/r (#*/r%)0- B(& )r(%'/)$!) &()( !$#$+)%/' B(& &()(

How w# /%0# (, w&r4 matching performance Wrong parameter settings
Inferior algorithm Bad train/test data selection Bad data H%0&&p P(* N&r/%+5%,(&) D%,% #)r($'/#), B#,,#r DB 0%,% q!%+(,2 R%)0&/ F&r#", P%r,(,(&)()* (Q!%04#2) R%)0&/ F&r#",

Pr#$("(&) > 97% R#$%++ > 80%

3 Things ,& r#/#/b#r

S,#p-b2-",#p +#%0" ,& clarity 1

-# 0#v(+ (" () ,'# details 2

K##p *&()* !),(+ 2&! smile 3

attribution I-(*$!: (CC) from Flickr, © “Pattern Recognition and
Machine Learning” by Christopher M. Bishop (2006) C/#/r!: from colourlovers, “Giant Goldfish” palette by manekineko I-(*$! %' &$)(%#*: S+(0# 1/2/7/17: by xdxd_vs_xdxd, Concept & realisation: Oriana Persico & Salvatore Iaconesi, Lara Mezzapelle & Giacomo Deriu Curated by: Marco Aion Mangani & Alice Zannoni, Produced by: Miria Baccolini - BT'F Gallery S+(0# 3: “Backgammon”: by Caroline Jones (CJ Woodworking &), photographed by Kate Fisher (Fishbone1) “Siri”: by Sean MacEntee “Spam”: by ario_ “IBM Watson”: by John Tolva (jntolva) “Google driverless car”: by Ben (loudtiger) “TrustYou”: ©TrustYou GmbH S+(0# 5/6/18/22: “Data Sources”: by Tim Morgan S+(0# 5/6/22: “Feature Engineering”: by Duke University Archives (Duke Yearlook) “Learning Algorithm”: by Tom Keene (anthillsocial) “Model”: by Steve Brokaw (SEBImages) – Model: Alexis Farley, MUA: Mandi Lucas, Hair: Jennifer Odom S+(0# 6: “Prediction”: by Gemma Stiles S+(0# 9/10: by Christopher M. Bishop: Pattern Recognition and Machine Learning, page 4 & 6, Springer, 2006 S+(0# 20/24: by Bill Gracey S+(0# 21: by GREG ANDREWS (gregthemayor) S+(0# 26: by Gal (baboonTM) S+(0# 27: by harusday S+(0# 28: by Viewminder * the quoted names for each image are not the actual titles but help to identify the appropriate image if there are multiple images on one slide

Florian Hartl - Parametric Machine Learning & R...

Florian Hartl - Parametric Machine Learning & Record Linkage

Munich DataGeeks

More Decks by Munich DataGeeks

Other Decks in Technology

Featured

Transcript

Parametric Machine learning Record linkage

Machine Learning

Many !"# $%"#"

questions Definition &f M%$'()# L#%r)()* „Field of study that gives

training ... -# Pr&$#"" Data Sources ... Learning Algorithm Feature

... ,'#) USING -# Pr&$#"" Data Sources ... Feature Engineering

Parametric Machine learning

Parametric non-parametric v" I)f#rr()* % .)(,# "#, &f p%r%/#,#r". P%r%/#,#r"

Hypothesis &f L%'$(r R$*r$!!%/' h w (x) = w 0

Cost function – +#%", "q!%r#" J(w) = 1 2N h

0 0,5 1 0 2 4 6

0 0,5 1 0 2 4 6

Goal Linear logistic -# „Logistic“ p%r, 0 ≤ h w

0 0 1 −log 1− h w

P!,,()* (, %++ together J(w) = − 1 N y(i)

Optimization objective O) ,'# r&%0 ,& linear SVM z z

Record linkage

Database Belagio Resort -# Problem Database M%,$' H&,#+ %00r#"" N(-$:

A+r(*',, +#,‘" do it To beat Pr#$("(&) = 90% R#$%++

-%,‘" awesome

Data Sources ... Learning Algorithm Feature Engineering Model Why (,

How w# /%0# (, w&r4 matching performance Wrong parameter settings

Pr#$("(&) > 97% R#$%++ > 80%

3 Things ,& r#/#/b#r

S,#p-b2-",#p +#%0" ,& clarity 1

-# 0#v(+ (" () ,'# details 2

K##p &() !),(+ 2&! smile 3

attribution I-(*$!: (CC) from Flickr, © “Pattern Recognition and