Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Florian Hartl - Parametric Machine Learning & Record Linkage

Florian Hartl - Parametric Machine Learning & Record Linkage

The presentation will at first provide a basic introduction to the field of Machine Learning and its core concepts and procedures. Thereafter, the natural evolution of parametric algorithms will be shown – from linear regression to logistic regression and, if time allows, linear SVM. Finally, a real life use case from TrustYou dealing with Record Linkage is presented.

MunichDataGeeks

October 08, 2013
Tweet

More Decks by MunichDataGeeks

Other Decks in Technology

Transcript

  1. questions Definition &f M%$'()# L#%r)()* „Field of study that gives

    computers the ability to +#%r) w(,'&!, being explicitly programmed“ (Arthur Samuel, 1959) v!. R"#$! H%&&$' P())$r'! What is „I')$##%*$'+$“? What is „Ar)%,+%(# I')$##%*$'+$“?
  2. training ... -# Pr&$#"" Data Sources ... Learning Algorithm Feature

    Engineering Model P(r(-$)$r! S"p$rv%!$& L$(r'%'*
  3. Parametric non-parametric v" I)f#rr()* % .)(,# "#, &f p%r%/#,#r". P%r%/#,#r"

    0#.)# ,'# 0(",r(b!,(&) &f ,'# 0%,%. Tr%()()* 1%/p+#" %r# 1p+($(,+2 !"#0 %" p%r%/#,#r". C&/p+1(,2 &f 3)$,(&)" (" %++&w#0 ,& *r&w w(,' /&r# ,r%()()* 0%,%. Linear Regression Logistic Regression Linear SVM Random Forest Non-linear svm (kernel) Artificial Neural networks à   à   à   à   à   à   f (x) = w 0 + w 1 x 1 X   Z   Y   ... ... .(-p#$!
  4. Hypothesis &f L%'$(r R$*r$!!%/' h w (x) = w 0

    + w 1 x + w 2 x2 ++ w M xM
  5. Cost function – +#%", "q!%r#" J(w) = 1 2N h

    w (x)− t n ( )2 n=1 N ∑
  6. 0 0,5 1 0   2   4   6

      8   10   12   14   Sp(-? (N/) (Y$!) but...
  7. 0 0,5 1 0   2   4   6

      8   10   12   14   Logistic R#*r#""(&) Sp(-? (N/) (Y$!)
  8. Goal Linear logistic -# „Logistic“ p%r, 0 ≤ h w

    (x) ≤1 f (x) = 1 1+ e−x h w (x) = 1 1+ e−wT x h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM = wT x
  9. 0   0   1   −log 1− h w

    (x) ( ) 0 0 1 cost function &f +&*(",($ r#*r#""(&) Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 (f (f −log h w (x) ( ) h w (x)
  10. P!,,()* (, %++ together J(w) = − 1 N y(i)

    log h w (x(i) ) ( )+ (1− y(i) )log 1− h w (x(i) ) ( ) i=1 N ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 (f (f h w (x) = 1 1+ e−wT x hypothesis Cost function Optimization objective
  11. Optimization objective O) ,'# r&%0 ,& linear SVM z z

    −log h w (x) ( )= −log 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ −log 1− h w (x) ( )= −log 1− 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ z = wT x z = wT x J(w) = C y(i)cost 1 wT x(i) ( )+ (1− y(i) )cost 0 wT x(i) ( ) ⎡ ⎣ ⎤ ⎦ i=1 N ∑ + 1 2 w2 ∑
  12. Database Belagio Resort -# Problem Database M%,$' H&,#+ %00r#"" N(-$:

    Bellagio Las Vegas A&&r$!": 3600 S Las Vegas Blvd, Las Vegas, NV 89109 500k 3600 South Las Vegas Boulevard, Las Vegas, NV 89109 1 matching 2 performance 500k
  13. A+r(*',, +#,‘" do it To beat Pr#$("(&) = 90% R#$%++

    = 60% Approach Pr#pr&$#""()* à F#%,!r# E)*()##r()* à L&* R#* First result A$$!r%$2 = 99% !!!
  14. Data Sources ... Learning Algorithm Feature Engineering Model Why (,

    0(0)‘, w&r4 Wr/'* p(r(-$)$r !$))%'*! I'f$r%/r (#*/r%)0- B(& )r(%'/)$!) &()( !$#$+)%/' B(& &()(
  15. How w# /%0# (, w&r4 matching performance Wrong parameter settings

    Inferior algorithm Bad train/test data selection Bad data H%0&&p P(* N&r/%+5%,(&) D%,% #)r($'/#), B#,,#r DB 0%,% q!%+(,2 R%)0&/ F&r#", P%r,(,(&)()* (Q!%04#2) R%)0&/ F&r#",
  16. attribution   I-(*$!:  (CC) from Flickr, © “Pattern Recognition and

    Machine Learning” by Christopher M. Bishop (2006) C/#/r!:  from colourlovers, “Giant Goldfish” palette by manekineko   I-(*$! %' &$)(%#*: S+(0# 1/2/7/17: by xdxd_vs_xdxd, Concept & realisation: Oriana Persico & Salvatore Iaconesi, Lara Mezzapelle & Giacomo Deriu Curated by: Marco Aion Mangani & Alice Zannoni, Produced by: Miria Baccolini - BT'F Gallery S+(0# 3: “Backgammon”: by Caroline Jones (CJ Woodworking &), photographed by Kate Fisher (Fishbone1) “Siri”: by Sean MacEntee “Spam”: by ario_ “IBM Watson”: by John Tolva (jntolva) “Google driverless car”: by Ben (loudtiger) “TrustYou”: ©TrustYou GmbH S+(0# 5/6/18/22: “Data Sources”: by Tim Morgan S+(0# 5/6/22: “Feature Engineering”: by Duke University Archives (Duke Yearlook) “Learning Algorithm”: by Tom Keene (anthillsocial) “Model”: by Steve Brokaw (SEBImages) – Model: Alexis Farley, MUA: Mandi Lucas, Hair: Jennifer Odom S+(0# 6: “Prediction”: by Gemma Stiles S+(0# 9/10: by Christopher M. Bishop: Pattern Recognition and Machine Learning, page 4 & 6, Springer, 2006 S+(0# 20/24: by Bill Gracey S+(0# 21: by GREG ANDREWS (gregthemayor) S+(0# 26: by Gal (baboonTM) S+(0# 27: by harusday S+(0# 28: by Viewminder * the quoted names for each image are not the actual titles but help to identify the appropriate image if there are multiple images on one slide