Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Florian Hartl - Parametric Machine Learning & R...

Florian Hartl - Parametric Machine Learning & Record Linkage

The presentation will at first provide a basic introduction to the field of Machine Learning and its core concepts and procedures. Thereafter, the natural evolution of parametric algorithms will be shown – from linear regression to logistic regression and, if time allows, linear SVM. Finally, a real life use case from TrustYou dealing with Record Linkage is presented.

Avatar for Munich DataGeeks

Munich DataGeeks

October 08, 2013
Tweet

More Decks by Munich DataGeeks

Other Decks in Technology

Transcript

  1. questions Definition &f M%$'()# L#%r)()* „Field of study that gives

    computers the ability to +#%r) w(,'&!, being explicitly programmed“ (Arthur Samuel, 1959) v!. R"#$! H%&&$' P())$r'! What is „I')$##%*$'+$“? What is „Ar)%,+%(# I')$##%*$'+$“?
  2. training ... -# Pr&$#"" Data Sources ... Learning Algorithm Feature

    Engineering Model P(r(-$)$r! S"p$rv%!$& L$(r'%'*
  3. Parametric non-parametric v" I)f#rr()* % .)(,# "#, &f p%r%/#,#r". P%r%/#,#r"

    0#.)# ,'# 0(",r(b!,(&) &f ,'# 0%,%. Tr%()()* 1%/p+#" %r# 1p+($(,+2 !"#0 %" p%r%/#,#r". C&/p+1(,2 &f 3)$,(&)" (" %++&w#0 ,& *r&w w(,' /&r# ,r%()()* 0%,%. Linear Regression Logistic Regression Linear SVM Random Forest Non-linear svm (kernel) Artificial Neural networks à   à   à   à   à   à   f (x) = w 0 + w 1 x 1 X   Z   Y   ... ... .(-p#$!
  4. Hypothesis &f L%'$(r R$*r$!!%/' h w (x) = w 0

    + w 1 x + w 2 x2 ++ w M xM
  5. Cost function – +#%", "q!%r#" J(w) = 1 2N h

    w (x)− t n ( )2 n=1 N ∑
  6. 0 0,5 1 0   2   4   6

      8   10   12   14   Sp(-? (N/) (Y$!) but...
  7. 0 0,5 1 0   2   4   6

      8   10   12   14   Logistic R#*r#""(&) Sp(-? (N/) (Y$!)
  8. Goal Linear logistic -# „Logistic“ p%r, 0 ≤ h w

    (x) ≤1 f (x) = 1 1+ e−x h w (x) = 1 1+ e−wT x h w (x) = w 0 + w 1 x + w 2 x2 ++ w M xM = wT x
  9. 0   0   1   −log 1− h w

    (x) ( ) 0 0 1 cost function &f +&*(",($ r#*r#""(&) Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 (f (f −log h w (x) ( ) h w (x)
  10. P!,,()* (, %++ together J(w) = − 1 N y(i)

    log h w (x(i) ) ( )+ (1− y(i) )log 1− h w (x(i) ) ( ) i=1 N ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ Cost h w (x),y ( )= −log h w (x) ( ) −log 1− h w (x) ( ) ⎧ ⎨ ⎪ ⎩ ⎪ y = 1 y = 0 (f (f h w (x) = 1 1+ e−wT x hypothesis Cost function Optimization objective
  11. Optimization objective O) ,'# r&%0 ,& linear SVM z z

    −log h w (x) ( )= −log 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ −log 1− h w (x) ( )= −log 1− 1 1+ e−z ⎛ ⎝ ⎜ ⎞ ⎠ ⎟ z = wT x z = wT x J(w) = C y(i)cost 1 wT x(i) ( )+ (1− y(i) )cost 0 wT x(i) ( ) ⎡ ⎣ ⎤ ⎦ i=1 N ∑ + 1 2 w2 ∑
  12. Database Belagio Resort -# Problem Database M%,$' H&,#+ %00r#"" N(-$:

    Bellagio Las Vegas A&&r$!": 3600 S Las Vegas Blvd, Las Vegas, NV 89109 500k 3600 South Las Vegas Boulevard, Las Vegas, NV 89109 1 matching 2 performance 500k
  13. A+r(*',, +#,‘" do it To beat Pr#$("(&) = 90% R#$%++

    = 60% Approach Pr#pr&$#""()* à F#%,!r# E)*()##r()* à L&* R#* First result A$$!r%$2 = 99% !!!
  14. Data Sources ... Learning Algorithm Feature Engineering Model Why (,

    0(0)‘, w&r4 Wr/'* p(r(-$)$r !$))%'*! I'f$r%/r (#*/r%)0- B(& )r(%'/)$!) &()( !$#$+)%/' B(& &()(
  15. How w# /%0# (, w&r4 matching performance Wrong parameter settings

    Inferior algorithm Bad train/test data selection Bad data H%0&&p P(* N&r/%+5%,(&) D%,% #)r($'/#), B#,,#r DB 0%,% q!%+(,2 R%)0&/ F&r#", P%r,(,(&)()* (Q!%04#2) R%)0&/ F&r#",
  16. attribution   I-(*$!:  (CC) from Flickr, © “Pattern Recognition and

    Machine Learning” by Christopher M. Bishop (2006) C/#/r!:  from colourlovers, “Giant Goldfish” palette by manekineko   I-(*$! %' &$)(%#*: S+(0# 1/2/7/17: by xdxd_vs_xdxd, Concept & realisation: Oriana Persico & Salvatore Iaconesi, Lara Mezzapelle & Giacomo Deriu Curated by: Marco Aion Mangani & Alice Zannoni, Produced by: Miria Baccolini - BT'F Gallery S+(0# 3: “Backgammon”: by Caroline Jones (CJ Woodworking &), photographed by Kate Fisher (Fishbone1) “Siri”: by Sean MacEntee “Spam”: by ario_ “IBM Watson”: by John Tolva (jntolva) “Google driverless car”: by Ben (loudtiger) “TrustYou”: ©TrustYou GmbH S+(0# 5/6/18/22: “Data Sources”: by Tim Morgan S+(0# 5/6/22: “Feature Engineering”: by Duke University Archives (Duke Yearlook) “Learning Algorithm”: by Tom Keene (anthillsocial) “Model”: by Steve Brokaw (SEBImages) – Model: Alexis Farley, MUA: Mandi Lucas, Hair: Jennifer Odom S+(0# 6: “Prediction”: by Gemma Stiles S+(0# 9/10: by Christopher M. Bishop: Pattern Recognition and Machine Learning, page 4 & 6, Springer, 2006 S+(0# 20/24: by Bill Gracey S+(0# 21: by GREG ANDREWS (gregthemayor) S+(0# 26: by Gal (baboonTM) S+(0# 27: by harusday S+(0# 28: by Viewminder * the quoted names for each image are not the actual titles but help to identify the appropriate image if there are multiple images on one slide