Slide 1

Slide 1 text

Anthony Chong Adaptly Speaker Head of Optimization

Slide 2

Slide 2 text

STATISTICAL TESTING IN STARTUPS February 28, 2013 Anthony Chong

Slide 3

Slide 3 text

Confusing Causation Causa%on  vs.  Correla%on:    a  topic  that  has  been  talked  about  to  death.     But  the  two  things  are  so  hardwired  into  us  all,  that  I  don’t  know  a   single  person  who  has  been  guilt  free  in  confusing  the  two.    ML/Data   Scien%sts  are  no  excep%on  to  this

Slide 4

Slide 4 text

Confusing Causation Admitted Denied Acceptance Rate Men 3738 4704 44.28% Women 1494 2827 34.58% UC Berkeley Graduate School Admissions, Fall 1973 Source: Bickel, Hammel, and O’Connell. Science 187, 4175 (1975) Strawman  argument  from  Berkeley  Professors  in  a  famous   Science  publica%on  on  the  difficul%es  in  detec%ng  bias It  appears  that  there  is  a  gender  bias  here!   Be  careful!    This  is  the  difference  between  correla%on  and   causa%on…

Slide 5

Slide 5 text

Simpson’s Paradox Department Applicants Acceptance Rate Applicants Acceptance Rate A 825 62% 108 82% B 560 63% 25 68% C 325 37% 593 34% D 417 33% 375 35% E 191 28% 393 24% F 272 6% 341 7% Men Women Paradoxically,  gender  bias  seems  to  be  strong  in  the   OTHER  direc%on.   Highlights  how  difficult  it  is  to  find  the  right  thing  to   measure.

Slide 6

Slide 6 text

Objectives “This bidding strategy causes costs to decrease” - Machine Learning not helpful - Gold Standard: Randomized Testing “People who like this movie tend to like this other one” - Use past observations for predictive modeling Inferring Causation Correlation We’re  going  to  focus  on  Inferring  Causa%on   here.     ML  is  great  for  correla%on:  (eg.   Classifica%on  via  k-­‐means)    It  leverages   correla%ons  to  produce  predic%ve   modeling.    However,  no  causal  structure   causes  it  to  be  preUy  useless  for  inferring   causa%on

Slide 7

Slide 7 text

Causal Relationships Directed Acyclic Graph. See Larry Wasserman’s blog: http://normaldeviate.wordpress.com/2012/06/18/48/ (C) Factors (A) Putative Cause (B) Response A  and  B  are  observed.    C  (large  variable   vector)  is  not.  

Slide 8

Slide 8 text

Causal Relationships (C) Factors (A) Putative Cause (B) Response From  Wasserman: In  a  randomized  studey,  where  we  assign  the  value  of  A  to  subjects  randomly,  we   break  the  arrow  from  C  to  X.    In  this  case,  it  is  easy  to  check  that   P(b  |  Set  A  =  a)  =  p(b  |  a) This  raises  an  interes%ng  ques%on:  have  there  been  randomized  studies  to  followup   results  from  observaional  studies?  In  fact,  there  have.  In  a  recent  ar%cle,  (Young  and   Karr,  2011)  found  12  such  randomized  studies  following  up  on  52  posi%ve  claims  from   observa%onal  studies.  They  then  asked,  of  52  claims,  how  many  were  verified  by  the   randomized  studies?  The  answer  is  depressing:  zero.

Slide 9

Slide 9 text

Implementation First  step:    pick  a  criteria  for  evalua%on.    This  is  the  main  thing  to  peg  your  progress  on.     Especially  in  the  early  stages  of  a  startup,  you’re  probably  super  excited  about  tackling  some   brand  new  problem  few  people  know  much  about.    It’s  easy  to  get  super  carried  away  with   modeling  some  sort  of  new  behavior.  The  thing  that  grounds  you  though–  is  pegging  the   accuracy/success  of  what  you’re  working  on  against  a  new  metric.    This  is  key  as  you  con%nue   to  develop  the  system  you’ve  created  a  prototype  for.  

Slide 10

Slide 10 text

Picking the Proper Criteria Beware picking a criteria “for which it is easy to beat the control by doing something clearly ‘wrong’ from a business perspective” Source: Kohavi, et al. Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained. KDD (2012) Find  the  right  thing  to  op%mize  for.   Microsof  Example:  Revenue  and  querries  going  up–  bad  long  term   value Adaptly:    not  op%mizing  just  for  cost/engagement.    Criteria  is  how   closely  are  we  following  client  parameters.

Slide 11

Slide 11 text

As  you  know,  this  is  Bing.    Actually–  you  probably  don’t  know,  because  you  s%ll  Google  things,   [even  afer  the  Bing  challenge] Kohavi,  et  al.: Bing,  Microsof’s  search  engine,  had  a  bug  in  an  experiment,  which  resulted  in  very  poor  

Slide 12

Slide 12 text

Setting up Randomized Tests Isolate  individual  variables -­‐  A/B  tes%ng

Slide 13

Slide 13 text

Setting up Randomized Tests Isolate  individual  variables A/B  tes%ng KEY  KEY  KEY:    do  A/A  tes%ng  as  well.  :  tests   your  methodology.    Most  of  what  you   “measure”  may  very  well  just  be  noise. [Men%on?]  Carryover  effect…  same  users   having  nega%ve  experience.    Can  be   slightly  mi%gated/tested  for  using  A/A  

Slide 14

Slide 14 text

Power of Statistical Tests ¨ Power of tests goes down by ~1/m, for m tests See: Bonferroni Correction ¨ Curse of dimensionality People  run  into  problems  with  too  many   simultaneous  tests.    Consider  a  standard   causal  experiment  you  are  trying  to  cook   up.    WAAAY  too  many  variables  to  account   for  using  proper  sta%s%cs.

Slide 15

Slide 15 text

Myths in Startups ¨ Use an easier criterion ¨ Reduce the power of tests You  have  finite  %me  and  resources Not  about  finding  simpler  metrics  to  measure Nor  reducing  “confidence  level” Don’t  be  afraid  of  assump%ons  –  just  be  cognizant  of  the  ones  you’re   making Seek  compounding  gains  in  learning

Slide 16

Slide 16 text

Myths in Startups ¨ Use an easier criterion ¨ Reduce the power of tests Instead: ¨ Find key low hanging fruit

Slide 17

Slide 17 text

Myths in Startups ¨ Use an easier criterion ¨ Reduce the power of tests Instead: ¨ Find key low hanging fruit ¨ Make assumptions

Slide 18

Slide 18 text

Myths in Startups ¨ Use an easier criterion ¨ Reduce the power of tests Instead: ¨ Find key low hanging fruit ¨ Make assumptions ¨ Seek compounding gains in learning

Slide 19

Slide 19 text

Myths in Startups ¨ Use an easier criterion ¨ Reduce the power of tests Instead: ¨ Find key low hanging fruit ¨ Make assumptions ¨ Seek compounding gains in learning ¨ Avoid having to reinvent the wheel

Slide 20

Slide 20 text

General Tips ¨ Do not be afraid to use heuristics based on your expertise! ¨ Tests should not be ignorant of state of organization. Do  not  be  afraid  to  use  heuris%cs  based  on  your  exper%se!    Everyone  (and  their   mother)  knows  what  a  linear  regression  is.    Nate  Silver,  tweaking  parameters Tests  should  not  be  ignorant  of  state  of  organiza%on.    No  reason  to  treat  clients  as   uniform  inputs  of  the  same  type.    Even  internally  to  your  own  organiza%on  (eg.  sales   as  a  pipe  of  different  types)

Slide 21

Slide 21 text

Hack+Startup Presented by First Round Capital