Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Statistical Testing in Startups

Statistical Testing in Startups

A talk from Anthony Chong at Hack+Startup on Feb 28, 2013.

hackplusstartup.com

First Round Capital

February 28, 2013
Tweet

More Decks by First Round Capital

Other Decks in Technology

Transcript

  1. Confusing Causation Causa%on  vs.  Correla%on:    a  topic  that  has

     been  talked  about  to  death.     But  the  two  things  are  so  hardwired  into  us  all,  that  I  don’t  know  a   single  person  who  has  been  guilt  free  in  confusing  the  two.    ML/Data   Scien%sts  are  no  excep%on  to  this
  2. Confusing Causation Admitted Denied Acceptance Rate Men 3738 4704 44.28%

    Women 1494 2827 34.58% UC Berkeley Graduate School Admissions, Fall 1973 Source: Bickel, Hammel, and O’Connell. Science 187, 4175 (1975) Strawman  argument  from  Berkeley  Professors  in  a  famous   Science  publica%on  on  the  difficul%es  in  detec%ng  bias It  appears  that  there  is  a  gender  bias  here!   Be  careful!    This  is  the  difference  between  correla%on  and   causa%on…
  3. Simpson’s Paradox Department Applicants Acceptance Rate Applicants Acceptance Rate A

    825 62% 108 82% B 560 63% 25 68% C 325 37% 593 34% D 417 33% 375 35% E 191 28% 393 24% F 272 6% 341 7% Men Women Paradoxically,  gender  bias  seems  to  be  strong  in  the   OTHER  direc%on.   Highlights  how  difficult  it  is  to  find  the  right  thing  to   measure.
  4. Objectives “This bidding strategy causes costs to decrease” - Machine

    Learning not helpful - Gold Standard: Randomized Testing “People who like this movie tend to like this other one” - Use past observations for predictive modeling Inferring Causation Correlation We’re  going  to  focus  on  Inferring  Causa%on   here.     ML  is  great  for  correla%on:  (eg.   Classifica%on  via  k-­‐means)    It  leverages   correla%ons  to  produce  predic%ve   modeling.    However,  no  causal  structure   causes  it  to  be  preUy  useless  for  inferring   causa%on
  5. Causal Relationships Directed Acyclic Graph. See Larry Wasserman’s blog: http://normaldeviate.wordpress.com/2012/06/18/48/

    (C) Factors (A) Putative Cause (B) Response A  and  B  are  observed.    C  (large  variable   vector)  is  not.  
  6. Causal Relationships (C) Factors (A) Putative Cause (B) Response From

     Wasserman: In  a  randomized  studey,  where  we  assign  the  value  of  A  to  subjects  randomly,  we   break  the  arrow  from  C  to  X.    In  this  case,  it  is  easy  to  check  that   P(b  |  Set  A  =  a)  =  p(b  |  a) This  raises  an  interes%ng  ques%on:  have  there  been  randomized  studies  to  followup   results  from  observaional  studies?  In  fact,  there  have.  In  a  recent  ar%cle,  (Young  and   Karr,  2011)  found  12  such  randomized  studies  following  up  on  52  posi%ve  claims  from   observa%onal  studies.  They  then  asked,  of  52  claims,  how  many  were  verified  by  the   randomized  studies?  The  answer  is  depressing:  zero.
  7. Implementation First  step:    pick  a  criteria  for  evalua%on.  

     This  is  the  main  thing  to  peg  your  progress  on.     Especially  in  the  early  stages  of  a  startup,  you’re  probably  super  excited  about  tackling  some   brand  new  problem  few  people  know  much  about.    It’s  easy  to  get  super  carried  away  with   modeling  some  sort  of  new  behavior.  The  thing  that  grounds  you  though–  is  pegging  the   accuracy/success  of  what  you’re  working  on  against  a  new  metric.    This  is  key  as  you  con%nue   to  develop  the  system  you’ve  created  a  prototype  for.  
  8. Picking the Proper Criteria Beware picking a criteria “for which

    it is easy to beat the control by doing something clearly ‘wrong’ from a business perspective” Source: Kohavi, et al. Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained. KDD (2012) Find  the  right  thing  to  op%mize  for.   Microsof  Example:  Revenue  and  querries  going  up–  bad  long  term   value Adaptly:    not  op%mizing  just  for  cost/engagement.    Criteria  is  how   closely  are  we  following  client  parameters.
  9. As  you  know,  this  is  Bing.    Actually–  you  probably

     don’t  know,  because  you  s%ll  Google  things,   [even  afer  the  Bing  challenge] Kohavi,  et  al.: Bing,  Microsof’s  search  engine,  had  a  bug  in  an  experiment,  which  resulted  in  very  poor  
  10. Setting up Randomized Tests Isolate  individual  variables A/B  tes%ng KEY

     KEY  KEY:    do  A/A  tes%ng  as  well.  :  tests   your  methodology.    Most  of  what  you   “measure”  may  very  well  just  be  noise. [Men%on?]  Carryover  effect…  same  users   having  nega%ve  experience.    Can  be   slightly  mi%gated/tested  for  using  A/A  
  11. Power of Statistical Tests ¨ Power of tests goes down

    by ~1/m, for m tests See: Bonferroni Correction ¨ Curse of dimensionality People  run  into  problems  with  too  many   simultaneous  tests.    Consider  a  standard   causal  experiment  you  are  trying  to  cook   up.    WAAAY  too  many  variables  to  account   for  using  proper  sta%s%cs.
  12. Myths in Startups ¨ Use an easier criterion ¨ Reduce

    the power of tests You  have  finite  %me  and  resources Not  about  finding  simpler  metrics  to  measure Nor  reducing  “confidence  level” Don’t  be  afraid  of  assump%ons  –  just  be  cognizant  of  the  ones  you’re   making Seek  compounding  gains  in  learning
  13. Myths in Startups ¨ Use an easier criterion ¨ Reduce

    the power of tests Instead: ¨ Find key low hanging fruit
  14. Myths in Startups ¨ Use an easier criterion ¨ Reduce

    the power of tests Instead: ¨ Find key low hanging fruit ¨ Make assumptions
  15. Myths in Startups ¨ Use an easier criterion ¨ Reduce

    the power of tests Instead: ¨ Find key low hanging fruit ¨ Make assumptions ¨ Seek compounding gains in learning
  16. Myths in Startups ¨ Use an easier criterion ¨ Reduce

    the power of tests Instead: ¨ Find key low hanging fruit ¨ Make assumptions ¨ Seek compounding gains in learning ¨ Avoid having to reinvent the wheel
  17. General Tips ¨ Do not be afraid to use heuristics

    based on your expertise! ¨ Tests should not be ignorant of state of organization. Do  not  be  afraid  to  use  heuris%cs  based  on  your  exper%se!    Everyone  (and  their   mother)  knows  what  a  linear  regression  is.    Nate  Silver,  tweaking  parameters Tests  should  not  be  ignorant  of  state  of  organiza%on.    No  reason  to  treat  clients  as   uniform  inputs  of  the  same  type.    Even  internally  to  your  own  organiza%on  (eg.  sales   as  a  pipe  of  different  types)