First Round Capital
February 28, 2013
130

# Statistical Testing in Startups

A talk from Anthony Chong at Hack+Startup on Feb 28, 2013.

hackplusstartup.com

## First Round Capital

February 28, 2013

## Transcript

3. ### Confusing Causation Causa%on  vs.  Correla%on:    a  topic  that  has

been  talked  about  to  death.     But  the  two  things  are  so  hardwired  into  us  all,  that  I  don’t  know  a   single  person  who  has  been  guilt  free  in  confusing  the  two.    ML/Data   Scien%sts  are  no  excep%on  to  this
4. ### Confusing Causation Admitted Denied Acceptance Rate Men 3738 4704 44.28%

Women 1494 2827 34.58% UC Berkeley Graduate School Admissions, Fall 1973 Source: Bickel, Hammel, and O’Connell. Science 187, 4175 (1975) Strawman  argument  from  Berkeley  Professors  in  a  famous   Science  publica%on  on  the  diﬃcul%es  in  detec%ng  bias It  appears  that  there  is  a  gender  bias  here!   Be  careful!    This  is  the  diﬀerence  between  correla%on  and   causa%on…
5. ### Simpson’s Paradox Department Applicants Acceptance Rate Applicants Acceptance Rate A

825 62% 108 82% B 560 63% 25 68% C 325 37% 593 34% D 417 33% 375 35% E 191 28% 393 24% F 272 6% 341 7% Men Women Paradoxically,  gender  bias  seems  to  be  strong  in  the   OTHER  direc%on.   Highlights  how  diﬃcult  it  is  to  ﬁnd  the  right  thing  to   measure.
6. ### Objectives “This bidding strategy causes costs to decrease” - Machine

Learning not helpful - Gold Standard: Randomized Testing “People who like this movie tend to like this other one” - Use past observations for predictive modeling Inferring Causation Correlation We’re  going  to  focus  on  Inferring  Causa%on   here.     ML  is  great  for  correla%on:  (eg.   Classiﬁca%on  via  k-­‐means)    It  leverages   correla%ons  to  produce  predic%ve   modeling.    However,  no  causal  structure   causes  it  to  be  preUy  useless  for  inferring   causa%on
7. ### Causal Relationships Directed Acyclic Graph. See Larry Wasserman’s blog: http://normaldeviate.wordpress.com/2012/06/18/48/

(C) Factors (A) Putative Cause (B) Response A  and  B  are  observed.    C  (large  variable   vector)  is  not.
8. ### Causal Relationships (C) Factors (A) Putative Cause (B) Response From

Wasserman: In  a  randomized  studey,  where  we  assign  the  value  of  A  to  subjects  randomly,  we   break  the  arrow  from  C  to  X.    In  this  case,  it  is  easy  to  check  that   P(b  |  Set  A  =  a)  =  p(b  |  a) This  raises  an  interes%ng  ques%on:  have  there  been  randomized  studies  to  followup   results  from  observaional  studies?  In  fact,  there  have.  In  a  recent  ar%cle,  (Young  and   Karr,  2011)  found  12  such  randomized  studies  following  up  on  52  posi%ve  claims  from   observa%onal  studies.  They  then  asked,  of  52  claims,  how  many  were  veriﬁed  by  the   randomized  studies?  The  answer  is  depressing:  zero.
9. ### Implementation First  step:    pick  a  criteria  for  evalua%on.

This  is  the  main  thing  to  peg  your  progress  on.     Especially  in  the  early  stages  of  a  startup,  you’re  probably  super  excited  about  tackling  some   brand  new  problem  few  people  know  much  about.    It’s  easy  to  get  super  carried  away  with   modeling  some  sort  of  new  behavior.  The  thing  that  grounds  you  though–  is  pegging  the   accuracy/success  of  what  you’re  working  on  against  a  new  metric.    This  is  key  as  you  con%nue   to  develop  the  system  you’ve  created  a  prototype  for.
10. ### Picking the Proper Criteria Beware picking a criteria “for which

it is easy to beat the control by doing something clearly ‘wrong’ from a business perspective” Source: Kohavi, et al. Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained. KDD (2012) Find  the  right  thing  to  op%mize  for.   Microsof  Example:  Revenue  and  querries  going  up–  bad  long  term   value Adaptly:    not  op%mizing  just  for  cost/engagement.    Criteria  is  how   closely  are  we  following  client  parameters.
11. ### As  you  know,  this  is  Bing.    Actually–  you  probably

don’t  know,  because  you  s%ll  Google  things,   [even  afer  the  Bing  challenge] Kohavi,  et  al.: Bing,  Microsof’s  search  engine,  had  a  bug  in  an  experiment,  which  resulted  in  very  poor

13. ### Setting up Randomized Tests Isolate  individual  variables A/B  tes%ng KEY

KEY  KEY:    do  A/A  tes%ng  as  well.  :  tests   your  methodology.    Most  of  what  you   “measure”  may  very  well  just  be  noise. [Men%on?]  Carryover  eﬀect…  same  users   having  nega%ve  experience.    Can  be   slightly  mi%gated/tested  for  using  A/A
14. ### Power of Statistical Tests ¨ Power of tests goes down

by ~1/m, for m tests See: Bonferroni Correction ¨ Curse of dimensionality People  run  into  problems  with  too  many   simultaneous  tests.    Consider  a  standard   causal  experiment  you  are  trying  to  cook   up.    WAAAY  too  many  variables  to  account   for  using  proper  sta%s%cs.
15. ### Myths in Startups ¨ Use an easier criterion ¨ Reduce

the power of tests You  have  ﬁnite  %me  and  resources Not  about  ﬁnding  simpler  metrics  to  measure Nor  reducing  “conﬁdence  level” Don’t  be  afraid  of  assump%ons  –  just  be  cognizant  of  the  ones  you’re   making Seek  compounding  gains  in  learning
16. ### Myths in Startups ¨ Use an easier criterion ¨ Reduce

the power of tests Instead: ¨ Find key low hanging fruit
17. ### Myths in Startups ¨ Use an easier criterion ¨ Reduce

the power of tests Instead: ¨ Find key low hanging fruit ¨ Make assumptions
18. ### Myths in Startups ¨ Use an easier criterion ¨ Reduce

the power of tests Instead: ¨ Find key low hanging fruit ¨ Make assumptions ¨ Seek compounding gains in learning
19. ### Myths in Startups ¨ Use an easier criterion ¨ Reduce

the power of tests Instead: ¨ Find key low hanging fruit ¨ Make assumptions ¨ Seek compounding gains in learning ¨ Avoid having to reinvent the wheel
20. ### General Tips ¨ Do not be afraid to use heuristics

based on your expertise! ¨ Tests should not be ignorant of state of organization. Do  not  be  afraid  to  use  heuris%cs  based  on  your  exper%se!    Everyone  (and  their   mother)  knows  what  a  linear  regression  is.    Nate  Silver,  tweaking  parameters Tests  should  not  be  ignorant  of  state  of  organiza%on.    No  reason  to  treat  clients  as   uniform  inputs  of  the  same  type.    Even  internally  to  your  own  organiza%on  (eg.  sales   as  a  pipe  of  diﬀerent  types)