Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Beyond Data Mining; Towards “Idea Engineering”

Beyond Data Mining; Towards “Idea Engineering”

by Tim Menzies

More Decks by PROMISE'13: The 9th International Conference on Predictive Models in Software Engineering

Other Decks in Research

Transcript

  1. Idea  Engineering   [email protected]   PROMISE’13   Oct’13   0.

     algorithm mining 1.  landscape mining 2.  decision mining 3.  discussion mining yesterday today tomorrow future
  2. The  Premises  of  PROMISE   (2005)     –   Wanted:

     predic+ons   •  Nope.  Users  want  decision,  or  engagement    
  3. The  Premises  of  PROMISE   (2005)     –   Wanted:

     predic+ons   •  Nope.  Users  want  decision,  or  engagement   –  Data  mining  will  reveal  “the  truth”  about  SE   •  [Dejaeger:  TSE’11],  [Hall:  TSE’12],  [Shepperd:COW’13]   •  Not(BeXer  learners  =  beXer  conclusions)    
  4. The  Premises  of  PROMISE   (2005)     –   Wanted:

     predic+ons   •  Nope.  Users  want  decision,  or  engagement   –  Data  mining  will  reveal  “the  truth”  about  SE   •  [Dejaeger:  TSE’11],  [Hall:  TSE’12],  [Shepperd:COW’13]   •  Not(BeXer  learners  =  beXer  conclusions)   –  Sooner  or  later:  enough  data  for  general  conclusions   •  Found  more  differences  than  generali+es   •  Special  issues:  [IST’13],  [ESEj’13]   •  Best  papers,  ASE’11,  MSR’12   •  Menzies,  Zimmermann  et  al  [TSE’13]   •  Lots  of  local  models  
  5. Landscape  mining:   look  before  your  leap   •  Report

     what  is  true  about  the   data   –  Not  trivia  on  how  algorithms     walk  that  data   •  Map  the  landscape   –  Reason  on  each  part  of  map                     •  E.g.  landscape  mining   –  Unsupervised  itera+ve   dichotomiza+on   –  Cluster,  prune   –  Then  generate  rules   5  
  6. Landscape  mining:   look  before  your  leap   •  Report

     what  is  true  about  the   data   –  Not  trivia  on  how  algorithms     walk  that  data   •  Map  the  landscape   –  Reason  on  each  part  of  map                     •  E.g.  landscape  mining   –  Unsupervised  itera+ve   dichotomiza+on   –  Cluster,  prune   –  Then  generate  rules   •  Different  to  “leap  before  you  look”   –  i.e.  skew  learning  by  class  variable   –  then  study  the  results                 •  E.g.  C4.5,  CART,  Fayya-­‐Iranni,  etc   –  Supervised  itera+ve  dichotomiza+on   •  E.g.  61%  *  300+effort  es+ma+on   papers   –  Algorithm  +nkering,  without  end     6  
  7. Find  landscape    =  cluster  data,  assign  “heights”    

                      Find  decisions  =    report  delta  highs  to  lows     Monitor  discussions  =  watch,  help,  communi+es  explore  deltas   7   IDEA  Engineering  =  <landscape,    decisions,  discussion>  
  8. Spectral  Landscape  Mining   •  Spectrum    =    condi+on

     that  is  not   limited  to  a  specific  set  of  values   but  varies  in  a  con+nuum.       •  Groups  together    a  broad  range  of   condi+ons  or  behaviors    under   one  single  +tle         •  In  mathema+cs,  the  spectrum  of   a  (finite-­‐dimensional)  matrix  is   the  set  of  its  eigenvalues.         •  Nystrom  algorithms:   approxima+ons  to  eigenvalues   –  FASTMAP:  linear  +me  
  9. Project  data  on  first  2  PCA;  grid  that  data  

    e.g.  Nasa93dem   0 0.02 0.04 0.06 0.08 0.1 0.12 0 0.2 0.4 0.6 $_Hell^0.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0.02 0.04 0.06 0.08 0.1 0.12 0 0.2 0.4 0.6 log( -months) 1.5 2 2.5 3 3.5 4 4.5 5 0 0.02 0.04 0.06 0.08 0.1 0.12 0 0.2 0.4 0.6 log( -defects) 3 4 5 6 7 8 9 10 11 0 0.02 0.04 0.06 0.08 0.1 0.12 0 0.2 0.4 0.6 log( -effort) 2 3 4 5 6 7 8 9 10 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.2 0.4 0.6 $_Hell^0.5 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.2 0.4 0.6 log( -months) 2.6 2.8 3 3.2 3.4 3.6 3.8 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.2 0.4 0.6 log( -defects) 7 7.2 7.4 7.6 7.8 8 8.2 8.4 8.6 8.8 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.2 0.4 0.6 log( -effort) 5 5.5 6 6.5 7 7.5 8          1)  project  23  dimensions  projected  into  2            2a)  cluster            2b)  replace  clusters  with  centroids.     MOEA:  score=        effort+defects        +months  
  10. Sanity  check:   What  informa+on  loss?   •  E.g.  POI-­‐3

          –  400+  examples   –  20  centroids     •  Predic+on  via:   –  Extrapola+on  between  two   nearest  centroids       •  Works  as  well  as   –  Random  forest,  Naïve  Bayes   •  For  defect  predic+on  (10  data  sets)   –  Linear  regression,  M5’   •  For  effort  es+ma+on  (10  data  sets)  
  11. •  Find  delta  between  neighbors  that  go  worse  to  beXer

      •  Very  small  rules,  found  in  logLinear  +me   •  Menzies  et  al.  [TSE’13]   11   Planning  =  Inter-­‐cluster  contrast  sets  
  12. Applica+ons     •  Predic+on   •  Planning   • 

    Monitoring   •  Mul+-­‐objec+ve  op+miza+on   –  Cluster  first  on  N  objec+ves     •  Anomaly  detec+on   •  Incremental  theory  revision   •  Compression   •  Privacy   •  etc  
  13.  Idea  Engineering   0.  algorithm mining 1.  landscape mining 2.

     decision mining 3.  discussion mining yesterday today tomorrow future Beyond  Data  Mining,  T.  Menzies,  IEEE  So6ware,  2013,  to  appear   13   Q:  why  call  it   mining?   •  A1:  because  all  the  primi+ves  for  the  above  are     in  the  data  mining  literature   •  So  we  know  how  to  get  from  here  to  there     •  A2:  because  data  mining  scales