Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Torsten Schön - How to gain a foothold in the world of classification

Torsten Schön - How to gain a foothold in the world of classification

This talk is supposed to serve as a basic introduction to classification. I will explain some common classification algorithms and fundamentals in the field of classification. Before starting to learn a model, it is crucial to explore and understand the underlying data. Based on these findings, proper feature engineering and selection is to be performed in order to get appropriate results. After choosing a model and classifying data instances, we will see different methods of evaluating the results, using techniques like cross-validation. All of this will be demonstrated by analyzing a public data set within a free cloud based data analysis tool called dotplot designer.

MunichDataGeeks

February 26, 2014
Tweet

More Decks by MunichDataGeeks

Other Decks in Technology

Transcript

  1. How  to  gain  a  foothold  in  the   world  of

     classifica3on   Torsten  Schön   dotplot  GmbH  
  2. Overview   •  What  is  classifica;on?   •  Workflow  

    •  Preprocessing   •  Basic  classifiers   •  Evalua;on     27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   2  
  3. What  is  classifica;on?   •  Predic;on  model   •  Supervised

     learning   •  A  set  of  historical  data  is  available  with  known   class  values   •  Task:  Predict  to  which  class/category  a  new   unseen  item  belongs   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   3  
  4. What  is  classifica;on?   •  Terminology:   •  Dataset:  complete

     data  measures   •  APributes/Features:  Parameters  measured  for   each  instance  (usually  columns)   •  Instance:  A  single  item  for  which  parameters   are  measured  (usually  rows)   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   4  
  5. What  is  classifica;on?     Example:   •  A  set

     of  blood  parameters  is  measured  from   50  cancer  pa;ents  and  from  50  control   persons   •  2-­‐class  problem:  Cancer  vs.  Healthy   •  To  test  if  a  new  pa;ent  has  cancer,  the  same   blood  parameters  are  measured  and   classifica;on  is  used  to  predict  the  class   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   5  
  6. General  Workflow     Class  values  are  known   Classifica;on

      Model   Unknown  class   Predicted  class   values   Test  Data   Training  Data   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   6  
  7. Detailed  Workflow   Classifica;on   Model   Predicted   class

     values   Preprocessing   -­‐  Feature  selec;on   -­‐  Feature  engineering   -­‐  Impute  missing  values   …   Preprocessing   Training  Data   Test  Data   Model  selec;on   Cross-­‐Valida;on   Accuracy   ROC   …   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   7  
  8. Preprocessing   Feature  Selec;on   •  Select  discriminant  features  only

      •  Save  execu;on  ;me     •  Remove  noise  effects   •  2  Kind  of  methods:   – Ranking   – Subset  evalua;on   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   8  
  9. Preprocessing   Ranking  (Filters)   •  Features  are  ranked  by

     a  score   – Correla;on     – Informa;on  gain   – …   •  Number  of  selected  features  must  be  given   manually   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   9  
  10. Preprocessing   Subset  Evalua;on  (Filter)   •  A  search  algorithm

     is  used  to  find  best   features   •  Number  of  selected  features  is  determined  by   the  algorithm   Subset  Evalua;on  (Wrapper)   •  A  model  is  learned  and  evaluated  on  the   subset  to  find  best  features   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   10  
  11. Preprocessing   Feature  Engineering   •  Transform  or  compute  features

     to  bePer   match  requirements   •  Text  analysis:  A  plain  text  field  cannot  be  used   for  classifica;on   •  Extract  key  words  as  nominal  features,  count   number  of  word,  lePers  …   •  Start  and  end  ;me  è  dura;on   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   11  
  12. Preprocessing   Es;mate  Missing  Values   •  Some  algorithms  require

     complete  datasets   •  Missing  values  need  to  be  imputed   •  Simplest:  Mean  and  mode   •  More  advanced  techniques  lead  to  bePer   results   (own  scien;fic  field)   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   12  
  13. Preprocessing   Add  Noise   •  Generaliza;on  of  the  

    algorithm  is  most   important!   •  Adding  ar;ficial  noise  to   the  training  data  can   lead  the  model  to   generalize  more   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   13  
  14. Classifica;on  Algorithms   •  There  are  many  different  classifica;on  models

      •  Important:   – Generaliza;on   – Robustness  to  noise   – Speed   – Performance   – …   •  “No  free  lunch”  Theorem   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   14  
  15. Classifica;on  Algorithms   k-­‐Nearest  Neighbors   •  Selects  the  k

     closest   instances  from  the   training  set   •  Similarity  measure   needed   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   15  
  16. Classifica;on  Algorithms   Support  Vector  Machine  (SVM)   27.02.14  

    How  to  gain  a  foothold  in  the  world  of  classifica;on   16   •  Learns  support  vectors   which  separate  training   instances     •  Can  be   – Higher  dimensions   – Non-­‐linear   – mul;ple  
  17. Classifica;on  Algorithms   Random  Forest   •  Learns  a  “forest”

     of  decision  trees  of  randomly   different  structures   •  Majority  of  the  votes  of  single  trees  is  final   result   •  Works  well  in  many  areas  as  it  is  very  robust   to  noise  and  against  over  fimng   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   17  
  18. Evalua;on   •  Evaluate  different  models  and  preprocessing   steps

     by  comparing  model  performance   •  Use  only  the  training  set  for  evalua;on   •  Onen  used:  Cross-­‐Valida;on   – Split  the  training  data  into  k  parts  of  equal  size   – Use  each  part  once  as  test  set  and  remaining  k-­‐1   parts  as  training  sets.   – Average  the  results   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   18