Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Torsten Schön - How to gain a foothold in the w...

Torsten Schön - How to gain a foothold in the world of classification

This talk is supposed to serve as a basic introduction to classification. I will explain some common classification algorithms and fundamentals in the field of classification. Before starting to learn a model, it is crucial to explore and understand the underlying data. Based on these findings, proper feature engineering and selection is to be performed in order to get appropriate results. After choosing a model and classifying data instances, we will see different methods of evaluating the results, using techniques like cross-validation. All of this will be demonstrated by analyzing a public data set within a free cloud based data analysis tool called dotplot designer.

Avatar for Munich DataGeeks

Munich DataGeeks

February 26, 2014
Tweet

More Decks by Munich DataGeeks

Other Decks in Technology

Transcript

  1. How  to  gain  a  foothold  in  the   world  of

     classifica3on   Torsten  Schön   dotplot  GmbH  
  2. Overview   •  What  is  classifica;on?   •  Workflow  

    •  Preprocessing   •  Basic  classifiers   •  Evalua;on     27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   2  
  3. What  is  classifica;on?   •  Predic;on  model   •  Supervised

     learning   •  A  set  of  historical  data  is  available  with  known   class  values   •  Task:  Predict  to  which  class/category  a  new   unseen  item  belongs   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   3  
  4. What  is  classifica;on?   •  Terminology:   •  Dataset:  complete

     data  measures   •  APributes/Features:  Parameters  measured  for   each  instance  (usually  columns)   •  Instance:  A  single  item  for  which  parameters   are  measured  (usually  rows)   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   4  
  5. What  is  classifica;on?     Example:   •  A  set

     of  blood  parameters  is  measured  from   50  cancer  pa;ents  and  from  50  control   persons   •  2-­‐class  problem:  Cancer  vs.  Healthy   •  To  test  if  a  new  pa;ent  has  cancer,  the  same   blood  parameters  are  measured  and   classifica;on  is  used  to  predict  the  class   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   5  
  6. General  Workflow     Class  values  are  known   Classifica;on

      Model   Unknown  class   Predicted  class   values   Test  Data   Training  Data   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   6  
  7. Detailed  Workflow   Classifica;on   Model   Predicted   class

     values   Preprocessing   -­‐  Feature  selec;on   -­‐  Feature  engineering   -­‐  Impute  missing  values   …   Preprocessing   Training  Data   Test  Data   Model  selec;on   Cross-­‐Valida;on   Accuracy   ROC   …   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   7  
  8. Preprocessing   Feature  Selec;on   •  Select  discriminant  features  only

      •  Save  execu;on  ;me     •  Remove  noise  effects   •  2  Kind  of  methods:   – Ranking   – Subset  evalua;on   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   8  
  9. Preprocessing   Ranking  (Filters)   •  Features  are  ranked  by

     a  score   – Correla;on     – Informa;on  gain   – …   •  Number  of  selected  features  must  be  given   manually   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   9  
  10. Preprocessing   Subset  Evalua;on  (Filter)   •  A  search  algorithm

     is  used  to  find  best   features   •  Number  of  selected  features  is  determined  by   the  algorithm   Subset  Evalua;on  (Wrapper)   •  A  model  is  learned  and  evaluated  on  the   subset  to  find  best  features   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   10  
  11. Preprocessing   Feature  Engineering   •  Transform  or  compute  features

     to  bePer   match  requirements   •  Text  analysis:  A  plain  text  field  cannot  be  used   for  classifica;on   •  Extract  key  words  as  nominal  features,  count   number  of  word,  lePers  …   •  Start  and  end  ;me  è  dura;on   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   11  
  12. Preprocessing   Es;mate  Missing  Values   •  Some  algorithms  require

     complete  datasets   •  Missing  values  need  to  be  imputed   •  Simplest:  Mean  and  mode   •  More  advanced  techniques  lead  to  bePer   results   (own  scien;fic  field)   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   12  
  13. Preprocessing   Add  Noise   •  Generaliza;on  of  the  

    algorithm  is  most   important!   •  Adding  ar;ficial  noise  to   the  training  data  can   lead  the  model  to   generalize  more   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   13  
  14. Classifica;on  Algorithms   •  There  are  many  different  classifica;on  models

      •  Important:   – Generaliza;on   – Robustness  to  noise   – Speed   – Performance   – …   •  “No  free  lunch”  Theorem   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   14  
  15. Classifica;on  Algorithms   k-­‐Nearest  Neighbors   •  Selects  the  k

     closest   instances  from  the   training  set   •  Similarity  measure   needed   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   15  
  16. Classifica;on  Algorithms   Support  Vector  Machine  (SVM)   27.02.14  

    How  to  gain  a  foothold  in  the  world  of  classifica;on   16   •  Learns  support  vectors   which  separate  training   instances     •  Can  be   – Higher  dimensions   – Non-­‐linear   – mul;ple  
  17. Classifica;on  Algorithms   Random  Forest   •  Learns  a  “forest”

     of  decision  trees  of  randomly   different  structures   •  Majority  of  the  votes  of  single  trees  is  final   result   •  Works  well  in  many  areas  as  it  is  very  robust   to  noise  and  against  over  fimng   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   17  
  18. Evalua;on   •  Evaluate  different  models  and  preprocessing   steps

     by  comparing  model  performance   •  Use  only  the  training  set  for  evalua;on   •  Onen  used:  Cross-­‐Valida;on   – Split  the  training  data  into  k  parts  of  equal  size   – Use  each  part  once  as  test  set  and  remaining  k-­‐1   parts  as  training  sets.   – Average  the  results   27.02.14   How  to  gain  a  foothold  in  the  world  of  classifica;on   18