Torsten Schön - How to gain a foothold in the world of classification

How to gain a foothold in the world of
classiﬁca3on Torsten Schön dotplot GmbH

Overview •  What is classifica;on? •  Workflow
•  Preprocessing •  Basic classifiers •  Evalua;on 27.02.14 How to gain a foothold in the world of classifica;on 2

What is classiﬁca;on? •  Predic;on model •  Supervised
learning •  A set of historical data is available with known class values •  Task: Predict to which class/category a new unseen item belongs 27.02.14 How to gain a foothold in the world of classiﬁca;on 3

What is classiﬁca;on? •  Terminology: •  Dataset: complete
data measures •  APributes/Features: Parameters measured for each instance (usually columns) •  Instance: A single item for which parameters are measured (usually rows) 27.02.14 How to gain a foothold in the world of classiﬁca;on 4

What is classifica;on? Example: •  A set
of blood parameters is measured from 50 cancer pa;ents and from 50 control persons •  2-‐class problem: Cancer vs. Healthy •  To test if a new pa;ent has cancer, the same blood parameters are measured and classifica;on is used to predict the class 27.02.14 How to gain a foothold in the world of classifica;on 5

General Workflow Class values are known Classifica;on
Model Unknown class Predicted class values Test Data Training Data 27.02.14 How to gain a foothold in the world of classifica;on 6

Detailed Workflow Classifica;on Model Predicted class
values Preprocessing -‐ Feature selec;on -‐ Feature engineering -‐ Impute missing values … Preprocessing Training Data Test Data Model selec;on Cross-‐Valida;on Accuracy ROC … 27.02.14 How to gain a foothold in the world of classifica;on 7

Preprocessing Feature Selec;on •  Select discriminant features only
•  Save execu;on ;me •  Remove noise eﬀects •  2 Kind of methods: – Ranking – Subset evalua;on 27.02.14 How to gain a foothold in the world of classiﬁca;on 8

Preprocessing Ranking (Filters) •  Features are ranked by
a score – Correla;on – Informa;on gain – … •  Number of selected features must be given manually 27.02.14 How to gain a foothold in the world of classiﬁca;on 9

Preprocessing Subset Evalua;on (Filter) •  A search algorithm
is used to find best features •  Number of selected features is determined by the algorithm Subset Evalua;on (Wrapper) •  A model is learned and evaluated on the subset to find best features 27.02.14 How to gain a foothold in the world of classifica;on 10

Preprocessing Feature Engineering •  Transform or compute features
to bePer match requirements •  Text analysis: A plain text field cannot be used for classifica;on •  Extract key words as nominal features, count number of word, lePers … •  Start and end ;me è dura;on 27.02.14 How to gain a foothold in the world of classifica;on 11

Preprocessing Es;mate Missing Values •  Some algorithms require
complete datasets •  Missing values need to be imputed •  Simplest: Mean and mode •  More advanced techniques lead to bePer results (own scien;fic field) 27.02.14 How to gain a foothold in the world of classifica;on 12

Preprocessing Add Noise •  Generaliza;on of the
algorithm is most important! •  Adding ar;ﬁcial noise to the training data can lead the model to generalize more 27.02.14 How to gain a foothold in the world of classiﬁca;on 13

Classifica;on Algorithms •  There are many different classifica;on models
•  Important: – Generaliza;on – Robustness to noise – Speed – Performance – … •  “No free lunch” Theorem 27.02.14 How to gain a foothold in the world of classifica;on 14

Classiﬁca;on Algorithms k-‐Nearest Neighbors •  Selects the k
closest instances from the training set •  Similarity measure needed 27.02.14 How to gain a foothold in the world of classiﬁca;on 15

Classiﬁca;on Algorithms Support Vector Machine (SVM) 27.02.14
How to gain a foothold in the world of classiﬁca;on 16 •  Learns support vectors which separate training instances •  Can be – Higher dimensions – Non-‐linear – mul;ple

Classifica;on Algorithms Random Forest •  Learns a “forest”
of decision trees of randomly different structures •  Majority of the votes of single trees is final result •  Works well in many areas as it is very robust to noise and against over fimng 27.02.14 How to gain a foothold in the world of classifica;on 17

Evalua;on •  Evaluate diﬀerent models and preprocessing steps
by comparing model performance •  Use only the training set for evalua;on •  Onen used: Cross-‐Valida;on – Split the training data into k parts of equal size – Use each part once as test set and remaining k-‐1 parts as training sets. – Average the results 27.02.14 How to gain a foothold in the world of classiﬁca;on 18

Torsten Schön - How to gain a foothold in the w...

Torsten Schön - How to gain a foothold in the world of classification

MunichDataGeeks

More Decks by MunichDataGeeks

Other Decks in Technology

Featured

Transcript

How to gain a foothold in the world of

Overview •  What is classiﬁca;on? •  Workﬂow

What is classiﬁca;on? •  Predic;on model •  Supervised

What is classiﬁca;on? •  Terminology: •  Dataset: complete

What is classiﬁca;on? Example: •  A set

General Workﬂow Class values are known Classiﬁca;on

Detailed Workﬂow Classiﬁca;on Model Predicted class

Preprocessing Feature Selec;on •  Select discriminant features only

Preprocessing Ranking (Filters) •  Features are ranked by

Preprocessing Subset Evalua;on (Filter) •  A search algorithm

Preprocessing Feature Engineering •  Transform or compute features

Preprocessing Es;mate Missing Values •  Some algorithms require

Preprocessing Add Noise •  Generaliza;on of the

Classifica;on Algorithms •  There are many different classifica;on models

Classiﬁca;on Algorithms k-‐Nearest Neighbors •  Selects the k

Classiﬁca;on Algorithms Support Vector Machine (SVM) 27.02.14

Classiﬁca;on Algorithms Random Forest •  Learns a “forest”

Evalua;on •  Evaluate diﬀerent models and preprocessing steps