Slide 1

Slide 1 text

Modeling using WEKA

Slide 2

Slide 2 text

Index • WEKA Introduction • WEKA file formats • Loading data • Univariate analysis • Data Manipulation • Feature Selection • Creating Training, Validation and Test Sets • Model Execution - Logistic Regression • Model Analysis - ROC Curve • Model Analysis – Cost/Benefit Analysis • Re-apply model on new data • Weka Plus and Negatives

Slide 3

Slide 3 text

Introduction • Weka is a collection of machine learning algorithms for data mining tasks • The algorithms can either be applied directly to a dataset or called from your own Java code. • ARFF – Attribute relation file format

Slide 4

Slide 4 text

Dataset File formats

Slide 5

Slide 5 text

Load Data Set

Slide 6

Slide 6 text

Univariate Analysis

Slide 7

Slide 7 text

Univariate Analysis • Current Relation – Dataset name, number of records, number of attributes in dataset. • Attribute details- All attributes to select for univariate analysis

Slide 8

Slide 8 text

Univariate Analysis • Selected attribute – • Provides information about attribute type, Missing values, Distinct values, etc.

Slide 9

Slide 9 text

Univariate Analysis • Selected attribute – histogram • Dispersion of attribute.

Slide 10

Slide 10 text

Univariate Analysis • All attribute visualization/plots

Slide 11

Slide 11 text

Data Manipulation • Changing data type of field • Missing values update • Creating BINS from data • Standardize data • Outlier Treatment • Creating new calculated fields

Slide 12

Slide 12 text

1. Convert NA to 0 • Flow > Preprocess > Edit > Right Click on Attribute > Replace Values

Slide 13

Slide 13 text

1. Convert NA to 0

Slide 14

Slide 14 text

2. Changing data types of attribute • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute >

Slide 15

Slide 15 text

2. Changing data types

Slide 16

Slide 16 text

3. Creating BINS • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Discretize

Slide 17

Slide 17 text

3. Creating BINS • Provide attribute number, Number of BINs to be created and click on ‘Apply’

Slide 18

Slide 18 text

3. Creating BINS • Click on attribute to see the bins and distribution •

Slide 19

Slide 19 text

3. Creating Custom BINS • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > AddExpression • ifelse(a2 > 0, ifelse(a2 > 10,ifelse(a2 > 20,4, 3), 2), 1)

Slide 20

Slide 20 text

4. Standardize data • To convert all numeric attributes in data to zero mean and unit variance. • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Standardize •

Slide 21

Slide 21 text

4. Standardize data

Slide 22

Slide 22 text

4. Standardize data- Log values • To convert specific numeric attributes to log. • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Numeric Transform •

Slide 23

Slide 23 text

4. Standardize data- Log values • Provide value for attribute number which is to be converted to log value. • Also provide method name – log. Here we can provide any other methods such as abs,round,floor •

Slide 24

Slide 24 text

5. Identify Outliers • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Interquartile Range • Outliers can be identified for separate attribute or for all together

Slide 25

Slide 25 text

5. Identify Outliers

Slide 26

Slide 26 text

5. Remove Outliers • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Instance > RemoveWithValues

Slide 27

Slide 27 text

5. Remove Outliers • Params : attributeIndices - Attribute number, NominalIndices=Nominal value of outlier in Outlier attribute

Slide 28

Slide 28 text

5. Transform Outliers • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > AddExpression – this option will create new field e.g : ifelse(a2 > 1000,200, 1) •

Slide 29

Slide 29 text

6. New Calculated fields • This is helpful In case any new field is to be derived from existing fields • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Add Expression

Slide 30

Slide 30 text

6. New Calculated fields • This is helpful In case any new field is to be derived from existing fields • Provide expression/equation and new field name

Slide 31

Slide 31 text

Feature Selection/Attribute selection 1. Info Gain 2. Correlation

Slide 32

Slide 32 text

Feature Selection – Info Gain • Flow > Select Attribute > Attribute evaluator > Choose >

Slide 33

Slide 33 text

Feature Selection – Correlation • Flow > Select Attribute > Attribute evaluator > Choose >

Slide 34

Slide 34 text

Features for Model • Features selected for model

Slide 35

Slide 35 text

Creating Training, Validation and Test Sets

Slide 36

Slide 36 text

Creating data sets 1. Dividing data into 60-20-20 % (Train-Test-Evaluate) 2. Weka inbuilt methods

Slide 37

Slide 37 text

Creating data sets • For 60%-20%-20% • Step 1- • Flow > Preprocess > Filter > Choose > Unsupervised > Instance > Resample

Slide 38

Slide 38 text

Creating data sets • Step 2- Parameters for resample • Flow > Preprocess > Filter > Choose > Unsupervised > Instance > Resample > • Check noReplacement = True, sample size percent – 60 > ok > Apply

Slide 39

Slide 39 text

Creating data sets • Step 3 – Check • After apply we can check the current relation for number of records selected • • Step 4 – Save the result as filename_train.arff • Step 5 – Click on ‘Undo’ to get to original data set • Step 6 – Change the Resample parameters again • Parameters - > Invert selection = True, noReplacement = True, sampleSizePercent = 60

Slide 40

Slide 40 text

Creating data sets • Step 7 – Apply and check results as below

Slide 41

Slide 41 text

Creating data sets • Step 8 – Don’t save the results • Step 9 – Open Resample parameters set below parameters • Invert selection = False, noReplacement = True, sampleSizePercent = 50 • OK > Apply Check the results

Slide 42

Slide 42 text

Creating data sets • Step 10 – Check the results • Step 11 – Save the results as Test Data. • Step 13 – Click on Undo to get earlier 40% of dataset • Step 13 – Parameters invertSelection = True, noReplacement=True , sampleSizePercent = 50 • Step 14 – Ok > Apply • Step 15 – Save as Evaluation Data

Slide 43

Slide 43 text

Creating data sets • .

Slide 44

Slide 44 text

Creating data sets 2. Weka inbuilt – Flow > Classify > Test Options > Use Training Set – With this option selected data set will be used as training set to create model

Slide 45

Slide 45 text

Creating data sets 2. Weka inbuilt – Flow > Classify > Test Options > Use Supplied Test Set – With this option selected data set will be used as test set to create model

Slide 46

Slide 46 text

Creating data sets 2. Weka inbuilt – Flow > Classify > Test Options > Use Supplied Test Set – With this option selected data set will be used as test set to test model

Slide 47

Slide 47 text

Creating data sets 2. Weka inbuilt – Flow > Classify > Test Options > Cross Validation – With this option selected data set will be divided into 10 folds create model internally and weka will take average of all these models to show final model on UI

Slide 48

Slide 48 text

Creating data sets 2. Weka inbuilt – Flow > Classify > Test Options > Percentage Split – With this option selected data set will be divided into Training and Test set for model creation

Slide 49

Slide 49 text

Logistic Regression

Slide 50

Slide 50 text

Logistic Regression • Flow > Classify > Functions > Logistic

Slide 51

Slide 51 text

Logistic Regression • Parameter selection

Slide 52

Slide 52 text

Logistic Regression • Model Results:

Slide 53

Slide 53 text

Model Analysis • Flow > Right Click on Model > Visualize threshold curve > ROC Curve

Slide 54

Slide 54 text

Model Analysis- ROC Curve

Slide 55

Slide 55 text

Model Analysis- Cost Benefit Analysis • Flow > Classify > Right click model > Cost/Benefit Analysis

Slide 56

Slide 56 text

Model Analysis- Cost/Benefit Analysis • Flow > Classify > Right click model > Cost/Benefit Analysis > Threshold Bar • Sliding the bar under Threshold label will change the accuracy and threshold curve

Slide 57

Slide 57 text

Save prediction output to file • Flow > Classify > Test Options > More Options > Output Predictions > Text Bar to provide file name • Parameters: Choose: to provide file type, Attributes : First-last to get all fields, outPutFile : File name to save data

Slide 58

Slide 58 text

Re-apply model on new data

Slide 59

Slide 59 text

WEKA Pluses: • Platform independent and portable, java library can be invoked from any program in any language • User friendly GUI, with built in visualization, Simpler to use than R, large collection of different data mining algorithms • Better results for classification and cluster modeling • Ease of designing solutions. • Provides 3 ways to use the software: the GUI, a Java API, and a command line interface (CLI) • Can work with Spark, BigData using other packages on Experimenter or in batch mode.

Slide 60

Slide 60 text

WEKA Limitation: • Visualizations can be managed better in R with different packages like ggplot • Not really flexible for data manipulation • Accepts only limited file format lile CSV,ARFF • Limited documentation available on Explorer.

Slide 61

Slide 61 text

THANK YOU !

Slide 62

Slide 62 text

Decision Tree

Slide 63

Slide 63 text

Decision Tree: Algorithm selection

Slide 64

Slide 64 text

Decision Tree: Setting params for algo

Slide 65

Slide 65 text

Decision Tree: Execution

Slide 66

Slide 66 text

Tree Visualization