Weka steps for logistic regression

Modeling using WEKA

Index • WEKA Introduction • WEKA file formats • Loading
data • Univariate analysis • Data Manipulation • Feature Selection • Creating Training, Validation and Test Sets • Model Execution - Logistic Regression • Model Analysis - ROC Curve • Model Analysis – Cost/Benefit Analysis • Re-apply model on new data • Weka Plus and Negatives

Introduction • Weka is a collection of machine learning algorithms
for data mining tasks • The algorithms can either be applied directly to a dataset or called from your own Java code. • ARFF – Attribute relation file format

Dataset File formats

Load Data Set

Univariate Analysis

Univariate Analysis • Current Relation – Dataset name, number of
records, number of attributes in dataset. • Attribute details- All attributes to select for univariate analysis

Univariate Analysis • Selected attribute – • Provides information about
attribute type, Missing values, Distinct values, etc.

Univariate Analysis • Selected attribute – histogram • Dispersion of
attribute.

Univariate Analysis • All attribute visualization/plots

Data Manipulation • Changing data type of field • Missing
values update • Creating BINS from data • Standardize data • Outlier Treatment • Creating new calculated fields

1. Convert NA to 0 • Flow > Preprocess >
Edit > Right Click on Attribute > Replace Values

1. Convert NA to 0

2. Changing data types of attribute • Flow > Preprocess
> Filter > Choose > Filters > Unsupervised > Attribute >

2. Changing data types

3. Creating BINS • Flow > Preprocess > Filter >
Choose > Filters > Unsupervised > Attribute > Discretize

3. Creating BINS • Provide attribute number, Number of BINs
to be created and click on ‘Apply’

3. Creating BINS • Click on attribute to see the
bins and distribution •

3. Creating Custom BINS • Flow > Preprocess > Filter
> Choose > Filters > Unsupervised > Attribute > AddExpression • ifelse(a2 > 0, ifelse(a2 > 10,ifelse(a2 > 20,4, 3), 2), 1)

4. Standardize data • To convert all numeric attributes in
data to zero mean and unit variance. • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Standardize •

4. Standardize data

4. Standardize data- Log values • To convert specific numeric
attributes to log. • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Numeric Transform •

4. Standardize data- Log values • Provide value for attribute
number which is to be converted to log value. • Also provide method name – log. Here we can provide any other methods such as abs,round,floor •

5. Identify Outliers • Flow > Preprocess > Filter >
Choose > Filters > Unsupervised > Attribute > Interquartile Range • Outliers can be identified for separate attribute or for all together

5. Identify Outliers

5. Remove Outliers • Flow > Preprocess > Filter >
Choose > Filters > Unsupervised > Instance > RemoveWithValues

5. Remove Outliers • Params : attributeIndices - Attribute number,
NominalIndices=Nominal value of outlier in Outlier attribute

5. Transform Outliers • Flow > Preprocess > Filter >
Choose > Filters > Unsupervised > Attribute > AddExpression – this option will create new field e.g : ifelse(a2 > 1000,200, 1) •

6. New Calculated fields • This is helpful In case
any new field is to be derived from existing fields • Flow > Preprocess > Filter > Choose > Filters > Unsupervised > Attribute > Add Expression

6. New Calculated fields • This is helpful In case
any new field is to be derived from existing fields • Provide expression/equation and new field name

Feature Selection/Attribute selection 1. Info Gain 2. Correlation

Feature Selection – Info Gain • Flow > Select Attribute
> Attribute evaluator > Choose >

Feature Selection – Correlation • Flow > Select Attribute >
Attribute evaluator > Choose >

Features for Model • Features selected for model

Creating Training, Validation and Test Sets

Creating data sets 1. Dividing data into 60-20-20 % (Train-Test-Evaluate)
2. Weka inbuilt methods

Creating data sets • For 60%-20%-20% • Step 1- •
Flow > Preprocess > Filter > Choose > Unsupervised > Instance > Resample

Creating data sets • Step 2- Parameters for resample •
Flow > Preprocess > Filter > Choose > Unsupervised > Instance > Resample > • Check noReplacement = True, sample size percent – 60 > ok > Apply

Creating data sets • Step 3 – Check • After
apply we can check the current relation for number of records selected • • Step 4 – Save the result as filename_train.arff • Step 5 – Click on ‘Undo’ to get to original data set • Step 6 – Change the Resample parameters again • Parameters - > Invert selection = True, noReplacement = True, sampleSizePercent = 60

Creating data sets • Step 7 – Apply and check
results as below

Creating data sets • Step 8 – Don’t save the
results • Step 9 – Open Resample parameters set below parameters • Invert selection = False, noReplacement = True, sampleSizePercent = 50 • OK > Apply Check the results

Creating data sets • Step 10 – Check the results
• Step 11 – Save the results as Test Data. • Step 13 – Click on Undo to get earlier 40% of dataset • Step 13 – Parameters invertSelection = True, noReplacement=True , sampleSizePercent = 50 • Step 14 – Ok > Apply • Step 15 – Save as Evaluation Data

Creating data sets • .

Creating data sets 2. Weka inbuilt – Flow > Classify
> Test Options > Use Training Set – With this option selected data set will be used as training set to create model

> Test Options > Use Supplied Test Set – With this option selected data set will be used as test set to create model

> Test Options > Use Supplied Test Set – With this option selected data set will be used as test set to test model

> Test Options > Cross Validation – With this option selected data set will be divided into 10 folds create model internally and weka will take average of all these models to show final model on UI

> Test Options > Percentage Split – With this option selected data set will be divided into Training and Test set for model creation

Logistic Regression

Logistic Regression • Flow > Classify > Functions > Logistic

Logistic Regression • Parameter selection

Logistic Regression • Model Results:

Model Analysis • Flow > Right Click on Model >
Visualize threshold curve > ROC Curve

Model Analysis- ROC Curve

Model Analysis- Cost Benefit Analysis • Flow > Classify >
Right click model > Cost/Benefit Analysis

Model Analysis- Cost/Benefit Analysis • Flow > Classify > Right
click model > Cost/Benefit Analysis > Threshold Bar • Sliding the bar under Threshold label will change the accuracy and threshold curve

Save prediction output to file • Flow > Classify >
Test Options > More Options > Output Predictions > Text Bar to provide file name • Parameters: Choose: to provide file type, Attributes : First-last to get all fields, outPutFile : File name to save data

Re-apply model on new data

WEKA Pluses: • Platform independent and portable, java library can
be invoked from any program in any language • User friendly GUI, with built in visualization, Simpler to use than R, large collection of different data mining algorithms • Better results for classification and cluster modeling • Ease of designing solutions. • Provides 3 ways to use the software: the GUI, a Java API, and a command line interface (CLI) • Can work with Spark, BigData using other packages on Experimenter or in batch mode.

WEKA Limitation: • Visualizations can be managed better in R
with different packages like ggplot • Not really flexible for data manipulation • Accepts only limited file format lile CSV,ARFF • Limited documentation available on Explorer.

THANK YOU !

Decision Tree

Decision Tree: Algorithm selection

Decision Tree: Setting params for algo

Decision Tree: Execution

Tree Visualization

Weka steps for logistic regression

Weka steps for logistic regression

Other Decks in Technology

Featured

Transcript