Topics - Weka

Introduction to Machine Learning – Data Analysis in Weka –
Gregory Ditzler Drexel University Ecological and Evolutionary Signal Processing & Informatics Lab Department of Electrical & Computer Engineering Philadelphia, PA, USA [email protected] http://github.com/gditzler/eces436-week1 April 21, 2014

Overview of the Support Vector Machine Problem: given a binary
classiﬁcation problem, the soft-margin between the classes is maximized by solving: Maximize L(α) = n i=1 αi − 1 2 n i=1 n j=1 αi αj ti tj k(xi , xj ) Subject to 0 ≤ αi ≤ C n i=1 αi ti = 0

Overview of the Support Vector Machine y = 1 y
= 0 y = −1 margin y = 1 y = 0 y = −1 Maximization of the margin between two classes given by y ∈ {−1, +1} (Graphic from: C. Bishop, PRML, 2006.)

Support Vector Machine Implementation Optimization of L(α) is convex and
can be solved using quadratic programming g(x) = n i=1 αi ti Φ(xi )TΦ(x) + w0 = n i=1 αi ti k(xi , x) + w0 where w0 = 1 | S | i∈S tj − j∈S αj tj k(xi , xj )

What are kernels? In layman’s terms: A kernel is a
measure of similarity between two patterns x and x Consider a similarity measure of the form: k :X × X → R (x, x ) → k(x, x ) k(x, x ) returns a real valued quantity measuring the similarity between x and x One simple measure of similarity is the canonical dot product computes the cosine of the angle between two vectors, provided they are normalized to length 1 dot product of two vectors form a pre-Hilbert space Kernels represent patterns in some dot product space H Φ : X → H

Feature Space: X → H A few notes about our
mapping into H via Φ 1. Mapping lets us deﬁne a similarity measure from the dot product in H k(x, x ) := Φ(x)TΦ(x ) 2. Patterns are dealt with geometrically; eﬃcient learning algorithms may be applied using linear algebra 3. The selection of Φ(·) leads to a large selection of similarity measures

The  Kernel   Illusion"

The Kernel Trick Consider the following dot product:  
x2 1 √ 2x1 x2 x2 2   T   x2 1 √ 2x1 x2 x2 2   = x2 1 x2 1 + 2x2 1 x2 2 + x2 2 x2 2 = (x2 1 + x2 2 )2 = x1 x2 T x1 x2 2 = (xTx)2 Φ(x)TΦ(x) = k(x, x) Can you think of a kernel where to dot product occurs in an inﬁnite dimensional space? Why?

Gaussian kernels implement dot products in an infinite dimensional space
The Gaussian kernel is defined as k(x, x ) = exp −γ x − x 2 The term in the exponent (−γ x − x 2) is a scaler value computed in the vector space of the data. Recall the dot product for two arbitrary vectors is given by xTx = d i=1 xi xi Thus all we need to show is that the calculation of the Gaussian kernel occurs with an infinite number of elements. Φ(x)TΦ(x ) = k(x, x ) = exp −γ x − x 2 = ∞ n=0 −γ x − x 2 n n!

A simple kernel example Φ(x)TΦ(x) = k(x, x) −3 −2
−1 0 1 2 3 −3 −2 −1 0 1 2 3 Figure : Original space 0 5 10 −20 0 20 0 5 10 Figure : Feature space

What is Weka? The Weka (Gallirallus australis) is a ﬂightless
bird species of the rail family that is native only to New Zealand. This bird’s conservation status is currently listed as vulnerable on the threatened conversation status. The Weka is thought to be a curious bird. Folklore has tales of the Weka stealing shiny items and sugar. How is a Weka going to help us in this class? Photo credit: J¨ org Hempel

What is Weka? (take two) Waikato Environment for Knowledge Analysis
Weka is a project run at the University of Waikato aimed at making machine learning methods available to the public. Weka is a collection of machine learning algorithms for data mining tasks. Weka has tools that can be directly applied to your data, or integrated into code that you are writing. Weka is open source and written in Java. Its possible to apply Weka to problems of Big Data, and Massive Online Analysis (developed at the University of Waikato) can handle large volume data streams.

Weka is open source

Attribute Relation File Format (ARFF) ARFF is the file format
that Weka expects when you are going to perform classification or regression. Experiments in Weka are output in ARFF format. The Header Section The header contains information about the data, such as expected formats and the number of attributes in a data set. A relation setting is used to give the data set a name. A relation is like naming a file, but it lets the computer program know the task. Comments can be used with “%”. They are not interpreted by the parser. The Data Section The data are listed in a CSV format. The features (i.e., attributes) appear in the order they were defined in the header section (examples to come). Dense and sparse formats are available.

ARFF Example % 1. Title: Iris Plants Database % 2.
Sources: % (a) Creator: R.A. Fisher % (b) Donor: Michael Marshall (MARSHALL%[email protected]) % (c) Date: July, 1988 % @RELATION iris @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE sepalwidth NUMERIC @ATTRIBUTE petallength NUMERIC @ATTRIBUTE petalwidth NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa

Attributes The format for the @attribute statement is: @attribute <attribute-name>
<datatype> datatype Numeric Integer (treated as numeric) Real (treated as numeric) Nominal. for example {cat,dog}. any value that has a space must be placed in quotes. String Date. The default format string accepts the ISO-8601 combined date and time format: yyyy-MM-dd’T’HH:mm:ss @attribute <name> date [<date-format>] @ATTRIBUTE weight NUMERIC @ATTRIBUTE age INTEGER @ATTRIBUTE height REAL @ATTRIBUTE sex {male,female} @ATTRIBUTE lastname STRING @ATTRIBUTE birthdate DATE "yyyy-MM-dd"

Sparse ARFF Sparse Data Representations Some data sets contain many
entries in the data matrix marked with a zero. We refer to these matrices as sparse. Saving a tuple such as (x-index, y-index, value) could be more space eﬃcient than saving every entry in the matrix. Header remains the same; however, the data entry is diﬀerent. Dense Format @DATA 0, X, 0, Y, "class A" 0, 0, W, 0, "class B" Sparse Format @DATA {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}

The Weka GUI The Weka GUI allows users a straightforward
way to evaluate classifier and run comparisons of multiple classifiers. Explorer: basic interface to Weka’s tools Experimenter: setup experiments with multiple classifiers on multiple data sets KnowledgeFlow: pipeline experimental development. similar to simulink Simple CLI: simple command line interface to the Weka tools

Examining/Loading the Data

Choosing the Classiﬁer + Evaluation Method

Conﬁguring Parameters (Adaboost)

Visualize Tree Models

ROC Curves and AUC

Cost of the model

Evaluating a Classiﬁer at the Commandline One of the draw
backs to using the Weka GUI in its default setting is that it limits the size of the JVM heap size. >> memory=1024m >> seed=12 >> java -Xmx$memory -cp weka.jar weka.classifiers.meta.AdaBoostM1 \ -t data/train.arff -T data/test.arff -i -k -d $dataset-ada.model \ -I $rounds -Q -S $seed > summary-adaboost.txt

Demo Time!

Topics - Weka

Topics - Weka

Gregory Ditzler

More Decks by Gregory Ditzler

Featured

Transcript

Introduction to Machine Learning – Data Analysis in Weka –

Overview of the Support Vector Machine Problem: given a binary

Overview of the Support Vector Machine y = 1 y

Support Vector Machine Implementation Optimization of L(α) is convex and

What are kernels? In layman’s terms: A kernel is a

Feature Space: X → H A few notes about our

The  Kernel   Illusion"

The Kernel Trick Consider the following dot product:  

Gaussian kernels implement dot products in an inﬁnite dimensional space

A simple kernel example Φ(x)TΦ(x) = k(x, x) −3 −2

What is Weka? The Weka (Gallirallus australis) is a ﬂightless

What is Weka? (take two) Waikato Environment for Knowledge Analysis

Weka is open source

Attribute Relation File Format (ARFF) ARFF is the ﬁle format

ARFF Example % 1. Title: Iris Plants Database % 2.

Attributes The format for the @attribute statement is: @attribute <attribute-name>

Sparse ARFF Sparse Data Representations Some data sets contain many

The Weka GUI The Weka GUI allows users a straightforward

Examining/Loading the Data

Choosing the Classiﬁer + Evaluation Method

Conﬁguring Parameters (Adaboost)

Visualize Tree Models

ROC Curves and AUC

Cost of the model

Evaluating a Classiﬁer at the Commandline One of the draw

Demo Time!