Gregory Ditzler Drexel University Ecological and Evolutionary Signal Processing & Informatics Lab Department of Electrical & Computer Engineering Philadelphia, PA, USA gregory.ditzler@gmail.com http://github.com/gditzler/eces436-week1 April 21, 2014
classiﬁcation problem, the soft-margin between the classes is maximized by solving: Maximize L(α) = n i=1 αi − 1 2 n i=1 n j=1 αi αj ti tj k(xi , xj ) Subject to 0 ≤ αi ≤ C n i=1 αi ti = 0
can be solved using quadratic programming g(x) = n i=1 αi ti Φ(xi )TΦ(x) + w0 = n i=1 αi ti k(xi , x) + w0 where w0 = 1 | S | i∈S tj − j∈S αj tj k(xi , xj )
measure of similarity between two patterns x and x Consider a similarity measure of the form: k :X × X → R (x, x ) → k(x, x ) k(x, x ) returns a real valued quantity measuring the similarity between x and x One simple measure of similarity is the canonical dot product computes the cosine of the angle between two vectors, provided they are normalized to length 1 dot product of two vectors form a pre-Hilbert space Kernels represent patterns in some dot product space H Φ : X → H
mapping into H via Φ 1. Mapping lets us deﬁne a similarity measure from the dot product in H k(x, x ) := Φ(x)TΦ(x ) 2. Patterns are dealt with geometrically; eﬃcient learning algorithms may be applied using linear algebra 3. The selection of Φ(·) leads to a large selection of similarity measures
The Gaussian kernel is deﬁned as k(x, x ) = exp −γ x − x 2 The term in the exponent (−γ x − x 2) is a scaler value computed in the vector space of the data. Recall the dot product for two arbitrary vectors is given by xTx = d i=1 xi xi Thus all we need to show is that the calculation of the Gaussian kernel occurs with an inﬁnite number of elements. Φ(x)TΦ(x ) = k(x, x ) = exp −γ x − x 2 = ∞ n=0 −γ x − x 2 n n!
bird species of the rail family that is native only to New Zealand. This bird’s conservation status is currently listed as vulnerable on the threatened conversation status. The Weka is thought to be a curious bird. Folklore has tales of the Weka stealing shiny items and sugar. How is a Weka going to help us in this class? Photo credit: J¨ org Hempel
Weka is a project run at the University of Waikato aimed at making machine learning methods available to the public. Weka is a collection of machine learning algorithms for data mining tasks. Weka has tools that can be directly applied to your data, or integrated into code that you are writing. Weka is open source and written in Java. Its possible to apply Weka to problems of Big Data, and Massive Online Analysis (developed at the University of Waikato) can handle large volume data streams.
that Weka expects when you are going to perform classiﬁcation or regression. Experiments in Weka are output in ARFF format. The Header Section The header contains information about the data, such as expected formats and the number of attributes in a data set. A relation setting is used to give the data set a name. A relation is like naming a ﬁle, but it lets the computer program know the task. Comments can be used with “%”. They are not interpreted by the parser. The Data Section The data are listed in a CSV format. The features (i.e., attributes) appear in the order they were deﬁned in the header section (examples to come). Dense and sparse formats are available.
<datatype> datatype Numeric Integer (treated as numeric) Real (treated as numeric) Nominal. for example {cat,dog}. any value that has a space must be placed in quotes. String Date. The default format string accepts the ISO-8601 combined date and time format: yyyy-MM-dd’T’HH:mm:ss @attribute <name> date [<date-format>] @ATTRIBUTE weight NUMERIC @ATTRIBUTE age INTEGER @ATTRIBUTE height REAL @ATTRIBUTE sex {male,female} @ATTRIBUTE lastname STRING @ATTRIBUTE birthdate DATE "yyyy-MM-dd"
entries in the data matrix marked with a zero. We refer to these matrices as sparse. Saving a tuple such as (x-index, y-index, value) could be more space eﬃcient than saving every entry in the matrix. Header remains the same; however, the data entry is diﬀerent. Dense Format @DATA 0, X, 0, Y, "class A" 0, 0, W, 0, "class B" Sparse Format @DATA {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}
way to evaluate classiﬁer and run comparisons of multiple classiﬁers. Explorer: basic interface to Weka’s tools Experimenter: setup experiments with multiple classiﬁers on multiple data sets KnowledgeFlow: pipeline experimental development. similar to simulink Simple CLI: simple command line interface to the Weka tools
backs to using the Weka GUI in its default setting is that it limits the size of the JVM heap size. >> memory=1024m >> seed=12 >> java -Xmx$memory -cp weka.jar weka.classifiers.meta.AdaBoostM1 \ -t data/train.arff -T data/test.arff -i -k -d $dataset-ada.model \ -I $rounds -Q -S $seed > summary-adaboost.txt