Topics - Weka

C382513e7a7ad401c00c4d427942a0f1?s=47 Gregory Ditzler
April 30, 2014

Topics - Weka


Gregory Ditzler

April 30, 2014


  1. Introduction to Machine Learning – Data Analysis in Weka –

    Gregory Ditzler Drexel University Ecological and Evolutionary Signal Processing & Informatics Lab Department of Electrical & Computer Engineering Philadelphia, PA, USA April 21, 2014
  2. Overview of the Support Vector Machine Problem: given a binary

    classification problem, the soft-margin between the classes is maximized by solving: Maximize L(α) = n i=1 αi − 1 2 n i=1 n j=1 αi αj ti tj k(xi , xj ) Subject to 0 ≤ αi ≤ C n i=1 αi ti = 0
  3. Overview of the Support Vector Machine y = 1 y

    = 0 y = −1 margin y = 1 y = 0 y = −1 Maximization of the margin between two classes given by y ∈ {−1, +1} (Graphic from: C. Bishop, PRML, 2006.)
  4. Support Vector Machine Implementation Optimization of L(α) is convex and

    can be solved using quadratic programming g(x) = n i=1 αi ti Φ(xi )TΦ(x) + w0 = n i=1 αi ti k(xi , x) + w0 where w0 = 1 | S | i∈S tj − j∈S αj tj k(xi , xj )
  5. What are kernels? In layman’s terms: A kernel is a

    measure of similarity between two patterns x and x Consider a similarity measure of the form: k :X × X → R (x, x ) → k(x, x ) k(x, x ) returns a real valued quantity measuring the similarity between x and x One simple measure of similarity is the canonical dot product computes the cosine of the angle between two vectors, provided they are normalized to length 1 dot product of two vectors form a pre-Hilbert space Kernels represent patterns in some dot product space H Φ : X → H
  6. Feature Space: X → H A few notes about our

    mapping into H via Φ 1. Mapping lets us define a similarity measure from the dot product in H k(x, x ) := Φ(x)TΦ(x ) 2. Patterns are dealt with geometrically; efficient learning algorithms may be applied using linear algebra 3. The selection of Φ(·) leads to a large selection of similarity measures
  7. The

  8. The Kernel Trick Consider the following dot product:  

    x2 1 √ 2x1 x2 x2 2   T   x2 1 √ 2x1 x2 x2 2   = x2 1 x2 1 + 2x2 1 x2 2 + x2 2 x2 2 = (x2 1 + x2 2 )2 = x1 x2 T x1 x2 2 = (xTx)2 Φ(x)TΦ(x) = k(x, x) Can you think of a kernel where to dot product occurs in an infinite dimensional space? Why?
  9. Gaussian kernels implement dot products in an infinite dimensional space

    The Gaussian kernel is defined as k(x, x ) = exp −γ x − x 2 The term in the exponent (−γ x − x 2) is a scaler value computed in the vector space of the data. Recall the dot product for two arbitrary vectors is given by xTx = d i=1 xi xi Thus all we need to show is that the calculation of the Gaussian kernel occurs with an infinite number of elements. Φ(x)TΦ(x ) = k(x, x ) = exp −γ x − x 2 = ∞ n=0 −γ x − x 2 n n!
  10. A simple kernel example Φ(x)TΦ(x) = k(x, x) −3 −2

    −1 0 1 2 3 −3 −2 −1 0 1 2 3 Figure : Original space 0 5 10 −20 0 20 0 5 10 Figure : Feature space
  11. What is Weka? The Weka (Gallirallus australis) is a flightless

    bird species of the rail family that is native only to New Zealand. This bird’s conservation status is currently listed as vulnerable on the threatened conversation status. The Weka is thought to be a curious bird. Folklore has tales of the Weka stealing shiny items and sugar. How is a Weka going to help us in this class? Photo credit: J¨ org Hempel
  12. What is Weka? (take two) Waikato Environment for Knowledge Analysis

    Weka is a project run at the University of Waikato aimed at making machine learning methods available to the public. Weka is a collection of machine learning algorithms for data mining tasks. Weka has tools that can be directly applied to your data, or integrated into code that you are writing. Weka is open source and written in Java. Its possible to apply Weka to problems of Big Data, and Massive Online Analysis (developed at the University of Waikato) can handle large volume data streams.
  13. Weka is open source

  14. Attribute Relation File Format (ARFF) ARFF is the file format

    that Weka expects when you are going to perform classification or regression. Experiments in Weka are output in ARFF format. The Header Section The header contains information about the data, such as expected formats and the number of attributes in a data set. A relation setting is used to give the data set a name. A relation is like naming a file, but it lets the computer program know the task. Comments can be used with “%”. They are not interpreted by the parser. The Data Section The data are listed in a CSV format. The features (i.e., attributes) appear in the order they were defined in the header section (examples to come). Dense and sparse formats are available.
  15. ARFF Example % 1. Title: Iris Plants Database % 2.

    Sources: % (a) Creator: R.A. Fisher % (b) Donor: Michael Marshall ( % (c) Date: July, 1988 % @RELATION iris @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE sepalwidth NUMERIC @ATTRIBUTE petallength NUMERIC @ATTRIBUTE petalwidth NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa
  16. Attributes The format for the @attribute statement is: @attribute <attribute-name>

    <datatype> datatype Numeric Integer (treated as numeric) Real (treated as numeric) Nominal. for example {cat,dog}. any value that has a space must be placed in quotes. String Date. The default format string accepts the ISO-8601 combined date and time format: yyyy-MM-dd’T’HH:mm:ss @attribute <name> date [<date-format>] @ATTRIBUTE weight NUMERIC @ATTRIBUTE age INTEGER @ATTRIBUTE height REAL @ATTRIBUTE sex {male,female} @ATTRIBUTE lastname STRING @ATTRIBUTE birthdate DATE "yyyy-MM-dd"
  17. Sparse ARFF Sparse Data Representations Some data sets contain many

    entries in the data matrix marked with a zero. We refer to these matrices as sparse. Saving a tuple such as (x-index, y-index, value) could be more space efficient than saving every entry in the matrix. Header remains the same; however, the data entry is different. Dense Format @DATA 0, X, 0, Y, "class A" 0, 0, W, 0, "class B" Sparse Format @DATA {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}
  18. The Weka GUI The Weka GUI allows users a straightforward

    way to evaluate classifier and run comparisons of multiple classifiers. Explorer: basic interface to Weka’s tools Experimenter: setup experiments with multiple classifiers on multiple data sets KnowledgeFlow: pipeline experimental development. similar to simulink Simple CLI: simple command line interface to the Weka tools
  19. Examining/Loading the Data

  20. Choosing the Classifier + Evaluation Method

  21. Configuring Parameters (Adaboost)

  22. Visualize Tree Models

  23. ROC Curves and AUC

  24. Cost of the model

  25. Evaluating a Classifier at the Commandline One of the draw

    backs to using the Weka GUI in its default setting is that it limits the size of the JVM heap size. >> memory=1024m >> seed=12 >> java -Xmx$memory -cp weka.jar weka.classifiers.meta.AdaBoostM1 \ -t data/train.arff -T data/test.arff -i -k -d $dataset-ada.model \ -I $rounds -Q -S $seed > summary-adaboost.txt
  26. Demo Time!