Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015

Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015

Preprocessing data is one of the most effort consuming tasks in Machine Learning (ML). In the Big Data context, the models automatically derived from data should be as simple as possible, interpretable and fast, and for achieving that we will need to use the best variables, that is, use the best features of such data.

Although there are already several libraries available which approach ML tasks in Big Data, that is not the case for FS algorithms yet, and other preprocessing techniques such as discretization. However, the existing FS methods do not scale well when dealing with Big Data. In this presentation, we show our efforts and new ideas for parallelizing standard FS methods for its use on Big Data environments.

Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-11.html

Big Data Spain

October 22, 2015
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. BEGIN AT THE BEGINNING: FEATURE SELECTION FOR BIG DATA AMPARO

    ALONSO-BETANZOS BIG DATA SPAIN 2015, Madrid
  2. BIG DATA HISPANO, 2015 2 Begin at the Begining “Begin

    at the beginning," the King said, very gravely, "and go on till you come to the end: then stop.”
  3. BIG DATA HISPANO, 2015 3 The first step: Preprocessing the

    data Peter Norvig Google Research Director
  4. BIG DATA HISPANO, 2015 4 Not everything that counts can

    be counted, and not everything that can be counted counts. Equality is not the way
  5. BIG DATA HISPANO, 2015 8 Feature selection: basic flavors Advantages

    Disadvantages Examples • Independence of classifier No interaction with classifier CFS • Low computational cost Consistency-based • Fast • Good generalization ability INTERACT ReliefF FCBF InfoGain mRMr
  6. BIG DATA HISPANO, 2015 9 Basic shapes of filters: In

    several ways Subset Filters Ranker Filters Univariate methods Multivariate methods Feature selection techniques do not scale well with Big data
  7. BIG DATA HISPANO, 2015 10 Distributed Feature Selection • Allocating

    the learning process among several workstations as a natural manner of scaling up learning algorithms. Scaling up FS • Advantages: – Reduction in execution time – Resources sharing – Better performance
  8. BIG DATA HISPANO, 2015 12  It is built on

    Apache Spark, a fast and general engine for large-scale data processing.  Runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.  Runs on Hadoop 2 clusters  Write applications quickly in Java, Scala, or Python. MLlib, why?
  9. BIG DATA HISPANO, 2015 14  Implemented a generic FS

    framework for Big Data based on Information Theory • Brown G, Pocock A, Zhao MJ, Lujan M (2012) Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66. Implementing FS based on IT framework Relevance Redundance Conditional Redundance
  10. BIG DATA HISPANO, 2015 15 The long and winding road

    Discretization is needed!! Transform numerical attributes into discrete or nominal attributes with a finite number of intervals.
  11. BIG DATA HISPANO, 2015 17  Proposal: Complete Re-design of

    Discretization method MDLP (Minimum Description Length Principle) – Sort all points in the dataset using a single distributed operation using a SPARK primitive. – Evaluates boundary points (per feature) in an parallel way. The algorithm: MLDP
  12. BIG DATA HISPANO, 2015 20  The complexity of the

    framework is determined by the computations of relevance and redundancy  Proposal: complete re-design of Brown's framework – Columnar transformation: The access pattern presented by most FS methods is feature-wise. The partitioning scheme of data is quite influential in Apache Spark. Re-design of FS framework for Spark
  13. BIG DATA HISPANO, 2015 21  The complexity of the

    framework is determined by the computations of relevance and redundancy  Proposal: complete re-design of Brown's framework – Columnar transformation: The access pattern presented by most FS methods is feature-wise. The partitioning scheme of data is quite influential in Apache Spark. – Caching variables: relevance is computed and cached once at the start. The marginal and joint proportions derived from these operation are also cached. This info. is replicated. – Greedy approach: only one feature is selected in each iteration. The quadratic complexity is transformed to a complexity determined by the number of features selected. Re-design of FS framework for Spark
  14. BIG DATA HISPANO, 2015 27 Other attempts Parallel Implementation of

    mRMR on GPU https://github.com/sramirez/fast-mRMR Implementation of other FS algorithms: ReliefF, CFS, SVM-RFE (working on the scalability studies)
  15. BIG DATA HISPANO, 2015 28 Data can be located in

    different sites: • Different parts of a company • Different cooperating organizations • A very large data set can be distributed on several processors and then combine the results Distributed Feature Selection (DFS)  DFS Goal: • to reduce the computational time • while maintaining the classification performance
  16. BIG DATA HISPANO, 2015 34 Discretization. How does it work?

     Parameters: 50 intervals and 100,000 max candidates per partition.  Classifier: Naive Bayes from MLLib, lambda = 1, iterations = 100.  Hardware: 16 nodes (12 cores per node), 64 GB RAM.  Software: Hadoop 2.5 and Apache Spark 1.2.0.
  17. BIG DATA HISPANO, 2015 35 Feature Selection. Experimental results DATASETS

     Parameters: FS algorithm = mRMR, level of parallelism = 864 partitions.  Classifier: Naive Bayes and SVM (default parameters), from MLLIB.  Hardware: 18 nodes (12 cores per node), 64 GB RAM.  Software: Hadoop 2.5 and Apache Spark 1.2.0.
  18. BIG DATA HISPANO, 2015 36 CPU vs CUDA Low number

    of possible values (< 64) High number of possible values (up to 256)
  19. BIG DATA HISPANO, 2015 37 GPU. Real Datasets DATASET PATTERNS

    FEATURES VALUES KDDCup99 4000000 41 255 Higgs 11000000 21 255
  20. BIG DATA HISPANO, 2015 Experimental Framework FILTERS  Subset filters:

    • CFS (Correlation-based Feature Selection) • INTERACT • Consistency-based filter  Ranker filters: • InfoGain • ReliefF CLASSIFIERS • C4.5 • Naïve Bayes • IB1 • SVM 42
  21. BIG DATA HISPANO, 2015 44  Horizontal partitioning of the

    datasets • Partitioning of the data maintaining class distribution  Application of the filter to the datasets  Combination of the results • Merging procedure: Theoretical complexity of feature subsets Improving the method New approach
  22. BIG DATA HISPANO, 2015 45  Calculate the complexity of

    each candidate subset of features  Fisher discriminant ratio The complexity measurement mi , si 2 and pi mean, variance and proportion of the ith-class Independency from classifier Temporal improvement
  23. BIG DATA HISPANO, 2015 46 Connect 4 Isolet Madelon Ozone

    Spambase Mnist Full set 42 617 500 72 57 717 Centralized 7 186 18 20 19 61 Distrib-Comp 8 105 9 8 18 77 Number of selected features
  24. BIG DATA HISPANO, 2015 55 Experimental results. Microarray datasets DNA

    microarray data is a good candidate for vertical distributed feature selection, since data needs to be split by features This type of data usually have redundant features
  25. BIG DATA HISPANO, 2015 57  Using complexity measurement instead

    of accuracy Vertical partition. Complexity measure Features Training Test Classes Isolet 617 6238 1236 26 Madelon 500 1600 800 2 Mnist 717 40000 20000 2 Number of features
  26. BIG DATA HISPANO, 2015 60 Complexity measure. Time Isolet Madelon

    MNIST Average Speedup 2318,45 Average Speedup 26,13 Average Speedup 1483,80 Average: 573.4337
  27. BIG DATA HISPANO, 2015 62  Parallel Computing paradigm 

    CUDA Platform • NVIDIA 780 GTI  IT-based algorithm: mRMR • (Maximum Relevance minimum redundancy) – CPU optimized version (fast mRMR) – GPU version GPU implementation of FS
  28. BIG DATA HISPANO, 2015 63  GPU compute efficiently image

    histograms  Accelerate computation of MI in GPU  Image processing (up to 256 values) Previous ideas Input Data Threads Partial histograms Final results
  29. BIG DATA HISPANO, 2015 64  Reorder  Variable Discretization

    : value 8 bits number (max 256 different values per feature) Data Access pattern
  30. BIG DATA HISPANO, 2015 67 Scalability CPU Optimization (I) Number

    of patterns Maximum posible values per feature Time(s) Time(s)
  31. BIG DATA HISPANO, 2015 69 CPU vs CUDA Low number

    of possible values (< 64) High number of possible values (up to 256)
  32. BIG DATA HISPANO, 2015 70 GPU. Real Datasets DATASET PATTERNS

    FEATURES VALUES KDDCup99 4000000 41 255 Higgs 11000000 21 255
  33. BIG DATA HISPANO, 2015 77  V. Bolón-Canedo, N. Sánchez-Maroño,

    A. Alonso-Betanzos. Distributed feature selection: An application to microarray data classification. Applied Soft Computing 05/2015; 30:136-150. DOI:10.1016/j.asoc.2015.01.035.  V. Bolón- Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos. Recent advances and emerging challenges of feature selection in the context of Big Data. Knowledge-Based Systems (2015). doi:http://dx.doi.org/10.1016/j.knosys.2015.05.014.  V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos. Feature Selection for high-dimensional data. Springer-Verlag, 2015 (in production). References