Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015

Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015

Preprocessing data is one of the most effort consuming tasks in Machine Learning (ML). In the Big Data context, the models automatically derived from data should be as simple as possible, interpretable and fast, and for achieving that we will need to use the best variables, that is, use the best features of such data.

Although there are already several libraries available which approach ML tasks in Big Data, that is not the case for FS algorithms yet, and other preprocessing techniques such as discretization. However, the existing FS methods do not scale well when dealing with Big Data. In this presentation, we show our efforts and new ideas for parallelizing standard FS methods for its use on Big Data environments.

Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-11.html

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

October 22, 2015
Tweet

Transcript

  1. None
  2. BEGIN AT THE BEGINNING: FEATURE SELECTION FOR BIG DATA AMPARO

    ALONSO-BETANZOS BIG DATA SPAIN 2015, Madrid
  3. BIG DATA HISPANO, 2015 2 Begin at the Begining “Begin

    at the beginning," the King said, very gravely, "and go on till you come to the end: then stop.”
  4. BIG DATA HISPANO, 2015 3 The first step: Preprocessing the

    data Peter Norvig Google Research Director
  5. BIG DATA HISPANO, 2015 4 Not everything that counts can

    be counted, and not everything that can be counted counts. Equality is not the way
  6. BIG DATA HISPANO, 2015 5 Why Feature reduction?

  7. BIG DATA HISPANO, 2015 6 Arriving at the best features

  8. BIG DATA HISPANO, 2015 7 Feature selection. Benefits

  9. BIG DATA HISPANO, 2015 8 Feature selection: basic flavors Advantages

    Disadvantages Examples • Independence of classifier No interaction with classifier CFS • Low computational cost Consistency-based • Fast • Good generalization ability INTERACT ReliefF FCBF InfoGain mRMr
  10. BIG DATA HISPANO, 2015 9 Basic shapes of filters: In

    several ways Subset Filters Ranker Filters Univariate methods Multivariate methods Feature selection techniques do not scale well with Big data
  11. BIG DATA HISPANO, 2015 10 Distributed Feature Selection • Allocating

    the learning process among several workstations as a natural manner of scaling up learning algorithms. Scaling up FS • Advantages: – Reduction in execution time – Resources sharing – Better performance
  12. BIG DATA HISPANO, 2015 11 Cluster computing MLlib Distributed implementation

    of a FS method
  13. BIG DATA HISPANO, 2015 12  It is built on

    Apache Spark, a fast and general engine for large-scale data processing.  Runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.  Runs on Hadoop 2 clusters  Write applications quickly in Java, Scala, or Python. MLlib, why?
  14. BIG DATA HISPANO, 2015 13 MLLib contents No FS algorithms!!!

  15. BIG DATA HISPANO, 2015 14  Implemented a generic FS

    framework for Big Data based on Information Theory • Brown G, Pocock A, Zhao MJ, Lujan M (2012) Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66. Implementing FS based on IT framework Relevance Redundance Conditional Redundance
  16. BIG DATA HISPANO, 2015 15 The long and winding road

    Discretization is needed!! Transform numerical attributes into discrete or nominal attributes with a finite number of intervals.
  17. BIG DATA HISPANO, 2015 16 Stages in the discretization process

  18. BIG DATA HISPANO, 2015 17  Proposal: Complete Re-design of

    Discretization method MDLP (Minimum Description Length Principle) – Sort all points in the dataset using a single distributed operation using a SPARK primitive. – Evaluates boundary points (per feature) in an parallel way. The algorithm: MLDP
  19. BIG DATA HISPANO, 2015 18 Is it worth?

  20. BIG DATA HISPANO, 2015 19 Original criteria and reformulation in

    framework
  21. BIG DATA HISPANO, 2015 20  The complexity of the

    framework is determined by the computations of relevance and redundancy  Proposal: complete re-design of Brown's framework – Columnar transformation: The access pattern presented by most FS methods is feature-wise. The partitioning scheme of data is quite influential in Apache Spark. Re-design of FS framework for Spark
  22. BIG DATA HISPANO, 2015 21  The complexity of the

    framework is determined by the computations of relevance and redundancy  Proposal: complete re-design of Brown's framework – Columnar transformation: The access pattern presented by most FS methods is feature-wise. The partitioning scheme of data is quite influential in Apache Spark. – Caching variables: relevance is computed and cached once at the start. The marginal and joint proportions derived from these operation are also cached. This info. is replicated. – Greedy approach: only one feature is selected in each iteration. The quadratic complexity is transformed to a complexity determined by the number of features selected. Re-design of FS framework for Spark
  23. BIG DATA HISPANO, 2015 22 Scalability results: Selection time Dna

    dataset
  24. BIG DATA HISPANO, 2015 23 Scalability : Cores ECDBL14 dataset

  25. BIG DATA HISPANO, 2015 24 Is it useful?

  26. BIG DATA HISPANO, 2015 25 Time for creating the classification

    model
  27. BIG DATA HISPANO, 2015 26 Spark-Infotheoretic FS

  28. BIG DATA HISPANO, 2015 27 Other attempts Parallel Implementation of

    mRMR on GPU https://github.com/sramirez/fast-mRMR Implementation of other FS algorithms: ReliefF, CFS, SVM-RFE (working on the scalability studies)
  29. BIG DATA HISPANO, 2015 28 Data can be located in

    different sites: • Different parts of a company • Different cooperating organizations • A very large data set can be distributed on several processors and then combine the results Distributed Feature Selection (DFS)  DFS Goal: • to reduce the computational time • while maintaining the classification performance
  30. BIG DATA HISPANO, 2015 29 DFS. Types of partition By

    samples By features
  31. BIG DATA HISPANO, 2015 30 DFS with rankers NDCG values

  32. BIG DATA HISPANO, 2015 31 Distributed FS by Samples HORIZONTAL

    PARTITION
  33. BIG DATA HISPANO, 2015 32 Distributed FS by Features VERTICAL

    PARTITION
  34. Thank You!!! 33 BIG DATA SPAIN 2015, Madrid

  35. BIG DATA HISPANO, 2015 34 Discretization. How does it work?

     Parameters: 50 intervals and 100,000 max candidates per partition.  Classifier: Naive Bayes from MLLib, lambda = 1, iterations = 100.  Hardware: 16 nodes (12 cores per node), 64 GB RAM.  Software: Hadoop 2.5 and Apache Spark 1.2.0.
  36. BIG DATA HISPANO, 2015 35 Feature Selection. Experimental results DATASETS

     Parameters: FS algorithm = mRMR, level of parallelism = 864 partitions.  Classifier: Naive Bayes and SVM (default parameters), from MLLIB.  Hardware: 18 nodes (12 cores per node), 64 GB RAM.  Software: Hadoop 2.5 and Apache Spark 1.2.0.
  37. BIG DATA HISPANO, 2015 36 CPU vs CUDA Low number

    of possible values (< 64) High number of possible values (up to 256)
  38. BIG DATA HISPANO, 2015 37 GPU. Real Datasets DATASET PATTERNS

    FEATURES VALUES KDDCup99 4000000 41 255 Higgs 11000000 21 255
  39. BIG DATA HISPANO, 2015 38 Horizontal partitioning: By samples

  40. BIG DATA HISPANO, 2015 1. Horizontal partitioning of the datasets

    39
  41. BIG DATA HISPANO, 2015 40 2. Application of the filter

    to the subsets
  42. BIG DATA HISPANO, 2015 41 3.Combination of the results

  43. BIG DATA HISPANO, 2015 Experimental Framework FILTERS  Subset filters:

    • CFS (Correlation-based Feature Selection) • INTERACT • Consistency-based filter  Ranker filters: • InfoGain • ReliefF CLASSIFIERS • C4.5 • Naïve Bayes • IB1 • SVM 42
  44. BIG DATA HISPANO, 2015 43 Results with C 4.5

  45. BIG DATA HISPANO, 2015 44  Horizontal partitioning of the

    datasets • Partitioning of the data maintaining class distribution  Application of the filter to the datasets  Combination of the results • Merging procedure: Theoretical complexity of feature subsets Improving the method New approach
  46. BIG DATA HISPANO, 2015 45  Calculate the complexity of

    each candidate subset of features  Fisher discriminant ratio The complexity measurement mi , si 2 and pi mean, variance and proportion of the ith-class Independency from classifier Temporal improvement
  47. BIG DATA HISPANO, 2015 46 Connect 4 Isolet Madelon Ozone

    Spambase Mnist Full set 42 617 500 72 57 717 Centralized 7 186 18 20 19 61 Distrib-Comp 8 105 9 8 18 77 Number of selected features
  48. BIG DATA HISPANO, 2015 47 Classification accuracy (II) CFS ReliefF

  49. BIG DATA HISPANO, 2015 48 Time. Distributed vs Centralized (maximum)

  50. BIG DATA HISPANO, 2015 49 Runtime. Comparing Distributed approaches (Average)

  51. BIG DATA HISPANO, 2015 50 Maximum Run Time (MNIST)

  52. BIG DATA HISPANO, 2015 51 Runtime to obtain threshold of

    votes
  53. BIG DATA HISPANO, 2015 52 Vertical partitioning: By features

  54. BIG DATA HISPANO, 2015 53 Vertical partitioning

  55. BIG DATA HISPANO, 2015 54 Different approaches tested

  56. BIG DATA HISPANO, 2015 55 Experimental results. Microarray datasets DNA

    microarray data is a good candidate for vertical distributed feature selection, since data needs to be split by features This type of data usually have redundant features
  57. BIG DATA HISPANO, 2015 56 Vertical distribution. Accuracy and time

  58. BIG DATA HISPANO, 2015 57  Using complexity measurement instead

    of accuracy Vertical partition. Complexity measure Features Training Test Classes Isolet 617 6238 1236 26 Madelon 500 1600 800 2 Mnist 717 40000 20000 2 Number of features
  59. BIG DATA HISPANO, 2015 58 Experimental results. Classification accuracy

  60. BIG DATA HISPANO, 2015 59 Complexity measure. Time

  61. BIG DATA HISPANO, 2015 60 Complexity measure. Time Isolet Madelon

    MNIST Average Speedup 2318,45 Average Speedup 26,13 Average Speedup 1483,80 Average: 573.4337
  62. BIG DATA HISPANO, 2015 61 GPU implementation

  63. BIG DATA HISPANO, 2015 62  Parallel Computing paradigm 

    CUDA Platform • NVIDIA 780 GTI  IT-based algorithm: mRMR • (Maximum Relevance minimum redundancy) – CPU optimized version (fast mRMR) – GPU version GPU implementation of FS
  64. BIG DATA HISPANO, 2015 63  GPU compute efficiently image

    histograms  Accelerate computation of MI in GPU  Image processing (up to 256 values) Previous ideas Input Data Threads Partial histograms Final results
  65. BIG DATA HISPANO, 2015 64  Reorder  Variable Discretization

    : value 8 bits number (max 256 different values per feature) Data Access pattern
  66. BIG DATA HISPANO, 2015 65 Datasets: performance and scalability

  67. BIG DATA HISPANO, 2015 66 Results. CPU optimization Time(s) Dataset

    Dataset 400 Features 100 Features Time(s)
  68. BIG DATA HISPANO, 2015 67 Scalability CPU Optimization (I) Number

    of patterns Maximum posible values per feature Time(s) Time(s)
  69. BIG DATA HISPANO, 2015 68 Scalability CPU Optimization (II) Time(s)

    Number of features
  70. BIG DATA HISPANO, 2015 69 CPU vs CUDA Low number

    of possible values (< 64) High number of possible values (up to 256)
  71. BIG DATA HISPANO, 2015 70 GPU. Real Datasets DATASET PATTERNS

    FEATURES VALUES KDDCup99 4000000 41 255 Higgs 11000000 21 255
  72. BIG DATA HISPANO, 2015 71 Challenges : Millions of Dimensions

  73. BIG DATA HISPANO, 2015 72 Challenges: Scalability

  74. BIG DATA HISPANO, 2015 73 Challenges: Distributed FS

  75. BIG DATA HISPANO, 2015 74 Challenges: Real-time FS

  76. BIG DATA HISPANO, 2015 75 Challenges: Cost-based FS

  77. BIG DATA HISPANO, 2015 76 Challenges: Visualization and Interpretability

  78. BIG DATA HISPANO, 2015 77  V. Bolón-Canedo, N. Sánchez-Maroño,

    A. Alonso-Betanzos. Distributed feature selection: An application to microarray data classification. Applied Soft Computing 05/2015; 30:136-150. DOI:10.1016/j.asoc.2015.01.035.  V. Bolón- Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos. Recent advances and emerging challenges of feature selection in the context of Big Data. Knowledge-Based Systems (2015). doi:http://dx.doi.org/10.1016/j.knosys.2015.05.014.  V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos. Feature Selection for high-dimensional data. Springer-Verlag, 2015 (in production). References
  79. Thank You!!! 78 BIG DATA SPAIN 2015, Madrid