Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017

Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017

In an era of growing data complexity and volume and the advent of Big Data, feature selection has a key role to play in helping reduce high-dimensionality in machine learning problems.


Big Data Spain 2017
November 16th - 17th Kinépolis Madrid


Big Data Spain

December 01, 2017


  1. None
  2. Feature selection for Big Data: advances and challenges Verónica Bolón-Canedo

  3. Big Data Volume Velocity Variety Veracity Value Variability Visualization Validity

    Vulnerability Volatility Variables
  4. The more data, the better… right? The curse of dimensionality

  5. Feature selection “Feature selection is the process of selecting the

    relevant features and discarding the irrelevant and redundant ones” Note: not talking about feature extraction for dimensionality reduction! PCA, t-SNE, manifold learning? No, they lose the meaning of original features
  6. What is a relevant feature? Imagine that you are trying

    to guess the price of a car… • Relevant: engine size, age, mileage, presence of rust, ... • Irrelevant: color of windscreen wipers, stickers on windows, ... • Redundant: age / mileage
  7. Why feature selection? General data reduction To limit storage requirements

    and increase algorithm speed Feature set reduction To save resources in the next round of data collection Performance improvement To gain in predictive accuracy Data understanding To gain knowledge about the process that generated the data or for visualization
  8. Feature selection methods Subset vs Ranker Filters vs Embedded vs

    Wrappers Univariate vs Multivariate Sorry… There is no one-size-fits-all method!
  9. Feature selection is successful!

  10. If you want to know more about feature selection...

  11. None
  12. Big Dimensionality 3,000,000 1500 100 100 1980s 1990s 2000s

  13. Big Dimensionality > 29 million features > 20 million samples

    > 54 million features > 149 million samples
  14. Scalability In scaling up learning algorithms, the issue is not

    so much one of speeding up a slow algorithm, as one of turning an impracticable algorithm into a practical one “Good enough” solutions as “fast” as possible and as “efficiently” as possible
  15. Scalability Model complexity Univariate vs Multivariate Parameter tuning Stability Distributed

  16. Distributed feature selection • Data is, sometimes, distributed in origin

    • Privacy issues • Vertical or horizontal distribution? • Overlap between partitions? • How to aggregate partial results?
  17. Distributed feature selection Arrow’s impossibility theorem: “When having at least

    two rankers (nodes), and at least three options to rank (features), it is impossible to design an aggregation function that satisfies in a strong way a set of desirable conditions at once”
  18. Distributed feature selection Good enough solutions in terms of accuracy

    Bolón-Canedo, Verónica, et al. "Exploring the consequences of distributed feature selection in DNA microarray data." In Proceedings of International Joint Conference on Neural Networks, IJCNN, pp. 1665-1672, (2017).
  19. Parallel feature selection

  20. Parallel feature selection

  21. Real-time processing Spam detection Video/image detection Portable devices CAD systems

  22. Online feature selection Pre-selecting features No subsequent online classification Classifiers

    not flexible with respect to input features Find flexible feature selection methods capable of modifying the selected subset of features as new training samples arrive Methods that can be executed in a dynamic feature space initially empty but would add features as new information arrives
  23. Online feature selection Chi2 k-means One-layer ANN

  24. Feature cost

  25. Feature cost

  26. Feature cost: a real case In tear film lipid layer

    classification, the time (cost) for extracting the features is not the same and should be minimized.
  27. Visualization and interpretability Typical approach: feature extraction Loss of interpretability!

    A model is only as good as its features, so features play a preponderant role in model interpretability Two-fold need for interpretability and transparency in feature selection and model creation processes: • More interactive model visualizations to better interact with the model and visualize future scenarios • More interactive feature selection process where, using interactive visualizations, it is possible to iterate through different feature subsets
  28. Visualization and interpretability Digital Diogenes Syndrome Organizations need to gather

    data in a meaningful way Data-rich/Knowledge-poor Data-rich/Knowledge-rich Krause, J., Perer, A., & Bertini, E. (2014). INFUSE: interactive feature selection for predictive modeling of high dimensional data. IEEE transactions on visualization and computer graphics, 20(12), 1614-1623.
  29. What is big in Big Data? New opportunity to develop

    methods in computationally constrained platforms!
  30. Take home message 1. If you have never considered applying

    feature selection to your problem, give it a try! 2. If you are interested in feature selection, it is a prolific open line of research facing new challenges that Big Data brought.
  31. None