Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017

Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017

In an era of growing data complexity and volume and the advent of Big Data, feature selection has a key role to play in helping reduce high-dimensionality in machine learning problems.

https://www.bigdataspain.org/2017/talk/feature-selection-for-big-data-advances-and-challenges

Big Data Spain 2017
November 16th - 17th Kinépolis Madrid

Big Data Spain

December 01, 2017
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. Feature selection “Feature selection is the process of selecting the

    relevant features and discarding the irrelevant and redundant ones” Note: not talking about feature extraction for dimensionality reduction! PCA, t-SNE, manifold learning? No, they lose the meaning of original features
  2. What is a relevant feature? Imagine that you are trying

    to guess the price of a car… • Relevant: engine size, age, mileage, presence of rust, ... • Irrelevant: color of windscreen wipers, stickers on windows, ... • Redundant: age / mileage
  3. Why feature selection? General data reduction To limit storage requirements

    and increase algorithm speed Feature set reduction To save resources in the next round of data collection Performance improvement To gain in predictive accuracy Data understanding To gain knowledge about the process that generated the data or for visualization
  4. Feature selection methods Subset vs Ranker Filters vs Embedded vs

    Wrappers Univariate vs Multivariate Sorry… There is no one-size-fits-all method!
  5. Big Dimensionality > 29 million features > 20 million samples

    > 54 million features > 149 million samples
  6. Scalability In scaling up learning algorithms, the issue is not

    so much one of speeding up a slow algorithm, as one of turning an impracticable algorithm into a practical one “Good enough” solutions as “fast” as possible and as “efficiently” as possible
  7. Distributed feature selection • Data is, sometimes, distributed in origin

    • Privacy issues • Vertical or horizontal distribution? • Overlap between partitions? • How to aggregate partial results?
  8. Distributed feature selection Arrow’s impossibility theorem: “When having at least

    two rankers (nodes), and at least three options to rank (features), it is impossible to design an aggregation function that satisfies in a strong way a set of desirable conditions at once”
  9. Distributed feature selection Good enough solutions in terms of accuracy

    Bolón-Canedo, Verónica, et al. "Exploring the consequences of distributed feature selection in DNA microarray data." In Proceedings of International Joint Conference on Neural Networks, IJCNN, pp. 1665-1672, (2017).
  10. Online feature selection Pre-selecting features No subsequent online classification Classifiers

    not flexible with respect to input features Find flexible feature selection methods capable of modifying the selected subset of features as new training samples arrive Methods that can be executed in a dynamic feature space initially empty but would add features as new information arrives
  11. Feature cost: a real case In tear film lipid layer

    classification, the time (cost) for extracting the features is not the same and should be minimized.
  12. Visualization and interpretability Typical approach: feature extraction Loss of interpretability!

    A model is only as good as its features, so features play a preponderant role in model interpretability Two-fold need for interpretability and transparency in feature selection and model creation processes: • More interactive model visualizations to better interact with the model and visualize future scenarios • More interactive feature selection process where, using interactive visualizations, it is possible to iterate through different feature subsets
  13. Visualization and interpretability Digital Diogenes Syndrome Organizations need to gather

    data in a meaningful way Data-rich/Knowledge-poor Data-rich/Knowledge-rich Krause, J., Perer, A., & Bertini, E. (2014). INFUSE: interactive feature selection for predictive modeling of high dimensional data. IEEE transactions on visualization and computer graphics, 20(12), 1614-1623.
  14. What is big in Big Data? New opportunity to develop

    methods in computationally constrained platforms!
  15. Take home message 1. If you have never considered applying

    feature selection to your problem, give it a try! 2. If you are interested in feature selection, it is a prolific open line of research facing new challenges that Big Data brought.