Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017

Feature selection for Big Data: advances and challenges Verónica Bolón-Canedo

Big Data Volume Velocity Variety Veracity Value Variability Visualization Validity
Vulnerability Volatility Variables

The more data, the better… right? The curse of dimensionality

Feature selection “Feature selection is the process of selecting the
relevant features and discarding the irrelevant and redundant ones” Note: not talking about feature extraction for dimensionality reduction! PCA, t-SNE, manifold learning? No, they lose the meaning of original features

What is a relevant feature? Imagine that you are trying
to guess the price of a car… • Relevant: engine size, age, mileage, presence of rust, ... • Irrelevant: color of windscreen wipers, stickers on windows, ... • Redundant: age / mileage

Why feature selection? General data reduction To limit storage requirements
and increase algorithm speed Feature set reduction To save resources in the next round of data collection Performance improvement To gain in predictive accuracy Data understanding To gain knowledge about the process that generated the data or for visualization

Feature selection methods Subset vs Ranker Filters vs Embedded vs
Wrappers Univariate vs Multivariate Sorry… There is no one-size-fits-all method!

Feature selection is successful!

If you want to know more about feature selection...

Big Dimensionality 3,000,000 1500 100 100 1980s 1990s 2000s

Big Dimensionality > 29 million features > 20 million samples
> 54 million features > 149 million samples

Scalability In scaling up learning algorithms, the issue is not
so much one of speeding up a slow algorithm, as one of turning an impracticable algorithm into a practical one “Good enough” solutions as “fast” as possible and as “efficiently” as possible

Scalability Model complexity Univariate vs Multivariate Parameter tuning Stability Distributed
learning

Distributed feature selection • Data is, sometimes, distributed in origin
• Privacy issues • Vertical or horizontal distribution? • Overlap between partitions? • How to aggregate partial results?

Distributed feature selection Arrow’s impossibility theorem: “When having at least
two rankers (nodes), and at least three options to rank (features), it is impossible to design an aggregation function that satisfies in a strong way a set of desirable conditions at once”

Distributed feature selection Good enough solutions in terms of accuracy
Bolón-Canedo, Verónica, et al. "Exploring the consequences of distributed feature selection in DNA microarray data." In Proceedings of International Joint Conference on Neural Networks, IJCNN, pp. 1665-1672, (2017).

Parallel feature selection

Real-time processing Spam detection Video/image detection Portable devices CAD systems
ETC...

Online feature selection Pre-selecting features No subsequent online classification Classifiers
not flexible with respect to input features Find flexible feature selection methods capable of modifying the selected subset of features as new training samples arrive Methods that can be executed in a dynamic feature space initially empty but would add features as new information arrives

Online feature selection Chi2 k-means One-layer ANN

Feature cost

Feature cost: a real case In tear film lipid layer
classification, the time (cost) for extracting the features is not the same and should be minimized.

Visualization and interpretability Typical approach: feature extraction Loss of interpretability!
A model is only as good as its features, so features play a preponderant role in model interpretability Two-fold need for interpretability and transparency in feature selection and model creation processes: • More interactive model visualizations to better interact with the model and visualize future scenarios • More interactive feature selection process where, using interactive visualizations, it is possible to iterate through different feature subsets

Visualization and interpretability Digital Diogenes Syndrome Organizations need to gather
data in a meaningful way Data-rich/Knowledge-poor Data-rich/Knowledge-rich Krause, J., Perer, A., & Bertini, E. (2014). INFUSE: interactive feature selection for predictive modeling of high dimensional data. IEEE transactions on visualization and computer graphics, 20(12), 1614-1623.

What is big in Big Data? New opportunity to develop
methods in computationally constrained platforms!

Take home message 1. If you have never considered applying
feature selection to your problem, give it a try! 2. If you are interested in feature selection, it is a prolific open line of research facing new challenges that Big Data brought.

Feature selection for Big Data: advances and ch...

Feature selection for Big Data: advances and challenges by Verónica Bolón-Canedo at Big Data Spain 2017

Big Data Spain

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript

Feature selection for Big Data: advances and challenges Verónica Bolón-Canedo

Big Data Volume Velocity Variety Veracity Value Variability Visualization Validity

The more data, the better… right? The curse of dimensionality

Feature selection “Feature selection is the process of selecting the

What is a relevant feature? Imagine that you are trying

Why feature selection? General data reduction To limit storage requirements

Feature selection methods Subset vs Ranker Filters vs Embedded vs

Feature selection is successful!

If you want to know more about feature selection...

Big Dimensionality 3,000,000 1500 100 100 1980s 1990s 2000s

Big Dimensionality > 29 million features > 20 million samples

Scalability In scaling up learning algorithms, the issue is not

Scalability Model complexity Univariate vs Multivariate Parameter tuning Stability Distributed

Distributed feature selection • Data is, sometimes, distributed in origin

Distributed feature selection Arrow’s impossibility theorem: “When having at least

Distributed feature selection Good enough solutions in terms of accuracy

Parallel feature selection

Parallel feature selection

Real-time processing Spam detection Video/image detection Portable devices CAD systems

Online feature selection Pre-selecting features No subsequent online classification Classifiers

Online feature selection Chi2 k-means One-layer ANN

Feature cost

Feature cost

Feature cost: a real case In tear film lipid layer

Visualization and interpretability Typical approach: feature extraction Loss of interpretability!

Visualization and interpretability Digital Diogenes Syndrome Organizations need to gather

What is big in Big Data? New opportunity to develop

Take home message 1. If you have never considered applying