Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015

BEGIN AT THE BEGINNING: FEATURE SELECTION FOR BIG DATA AMPARO
ALONSO-BETANZOS BIG DATA SPAIN 2015, Madrid

BIG DATA HISPANO, 2015 2 Begin at the Begining “Begin
at the beginning," the King said, very gravely, "and go on till you come to the end: then stop.”

BIG DATA HISPANO, 2015 3 The first step: Preprocessing the
data Peter Norvig Google Research Director

BIG DATA HISPANO, 2015 4 Not everything that counts can
be counted, and not everything that can be counted counts. Equality is not the way

BIG DATA HISPANO, 2015 5 Why Feature reduction?

BIG DATA HISPANO, 2015 6 Arriving at the best features

BIG DATA HISPANO, 2015 7 Feature selection. Benefits

BIG DATA HISPANO, 2015 8 Feature selection: basic flavors Advantages
Disadvantages Examples • Independence of classifier No interaction with classifier CFS • Low computational cost Consistency-based • Fast • Good generalization ability INTERACT ReliefF FCBF InfoGain mRMr

BIG DATA HISPANO, 2015 9 Basic shapes of filters: In
several ways Subset Filters Ranker Filters Univariate methods Multivariate methods Feature selection techniques do not scale well with Big data

BIG DATA HISPANO, 2015 10 Distributed Feature Selection • Allocating
the learning process among several workstations as a natural manner of scaling up learning algorithms. Scaling up FS • Advantages: – Reduction in execution time – Resources sharing – Better performance

BIG DATA HISPANO, 2015 11 Cluster computing MLlib Distributed implementation
of a FS method

BIG DATA HISPANO, 2015 12  It is built on
Apache Spark, a fast and general engine for large-scale data processing.  Runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.  Runs on Hadoop 2 clusters  Write applications quickly in Java, Scala, or Python. MLlib, why?

BIG DATA HISPANO, 2015 13 MLLib contents No FS algorithms!!!

BIG DATA HISPANO, 2015 14  Implemented a generic FS
framework for Big Data based on Information Theory • Brown G, Pocock A, Zhao MJ, Lujan M (2012) Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66. Implementing FS based on IT framework Relevance Redundance Conditional Redundance

BIG DATA HISPANO, 2015 15 The long and winding road
Discretization is needed!! Transform numerical attributes into discrete or nominal attributes with a finite number of intervals.

BIG DATA HISPANO, 2015 16 Stages in the discretization process

BIG DATA HISPANO, 2015 17  Proposal: Complete Re-design of
Discretization method MDLP (Minimum Description Length Principle) – Sort all points in the dataset using a single distributed operation using a SPARK primitive. – Evaluates boundary points (per feature) in an parallel way. The algorithm: MLDP

BIG DATA HISPANO, 2015 18 Is it worth?

BIG DATA HISPANO, 2015 19 Original criteria and reformulation in
framework

BIG DATA HISPANO, 2015 20  The complexity of the
framework is determined by the computations of relevance and redundancy  Proposal: complete re-design of Brown's framework – Columnar transformation: The access pattern presented by most FS methods is feature-wise. The partitioning scheme of data is quite influential in Apache Spark. Re-design of FS framework for Spark

BIG DATA HISPANO, 2015 21  The complexity of the
framework is determined by the computations of relevance and redundancy  Proposal: complete re-design of Brown's framework – Columnar transformation: The access pattern presented by most FS methods is feature-wise. The partitioning scheme of data is quite influential in Apache Spark. – Caching variables: relevance is computed and cached once at the start. The marginal and joint proportions derived from these operation are also cached. This info. is replicated. – Greedy approach: only one feature is selected in each iteration. The quadratic complexity is transformed to a complexity determined by the number of features selected. Re-design of FS framework for Spark

BIG DATA HISPANO, 2015 22 Scalability results: Selection time Dna
dataset

BIG DATA HISPANO, 2015 23 Scalability : Cores ECDBL14 dataset

BIG DATA HISPANO, 2015 24 Is it useful?

BIG DATA HISPANO, 2015 25 Time for creating the classification
model

BIG DATA HISPANO, 2015 26 Spark-Infotheoretic FS

BIG DATA HISPANO, 2015 27 Other attempts Parallel Implementation of
mRMR on GPU https://github.com/sramirez/fast-mRMR Implementation of other FS algorithms: ReliefF, CFS, SVM-RFE (working on the scalability studies)

BIG DATA HISPANO, 2015 28 Data can be located in
different sites: • Different parts of a company • Different cooperating organizations • A very large data set can be distributed on several processors and then combine the results Distributed Feature Selection (DFS)  DFS Goal: • to reduce the computational time • while maintaining the classification performance

BIG DATA HISPANO, 2015 29 DFS. Types of partition By
samples By features

BIG DATA HISPANO, 2015 30 DFS with rankers NDCG values

BIG DATA HISPANO, 2015 31 Distributed FS by Samples HORIZONTAL
PARTITION

BIG DATA HISPANO, 2015 32 Distributed FS by Features VERTICAL
PARTITION

Thank You!!! 33 BIG DATA SPAIN 2015, Madrid

BIG DATA HISPANO, 2015 34 Discretization. How does it work?
 Parameters: 50 intervals and 100,000 max candidates per partition.  Classifier: Naive Bayes from MLLib, lambda = 1, iterations = 100.  Hardware: 16 nodes (12 cores per node), 64 GB RAM.  Software: Hadoop 2.5 and Apache Spark 1.2.0.

BIG DATA HISPANO, 2015 35 Feature Selection. Experimental results DATASETS
 Parameters: FS algorithm = mRMR, level of parallelism = 864 partitions.  Classifier: Naive Bayes and SVM (default parameters), from MLLIB.  Hardware: 18 nodes (12 cores per node), 64 GB RAM.  Software: Hadoop 2.5 and Apache Spark 1.2.0.

BIG DATA HISPANO, 2015 36 CPU vs CUDA Low number
of possible values (< 64) High number of possible values (up to 256)

BIG DATA HISPANO, 2015 37 GPU. Real Datasets DATASET PATTERNS
FEATURES VALUES KDDCup99 4000000 41 255 Higgs 11000000 21 255

BIG DATA HISPANO, 2015 38 Horizontal partitioning: By samples

BIG DATA HISPANO, 2015 1. Horizontal partitioning of the datasets
39

BIG DATA HISPANO, 2015 40 2. Application of the filter
to the subsets

BIG DATA HISPANO, 2015 41 3.Combination of the results

BIG DATA HISPANO, 2015 Experimental Framework FILTERS  Subset filters:
• CFS (Correlation-based Feature Selection) • INTERACT • Consistency-based filter  Ranker filters: • InfoGain • ReliefF CLASSIFIERS • C4.5 • Naïve Bayes • IB1 • SVM 42

BIG DATA HISPANO, 2015 43 Results with C 4.5

BIG DATA HISPANO, 2015 44  Horizontal partitioning of the
datasets • Partitioning of the data maintaining class distribution  Application of the filter to the datasets  Combination of the results • Merging procedure: Theoretical complexity of feature subsets Improving the method New approach

BIG DATA HISPANO, 2015 45  Calculate the complexity of
each candidate subset of features  Fisher discriminant ratio The complexity measurement mi , si 2 and pi mean, variance and proportion of the ith-class Independency from classifier Temporal improvement

BIG DATA HISPANO, 2015 46 Connect 4 Isolet Madelon Ozone
Spambase Mnist Full set 42 617 500 72 57 717 Centralized 7 186 18 20 19 61 Distrib-Comp 8 105 9 8 18 77 Number of selected features

BIG DATA HISPANO, 2015 47 Classification accuracy (II) CFS ReliefF

BIG DATA HISPANO, 2015 48 Time. Distributed vs Centralized (maximum)

BIG DATA HISPANO, 2015 49 Runtime. Comparing Distributed approaches (Average)

BIG DATA HISPANO, 2015 50 Maximum Run Time (MNIST)

BIG DATA HISPANO, 2015 51 Runtime to obtain threshold of
votes

BIG DATA HISPANO, 2015 52 Vertical partitioning: By features

BIG DATA HISPANO, 2015 53 Vertical partitioning

BIG DATA HISPANO, 2015 54 Different approaches tested

BIG DATA HISPANO, 2015 55 Experimental results. Microarray datasets DNA
microarray data is a good candidate for vertical distributed feature selection, since data needs to be split by features This type of data usually have redundant features

BIG DATA HISPANO, 2015 56 Vertical distribution. Accuracy and time

BIG DATA HISPANO, 2015 57  Using complexity measurement instead
of accuracy Vertical partition. Complexity measure Features Training Test Classes Isolet 617 6238 1236 26 Madelon 500 1600 800 2 Mnist 717 40000 20000 2 Number of features

BIG DATA HISPANO, 2015 58 Experimental results. Classification accuracy

BIG DATA HISPANO, 2015 59 Complexity measure. Time

BIG DATA HISPANO, 2015 60 Complexity measure. Time Isolet Madelon
MNIST Average Speedup 2318,45 Average Speedup 26,13 Average Speedup 1483,80 Average: 573.4337

BIG DATA HISPANO, 2015 61 GPU implementation

BIG DATA HISPANO, 2015 62  Parallel Computing paradigm 
CUDA Platform • NVIDIA 780 GTI  IT-based algorithm: mRMR • (Maximum Relevance minimum redundancy) – CPU optimized version (fast mRMR) – GPU version GPU implementation of FS

BIG DATA HISPANO, 2015 63  GPU compute efficiently image
histograms  Accelerate computation of MI in GPU  Image processing (up to 256 values) Previous ideas Input Data Threads Partial histograms Final results

BIG DATA HISPANO, 2015 64  Reorder  Variable Discretization
: value 8 bits number (max 256 different values per feature) Data Access pattern

BIG DATA HISPANO, 2015 65 Datasets: performance and scalability

BIG DATA HISPANO, 2015 66 Results. CPU optimization Time(s) Dataset
Dataset 400 Features 100 Features Time(s)

BIG DATA HISPANO, 2015 67 Scalability CPU Optimization (I) Number
of patterns Maximum posible values per feature Time(s) Time(s)

BIG DATA HISPANO, 2015 68 Scalability CPU Optimization (II) Time(s)
Number of features

BIG DATA HISPANO, 2015 69 CPU vs CUDA Low number
of possible values (< 64) High number of possible values (up to 256)

BIG DATA HISPANO, 2015 70 GPU. Real Datasets DATASET PATTERNS
FEATURES VALUES KDDCup99 4000000 41 255 Higgs 11000000 21 255

BIG DATA HISPANO, 2015 71 Challenges : Millions of Dimensions

BIG DATA HISPANO, 2015 72 Challenges: Scalability

BIG DATA HISPANO, 2015 73 Challenges: Distributed FS

BIG DATA HISPANO, 2015 74 Challenges: Real-time FS

BIG DATA HISPANO, 2015 75 Challenges: Cost-based FS

BIG DATA HISPANO, 2015 76 Challenges: Visualization and Interpretability

BIG DATA HISPANO, 2015 77  V. Bolón-Canedo, N. Sánchez-Maroño,
A. Alonso-Betanzos. Distributed feature selection: An application to microarray data classification. Applied Soft Computing 05/2015; 30:136-150. DOI:10.1016/j.asoc.2015.01.035.  V. Bolón- Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos. Recent advances and emerging challenges of feature selection in the context of Big Data. Knowledge-Based Systems (2015). doi:http://dx.doi.org/10.1016/j.knosys.2015.05.014.  V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos. Feature Selection for high-dimensional data. Springer-Verlag, 2015 (in production). References

Thank You!!! 78 BIG DATA SPAIN 2015, Madrid

Begin at the beginning: Feature selection for B...

Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript