big data is a problem - What does big data mean for the future of machine learning? - Scalability is extremely important! • How can we help the life sciences? - What is it and how can we reduce is affects? • Where are the open problems? - Where are the applications? - What are the fundamental challenges we still face? http://venturebeat.files.wordpress.com/2012/01/big-data.jpg
there? • What are they doing? Bacteria,Planctomycetes,Phycisphaerae,WD2101, Bacteria,Firmicutes,Clostridia,Clostridiales,Lachnospiraceae Bacteria,Bacteroidetes,Saprospirae,Saprospirales,Chitinophagaceae Bacteria,Firmicutes,Clostridia,Clostridiales,Lachnospiraceae,Lachnospira, Bacteria,Proteobacteria,Alphaproteobacteria,Sphingomonadales,Sphingomonadaceae,Kaistobacter, Bacteria,Firmicutes,Clostridia,Clostridiales,Lachnospiraceae Bacteria,Bacteroidetes,Flavobacteriia,Flavobacteriales,Flavobacteriaceae,Flavobacterium,
growing rapidly in the internet era • What does it mean for data to be “big”? • The Five V’s: volume, velocity, variety, veracity, and value - volume: not only number of samples, but the dimensionality - velocity: data arrive in a stream - value: not of cost, but of importance • There are a lot of data being generated in today’s technological climate - Pro: some data are useful - Con: some data are not useful • Twenty years of machine learning research has led to wide body of research for detecting value - “volume” of twenty years ago is not the volume of today - we want algorithms that handle distributed data, are easily implemented in parallel, and statistically sound
comprised of a collection of variables (features), and at least one dependent variable (class) • Selecting only the most relevant features not only can lead to a model of lower complexity, but also helps knowledge discovery - Applications: life sciences, clinical tests, and detecting relationships in social networks - Large data will complicates things! We need improvements to algorithms even for selection tasks that could seem trivial. • How many variables should be selected? Its combinatorial! Wrapper Methods • Build a classifier, measure loss, adapt feature set, repeat… • Easy to overfit • Too computationally complex Embedded Methods • Jointly optimize classifier and variable selector parameters • E.g., Linear model with L1 penalization. Filter Methods • Optimize feature set independent from a classifier • Fast, but need ways to scale them to big data
generic filter subset selection algorithm to large data, while: - detecting with relevant set size from an initial condition - independent of what a user feels is “relevant” - works with the decisions of a base-selection algorithm • Scalability is important and NPFS models parallelism from a programatic perceptive - Nicely fits into a MapReduce approach to parallelism - How many parallel tasks does NPFS allow? How many slots are available? • Concept: generate bootstrap data sets, perform feature selection, then reduce importance detections to a Neyman- Pearson hypothesis test G. Ditzler et al., “A bootstrap based Neyman-Pearson test for identifying variable importance,” IEEE Transactions on Neural Networks and Learning Systems, 2014.
FS algorithms) need to consider the entire feature space - from a software perspective NPFS could be problematic if the feature set size is too large - what if features are missing from the data? • SLSS uses bandit learning algorithms to sequential learn a variable’s importance by considering subsets of the space - Bandits: UCB1, Exp3, epsilon-greedy, & Thompson sampling - hold a distribution over the features to represent importance • SLSS can be scaled to massive data using the bag-of-little bootstraps G. Ditzler et al., “Sequential learning and ranking of variable importance,” In preparation, 2014
fatty acid transport that is associated with obesity and insulin resistant states • ATPases catalyze dephosphorylation reactions to release energy • Glycosyl transferase is hypothesized to result in recruitment of bacteria to the gut mucosa and increased inflammation G. Ditzler et al., “Feature selection for metagenomic data analysis,” Encyclopedia of Metagenomics, 2014. G. Ditzler et al., “Fizzy: Feature Selection for Metagenomics,” In preparation, 2014.
of Utmost Importance - NPFS & SLSS are scalable to large data sets, and have served the microbial ecology community through open-source implementations - The fields of life science are generating a tremendous amount of data and need help from the fields of computational science for analysis Future Work • Advanced Frameworks for Big Data Subset Selection - CIM recently published a special issue on Big Data and the curse of big dimensionality - Millions of features & beyond while being mindful of veracity - Migrating from a batch-based learning setting to a pure online setting with non-linear selection techniques • Calibrated Prediction for Domain Adaptation in Time-Series - Tuning model parameters in changing domains using unlabeled data (inspired by Ditzler et al., 2014-IJCNN), and accessing the stability of predictions in uncertain environments G. Ditzler et al., “Domain adaptation bounds for multiple expert systems under concept drift,” in International Joint Conference on Neural Networks, Beijing, China, 2014. (Best Paper Award)
Reichenberger (Drexel) Diamantino Caseiro (Google) Robi Polikar (Rowan) Steve Essinger (Pandora) Yemin Lan (Drexel) Steve Pastor (Drexel) Steve Woloszynek (Drexel)