Mining Big Data Sets

Objectives 1. Big Data challenges, trends and development 2. Open
Source frameworks/libraries 3. Scalability of modeling algorithms Christine Doig. Maryam Pashmi. May 2014

Index 0. Introduction 1.State of the art 2.Frameworks and libraries
2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

0. Introduction. Open Data context Open Data Syllabus: The focus
of this course is on enriching the available own data (i.e., data owned by the organization) with external repositories (special attention will be paid on Open Data), in order to gain insights into the organization business domain Source: Oscar Romero, Open Data, MIRI-FIB-UPC Christine Doig. Maryam Pashmi. May 2014

0. Introduction. Data Mining Data Mining: The process of discovering
interesting and useful patterns and relationships in data. Statistics, AI (neural networks and machine learning) Source: Oscar Romero, Open Data, MIRI-FIB-UPC Process diagram CRISP-DM: Cross Industry Standard Process for Data Mining. Source: Wikipedia Christine Doig. Maryam Pashmi. May 2014

0. Introduction. Big Data. Source: Intel Christine Doig. Maryam Pashmi.
May 2014

0. Introduction. Big Data. Source: http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data Christine Doig. Maryam Pashmi.
May 2014

0. Introduction. Big Data. Source: http://worrydream.com/ABriefRantOnTheFutureOfInteractionDesign/ Christine Doig. Maryam Pashmi.
May 2014

0. Introduction. Big Data. Don't use Hadoop, your data is
not that big. http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html Christine Doig. Maryam Pashmi. May 2014

0. Introduction. Concepts •Frameworks: abstraction that gives generic functionality, can
be modified by user-written code -> Giving you the pieces to build your tools. e.g. Ikea style -> Capability to build ! •Libraries: Implemented Algorithms -> Giving you a tool that you can use. You still need to know how the tool works -> Capability to use Christine Doig. Maryam Pashmi. May 2014

Mining Big Data Sets 0. Introduction 1.State of the art
2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

1. State of the art. Christine Doig. Maryam Pashmi. May
2014

Business Intelligence RDBMS. SQL. Statistics. Machine Learning. Data mining Once
upon a time… Christine Doig. Maryam Pashmi. May 2014

Business Intelligence RDBMS. SQL. Statistics. Machine Learning. Data mining Programming
Languages 1.0 Storing: Processing: Open Source project Hadoop: Distributed framework -Written in Java -Scalable, but think in MapReduce Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Languages 1.0 Storing: Processing: Open Source project Hadoop: Distributed framework -Written in Java -Scalable, but think in MapReduce Christine Doig. Maryam Pashmi. May 2014 Distributed systems Source: Doug Cutting

Languages 1.0 Storing: Processing: Open Source project Hadoop: Distributed framework -Written in Java -Scalable, but think in MapReduce Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Languages 1.0 Storing: Processing: Solutions: -Hive, Mahout -Hadoop Streaming HS HS ML library HQL Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming
Languages YARN Cluster mgmt. 2.0 Hadoop 2.0: YARN: Separating processing from cluster resource management Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Languages Streams YARN Cluster mgmt. 2.0 Applications. Data flow Python: ML/Stats libraries Cascading: Data flows Storm: Stream processing Christine Doig. Maryam Pashmi. May 2014 Distributed systems 1 3 2 1 2 3

Languages Streams YARN Cluster mgmt. 2.0 Cascalog Lingual Pattern Applications. Data flow Lingual: SQL Pattern: PMML (R/SAS) Cascalog: Clojure Scalding: Scala Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Languages Streams SAMOA YARN Cluster mgmt. 2.0 Cascalog Lingual Pattern Applications. Data flow Mining Streams: Frameworks/Libraries Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Languages Streams SAMOA YARN Cluster mgmt. 2.0 Cascalog Lingual Pattern MLlib Complex DAG Applications. Data flow Problem -> Real-time Performance/optimization SPARK: RDD (in-memory) Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Languages HPC HPDA Streams SAMOA YARN Cluster mgmt. Tez Complex DAG 2.0 Cascalog Lingual Pattern MLlib Complex DAG Applications. Data flow Numba/Blaze Christine Doig. Maryam Pashmi. May 2014 Distributed systems 1 2 3

Languages HPC HPDA Streams SAMOA YARN Cluster mgmt. Tez Complex DAG 2.0 Cascalog Lingual Pattern MLlib Complex DAG Applications. Data flow Numba/Blaze Christine Doig. Maryam Pashmi. May 2014 Distributed systems Voppal Wabbit Distributed online machine learning MLbase

1. State of the art. State of the art -
Summary What are the problems with plain MapReduce? • Fit your problem into MapReduce • Java • Optimization • Performance • Moving data around Christine Doig. Maryam Pashmi. May 2014

1. State of the art. Solutions & Trends •Fit into
MapReduce: Building frameworks and tools on top of MapReduce: -Hive (HQL), Pig (ETL/data flow/scripts), Cascading (data flow). Building DM/ML libraries on top of MapReduce: -Mahout, Pattern •Java: -Hadoop Streaming -R (RHadoop/RSpark) & Python(PySpark) -Scala (Scalding, Spark) & Clojure (Cascalog) •Optimization: Building application framework which allows for a complex directed-acyclic- graph of tasks for processing data: -Tez/Spark •Performance: Interactivity and Iteration: Spark. Streaming and Real-time: Storm, Spark Streaming •Moving data around: Bring the algorithms to the DB: MADlib Christine Doig. Maryam Pashmi. May 2014

1. State of the art. Frameworks&Libraries. Christine Doig. Maryam Pashmi.
May 2014

May 2014 Next Section: 2. Frameworks and Libraries -> Present each technology

May 2014 Next Section: 2. Frameworks and Libraries -> Present each technology 3. Scalability of Modeling algorithm -> Approach comparison between libraries

Overview of MapReduce. Source:” A Comparison of Join Algorithms for
Log Processing in MapReduce” 2.1 MapReduce Christine Doig. Maryam Pashmi. May 2014 Simple model to express relatively sophisticated distributed programs: -Processes pairs [key, value] -Signature (map, reduce) ! Benefits: Hides parallelization, transparently distributes data, balances workload, provides fault tolerance, Elastically scalable MAPPER REDUCER Important note: Just a brief introduction to MapReduce

An Apache Software Foundation project to create scalable & robust
machine learning libraries under the Apache Software License. •Hides the underlying MapReduce processes to the user. •Provides implemented algorithms. ! 2.1 Mahout *From April, 25th : The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them. ! Why? Spark Christine Doig. Maryam Pashmi. May 2014 Source: http://mahout.apache.org/ Contains: User and Item based recommenders Matrix factorization based recommenders K-Means, Fuzzy K-Means clustering Latent Dirichlet Allocation Singular Value Decomposition Logistic regression classifier (Complementary) Naive Bayes classifier Random forest classifier

Creating a user-based recommender load data from a file. compute
the correlation coefficient between their interactions. define which simila users we want to leverage for the recommender 2.1 Mahout - Example Christine Doig. Maryam Pashmi. May 2014

2.2 Cascading. Christine Doig. Maryam Pashmi. May 2014 Cascading is
a software abstraction layer for Apache Hadoop that is intended to allow developers to write their data applications once and then deploy those applications on any big data infrastructure. ! In Cascading you define workflows, pipes of data that get routed through familiar operators such as "GroupBy", "Count", "Filter", "Merge", “Each” (map), “Every” (aggregate), “CoGroup” (MR join), “HashJoin” (Local join), etc. ! In Cascading vocabulary: "Build a Cascade from Flows which connect Taps via Pipes built into Assemblies to process Tuples” ! A workflow can be visualized as a "flow diagram”. ! ! “Scalding the Crunchy Pig for Cascading into the Hive“ http://thewit.ch/scalding_crunchy_pig

2.2 Cascading. Christine Doig. Maryam Pashmi. May 2014 Cascading for
the Impatient: http://docs.cascading.org/impatient/impatient5.html

2.2 Cascading. Pattern Christine Doig. Maryam Pashmi. May 2014 PMML:
Predictive Model Markup Language •Import model descriptions from R, SAS,Weka, RapidMiner, KNIME,SQL Server, etc. •XML-based file format developed by the Data Mining Group •Implemented algorithms: Random Forest, Linear Regression, Hierarchical Clustering and K-Means Clustering, Logistic Regression. Example: Linear Regression. package cascading.pattern.pmml.iris; *Popular open source projects have been built on of top Cascading, such as Scala (Scalding) and Clojure (Cascalog), which include significant Machine Learning libraries.

2.3 MADLib Christine Doig. Maryam Pashmi. May 2014 Philosophy: Bring
the modeling algorithms to the databases MADlib is an open-source library for scalable in-database analytics. It provides data- parallel implementations of mathematical, statistical and machine learning methods. Source: http://doc.madlib.net/latest/group__grp__linreg.html

- Big Data Mining 2.Frameworks and libraries 2.1 MapReduce – Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

2.4 Spark. Overview Christine Doig. Maryam Pashmi. May 2014 Apache
Spark™ is a fast and general engine for large-scale data processing. ! Characteristics: -Speed: DAG execution engine and in-memory computing. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. -Ease of Use: Write applications quickly in Java, Scala or Python. -Generality: Combine SQL, streaming, and ML. -Integrated with Hadoop: HDFS, YARN or Mesos cluster manager, can read any existing Hadoop data. ! ! ! ! Source: http://databricks.com/spark/ Source: http://spark.apache.org/

2.4 Spark. RDD(I) Christine Doig. Maryam Pashmi. May 2014 RDD
(Resilient Distributed Datasets): -distributed memory abstraction to perform in- memory computations on large clusters in a fault-tolerant manner: •logging the transformations used to build a dataset (its lineage) rather than the actual data*. -is read-only. RDD can be created from: -data in stable storage (e.g. HDFS) -other RDDs with deterministic operations called transformations (map, filter, join) ! RDDs do not need to be materialized at all times. ! *Can be helpful to checkpoint some RDDs to stable storage ! ! ! ! Source: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 Source:
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 The
scheduler will perform a topology sort to determine the execution sequence of the DAG, tracing all the way back to the source nodes, or node that represents a cached RDD. Source: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 Source:
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

2.4 Spark for DM Christine Doig. Maryam Pashmi. May 2014
Why Spark is suitable for mining large data sets? ! -Iteration -Interactivity ! Many machine learning algorithms are iterative in nature because they run iterative optimization procedures. They can thus run much faster by keeping their data in memory. ! Spark supports two types of shared variables: •broadcast variables: cache a value in memory on all nodes. •accumulators: variables that are only “added” to, such as counters and sums. ! Users can control two other aspects of RDDs: • persistence: indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage). • partitioning: ask that an RDD’s elements be partitioned across machines based on a key in each record. Controlling how different RDD are co-partitioned (with the same keys) across machines can reduce inter-machine data shuffling within a cluster. ! !

2.4 Spark. MLlib example. Christine Doig. Maryam Pashmi. May 2014
KNN example

2.4 Spark. MLBase Christine Doig. Maryam Pashmi. May 2014 Source:
http://mlbase.org/

3. Scalability of modeling algorithms Machine Learning Revisited Christine Doig.
Maryam Pashmi. May 2014 Data (labeled, true value) Supervised learning Learning data (training + validation) Testing data Model (e.g. Linear Regression, SVM…) (+ Loss function) Best model (parameters) Learning error Testing error New data (unlabeled, unknown value) Prediction

3. Scalability of modeling algorithms Christine Doig. Maryam Pashmi. May
2014

3. Scalability of modeling algorithms What are the problems in
performing numerical optimization in large datasets? ! • Iterations (MR: intermediate results are written to disk) • Inverse matrices ! Solutions: Approximations ->Parallelizable optimization techniques ! Examples: • Stochastic Gradient Descent • Limited-memory BFGS • Reduced rank matrix •Singular Value Decomposition •Kernel matrix ! ! ! Christine Doig. Maryam Pashmi. May 2014 “Logistic Regression for Data Mining and High-Dimensional Classification”, Paul Komarek. http://repository.cmu.edu/cgi/viewcontent.cgi?article=1221&context=robotics “Logistic Regression and Newton's Method”, http://www.maths.lth.se/matstat/kurser/masm22/lecture/NewtonRaphson.pdf “ROBUST STOCHASTIC APPROXIMATION APPROACH TO STOCHASTIC PROGRAMMING∗” http://www2.isye.gatech.edu/~nemirovs/SIOPT_RSA_2009.pdfhttp://ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf Pegasos: Primal Estimated sub-GrAdient SOlver for SVM: http://ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf Efﬁcient Large-Scale Distributed Training of Conditional Maximum Entropy Models: http://cs.nyu.edu/~silberman/papers/efficient_maxentNIPS2009.pdf Source: “A comparison of numerical optimizers for logistic regression”, Thomas Minka, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.7017&rep=rep1&type=pdf

3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig.
Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng

3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig.
Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.

3. Scalability of modeling algorithms Stochastic Gradient Descent Christine Doig.
Maryam Pashmi. May 2014 Subgradient methods represent an approximation in which only a subset of all training examples is used in each gradient computation, and when the size of the subsample is reduced to a single instance, we arrive at stochastic gradient descent. Source: Slow learners are fast: http://papers.nips.cc/paper/3888-slow-learners-are-fast.pdf

3. Scalability of modeling algorithms Alternatives in large datasets Christine
Doig. Maryam Pashmi. May 2014 •“Stochastic” option: •Single machine performs stochastic gradient descent. •Several processing cores perform stochastic gradient descent independently of each other while sharing a common parameter vector which is updated asynchronously. ! •“Batch” option: linear function classes where parts of the function can be computed independently on several cores. Source: Slow learners are fast: http://papers.nips.cc/paper/3888-slow-learners-are-fast.pdf

3. Scalability of modeling algorithms Approaches of different libraries: !
•Mahout: Stochastic Gradient Decent ! •MLlib in Spark: Stochastic Gradient Decent ! •Voppal Wabbit: Variant of online gradient decent. Conjugate gradient (CG), mini- batch, and data-dependent learning rates, are included. ! •Jubatus: Loose model sharing. The key is to share only models rather than data between distributed servers. Iterative Parameter Mixture. •UPDATE •MIX •ANALYZE ! Christine Doig. Maryam Pashmi. May 2014 Jubatus: An Open Source Platform for Distributed Online Machine Learning. http://biglearn.org/2013/files/papers/biglearning2013_submission_24.pdf http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

4. Summary Big Data challenges, trends and development -Optimization: DAG
(Directed Acyclic Graph Scheduler) -Performance: Interactive results, Speed, in-memory -Building tools for different capabilities, hiding underlying processes: Importance of end users Open Source frameworks/libraries -Hadoop/MapReduce: Mining large datasets as a batch process -Cascading/Pattern: Building data applications that incorporate ML/Stats algorithms -MADLib: Bring the algorithms to the database. Familiarity SQL -Spark,/Mlbase: DAG, in-memory, optimization of model selection Scalability of modeling algorithms Different numerical optimization techniques for large datasets -Stochastic Gradient Descent -Subgradient or “mini-batch” Gradient Descent -Batch Gradient Descent Christine Doig. Maryam Pashmi. May 2014

Christine Doig. Maryam Pashmi. May 2014 5. Recommendations and Resources
•Related MOOC courses: •Machine Learning, Coursera •Introduction to Hadoop and MapReduce, Udacity •Introduction to Data Science, Coursera •Learning from data, edX •Web Intelligence and Big Data, Coursera ! •Other courses: •Mining Massive Datasets: Stanford course http://infolab.stanford.edu/~ullman/mmds.html •Book •Slides ! •O’Reilly books -Safari Online books, subscription !

•Map-Reduce for Machine Learning on Multicore http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf •Spark http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf •Jubatus http://biglearn.org/2013/files/papers/biglearning2013_submission_24.pdf •Samoa: http://yahoo.github.io/samoa/SAMOA-Developers-Guide-0-0-1.pdf ! •HPDA: http://www.hpcuserforum.com/presentations/tuscon2013/IDCHPDABigDataHPC.pdf •MADlib: http://madlib.net/ http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-38.pdf ! ! !

Some online tutorials/exercises: R in Hadoop: http://hortonworks.com/hadoop-tutorial/using-rhadoop-to-predict-visitors-amount/ http://www.revolutionanalytics.com/sites/default/files/revoscalerdecisiontrees.pdf ! Tutorials for Cascading: http://www.cascading.org/documentation/tutorials/ ! Spark: A Data Science Case Study http://hortonworks.com/blog/spark-data-science-case-study/ ! Cascading Pattern to translate from R to Hadoop. Example: anti-fraud classifier used in e-commerce apps http://blog.cloudera.com/blog/2013/11/how-to-use-cascading-pattern-with-r-and-cdh/ ! Movie Recommendation with Scalding: http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and- scalding/ ! Vowpal wabbit: http://zinkov.com/posts/2013-08-13-vowpal-tutorial/ !

•Python for Data Science: •IPython/IPython notebook •Numpy •Scipy •Pandas •Matplotlib •Scikit-learn •Orange •Mlpy •Numba •Blaze •Pytables ! ! Presentation at Pybcn: http://chdoig.github.io/pybcn-python4science/ ADM Paper: A Primer on Python for DM (ask me) Python vs R: http://inside-bigdata.com/2013/12/09/data- science-wars-python-vs-r/ http://www.kaggle.com/forums/t/5243/pros-and- cons-of-r-vs-python-sci-kit-learn

Mining Big Data Sets Questions? Christine Doig. Maryam Pashmi. May
2014

- Big Data Mining 2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

Exercise 1: Christine Doig. Maryam Pashmi. May 2014 Q1: Order
the following implementations by their performance (worst -> better) in mining a big data set (imagine 1TB of data). Why? ! Q2: Name three issues/problems in writing plain MapReduce in Java for DM/ ML algorithms. Why?

Mining Big Data Sets

Mining Big Data Sets

More Decks by Christine Doig

Other Decks in Programming

Featured

Transcript