Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

Objectives 1. Big Data challenges, trends and development 2. Open Source frameworks/libraries 3. Scalability of modeling algorithms Christine Doig. Maryam Pashmi. May 2014

Slide 3

Slide 3 text

Index 0. Introduction 1.State of the art 2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

Slide 4

Slide 4 text

Index 0. Introduction 1.State of the art 2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

Slide 5

Slide 5 text

0. Introduction. Open Data context Open Data Syllabus: The focus of this course is on enriching the available own data (i.e., data owned by the organization) with external repositories (special attention will be paid on Open Data), in order to gain insights into the organization business domain Source: Oscar Romero, Open Data, MIRI-FIB-UPC Christine Doig. Maryam Pashmi. May 2014

Slide 6

Slide 6 text

0. Introduction. Data Mining Data Mining: The process of discovering interesting and useful patterns and relationships in data. Statistics, AI (neural networks and machine learning) Source: Oscar Romero, Open Data, MIRI-FIB-UPC Process diagram CRISP-DM: Cross Industry Standard Process for Data Mining. Source: Wikipedia Christine Doig. Maryam Pashmi. May 2014

Slide 7

Slide 7 text

0. Introduction. Big Data. Source: Intel Christine Doig. Maryam Pashmi. May 2014

Slide 8

Slide 8 text

0. Introduction. Big Data. Source: http://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data Christine Doig. Maryam Pashmi. May 2014

Slide 9

Slide 9 text

0. Introduction. Big Data. Source: http://worrydream.com/ABriefRantOnTheFutureOfInteractionDesign/ Christine Doig. Maryam Pashmi. May 2014

Slide 10

Slide 10 text

0. Introduction. Big Data. Source: http://worrydream.com/ABriefRantOnTheFutureOfInteractionDesign/ Christine Doig. Maryam Pashmi. May 2014

Slide 11

Slide 11 text

0. Introduction. Big Data. Don't use Hadoop, your data is not that big. http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html Christine Doig. Maryam Pashmi. May 2014

Slide 12

Slide 12 text

0. Introduction. Concepts •Frameworks: abstraction that gives generic functionality, can be modified by user-written code -> Giving you the pieces to build your tools. e.g. Ikea style -> Capability to build ! •Libraries: Implemented Algorithms -> Giving you a tool that you can use. You still need to know how the tool works -> Capability to use Christine Doig. Maryam Pashmi. May 2014

Slide 13

Slide 13 text

Mining Big Data Sets 0. Introduction 1.State of the art 2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

Slide 14

Slide 14 text

1. State of the art. Christine Doig. Maryam Pashmi. May 2014

Slide 15

Slide 15 text

1. State of the art. Christine Doig. Maryam Pashmi. May 2014

Slide 16

Slide 16 text

Business Intelligence RDBMS. SQL. Statistics. Machine Learning. Data mining Once upon a time… Christine Doig. Maryam Pashmi. May 2014

Slide 17

Slide 17 text

Business Intelligence RDBMS. SQL. Statistics. Machine Learning. Data mining Programming Languages 1.0 Storing: Processing: Open Source project Hadoop: Distributed framework -Written in Java -Scalable, but think in MapReduce Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Slide 18

Slide 18 text

Business Intelligence RDBMS. SQL. Statistics. Machine Learning. Data mining Programming Languages 1.0 Storing: Processing: Open Source project Hadoop: Distributed framework -Written in Java -Scalable, but think in MapReduce Christine Doig. Maryam Pashmi. May 2014 Distributed systems Source: Doug Cutting

Slide 19

Slide 19 text

Business Intelligence RDBMS. SQL. Statistics. Machine Learning. Data mining Programming Languages 1.0 Storing: Processing: Open Source project Hadoop: Distributed framework -Written in Java -Scalable, but think in MapReduce Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Slide 20

Slide 20 text

Business Intelligence RDBMS. SQL. Statistics. Machine Learning. Data mining Programming Languages 1.0 Storing: Processing: Solutions: -Hive, Mahout -Hadoop Streaming HS HS ML library HQL Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Slide 21

Slide 21 text

Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming Languages YARN Cluster mgmt. 2.0 Hadoop 2.0: YARN: Separating processing from cluster resource management Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Slide 22

Slide 22 text

Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming Languages Streams YARN Cluster mgmt. 2.0 Applications. Data flow Python: ML/Stats libraries Cascading: Data flows Storm: Stream processing Christine Doig. Maryam Pashmi. May 2014 Distributed systems 1 3 2 1 2 3

Slide 23

Slide 23 text

Business Intelligence RDBMS. SQL. Statistics. Machine Learning. Data mining Programming Languages Streams YARN Cluster mgmt. 2.0 Cascalog Lingual Pattern Applications. Data flow Lingual: SQL Pattern: PMML (R/SAS) Cascalog: Clojure Scalding: Scala Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Slide 24

Slide 24 text

Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming Languages Streams SAMOA YARN Cluster mgmt. 2.0 Cascalog Lingual Pattern Applications. Data flow Mining Streams: Frameworks/Libraries Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Slide 25

Slide 25 text

Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming Languages Streams SAMOA YARN Cluster mgmt. 2.0 Cascalog Lingual Pattern MLlib Complex DAG Applications. Data flow Problem -> Real-time Performance/optimization SPARK: RDD (in-memory) Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Slide 26

Slide 26 text

Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming Languages Streams SAMOA YARN Cluster mgmt. 2.0 Cascalog Lingual Pattern MLlib Complex DAG Applications. Data flow Problem -> Real-time Performance/optimization SPARK: RDD (in-memory) Christine Doig. Maryam Pashmi. May 2014 Distributed systems

Slide 27

Slide 27 text

Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming Languages HPC HPDA Streams SAMOA YARN Cluster mgmt. Tez Complex DAG 2.0 Cascalog Lingual Pattern MLlib Complex DAG Applications. Data flow Numba/Blaze Christine Doig. Maryam Pashmi. May 2014 Distributed systems 1 2 3

Slide 28

Slide 28 text

Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming Languages HPC HPDA Streams SAMOA YARN Cluster mgmt. Tez Complex DAG 2.0 Cascalog Lingual Pattern MLlib Complex DAG Applications. Data flow Numba/Blaze Christine Doig. Maryam Pashmi. May 2014 Distributed systems Voppal Wabbit Distributed online machine learning MLbase

Slide 29

Slide 29 text

1. State of the art. State of the art - Summary What are the problems with plain MapReduce? • Fit your problem into MapReduce • Java • Optimization • Performance • Moving data around Christine Doig. Maryam Pashmi. May 2014

Slide 30

Slide 30 text

1. State of the art. Solutions & Trends •Fit into MapReduce: Building frameworks and tools on top of MapReduce: -Hive (HQL), Pig (ETL/data flow/scripts), Cascading (data flow). Building DM/ML libraries on top of MapReduce: -Mahout, Pattern •Java: -Hadoop Streaming -R (RHadoop/RSpark) & Python(PySpark) -Scala (Scalding, Spark) & Clojure (Cascalog) •Optimization: Building application framework which allows for a complex directed-acyclic- graph of tasks for processing data: -Tez/Spark •Performance: Interactivity and Iteration: Spark. Streaming and Real-time: Storm, Spark Streaming •Moving data around: Bring the algorithms to the DB: MADlib Christine Doig. Maryam Pashmi. May 2014

Slide 31

Slide 31 text

1. State of the art. Frameworks&Libraries. Christine Doig. Maryam Pashmi. May 2014

Slide 32

Slide 32 text

1. State of the art. Frameworks&Libraries. Christine Doig. Maryam Pashmi. May 2014 Next Section: 2. Frameworks and Libraries -> Present each technology

Slide 33

Slide 33 text

1. State of the art. Frameworks&Libraries. Christine Doig. Maryam Pashmi. May 2014 Next Section: 2. Frameworks and Libraries -> Present each technology 3. Scalability of Modeling algorithm -> Approach comparison between libraries

Slide 34

Slide 34 text

Mining Big Data Sets 0. Introduction 1.State of the art 2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

Slide 35

Slide 35 text

Overview of MapReduce. Source:” A Comparison of Join Algorithms for Log Processing in MapReduce” 2.1 MapReduce Christine Doig. Maryam Pashmi. May 2014 Simple model to express relatively sophisticated distributed programs: -Processes pairs [key, value] -Signature (map, reduce) ! Benefits: Hides parallelization, transparently distributes data, balances workload, provides fault tolerance, Elastically scalable MAPPER REDUCER Important note: Just a brief introduction to MapReduce

Slide 36

Slide 36 text

An Apache Software Foundation project to create scalable & robust machine learning libraries under the Apache Software License. •Hides the underlying MapReduce processes to the user. •Provides implemented algorithms. ! 2.1 Mahout *From April, 25th : The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them. ! Why? Spark Christine Doig. Maryam Pashmi. May 2014 Source: http://mahout.apache.org/ Contains: User and Item based recommenders Matrix factorization based recommenders K-Means, Fuzzy K-Means clustering Latent Dirichlet Allocation Singular Value Decomposition Logistic regression classifier (Complementary) Naive Bayes classifier Random forest classifier

Slide 37

Slide 37 text

Creating a user-based recommender load data from a file. compute the correlation coefficient between their interactions. define which simila users we want to leverage for the recommender 2.1 Mahout - Example Christine Doig. Maryam Pashmi. May 2014

Slide 38

Slide 38 text

Mining Big Data Sets 0. Introduction 1.State of the art 2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

Slide 39

Slide 39 text

2.2 Cascading. Christine Doig. Maryam Pashmi. May 2014 Cascading is a software abstraction layer for Apache Hadoop that is intended to allow developers to write their data applications once and then deploy those applications on any big data infrastructure. ! In Cascading you define workflows, pipes of data that get routed through familiar operators such as "GroupBy", "Count", "Filter", "Merge", “Each” (map), “Every” (aggregate), “CoGroup” (MR join), “HashJoin” (Local join), etc. ! In Cascading vocabulary: "Build a Cascade from Flows which connect Taps via Pipes built into Assemblies to process Tuples” ! A workflow can be visualized as a "flow diagram”. ! ! “Scalding the Crunchy Pig for Cascading into the Hive“ http://thewit.ch/scalding_crunchy_pig

Slide 40

Slide 40 text

2.2 Cascading. Christine Doig. Maryam Pashmi. May 2014 Cascading for the Impatient: http://docs.cascading.org/impatient/impatient5.html

Slide 41

Slide 41 text

2.2 Cascading. Pattern Christine Doig. Maryam Pashmi. May 2014 PMML: Predictive Model Markup Language •Import model descriptions from R, SAS,Weka, RapidMiner, KNIME,SQL Server, etc. •XML-based file format developed by the Data Mining Group •Implemented algorithms: Random Forest, Linear Regression, Hierarchical Clustering and K-Means Clustering, Logistic Regression. Example: Linear Regression. package cascading.pattern.pmml.iris; *Popular open source projects have been built on of top Cascading, such as Scala (Scalding) and Clojure (Cascalog), which include significant Machine Learning libraries.

Slide 42

Slide 42 text

2.2 Cascading. Pattern Christine Doig. Maryam Pashmi. May 2014 PMML: Predictive Model Markup Language •Import model descriptions from R, SAS,Weka, RapidMiner, KNIME,SQL Server, etc. •XML-based file format developed by the Data Mining Group •Implemented algorithms: Random Forest, Linear Regression, Hierarchical Clustering and K-Means Clustering, Logistic Regression. Example: Linear Regression. package cascading.pattern.pmml.iris; *Popular open source projects have been built on of top Cascading, such as Scala (Scalding) and Clojure (Cascalog), which include significant Machine Learning libraries.

Slide 43

Slide 43 text

2.2 Cascading. Pattern Christine Doig. Maryam Pashmi. May 2014 PMML: Predictive Model Markup Language •Import model descriptions from R, SAS,Weka, RapidMiner, KNIME,SQL Server, etc. •XML-based file format developed by the Data Mining Group •Implemented algorithms: Random Forest, Linear Regression, Hierarchical Clustering and K-Means Clustering, Logistic Regression. Example: Linear Regression. package cascading.pattern.pmml.iris; *Popular open source projects have been built on of top Cascading, such as Scala (Scalding) and Clojure (Cascalog), which include significant Machine Learning libraries.

Slide 44

Slide 44 text

2.2 Cascading. Pattern Christine Doig. Maryam Pashmi. May 2014 PMML: Predictive Model Markup Language •Import model descriptions from R, SAS,Weka, RapidMiner, KNIME,SQL Server, etc. •XML-based file format developed by the Data Mining Group •Implemented algorithms: Random Forest, Linear Regression, Hierarchical Clustering and K-Means Clustering, Logistic Regression. Example: Linear Regression. package cascading.pattern.pmml.iris; *Popular open source projects have been built on of top Cascading, such as Scala (Scalding) and Clojure (Cascalog), which include significant Machine Learning libraries.

Slide 45

Slide 45 text

Mining Big Data Sets 0. Introduction 1.State of the art 2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

Slide 46

Slide 46 text

2.3 MADLib Christine Doig. Maryam Pashmi. May 2014 Philosophy: Bring the modeling algorithms to the databases MADlib is an open-source library for scalable in-database analytics. It provides data- parallel implementations of mathematical, statistical and machine learning methods. Source: http://doc.madlib.net/latest/group__grp__linreg.html

Slide 47

Slide 47 text

2.3 MADLib Christine Doig. Maryam Pashmi. May 2014 Philosophy: Bring the modeling algorithms to the databases MADlib is an open-source library for scalable in-database analytics. It provides data- parallel implementations of mathematical, statistical and machine learning methods. Source: http://doc.madlib.net/latest/group__grp__linreg.html

Slide 48

Slide 48 text

2.3 MADLib Christine Doig. Maryam Pashmi. May 2014 Philosophy: Bring the modeling algorithms to the databases MADlib is an open-source library for scalable in-database analytics. It provides data- parallel implementations of mathematical, statistical and machine learning methods. Source: http://doc.madlib.net/latest/group__grp__linreg.html

Slide 49

Slide 49 text

2.3 MADLib Christine Doig. Maryam Pashmi. May 2014 Philosophy: Bring the modeling algorithms to the databases MADlib is an open-source library for scalable in-database analytics. It provides data- parallel implementations of mathematical, statistical and machine learning methods. Source: http://doc.madlib.net/latest/group__grp__linreg.html

Slide 50

Slide 50 text

2.3 MADLib Christine Doig. Maryam Pashmi. May 2014 Philosophy: Bring the modeling algorithms to the databases MADlib is an open-source library for scalable in-database analytics. It provides data- parallel implementations of mathematical, statistical and machine learning methods. Source: http://doc.madlib.net/latest/group__grp__linreg.html

Slide 51

Slide 51 text

2.3 MADLib Christine Doig. Maryam Pashmi. May 2014 Philosophy: Bring the modeling algorithms to the databases MADlib is an open-source library for scalable in-database analytics. It provides data- parallel implementations of mathematical, statistical and machine learning methods. Source: http://doc.madlib.net/latest/group__grp__linreg.html

Slide 52

Slide 52 text

Mining Big Data Sets 0. Introduction 1.State of the art - Big Data Mining 2.Frameworks and libraries 2.1 MapReduce – Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

Slide 53

Slide 53 text

2.4 Spark. Overview Christine Doig. Maryam Pashmi. May 2014 Apache Spark™ is a fast and general engine for large-scale data processing. ! Characteristics: -Speed: DAG execution engine and in-memory computing. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. -Ease of Use: Write applications quickly in Java, Scala or Python. -Generality: Combine SQL, streaming, and ML. -Integrated with Hadoop: HDFS, YARN or Mesos cluster manager, can read any existing Hadoop data. ! ! ! ! Source: http://databricks.com/spark/ Source: http://spark.apache.org/

Slide 54

Slide 54 text

2.4 Spark. RDD(I) Christine Doig. Maryam Pashmi. May 2014 RDD (Resilient Distributed Datasets): -distributed memory abstraction to perform in- memory computations on large clusters in a fault-tolerant manner: •logging the transformations used to build a dataset (its lineage) rather than the actual data*. -is read-only. RDD can be created from: -data in stable storage (e.g. HDFS) -other RDDs with deterministic operations called transformations (map, filter, join) ! RDDs do not need to be materialized at all times. ! *Can be helpful to checkpoint some RDDs to stable storage ! ! ! ! Source: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Slide 55

Slide 55 text

2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 Source: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Slide 56

Slide 56 text

2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 Source: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Slide 57

Slide 57 text

2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 Source: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Slide 58

Slide 58 text

2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 Source: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Slide 59

Slide 59 text

2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 The scheduler will perform a topology sort to determine the execution sequence of the DAG, tracing all the way back to the source nodes, or node that represents a cached RDD. Source: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Slide 60

Slide 60 text

2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 The scheduler will perform a topology sort to determine the execution sequence of the DAG, tracing all the way back to the source nodes, or node that represents a cached RDD. Source: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Slide 61

Slide 61 text

2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 Source: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Slide 62

Slide 62 text

2.4 Spark for DM Christine Doig. Maryam Pashmi. May 2014 Why Spark is suitable for mining large data sets? ! -Iteration -Interactivity ! Many machine learning algorithms are iterative in nature because they run iterative optimization procedures. They can thus run much faster by keeping their data in memory. ! Spark supports two types of shared variables: •broadcast variables: cache a value in memory on all nodes. •accumulators: variables that are only “added” to, such as counters and sums. ! Users can control two other aspects of RDDs: • persistence: indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage). • partitioning: ask that an RDD’s elements be partitioned across machines based on a key in each record. Controlling how different RDD are co-partitioned (with the same keys) across machines can reduce inter-machine data shuffling within a cluster. ! !

Slide 63

Slide 63 text

2.4 Spark. MLlib example. Christine Doig. Maryam Pashmi. May 2014 KNN example

Slide 64

Slide 64 text

2.4 Spark. MLBase Christine Doig. Maryam Pashmi. May 2014 Source: http://mlbase.org/

Slide 65

Slide 65 text

2.4 Spark. MLBase Christine Doig. Maryam Pashmi. May 2014 Source: http://mlbase.org/

Slide 66

Slide 66 text

2.4 Spark. MLBase Christine Doig. Maryam Pashmi. May 2014 Source: http://mlbase.org/

Slide 67

Slide 67 text

2.4 Spark. MLBase Christine Doig. Maryam Pashmi. May 2014 Source: http://mlbase.org/

Slide 68

Slide 68 text

Mining Big Data Sets 0. Introduction 1.State of the art 2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

Slide 69

Slide 69 text

3. Scalability of modeling algorithms Machine Learning Revisited Christine Doig. Maryam Pashmi. May 2014 Data (labeled, true value) Supervised learning Learning data (training + validation) Testing data Model (e.g. Linear Regression, SVM…) (+ Loss function) Best model (parameters) Learning error Testing error New data (unlabeled, unknown value) Prediction

Slide 70

Slide 70 text

3. Scalability of modeling algorithms Christine Doig. Maryam Pashmi. May 2014

Slide 71

Slide 71 text

3. Scalability of modeling algorithms What are the problems in performing numerical optimization in large datasets? ! • Iterations (MR: intermediate results are written to disk) • Inverse matrices ! Solutions: Approximations ->Parallelizable optimization techniques ! Examples: • Stochastic Gradient Descent • Limited-memory BFGS • Reduced rank matrix •Singular Value Decomposition •Kernel matrix ! ! ! Christine Doig. Maryam Pashmi. May 2014 “Logistic Regression for Data Mining and High-Dimensional Classification”, Paul Komarek. http://repository.cmu.edu/cgi/viewcontent.cgi?article=1221&context=robotics “Logistic Regression and Newton's Method”, http://www.maths.lth.se/matstat/kurser/masm22/lecture/NewtonRaphson.pdf “ROBUST STOCHASTIC APPROXIMATION APPROACH TO STOCHASTIC PROGRAMMING∗” http://www2.isye.gatech.edu/~nemirovs/SIOPT_RSA_2009.pdfhttp://ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf Pegasos: Primal Estimated sub-GrAdient SOlver for SVM: http://ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models: http://cs.nyu.edu/~silberman/papers/efficient_maxentNIPS2009.pdf Source: “A comparison of numerical optimizers for logistic regression”, Thomas Minka, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.7017&rep=rep1&type=pdf

Slide 72

Slide 72 text

3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig. Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng

Slide 73

Slide 73 text

3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig. Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng

Slide 74

Slide 74 text

3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig. Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng

Slide 75

Slide 75 text

3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig. Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.

Slide 76

Slide 76 text

3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig. Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.

Slide 77

Slide 77 text

3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig. Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.

Slide 78

Slide 78 text

3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig. Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.

Slide 79

Slide 79 text

3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig. Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.

Slide 80

Slide 80 text

3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig. Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.

Slide 81

Slide 81 text

3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig. Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.

Slide 82

Slide 82 text

3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig. Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.

Slide 83

Slide 83 text

3. Scalability of modeling algorithms Stochastic Gradient Descent Christine Doig. Maryam Pashmi. May 2014 Subgradient methods represent an approximation in which only a subset of all training examples is used in each gradient computation, and when the size of the subsample is reduced to a single instance, we arrive at stochastic gradient descent. Source: Slow learners are fast: http://papers.nips.cc/paper/3888-slow-learners-are-fast.pdf

Slide 84

Slide 84 text

3. Scalability of modeling algorithms Alternatives in large datasets Christine Doig. Maryam Pashmi. May 2014 •“Stochastic” option: •Single machine performs stochastic gradient descent. •Several processing cores perform stochastic gradient descent independently of each other while sharing a common parameter vector which is updated asynchronously. ! •“Batch” option: linear function classes where parts of the function can be computed independently on several cores. Source: Slow learners are fast: http://papers.nips.cc/paper/3888-slow-learners-are-fast.pdf

Slide 85

Slide 85 text

3. Scalability of modeling algorithms Approaches of different libraries: ! •Mahout: Stochastic Gradient Decent ! •MLlib in Spark: Stochastic Gradient Decent ! •Voppal Wabbit: Variant of online gradient decent. Conjugate gradient (CG), mini- batch, and data-dependent learning rates, are included. ! •Jubatus: Loose model sharing. The key is to share only models rather than data between distributed servers. Iterative Parameter Mixture. •UPDATE •MIX •ANALYZE ! Christine Doig. Maryam Pashmi. May 2014 Jubatus: An Open Source Platform for Distributed Online Machine Learning. http://biglearn.org/2013/files/papers/biglearning2013_submission_24.pdf http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf

Slide 86

Slide 86 text

Mining Big Data Sets 0. Introduction 1.State of the art 2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

Slide 87

Slide 87 text

4. Summary Big Data challenges, trends and development -Optimization: DAG (Directed Acyclic Graph Scheduler) -Performance: Interactive results, Speed, in-memory -Building tools for different capabilities, hiding underlying processes: Importance of end users Open Source frameworks/libraries -Hadoop/MapReduce: Mining large datasets as a batch process -Cascading/Pattern: Building data applications that incorporate ML/Stats algorithms -MADLib: Bring the algorithms to the database. Familiarity SQL -Spark,/Mlbase: DAG, in-memory, optimization of model selection Scalability of modeling algorithms Different numerical optimization techniques for large datasets -Stochastic Gradient Descent -Subgradient or “mini-batch” Gradient Descent -Batch Gradient Descent Christine Doig. Maryam Pashmi. May 2014

Slide 88

Slide 88 text

Mining Big Data Sets 0. Introduction 1.State of the art 2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

Slide 89

Slide 89 text

Christine Doig. Maryam Pashmi. May 2014 5. Recommendations and Resources •Related MOOC courses: •Machine Learning, Coursera •Introduction to Hadoop and MapReduce, Udacity •Introduction to Data Science, Coursera •Learning from data, edX •Web Intelligence and Big Data, Coursera ! •Other courses: •Mining Massive Datasets: Stanford course http://infolab.stanford.edu/~ullman/mmds.html •Book •Slides ! •O’Reilly books -Safari Online books, subscription !

Slide 90

Slide 90 text

Christine Doig. Maryam Pashmi. May 2014 5. Recommendations and Resources •Map-Reduce for Machine Learning on Multicore http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf •Spark http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf •Jubatus http://biglearn.org/2013/files/papers/biglearning2013_submission_24.pdf •Samoa: http://yahoo.github.io/samoa/SAMOA-Developers-Guide-0-0-1.pdf ! •HPDA: http://www.hpcuserforum.com/presentations/tuscon2013/IDCHPDABigDataHPC.pdf •MADlib: http://madlib.net/ http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-38.pdf ! ! !

Slide 91

Slide 91 text

Christine Doig. Maryam Pashmi. May 2014 5. Recommendations and Resources Some online tutorials/exercises: R in Hadoop: http://hortonworks.com/hadoop-tutorial/using-rhadoop-to-predict-visitors-amount/ http://www.revolutionanalytics.com/sites/default/files/revoscalerdecisiontrees.pdf ! Tutorials for Cascading: http://www.cascading.org/documentation/tutorials/ ! Spark: A Data Science Case Study http://hortonworks.com/blog/spark-data-science-case-study/ ! Cascading Pattern to translate from R to Hadoop. Example: anti-fraud classifier used in e-commerce apps http://blog.cloudera.com/blog/2013/11/how-to-use-cascading-pattern-with-r-and-cdh/ ! Movie Recommendation with Scalding: http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and- scalding/ ! Vowpal wabbit: http://zinkov.com/posts/2013-08-13-vowpal-tutorial/ !

Slide 92

Slide 92 text

Christine Doig. Maryam Pashmi. May 2014 5. Recommendations and Resources •Python for Data Science: •IPython/IPython notebook •Numpy •Scipy •Pandas •Matplotlib •Scikit-learn •Orange •Mlpy •Numba •Blaze •Pytables ! ! Presentation at Pybcn: http://chdoig.github.io/pybcn-python4science/ ADM Paper: A Primer on Python for DM (ask me) Python vs R: http://inside-bigdata.com/2013/12/09/data- science-wars-python-vs-r/ http://www.kaggle.com/forums/t/5243/pros-and- cons-of-r-vs-python-sci-kit-learn

Slide 93

Slide 93 text

Mining Big Data Sets Questions? Christine Doig. Maryam Pashmi. May 2014

Slide 94

Slide 94 text

Mining Big Data Sets 0. Introduction 1.State of the art - Big Data Mining 2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014

Slide 95

Slide 95 text

Exercise 1: Christine Doig. Maryam Pashmi. May 2014 Q1: Order the following implementations by their performance (worst -> better) in mining a big data set (imagine 1TB of data). Why? ! Q2: Name three issues/problems in writing plain MapReduce in Java for DM/ ML algorithms. Why?