Upgrade to PRO for Only $50/Year—Limited-Time Offer! 🔥

Mining Big Data Sets

Mining Big Data Sets

Open Data.
Master in Innovation and Research in Informatics, UPC, Barcelona, 2014.

Christine Doig

May 29, 2014
Tweet

More Decks by Christine Doig

Other Decks in Programming

Transcript

  1. Objectives 1. Big Data challenges, trends and development 2. Open

    Source frameworks/libraries 3. Scalability of modeling algorithms Christine Doig. Maryam Pashmi. May 2014
  2. Index 0. Introduction 1.State of the art 2.Frameworks and libraries

    2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014
  3. Index 0. Introduction 1.State of the art 2.Frameworks and libraries

    2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014
  4. 0. Introduction. Open Data context Open Data Syllabus: The focus

    of this course is on enriching the available own data (i.e., data owned by the organization) with external repositories (special attention will be paid on Open Data), in order to gain insights into the organization business domain Source: Oscar Romero, Open Data, MIRI-FIB-UPC Christine Doig. Maryam Pashmi. May 2014
  5. 0. Introduction. Data Mining Data Mining: The process of discovering

    interesting and useful patterns and relationships in data. Statistics, AI (neural networks and machine learning) Source: Oscar Romero, Open Data, MIRI-FIB-UPC Process diagram CRISP-DM: Cross Industry Standard Process for Data Mining. Source: Wikipedia Christine Doig. Maryam Pashmi. May 2014
  6. 0. Introduction. Big Data. Don't use Hadoop, your data is

    not that big. http://www.chrisstucchio.com/blog/2013/hadoop_hatred.html Christine Doig. Maryam Pashmi. May 2014
  7. 0. Introduction. Concepts •Frameworks: abstraction that gives generic functionality, can

    be modified by user-written code -> Giving you the pieces to build your tools. e.g. Ikea style -> Capability to build ! •Libraries: Implemented Algorithms -> Giving you a tool that you can use. You still need to know how the tool works -> Capability to use Christine Doig. Maryam Pashmi. May 2014
  8. Mining Big Data Sets 0. Introduction 1.State of the art

    2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014
  9. Business Intelligence RDBMS. SQL. Statistics. Machine Learning. Data mining Once

    upon a time… Christine Doig. Maryam Pashmi. May 2014
  10. Business Intelligence RDBMS. SQL. Statistics. Machine Learning. Data mining Programming

    Languages 1.0 Storing: Processing: Open Source project Hadoop: Distributed framework -Written in Java -Scalable, but think in MapReduce Christine Doig. Maryam Pashmi. May 2014 Distributed systems
  11. Business Intelligence RDBMS. SQL. Statistics. Machine Learning. Data mining Programming

    Languages 1.0 Storing: Processing: Open Source project Hadoop: Distributed framework -Written in Java -Scalable, but think in MapReduce Christine Doig. Maryam Pashmi. May 2014 Distributed systems Source: Doug Cutting
  12. Business Intelligence RDBMS. SQL. Statistics. Machine Learning. Data mining Programming

    Languages 1.0 Storing: Processing: Open Source project Hadoop: Distributed framework -Written in Java -Scalable, but think in MapReduce Christine Doig. Maryam Pashmi. May 2014 Distributed systems
  13. Business Intelligence RDBMS. SQL. Statistics. Machine Learning. Data mining Programming

    Languages 1.0 Storing: Processing: Solutions: -Hive, Mahout -Hadoop Streaming HS HS ML library HQL Christine Doig. Maryam Pashmi. May 2014 Distributed systems
  14. Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming

    Languages YARN Cluster mgmt. 2.0 Hadoop 2.0: YARN: Separating processing from cluster resource management Christine Doig. Maryam Pashmi. May 2014 Distributed systems
  15. Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming

    Languages Streams YARN Cluster mgmt. 2.0 Applications. Data flow Python: ML/Stats libraries Cascading: Data flows Storm: Stream processing Christine Doig. Maryam Pashmi. May 2014 Distributed systems 1 3 2 1 2 3
  16. Business Intelligence RDBMS. SQL. Statistics. Machine Learning. Data mining Programming

    Languages Streams YARN Cluster mgmt. 2.0 Cascalog Lingual Pattern Applications. Data flow Lingual: SQL Pattern: PMML (R/SAS) Cascalog: Clojure Scalding: Scala Christine Doig. Maryam Pashmi. May 2014 Distributed systems
  17. Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming

    Languages Streams SAMOA YARN Cluster mgmt. 2.0 Cascalog Lingual Pattern Applications. Data flow Mining Streams: Frameworks/Libraries Christine Doig. Maryam Pashmi. May 2014 Distributed systems
  18. Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming

    Languages Streams SAMOA YARN Cluster mgmt. 2.0 Cascalog Lingual Pattern MLlib Complex DAG Applications. Data flow Problem -> Real-time Performance/optimization SPARK: RDD (in-memory) Christine Doig. Maryam Pashmi. May 2014 Distributed systems
  19. Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming

    Languages Streams SAMOA YARN Cluster mgmt. 2.0 Cascalog Lingual Pattern MLlib Complex DAG Applications. Data flow Problem -> Real-time Performance/optimization SPARK: RDD (in-memory) Christine Doig. Maryam Pashmi. May 2014 Distributed systems
  20. Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming

    Languages HPC HPDA Streams SAMOA YARN Cluster mgmt. Tez Complex DAG 2.0 Cascalog Lingual Pattern MLlib Complex DAG Applications. Data flow Numba/Blaze Christine Doig. Maryam Pashmi. May 2014 Distributed systems 1 2 3
  21. Statistics. Machine Learning. Business Intelligence RDBMS. SQL. Data mining Programming

    Languages HPC HPDA Streams SAMOA YARN Cluster mgmt. Tez Complex DAG 2.0 Cascalog Lingual Pattern MLlib Complex DAG Applications. Data flow Numba/Blaze Christine Doig. Maryam Pashmi. May 2014 Distributed systems Voppal Wabbit Distributed online machine learning MLbase
  22. 1. State of the art. State of the art -

    Summary What are the problems with plain MapReduce? • Fit your problem into MapReduce • Java • Optimization • Performance • Moving data around Christine Doig. Maryam Pashmi. May 2014
  23. 1. State of the art. Solutions & Trends •Fit into

    MapReduce: Building frameworks and tools on top of MapReduce: -Hive (HQL), Pig (ETL/data flow/scripts), Cascading (data flow). Building DM/ML libraries on top of MapReduce: -Mahout, Pattern •Java: -Hadoop Streaming -R (RHadoop/RSpark) & Python(PySpark) -Scala (Scalding, Spark) & Clojure (Cascalog) •Optimization: Building application framework which allows for a complex directed-acyclic- graph of tasks for processing data: -Tez/Spark •Performance: Interactivity and Iteration: Spark. Streaming and Real-time: Storm, Spark Streaming •Moving data around: Bring the algorithms to the DB: MADlib Christine Doig. Maryam Pashmi. May 2014
  24. 1. State of the art. Frameworks&Libraries. Christine Doig. Maryam Pashmi.

    May 2014 Next Section: 2. Frameworks and Libraries -> Present each technology
  25. 1. State of the art. Frameworks&Libraries. Christine Doig. Maryam Pashmi.

    May 2014 Next Section: 2. Frameworks and Libraries -> Present each technology 3. Scalability of Modeling algorithm -> Approach comparison between libraries
  26. Mining Big Data Sets 0. Introduction 1.State of the art

    2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014
  27. Overview of MapReduce. Source:” A Comparison of Join Algorithms for

    Log Processing in MapReduce” 2.1 MapReduce Christine Doig. Maryam Pashmi. May 2014 Simple model to express relatively sophisticated distributed programs: -Processes pairs [key, value] -Signature (map, reduce) ! Benefits: Hides parallelization, transparently distributes data, balances workload, provides fault tolerance, Elastically scalable MAPPER REDUCER Important note: Just a brief introduction to MapReduce
  28. An Apache Software Foundation project to create scalable & robust

    machine learning libraries under the Apache Software License. •Hides the underlying MapReduce processes to the user. •Provides implemented algorithms. ! 2.1 Mahout *From April, 25th : The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. We will however keep our widely used MapReduce algorithms in the codebase and maintain them. ! Why? Spark Christine Doig. Maryam Pashmi. May 2014 Source: http://mahout.apache.org/ Contains: User and Item based recommenders Matrix factorization based recommenders K-Means, Fuzzy K-Means clustering Latent Dirichlet Allocation Singular Value Decomposition Logistic regression classifier (Complementary) Naive Bayes classifier Random forest classifier
  29. Creating a user-based recommender load data from a file. compute

    the correlation coefficient between their interactions. define which simila users we want to leverage for the recommender 2.1 Mahout - Example Christine Doig. Maryam Pashmi. May 2014
  30. Mining Big Data Sets 0. Introduction 1.State of the art

    2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014
  31. 2.2 Cascading. Christine Doig. Maryam Pashmi. May 2014 Cascading is

    a software abstraction layer for Apache Hadoop that is intended to allow developers to write their data applications once and then deploy those applications on any big data infrastructure. ! In Cascading you define workflows, pipes of data that get routed through familiar operators such as "GroupBy", "Count", "Filter", "Merge", “Each” (map), “Every” (aggregate), “CoGroup” (MR join), “HashJoin” (Local join), etc. ! In Cascading vocabulary: "Build a Cascade from Flows which connect Taps via Pipes built into Assemblies to process Tuples” ! A workflow can be visualized as a "flow diagram”. ! ! “Scalding the Crunchy Pig for Cascading into the Hive“ http://thewit.ch/scalding_crunchy_pig
  32. 2.2 Cascading. Christine Doig. Maryam Pashmi. May 2014 Cascading for

    the Impatient: http://docs.cascading.org/impatient/impatient5.html
  33. 2.2 Cascading. Pattern Christine Doig. Maryam Pashmi. May 2014 PMML:

    Predictive Model Markup Language •Import model descriptions from R, SAS,Weka, RapidMiner, KNIME,SQL Server, etc. •XML-based file format developed by the Data Mining Group •Implemented algorithms: Random Forest, Linear Regression, Hierarchical Clustering and K-Means Clustering, Logistic Regression. Example: Linear Regression. package cascading.pattern.pmml.iris; *Popular open source projects have been built on of top Cascading, such as Scala (Scalding) and Clojure (Cascalog), which include significant Machine Learning libraries.
  34. 2.2 Cascading. Pattern Christine Doig. Maryam Pashmi. May 2014 PMML:

    Predictive Model Markup Language •Import model descriptions from R, SAS,Weka, RapidMiner, KNIME,SQL Server, etc. •XML-based file format developed by the Data Mining Group •Implemented algorithms: Random Forest, Linear Regression, Hierarchical Clustering and K-Means Clustering, Logistic Regression. Example: Linear Regression. package cascading.pattern.pmml.iris; *Popular open source projects have been built on of top Cascading, such as Scala (Scalding) and Clojure (Cascalog), which include significant Machine Learning libraries.
  35. 2.2 Cascading. Pattern Christine Doig. Maryam Pashmi. May 2014 PMML:

    Predictive Model Markup Language •Import model descriptions from R, SAS,Weka, RapidMiner, KNIME,SQL Server, etc. •XML-based file format developed by the Data Mining Group •Implemented algorithms: Random Forest, Linear Regression, Hierarchical Clustering and K-Means Clustering, Logistic Regression. Example: Linear Regression. package cascading.pattern.pmml.iris; *Popular open source projects have been built on of top Cascading, such as Scala (Scalding) and Clojure (Cascalog), which include significant Machine Learning libraries.
  36. 2.2 Cascading. Pattern Christine Doig. Maryam Pashmi. May 2014 PMML:

    Predictive Model Markup Language •Import model descriptions from R, SAS,Weka, RapidMiner, KNIME,SQL Server, etc. •XML-based file format developed by the Data Mining Group •Implemented algorithms: Random Forest, Linear Regression, Hierarchical Clustering and K-Means Clustering, Logistic Regression. Example: Linear Regression. package cascading.pattern.pmml.iris; *Popular open source projects have been built on of top Cascading, such as Scala (Scalding) and Clojure (Cascalog), which include significant Machine Learning libraries.
  37. Mining Big Data Sets 0. Introduction 1.State of the art

    2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014
  38. 2.3 MADLib Christine Doig. Maryam Pashmi. May 2014 Philosophy: Bring

    the modeling algorithms to the databases MADlib is an open-source library for scalable in-database analytics. It provides data- parallel implementations of mathematical, statistical and machine learning methods. Source: http://doc.madlib.net/latest/group__grp__linreg.html
  39. 2.3 MADLib Christine Doig. Maryam Pashmi. May 2014 Philosophy: Bring

    the modeling algorithms to the databases MADlib is an open-source library for scalable in-database analytics. It provides data- parallel implementations of mathematical, statistical and machine learning methods. Source: http://doc.madlib.net/latest/group__grp__linreg.html
  40. 2.3 MADLib Christine Doig. Maryam Pashmi. May 2014 Philosophy: Bring

    the modeling algorithms to the databases MADlib is an open-source library for scalable in-database analytics. It provides data- parallel implementations of mathematical, statistical and machine learning methods. Source: http://doc.madlib.net/latest/group__grp__linreg.html
  41. 2.3 MADLib Christine Doig. Maryam Pashmi. May 2014 Philosophy: Bring

    the modeling algorithms to the databases MADlib is an open-source library for scalable in-database analytics. It provides data- parallel implementations of mathematical, statistical and machine learning methods. Source: http://doc.madlib.net/latest/group__grp__linreg.html
  42. 2.3 MADLib Christine Doig. Maryam Pashmi. May 2014 Philosophy: Bring

    the modeling algorithms to the databases MADlib is an open-source library for scalable in-database analytics. It provides data- parallel implementations of mathematical, statistical and machine learning methods. Source: http://doc.madlib.net/latest/group__grp__linreg.html
  43. 2.3 MADLib Christine Doig. Maryam Pashmi. May 2014 Philosophy: Bring

    the modeling algorithms to the databases MADlib is an open-source library for scalable in-database analytics. It provides data- parallel implementations of mathematical, statistical and machine learning methods. Source: http://doc.madlib.net/latest/group__grp__linreg.html
  44. Mining Big Data Sets 0. Introduction 1.State of the art

    - Big Data Mining 2.Frameworks and libraries 2.1 MapReduce – Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014
  45. 2.4 Spark. Overview Christine Doig. Maryam Pashmi. May 2014 Apache

    Spark™ is a fast and general engine for large-scale data processing. ! Characteristics: -Speed: DAG execution engine and in-memory computing. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. -Ease of Use: Write applications quickly in Java, Scala or Python. -Generality: Combine SQL, streaming, and ML. -Integrated with Hadoop: HDFS, YARN or Mesos cluster manager, can read any existing Hadoop data. ! ! ! ! Source: http://databricks.com/spark/ Source: http://spark.apache.org/
  46. 2.4 Spark. RDD(I) Christine Doig. Maryam Pashmi. May 2014 RDD

    (Resilient Distributed Datasets): -distributed memory abstraction to perform in- memory computations on large clusters in a fault-tolerant manner: •logging the transformations used to build a dataset (its lineage) rather than the actual data*. -is read-only. RDD can be created from: -data in stable storage (e.g. HDFS) -other RDDs with deterministic operations called transformations (map, filter, join) ! RDDs do not need to be materialized at all times. ! *Can be helpful to checkpoint some RDDs to stable storage ! ! ! ! Source: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  47. 2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 Source:

    Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  48. 2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 Source:

    Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  49. 2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 Source:

    Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  50. 2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 Source:

    Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  51. 2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 The

    scheduler will perform a topology sort to determine the execution sequence of the DAG, tracing all the way back to the source nodes, or node that represents a cached RDD. Source: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  52. 2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 The

    scheduler will perform a topology sort to determine the execution sequence of the DAG, tracing all the way back to the source nodes, or node that represents a cached RDD. Source: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  53. 2.4 Spark. RDD(II) Christine Doig. Maryam Pashmi. May 2014 Source:

    Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  54. 2.4 Spark for DM Christine Doig. Maryam Pashmi. May 2014

    Why Spark is suitable for mining large data sets? ! -Iteration -Interactivity ! Many machine learning algorithms are iterative in nature because they run iterative optimization procedures. They can thus run much faster by keeping their data in memory. ! Spark supports two types of shared variables: •broadcast variables: cache a value in memory on all nodes. •accumulators: variables that are only “added” to, such as counters and sums. ! Users can control two other aspects of RDDs: • persistence: indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage). • partitioning: ask that an RDD’s elements be partitioned across machines based on a key in each record. Controlling how different RDD are co-partitioned (with the same keys) across machines can reduce inter-machine data shuffling within a cluster. ! !
  55. Mining Big Data Sets 0. Introduction 1.State of the art

    2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014
  56. 3. Scalability of modeling algorithms Machine Learning Revisited Christine Doig.

    Maryam Pashmi. May 2014 Data (labeled, true value) Supervised learning Learning data (training + validation) Testing data Model (e.g. Linear Regression, SVM…) (+ Loss function) Best model (parameters) Learning error Testing error New data (unlabeled, unknown value) Prediction
  57. 3. Scalability of modeling algorithms What are the problems in

    performing numerical optimization in large datasets? ! • Iterations (MR: intermediate results are written to disk) • Inverse matrices ! Solutions: Approximations ->Parallelizable optimization techniques ! Examples: • Stochastic Gradient Descent • Limited-memory BFGS • Reduced rank matrix •Singular Value Decomposition •Kernel matrix ! ! ! Christine Doig. Maryam Pashmi. May 2014 “Logistic Regression for Data Mining and High-Dimensional Classification”, Paul Komarek. http://repository.cmu.edu/cgi/viewcontent.cgi?article=1221&context=robotics “Logistic Regression and Newton's Method”, http://www.maths.lth.se/matstat/kurser/masm22/lecture/NewtonRaphson.pdf “ROBUST STOCHASTIC APPROXIMATION APPROACH TO STOCHASTIC PROGRAMMING∗” http://www2.isye.gatech.edu/~nemirovs/SIOPT_RSA_2009.pdfhttp://ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf Pegasos: Primal Estimated sub-GrAdient SOlver for SVM: http://ttic.uchicago.edu/~shai/papers/ShalevSiSr07.pdf Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models: http://cs.nyu.edu/~silberman/papers/efficient_maxentNIPS2009.pdf Source: “A comparison of numerical optimizers for logistic regression”, Thomas Minka, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.7017&rep=rep1&type=pdf
  58. 3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig.

    Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng
  59. 3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig.

    Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng
  60. 3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig.

    Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng
  61. 3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig.

    Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.
  62. 3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig.

    Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.
  63. 3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig.

    Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.
  64. 3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig.

    Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.
  65. 3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig.

    Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.
  66. 3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig.

    Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.
  67. 3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig.

    Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.
  68. 3. Scalability of modeling algorithms “Batch” Gradient Descent Christine Doig.

    Maryam Pashmi. May 2014 Source: Machine Learning, Coursera, Andrew Ng Think of it as “The path a drop of water/or a ball would follow if left somewhere at random”.
  69. 3. Scalability of modeling algorithms Stochastic Gradient Descent Christine Doig.

    Maryam Pashmi. May 2014 Subgradient methods represent an approximation in which only a subset of all training examples is used in each gradient computation, and when the size of the subsample is reduced to a single instance, we arrive at stochastic gradient descent. Source: Slow learners are fast: http://papers.nips.cc/paper/3888-slow-learners-are-fast.pdf
  70. 3. Scalability of modeling algorithms Alternatives in large datasets Christine

    Doig. Maryam Pashmi. May 2014 •“Stochastic” option: •Single machine performs stochastic gradient descent. •Several processing cores perform stochastic gradient descent independently of each other while sharing a common parameter vector which is updated asynchronously. ! •“Batch” option: linear function classes where parts of the function can be computed independently on several cores. Source: Slow learners are fast: http://papers.nips.cc/paper/3888-slow-learners-are-fast.pdf
  71. 3. Scalability of modeling algorithms Approaches of different libraries: !

    •Mahout: Stochastic Gradient Decent ! •MLlib in Spark: Stochastic Gradient Decent ! •Voppal Wabbit: Variant of online gradient decent. Conjugate gradient (CG), mini- batch, and data-dependent learning rates, are included. ! •Jubatus: Loose model sharing. The key is to share only models rather than data between distributed servers. Iterative Parameter Mixture. •UPDATE •MIX •ANALYZE ! Christine Doig. Maryam Pashmi. May 2014 Jubatus: An Open Source Platform for Distributed Online Machine Learning. http://biglearn.org/2013/files/papers/biglearning2013_submission_24.pdf http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_SIGMOD2012.pdf
  72. Mining Big Data Sets 0. Introduction 1.State of the art

    2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014
  73. 4. Summary Big Data challenges, trends and development -Optimization: DAG

    (Directed Acyclic Graph Scheduler) -Performance: Interactive results, Speed, in-memory -Building tools for different capabilities, hiding underlying processes: Importance of end users Open Source frameworks/libraries -Hadoop/MapReduce: Mining large datasets as a batch process -Cascading/Pattern: Building data applications that incorporate ML/Stats algorithms -MADLib: Bring the algorithms to the database. Familiarity SQL -Spark,/Mlbase: DAG, in-memory, optimization of model selection Scalability of modeling algorithms Different numerical optimization techniques for large datasets -Stochastic Gradient Descent -Subgradient or “mini-batch” Gradient Descent -Batch Gradient Descent Christine Doig. Maryam Pashmi. May 2014
  74. Mining Big Data Sets 0. Introduction 1.State of the art

    2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014
  75. Christine Doig. Maryam Pashmi. May 2014 5. Recommendations and Resources

    •Related MOOC courses: •Machine Learning, Coursera •Introduction to Hadoop and MapReduce, Udacity •Introduction to Data Science, Coursera •Learning from data, edX •Web Intelligence and Big Data, Coursera ! •Other courses: •Mining Massive Datasets: Stanford course http://infolab.stanford.edu/~ullman/mmds.html •Book •Slides ! •O’Reilly books -Safari Online books, subscription !
  76. Christine Doig. Maryam Pashmi. May 2014 5. Recommendations and Resources

    •Map-Reduce for Machine Learning on Multicore http://www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf •Spark http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf •Jubatus http://biglearn.org/2013/files/papers/biglearning2013_submission_24.pdf •Samoa: http://yahoo.github.io/samoa/SAMOA-Developers-Guide-0-0-1.pdf ! •HPDA: http://www.hpcuserforum.com/presentations/tuscon2013/IDCHPDABigDataHPC.pdf •MADlib: http://madlib.net/ http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-38.pdf ! ! !
  77. Christine Doig. Maryam Pashmi. May 2014 5. Recommendations and Resources

    Some online tutorials/exercises: R in Hadoop: http://hortonworks.com/hadoop-tutorial/using-rhadoop-to-predict-visitors-amount/ http://www.revolutionanalytics.com/sites/default/files/revoscalerdecisiontrees.pdf ! Tutorials for Cascading: http://www.cascading.org/documentation/tutorials/ ! Spark: A Data Science Case Study http://hortonworks.com/blog/spark-data-science-case-study/ ! Cascading Pattern to translate from R to Hadoop. Example: anti-fraud classifier used in e-commerce apps http://blog.cloudera.com/blog/2013/11/how-to-use-cascading-pattern-with-r-and-cdh/ ! Movie Recommendation with Scalding: http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and- scalding/ ! Vowpal wabbit: http://zinkov.com/posts/2013-08-13-vowpal-tutorial/ !
  78. Christine Doig. Maryam Pashmi. May 2014 5. Recommendations and Resources

    •Python for Data Science: •IPython/IPython notebook •Numpy •Scipy •Pandas •Matplotlib •Scikit-learn •Orange •Mlpy •Numba •Blaze •Pytables ! ! Presentation at Pybcn: http://chdoig.github.io/pybcn-python4science/ ADM Paper: A Primer on Python for DM (ask me) Python vs R: http://inside-bigdata.com/2013/12/09/data- science-wars-python-vs-r/ http://www.kaggle.com/forums/t/5243/pros-and- cons-of-r-vs-python-sci-kit-learn
  79. Mining Big Data Sets 0. Introduction 1.State of the art

    - Big Data Mining 2.Frameworks and libraries 2.1 MapReduce - Mahout 2.2 Cascading – Pattern 2.3 MADlib 2.4 Spark - MLlib 3.Scalability of modeling algorithms 4.Summary 5.Resources and Recommendations 6.Exercise Christine Doig. Maryam Pashmi. May 2014
  80. Exercise 1: Christine Doig. Maryam Pashmi. May 2014 Q1: Order

    the following implementations by their performance (worst -> better) in mining a big data set (imagine 1TB of data). Why? ! Q2: Name three issues/problems in writing plain MapReduce in Java for DM/ ML algorithms. Why?