Big Data Intelligence & Analytics Part 2: Scalable Advanced Massive Online Analysis

Big Data Intelligence & Analytics Session 4 SAMOA: Scalable Advanced
Massive Online Analysis Mário Cordeiro | Porto 2019 [email protected]

Big Data Intelligence & Analytics Who am I • Hometown:
• Miranda do Douro, Bragança, Portugal

Big Data Intelligence & Analytics Who am I • Education:
• 2000: Degree in Electrical and Computer Engineering, FEUP • Digital Television and Digital Broadcast • 2008: Master in Sciences in Informatics Engineering, FEUP • Information Retrieval • Since 2011: PhD student in Doctoral Program in Informatics Engineering, FEUP • Event detection on social network data streams • Dynamic Complex Networks

Big Data Intelligence & Analytics Work and research interests •
Professional Work • Senior Engineer at Critical Manufacturing (CM was acquired by ASM Pacific Technology) | 5 Source: https://www.forbes.com/sites/bernardmarr/2018/09/02/what-is-industry-4-0-heres-a-super-easy-explanation-for-anyone/

Big Data Intelligence & Analytics Work and research interests Real-time
enterprise-wide visualization and monitoring is crucial for high tech industries: | 6 Critical Manufacturing FabLive: https://www.criticalmanufacturing.com/en/critical-manufacturing-mes/fablive

Lecturer: • Invited assistant at ISEP-DEI, since 2010 • Computer Networks (RCOMP), Data Structures (ESINF), Algorithmic and Programming (APROG) • Former invited assistant at FEUP-DEI, 2012-2016 • Computing Theory (TCOM), Computers Laboratory (LCOM) | 7

Selected papers: • Cordeiro, M., Sarmento, R. P., Brazdil, P., Kimura, M., Gama, J., Identifying, Ranking and Tracking Community Leaders in Evolving Social Networks. International Conference on Complex Networks and their Applications (COMPLEX NETWORKS), 2019 • Cordeiro, M., Sarmento, R. P., Brazdil, P., Gama, J., Evolving Networks and Social Network Analysis: Methods and Techniques. Journalism and Social Media - New Trends and Cultural Implications (IntechOpen Book Chapter), 2018 • Sarmento, R. P., Cordeiro, M., Brazdil, P., Gama, J., Efficient Incremental Laplace Centrality Algorithm for Dynamic Networks. International Conference on Complex Networks and their Applications (COMPLEX NETWORKS), 2017 • Cordeiro, M., Sarmento, R. P., Gama, J., Evolving Networks Dynamic Community Detection using Locality Modularity Optimization. Social Network Analysis and Mining (SNAM), 2016 • M. Cordeiro, Twitter event detection: combining wavelet analysis and topic inference summarization, DSIE’12, the Doctoral Symposium on Informatics Engineering, 2012 | 8

Big Data Intelligence & Analytics Work and research interests Identifying,
Ranking and Tracking Community Leaders in Evolving Social Networks: | 9 Zachary karate club, classical vs hierarchical community detection

Big Data Intelligence & Analytics Work and research interests Identifying,
Ranking and Tracking Community Leaders in Evolving Social Networks: | 10 Temporal collaboration network of Jure Leskovec and Andrews Ng Temporal Zachary karate club

Road Map Session 3 & 4 | 11

Big Data Intelligence & Analytics Road Map Session 1: •
Big data science • Issues with (small or big) data quality (examples in healthcare data) • Streaming data sources (examples in energy providers data) • Approximate vs exact computations (practical examples) Session 2: • From streaming to ubiquitous data sources • Distributed streaming versions of state-of-the-art data mining algorithms • Real-world application examples of such algorithms Session 3: • MOA: Massive Online Analysis Session 4: • SAMOA: Scalable Advanced Massive Online Analysis | 12

Big Data Intelligence & Analytics The journey Data Data Streams
Big Data Big Data Stream | 13 Single Node Multi Node Real-time Analytics Batch Analytics

Big Data Intelligence & Analytics About this presentation • Adapted
from 2018 Albert Bifet Big Data Intelligence & Analytics slides | 15

Stream Engine Motivation | 16

Big Data Intelligence & Analytics Digital Universe EMC Digital Universe
with Research & Analysis by IDC The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things April 2014 | 17

Big Data Intelligence & Analytics Digital Universe | 18 Figure::
https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm

https://www.wsj.com/articles/to-keep-track-of-worlds-data-youll-need-more-than-a-yottabyte-11552048200

https://www.visualcapitalist.com/how-much-data-is-generated-each-day/

Big Data Intelligence & Analytics Big Data 6V’s | 23
Image: https://searchdatamanagement.techtarget.com/definition/big-data

Big Data Intelligence & Analytics Big Data 10V’s | 24
Source: https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx

Big Data Intelligence & Analytics Hadoop Map Reduce • Hadoop
architecture deals with datasets, not data streams | 25 Big Data and Google's Three Papers I - GFS and MapReduce: https://bowenli86.github.io/2016/10/23/distributed%20system/data/Big-Data-and-Google-s-Three-Papers-I-GFS-and-MapReduce/

Big Data Intelligence & Analytics Hadoop Map Reduce • three
operations: • Map • Shuffle • Reduce | 26 Dean et al. MapReduce: Simplified Data Processing on Large Clusters, 2004: http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

Big Data Intelligence & Analytics Hadoop Map Reduce MapReduce example
counts the appearance of each word in a set of documents: function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += pc emit (word, sum) | 27 Dean et al. MapReduce: Simplified Data Processing on Large Clusters, 2004: http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

Big Data Intelligence & Analytics Hadoop Map Reduce • Distributed
File Systems: | 28 Ghemawat et al. The Google File System, 2003: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf

Big Data Intelligence & Analytics Requirements • We should have
some ways of coupling programs like garden hose–screw in another segment when it becomes when it becomes necessary to massage data in another way. This is the way of IO also. • Our loader should be able to do link-loading and controlled establishment. • Our library filing scheme should allow for rather general indexing, responsibility, generations, data path switching. • It should be possible to get private system components (all routines are system components) for buggering around with. | 29

Big Data Intelligence & Analytics Requirements • We should have
some ways of coupling programs like garden hose–screw in another segment when it becomes when it becomes necessary to massage data in another way. This is the way of IO also. • Our loader should be able to do link-loading and controlled establishment. • Our library filing scheme should allow for rather general indexing, responsibility, generations, data path switching. • It should be possible to get private system components (all routines are system components) for buggering around with. - M. D. McIlroy 1968 | 30

Big Data Intelligence & Analytics Unix Pipelines | 31 Source:
http://doc.cat-v.org/unix/pipes/

Big Data Intelligence & Analytics Unix Pipelines | 34 Apache
Kafka, Samza, and the Unix Philosophy of Distributed Data: Martin Kleppmann, Conuent: https://www.confluent.io/blog/apache-kafka-samza-and-the-unix-philosophy-of-distributed-data/

Big Data Intelligence & Analytics Real Time Processing Jay Kreps,
LinkedIn The Log: What every software engineer should know about real-time data’s unifying abstraction | 39 Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Big Data Intelligence & Analytics • The Log • perhaps
the simplest possible storage abstraction. • It is an append-only, totally-ordered sequence of records ordered by time Real Time Processing | 40 Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Big Data Intelligence & Analytics Real Time Processing • The
Log • horizontal scaling by chop up the log into partitions: | 41 Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Log • building out custom data loads for each data source and destination, was clearly infeasible • fully connectivity would end up with O(N2) pipelines. | 42 Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Log • Unified log: | 43 Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Log • Event sourcing | 44 Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

Big Data Intelligence & Analytics Real Time Processing | 45
Source: https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/ • The Log • Event sourcing based architecture https://martinfowler.com/eaaDev/EventSourcing.html

Stream Engines Apache Kafka | 46

Big Data Intelligence & Analytics Apache Kafka from LinkedIn •
Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. | 47 Source: https://kafka.apache.org/intro

Components of Apache Kafka • topics: categories that Kafka uses to maintains feeds of messages • producers: processes that publish messages to a Kafka topic • consumers: processes that subscribe to topics and process the feed of published messages • broker: server that is part of the cluster that runs Kafka | 48 Source: https://kafka.apache.org/intro

Under the hood: • The Kafka cluster maintains a partitioned log. • Each partition is an ordered, immutable sequence of messages that is continually appended to a commit log. • The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition. | 49 Source: https://kafka.apache.org/intro

A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups: | 50 Source: https://kafka.apache.org/intro

Guarantees: • Messages sent by a producer to a particular topic partition will be appended in the order they are sent. • A consumer instance sees messages in the order they are stored in the log. • For a topic with replication factor N, Kafka tolerates up to N-1 server failures without losing any messages committed to the log. | 51 Source: https://kafka.apache.org/intro

Stream Engines Apache Samza | 52

Big Data Intelligence & Analytics Apache Samza • Samza is
a stream processing framework with the following features: • Simple API: it provides a very simple callback-based ”process message” API comparable to MapReduce. • Managed state: Samza manages snapshotting and restoration of a stream processor’s state. • Fault tolerance: Whenever a machine fails, Samza works with YARN to transparently migrate your tasks to another machine. • Durability: Samza uses Kafka to guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost. • Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, replayable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in. • Pluggable: Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments. • Processor isolation: Samza works with Apache YARN | 53 Source: http://samza.apache.org/

Big Data Intelligence & Analytics Apache Samza | 54 Source:
http://samza.apache.org/

Big Data Intelligence & Analytics Apache Samza • Storm and
Samza are fairly similar. Both systems provide: 1. a partitioned stream model, 2. a distributed execution environment, 3. an API for stream processing, 4. fault tolerance, 5. Kafka integration | 55 Source: http://samza.apache.org/

Big Data Intelligence & Analytics Apache Samza • Samza components:
• Streams: A stream is composed of immutable messages of a similar type or category • Jobs: code that performs a logical transformation on a set of input streams to append output messages to set of output streams • Samza parallel Components: • Partitions: Each streamis broken into one or more partitions. Each partition in the stream is a totally ordered sequence of messages. • Tasks: A job is scaled by breaking it into multiple tasks. The task is the unit of parallelism of the job, just as the partition is to the stream. | 56 Source: http://samza.apache.org/

Big Data Intelligence & Analytics Apache Samza • Samza dataflow:
| 57 Source: http://samza.apache.org/

Big Data Intelligence & Analytics Apache Samza (parenthesis) • The
dataflow pattern: • Adopted by many other platforms: • https://streamsets.com/ • https://beam.apache.org/ • https://cloud.google.com/dataflow/ • https://airflow.apache.org/ | 58 StreamSets Demo - Connected Car with StreamSets Data Collector: https://www.youtube.com/watch?time_continue=308&v=qAyFvC4c2n4

Big Data Intelligence & Analytics Apache Samza • Samza architecture:
• Samza is made up of three layers: 1. A streaming layer. 2. An execution layer. 3. A processing layer • Samza provides out of the box support for all three layers. 1. Streaming: Kafka 2. Execution: YARN 3. Processing: Samza API | 59 Source: http://samza.apache.org/

Big Data Intelligence & Analytics Apache Samza • Samza architecture:
• These three pieces fit together to form Samza: • This architecture follows a similar pattern to Hadoop (which also uses YARN as execution layer, HDFS for storage, and MapReduce as processing API): | 60 Source: http://samza.apache.org/

Big Data Intelligence & Analytics Apache Samza • Samza, Yarn
and Kafka integration: | 61 Source: http://samza.apache.org/ RM: Resource Manager) NM: Node Manager AM: Application Master)

Stream Engines Apache Storm | 62

Big Data Intelligence & Analytics Apache Storm • Apache S4
from Yahoo: | 63 Not longer an active project.

Big Data Intelligence & Analytics Apache Storm A Storm application
has essentially four components/abstractions: • Topology: the logic for any real-time application is packaged in the form of a topology – which is essentially a network of bolts and spouts. | 64 Source: http://storm.apache.org

has essentially four components/abstractions: • Streams: streams are a sequence of tuples that are created and processed in real-time in a distributed environment. | 65 Source: http://storm.apache.org Tuples are the main data structures in a Storm cluster. These are named lists of values where the values can be anything from integers, longs, shorts, bytes, doubles, strings, booleans floats, to byte arrays.

has essentially four components/abstractions: • Spout: A sprout is the source of streams in a Storm tuple. It is responsible for getting in touch with the actual data source, receiving data continuously, transforming those data into the actual stream of tuples and finally sending them to the bolts to be processed. | 66 Source: http://storm.apache.org

has essentially four components/abstractions: • Bolt: Bolts are responsible for performing all the processing of the topology. They form the processing logic unit of a Storm application. One can utilise bolt to perform many essential operations like- filtering, functions, joins, aggregations, connecting to databases, and many more. | 67 Source: http://storm.apache.org

Big Data Intelligence & Analytics Apache Storm Stream groupings: •
spouts and bolts execute in parallel as many tasks across the cluster. If you look at how a topology is executing at the task level, it looks something like this: | 68 Source: http://storm.apache.org

Big Data Intelligence & Analytics Apache Storm General Architecture and
Important Components • Storm cluster nodes: • Nimbus node: (master node, similar to the Hadoop JobTracker): • Uploads computations for execution • Distributes code across the cluster • Launches workers across the cluster • Monitors computation and reallocates workers as needed • ZooKeeper nodes: • coordinates the Storm cluster • Supervisor nodes: • communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus | 69 Source: https://www.upgrad.com/blog/everything-you-need-to-know-about-apache-storm/

Big Data Intelligence & Analytics Apache Storm Storm Abstractions: •
Tuples: an ordered list of elements. • Streams: an unbounded sequence of tuples. • Spouts: sources of streams in a computation • Bolts: process input streams and produce output streams. They can: run functions; filter, aggregate, or join data; or talk to databases. • Topologies: the overall calculation, represented visually as a network of spouts and bolts | 70 Source: http://storm.apache.org

Big Data Intelligence & Analytics Apache Storm Main Storm Groupings:
• Shuffle grouping: Tuples are randomly distributed but each bolt is guaranteed to get an equal number of tuples. • Fields grouping: The stream is partitioned by the fields specified in the grouping. • Partial Key grouping: The stream is partitioned by the fields specified in the grouping, but are load balanced between two downstream bolts. • All grouping: The stream is replicated across all the bolt’s tasks. • Global grouping: The entire stream goes to the task with the lowest id. | 71 Source: http://storm.apache.org

Big Data Intelligence & Analytics Apache Storm Storm characteristics for
real-time data processing workloads: 1. Fast 2. Scalable 3. Fault-tolerant 4. Reliable 5. Easy to operate | 72 Source: http://storm.apache.org

Stream Engines Twitter Heron | 73

Big Data Intelligence & Analytics Apache Heron A realtime, distributed,
fault-tolerant stream processing engine from Twitter. It has a wide array of architectural improvements over it's predecessor: 1. Off the shelf scheduler 2. Handling spikes and congestion 3. Easy debugging 4. Compatibility with Storm 5. Scalability and latency | 74 Source: https://apache.github.io/incubator-heron/

Big Data Intelligence & Analytics Apache Heron Heron Architecture: |
75 Source: https://blog.twitter.com/engineering/en_us/a/2015/flying-faster-with-twitter-heron.html

Big Data Intelligence & Analytics Apache Heron Topology Architecture: |
76 Source: https://blog.twitter.com/engineering/en_us/a/2015/flying-faster-with-twitter-heron.html

Big Data Intelligence & Analytics Apache Heron Throughput with acks
enabled: | 77 Source: https://blog.twitter.com/engineering/en_us/a/2015/flying-faster-with-twitter-heron.html

Big Data Intelligence & Analytics Apache Heron Latency with acks
enabled: | 78 Source: https://blog.twitter.com/engineering/en_us/a/2015/flying-faster-with-twitter-heron.html

Big Data Intelligence & Analytics Apache Heron Highlights: 1. Able
to re-use the code written using Storm 2. Efficient in terms of resource usage 3. 3x reduction in hardware 4. Now open-source: https://blog.twitter.com/engineering/en_us/topics/open-source/2016/open-sourcing-twitter-heron.html | 79 Source: https://blog.twitter.com/engineering/en_us/a/2015/flying-faster-with-twitter-heron.html

Stream Engines Apache Spark | 80

Big Data Intelligence & Analytics Apache Spark IBM and Apache
Spark: (2015) | 81 Source: https://www-03.ibm.com/press/us/en/pressrelease/47107.wss

Big Data Intelligence & Analytics Apache Spark IBM and Apache
Spark: (2018) | 82 Source: https://developer.ibm.com/blogs/ibm-continues-commitment-to-apache-spark/

Big Data Intelligence & Analytics Apache Spark What is Apache
Spark Apache Spark is a fast and general engine for large-scale data Processing: • Speed: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk • Ease of Use: Write applications quickly in Java, Scala, Python, R and SQL. • Generality: Combine SQL, streaming, and complex analytics. • Runs Everywhere: Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources. | 83 Source: https://spark.apache.org/ Logistic regression in Hadoop and Spark

Big Data Intelligence & Analytics Apache Spark Spark Ecosystem |
84 Source: https://spark.apache.org/

Big Data Intelligence & Analytics Apache Spark Spark API Spark’s
Python API text_file = spark.textFile("hdfs://...") text_file.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) Spark’s Scala API val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) | 85 Source: https://spark.apache.org/

Big Data Intelligence & Analytics Apache Spark Apache Spark Project
• Spark started as a research project at UC Berkeley • Matei Zaharia created Spark during his PhD • Ion Stoica was his advisor • DataBricks is the Spark start-up | 86 Source: https://spark.apache.org/ https://cs.stanford.edu/~matei/ Total Funding Amount $497M Source: https://www.crunchbase.com/organization/databricks

Big Data Intelligence & Analytics Apache Spark Resilient Distributed Datasets
(RDDs) • An RDD is a fault-tolerant collection of elements that can be operated on in parallel. • RDDs are created: • parallelizing an existing collection in your driver program, or • referencing a dataset in an external storage system | 87 Source: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Big Data Intelligence & Analytics Apache Spark Spark API: Parallelized
Collections Spark’s Python API data = [1, 2, 3, 4, 5] distData = sc.parallelize(data) Spark’s Scala API val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) Spark’s Java API List<Integer> data = Arrays.asList(1, 2, 3, 4, 5); JavaRDD<Integer> distData = sc.parallelize(data); | 88 Source: https://spark.apache.org/docs/latest/rdd-programming-guide.html

Big Data Intelligence & Analytics Apache Spark Spark API: External
Datasets Spark’s Python API >>> distFile = sc.textFile("data.txt") Spark’s Scala API scala> val distFile = sc.textFile("data.txt") distFile: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[10] at textFile at <console>:26 Spark’s Java API JavaRDD<String> distFile = sc.textFile("data.txt"); | 89 Source: https://spark.apache.org/docs/latest/rdd-programming-guide.html

Big Data Intelligence & Analytics Apache Spark Spark API: RDD
Operations Spark’s Python API lines = sc.textFile("data.txt") lineLengths = lines.map(lambda s: len(s)) totalLength = lineLengths.reduce(lambda a, b: a + b) Spark’s Scala API val lines = sc.textFile("data.txt") val lineLengths = lines.map(s => s.length) val totalLength = lineLengths.reduce((a, b) => a + b) Spark’s Java API JavaRDD<String> lines = sc.textFile("data.txt"); JavaRDD<Integer> lineLengths = lines.map(s -> s.length()); int totalLength = lineLengths.reduce((a, b) -> a + b); | 90 Source: https://spark.apache.org/docs/latest/rdd-programming-guide.html

Big Data Intelligence & Analytics Apache Spark Spark API: Working
with Key-Value Pairs Spark’s Python API lines = sc.textFile("data.txt") pairs = lines.map(lambda s: (s, 1)) counts = pairs.reduceByKey(lambda a, b: a + b) Spark’s Scala API val lines = sc.textFile("data.txt") val pairs = lines.map(s => (s, 1)) val counts = pairs.reduceByKey((a, b) => a + b) Spark’s Java API JavaRDD<String> lines = sc.textFile("data.txt"); JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1)); JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b); | 91 Source: https://spark.apache.org/docs/latest/rdd-programming-guide.html

Big Data Intelligence & Analytics Apache Spark Spark API: Shared
Variables Spark’s Python API >>> broadcastVar = sc.broadcast([1, 2, 3]) <pyspark.broadcast.Broadcast object at 0x102789f10> >>> broadcastVar.value [1, 2, 3] Spark’s Scala API scala> val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0) scala> broadcastVar.value res0: Array[Int] = Array(1, 2, 3) Spark’s Java API Broadcast<int[]> broadcastVar = sc.broadcast(new int[] {1, 2, 3}); broadcastVar.value(); // returns [1, 2, 3] | 92 Source: https://spark.apache.org/docs/latest/rdd-programming-guide.html

Big Data Intelligence & Analytics Apache Spark Three Apache Spark
APIs: • Resilient Distributed Datasets (RDDs) • DataFrames • Datasets | 93 Source: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Big Data Intelligence & Analytics Apache Spark Spark Cluster |
94 Source: https://spark.apache.org/docs/latest/cluster-overview.html

Big Data Intelligence & Analytics Apache Spark Spark Cluster •
Spark is agnostic to the underlying cluster manager. • The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master. • Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. Each driver schedules its own tasks. • The drivers must listen for and accept incoming connections from its executors throughout its lifetime • Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. | 95 Source: https://spark.apache.org/docs/latest/cluster-overview.html

Big Data Intelligence & Analytics Apache Spark Apache Spark Streaming
• Spark Streaming is an extension of Spark that allows processing data stream using micro-batches of data. | 96 Source: https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html

Big Data Intelligence & Analytics Apache Spark Discretized Streams (DStreams)
• Discretized Stream or DStream represents a continuous stream • of data, either the input data stream received from source, or • the processed data stream generated by transforming the input stream. • Internally, a DStream is represented by a continuous series of RDDs | 97 Source: https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html

• Any operation applied on a DStream translates to operations on the underlying RDDs. | 98 Source: https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html

• Spark Streaming provides windowed computations, which allow transformations over a sliding window of data. | 99 Source: https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html

Big Data Intelligence & Analytics Apache Spark Example // Create
a local StreamingContext with two working thread and batch interval of 1 second. // The master requires 2 cores to prevent from a starvation scenario. val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) // Create a DStream that will connect to hostname:port, like localhost:9999 val lines = ssc.socketTextStream("localhost", 9999) // Split each line into words val words = lines.flatMap(_.split(" ")) // Count each word in each batch val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) // Print the first ten elements of each RDD generated in this DStream to the console wordCounts.print() ssc.start() // Start the computation ssc.awaitTermination() // Wait for the computation to terminate | 100 Source: https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html

Big Data Intelligence & Analytics Apache Spark Spark SQL and
DataFrames • Spark SQL is a Spark module for structured data processing. • It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. • A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database. | 101 Source: https://spark.apache.org/sql/

Big Data Intelligence & Analytics Apache Spark Spark Machine Learning
Libraries • MLLib contains the original API built on top of RDDs. • spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines. | 102 Source: https://spark.apache.org/docs/latest/ml-pipeline.html

Big Data Intelligence & Analytics Apache Spark Spark Machine Learning
Libraries • MLLib contains the original API built on top of RDDs. • spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines. | 103 Source: https://spark.apache.org/docs/latest/ml-pipeline.html

Big Data Intelligence & Analytics Apache Spark Spark GraphX •
GraphX optimizes the representation of vertex and edge types when they are primitive data types • The property graph is a directed multigraph with user defined objects attached to each vertex and edge. | 104 Source: https://spark.apache.org/docs/latest/graphx-programming-guide.html

Big Data Intelligence & Analytics Apache Spark Spark GraphX //
Assume the SparkContext has already been constructed val sc: SparkContext // Create an RDD for the vertices val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica", "prof")))) // Create an RDD for edges val relationships: RDD[Edge[String]] = sc.parallelize(Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) // Define a default user in case there are relationship with missing user val defaultUser = ("John Doe", "Missing") // Build the initial Graph val graph = Graph(users, relationships, defaultUser) | 105 Source: https://spark.apache.org/docs/latest/graphx-programming-guide.html

What is SAMOA? Scalable Advanced Massive Online Analytics | 106

Big Data Intelligence & Analytics Apache SAMOA Apache SA(MOA) Vision
• Data Stream mining platform • Library of state-of-the-art algorithms for practitioners • Development and collaboration framework for researchers • Algorithms & Systems | 107 Souce: https://samoa.incubator.apache.org/

Big Data Intelligence & Analytics Apache SAMOA Importance: • Example:
spam detection in comments on Yahoo News • Trends change in time • Need to retrain model with new data | 108 Souce: https://samoa.incubator.apache.org/

Big Data Intelligence & Analytics Apache SAMOA Internet of Things
| 109 Souce: https://samoa.incubator.apache.org/

Big Data Intelligence & Analytics Apache SAMOA Big Data Stream
• Volume + Velocity (+ Variety) • Too large for single commodity server main memory • Too fast for single commodity server CPU • A solution should be: • Distributed • Scalable | 110 Image: https://twitter.com/GrandCanyonNPS

Big Data Intelligence & Analytics Apache SAMOA Big Data Processing
Engines • Low latency • High Latency (Not real time) | 111 Souce: https://samoa.incubator.apache.org/

Big Data Intelligence & Analytics Apache SAMOA Machine Learning over
Big Data Streams • Classification • Regression • Clustering • Frequent Pattern Mining | 112 Souce: https://samoa.incubator.apache.org/

Big Data Intelligence & Analytics What is Apache SAMOA? Streaming
Model: • Sequence is potentially infinite • High amount of data, high speed of arrival • Change over time (concept drift) • Approximation algorithms (small error with high probability) • Single pass, one data item at a time • Sub-linear space and time per data item Apache SAMOA | 113 Souce: https://samoa.incubator.apache.org/

Big Data Intelligence & Analytics Taxonomy: Apache SAMOA | 114
Souce: https://samoa.incubator.apache.org/

Big Data Intelligence & Analytics Architecture: Apache SAMOA | 115
Souce: https://samoa.incubator.apache.org/ Flink Machine Learning Stream Engines

Big Data Intelligence & Analytics Status: • Implementation of parallel
algorithms: • Classification (Vertical Hoeffding Tree) • Clustering (CluStream) • Regression (Adaptive Model Rules) • Execution engines Apache SAMOA | 116 Souce: https://samoa.incubator.apache.org/

Big Data Intelligence & Analytics Is SAMOA useful for you:
• Only if you need to deal with: • Large fast data • Evolving process (model updates) • Regression (Adaptive Model Rules) • What is happening now? • Use feedback in real-time • Adapt to changes faster Apache SAMOA | 117 Souce: https://samoa.incubator.apache.org/

Big Data Intelligence & Analytics ML Developer API: Apache SAMOA
| 118 Souce: https://samoa.incubator.apache.org/

Big Data Intelligence & Analytics ML Developer API: TopologyBuilder builder;
Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo); Apache SAMOA | 119 Souce: https://samoa.incubator.apache.org/ join join Connect Input Shuffle Connect Input Shuffle Connect Input Key Connect Input Key Stream One Stream Two

Big Data Intelligence & Analytics Decision Tree: • Nodes are
tests on attributes • Branches are possible outcomes • Leafs are class assignments Apache SAMOA | 120 Souce: https://samoa.incubator.apache.org/ Road Tested Mileage Age …

Big Data Intelligence & Analytics Hoeffding Tree: • Sample of
stream enough for near optimal decision • Estimate merit of alternatives from prefix of stream • Choose sample size based on statistical principles • When to expand a leaf? Let be the most informative attribute, the second most informative one Hoeffding bound: split if ∆ , ln 1/ 2 Apache SAMOA | 121 P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00: https://homes.cs.washington.edu/~pedrod/papers/kdd00.pdf

Big Data Intelligence & Analytics Parallel Decision Trees: • Which
kind of parallelism? Apache SAMOA | 122 Souce: https://samoa.incubator.apache.org/

kind of parallelism? • Task Apache SAMOA | 123 Souce: https://samoa.incubator.apache.org/

kind of parallelism? • Task • Data • Horizontal • Vertical Apache SAMOA | 124 Souce: https://samoa.incubator.apache.org/ Data Attributes Instances

Big Data Intelligence & Analytics Horizontal Parallelism: Apache SAMOA |
127 Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010: http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf single attribute tracked in multiple nodes aggregation to compute splits

Big Data Intelligence & Analytics Hoeffding Tree Profiling: Apache SAMOA
| 128 Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010: http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf

Big Data Intelligence & Analytics Vertical Parallelism: Apache SAMOA |
129 Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, Arinto Murdopo, “VHT: Vertical Hoeffding Tree”, 2016: https://arxiv.org/pdf/1607.08325.pdf single attribute tracked in single node

Big Data Intelligence & Analytics Advantages Vertical Parallelism: • High
number of attributes => high level of parallelism • (e.g., documents) • Vs task parallelism • Parallelism observed immediately • Vs horizontal parallelism • Reduced memory usage (no model replication) • Parallelized split computation Apache SAMOA | 130 Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, Arinto Murdopo, “VHT: Vertical Hoeffding Tree”, 2016: https://arxiv.org/pdf/1607.08325.pdf

Big Data Intelligence & Analytics Vertical Hoeffding Tree: Apache SAMOA
| 131 Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, Arinto Murdopo, “VHT: Vertical Hoeffding Tree”, 2016: https://arxiv.org/pdf/1607.08325.pdf

Big Data Intelligence & Analytics Accuracy: Apache SAMOA | 132
Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, Arinto Murdopo, “VHT: Vertical Hoeffding Tree”, 2016: https://arxiv.org/pdf/1607.08325.pdf

Big Data Intelligence & Analytics Performance: Apache SAMOA | 133
Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, Arinto Murdopo, “VHT: Vertical Hoeffding Tree”, 2016: https://arxiv.org/pdf/1607.08325.pdf

Big Data Intelligence & Analytics Summary: • Streaming is an
important V of Big Data • Mining big data streams is an open field • MOA: Massive Online Analytics • Available and open-source http://moa.cms.waikato.ac.nz/ • SAMOA: A Platform for Mining Big Data Streams • Available and open-source (incubating @ASF) http://samoa.incubator.apache.org Apache SAMOA | 134

Big Data Intelligence & Analytics Open Challenges: • Distributed stream
mining algorithms • Active & semi-supervised learning + crowdsourcing • Millions of classes (e.g., Wikipedia pages) • Multi-target learning • System issues (load balancing, communication) • Programming paradigms and abstractions Apache SAMOA | 135

Summary | 136

Big Data Intelligence & Analytics • Streaming is the future
and is happening now • Moving from Data to Data Streams requires • new algorithms, methods and techniques • approximated models, randomized methods, sketches, sampling, etc. • Mining big data streams is an open field • open fields usually mean huge opportunities • Moving from Big Data to Big Data Streams is not just changing technology • Parallelization, distribution, etc Conclusions | 138

Big Data Intelligence & Analytics Data Streams Books | 139
Machine Learning for Data Streams with Practical Examples in MOA By Albert Bifet, Ricard Gavaldà, Geoff Holmes and Bernhard Pfahringer Online: https://moa.cms.waikato.ac.nz/book/ Mining of Massive Datasets, 2nd Edition By Jure Leskovec, Anand Rajaraman, Jeff Ullman Online: http://www.mmds.org/#ver21v Knowledge Discovery from Data Streams By João Gama Online: http://www.liaad.up.pt/area/jgama/DataStre amsCRC.pdf

Big Data Intelligence & Analytics Part 2: Scala...

Big Data Intelligence & Analytics Part 2: Scalable Advanced Massive Online Analysis

More Decks by Mário Cordeiro

Other Decks in Research

Featured

Transcript