Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data Intelligence & Analytics Part 2: Scalable Advanced Massive Online Analysis

Big Data Intelligence & Analytics Part 2: Scalable Advanced Massive Online Analysis

Big Data Intelligence & Analytics Part 2:
- introduction to data stream engines
- introduction to big data stream mining
- introduction to SAMOA: Scalable Advanced Massive Online Analytics
- strategies to parallelize and scale data stream algorithms

Mário Cordeiro

October 19, 2019
Tweet

More Decks by Mário Cordeiro

Other Decks in Research

Transcript

  1. Big Data Intelligence & Analytics Who am I • Hometown:

    • Miranda do Douro, Bragança, Portugal
  2. Big Data Intelligence & Analytics Who am I • Hometown:

    • Miranda do Douro, Bragança, Portugal
  3. Big Data Intelligence & Analytics Who am I • Education:

    • 2000: Degree in Electrical and Computer Engineering, FEUP • Digital Television and Digital Broadcast • 2008: Master in Sciences in Informatics Engineering, FEUP • Information Retrieval • Since 2011: PhD student in Doctoral Program in Informatics Engineering, FEUP • Event detection on social network data streams • Dynamic Complex Networks
  4. Big Data Intelligence & Analytics Work and research interests •

    Professional Work • Senior Engineer at Critical Manufacturing (CM was acquired by ASM Pacific Technology) | 5 Source: https://www.forbes.com/sites/bernardmarr/2018/09/02/what-is-industry-4-0-heres-a-super-easy-explanation-for-anyone/
  5. Big Data Intelligence & Analytics Work and research interests Real-time

    enterprise-wide visualization and monitoring is crucial for high tech industries: | 6 Critical Manufacturing FabLive: https://www.criticalmanufacturing.com/en/critical-manufacturing-mes/fablive
  6. Big Data Intelligence & Analytics Work and research interests •

    Lecturer: • Invited assistant at ISEP-DEI, since 2010 • Computer Networks (RCOMP), Data Structures (ESINF), Algorithmic and Programming (APROG) • Former invited assistant at FEUP-DEI, 2012-2016 • Computing Theory (TCOM), Computers Laboratory (LCOM) | 7
  7. Big Data Intelligence & Analytics Work and research interests •

    Selected papers: • Cordeiro, M., Sarmento, R. P., Brazdil, P., Kimura, M., Gama, J., Identifying, Ranking and Tracking Community Leaders in Evolving Social Networks. International Conference on Complex Networks and their Applications (COMPLEX NETWORKS), 2019 • Cordeiro, M., Sarmento, R. P., Brazdil, P., Gama, J., Evolving Networks and Social Network Analysis: Methods and Techniques. Journalism and Social Media - New Trends and Cultural Implications (IntechOpen Book Chapter), 2018 • Sarmento, R. P., Cordeiro, M., Brazdil, P., Gama, J., Efficient Incremental Laplace Centrality Algorithm for Dynamic Networks. International Conference on Complex Networks and their Applications (COMPLEX NETWORKS), 2017 • Cordeiro, M., Sarmento, R. P., Gama, J., Evolving Networks Dynamic Community Detection using Locality Modularity Optimization. Social Network Analysis and Mining (SNAM), 2016 • M. Cordeiro, Twitter event detection: combining wavelet analysis and topic inference summarization, DSIE’12, the Doctoral Symposium on Informatics Engineering, 2012 | 8
  8. Big Data Intelligence & Analytics Work and research interests Identifying,

    Ranking and Tracking Community Leaders in Evolving Social Networks: | 9 Zachary karate club, classical vs hierarchical community detection
  9. Big Data Intelligence & Analytics Work and research interests Identifying,

    Ranking and Tracking Community Leaders in Evolving Social Networks: | 10 Temporal collaboration network of Jure Leskovec and Andrews Ng Temporal Zachary karate club
  10. Big Data Intelligence & Analytics Road Map Session 1: •

    Big data science • Issues with (small or big) data quality (examples in healthcare data) • Streaming data sources (examples in energy providers data) • Approximate vs exact computations (practical examples) Session 2: • From streaming to ubiquitous data sources • Distributed streaming versions of state-of-the-art data mining algorithms • Real-world application examples of such algorithms Session 3: • MOA: Massive Online Analysis Session 4: • SAMOA: Scalable Advanced Massive Online Analysis | 12
  11. Big Data Intelligence & Analytics The journey Data Data Streams

    Big Data Big Data Stream | 13 Single Node Multi Node Real-time Analytics Batch Analytics
  12. Big Data Intelligence & Analytics The journey Data Data Streams

    Big Data Big Data Stream | 14 Single Node Multi Node Real-time Analytics Batch Analytics
  13. Big Data Intelligence & Analytics About this presentation • Adapted

    from 2018 Albert Bifet Big Data Intelligence & Analytics slides | 15
  14. Big Data Intelligence & Analytics Digital Universe EMC Digital Universe

    with Research & Analysis by IDC The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things April 2014 | 17
  15. Big Data Intelligence & Analytics Digital Universe | 18 Figure::

    https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
  16. Big Data Intelligence & Analytics Digital Universe | 19 Figure::

    https://www.wsj.com/articles/to-keep-track-of-worlds-data-youll-need-more-than-a-yottabyte-11552048200
  17. Big Data Intelligence & Analytics Digital Universe | 20 Figure::

    https://www.visualcapitalist.com/how-much-data-is-generated-each-day/
  18. Big Data Intelligence & Analytics Digital Universe | 21 Figure::

    https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
  19. Big Data Intelligence & Analytics Digital Universe | 22 Figure::

    https://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm
  20. Big Data Intelligence & Analytics Big Data 6V’s | 23

    Image: https://searchdatamanagement.techtarget.com/definition/big-data
  21. Big Data Intelligence & Analytics Big Data 10V’s | 24

    Source: https://tdwi.org/articles/2017/02/08/10-vs-of-big-data.aspx
  22. Big Data Intelligence & Analytics Hadoop Map Reduce • Hadoop

    architecture deals with datasets, not data streams | 25 Big Data and Google's Three Papers I - GFS and MapReduce: https://bowenli86.github.io/2016/10/23/distributed%20system/data/Big-Data-and-Google-s-Three-Papers-I-GFS-and-MapReduce/
  23. Big Data Intelligence & Analytics Hadoop Map Reduce • three

    operations: • Map • Shuffle • Reduce | 26 Dean et al. MapReduce: Simplified Data Processing on Large Clusters, 2004: http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
  24. Big Data Intelligence & Analytics Hadoop Map Reduce MapReduce example

    counts the appearance of each word in a set of documents: function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += pc emit (word, sum) | 27 Dean et al. MapReduce: Simplified Data Processing on Large Clusters, 2004: http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf
  25. Big Data Intelligence & Analytics Hadoop Map Reduce • Distributed

    File Systems: | 28 Ghemawat et al. The Google File System, 2003: http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
  26. Big Data Intelligence & Analytics Requirements • We should have

    some ways of coupling programs like garden hose–screw in another segment when it becomes when it becomes necessary to massage data in another way. This is the way of IO also. • Our loader should be able to do link-loading and controlled establishment. • Our library filing scheme should allow for rather general indexing, responsibility, generations, data path switching. • It should be possible to get private system components (all routines are system components) for buggering around with. | 29
  27. Big Data Intelligence & Analytics Requirements • We should have

    some ways of coupling programs like garden hose–screw in another segment when it becomes when it becomes necessary to massage data in another way. This is the way of IO also. • Our loader should be able to do link-loading and controlled establishment. • Our library filing scheme should allow for rather general indexing, responsibility, generations, data path switching. • It should be possible to get private system components (all routines are system components) for buggering around with. - M. D. McIlroy 1968 | 30
  28. Big Data Intelligence & Analytics Unix Pipelines cat colors.txt |

    sort | uniq -c | sort -rnk 1 | head -3 > favcolors.txt cat colors.txt | sort | uniq -c | sort -rnk 1 | head -3 > favcolors.txt | 32
  29. Big Data Intelligence & Analytics Unix Pipelines cat colors.txt |

    sort –c | uniq –c | head –3 > favcolors.txt cat colors.txt | sort | uniq -c | sort -rnk 1 | head -3 > favcolors.txt | 33 Online: https://www.datascienceatthecommandline.com
  30. Big Data Intelligence & Analytics Unix Pipelines | 34 Apache

    Kafka, Samza, and the Unix Philosophy of Distributed Data: Martin Kleppmann, Conuent: https://www.confluent.io/blog/apache-kafka-samza-and-the-unix-philosophy-of-distributed-data/
  31. Big Data Intelligence & Analytics Unix Pipelines | 35 Apache

    Kafka, Samza, and the Unix Philosophy of Distributed Data: Martin Kleppmann, Conuent: https://www.confluent.io/blog/apache-kafka-samza-and-the-unix-philosophy-of-distributed-data/
  32. Big Data Intelligence & Analytics Unix Pipelines | 36 Apache

    Kafka, Samza, and the Unix Philosophy of Distributed Data: Martin Kleppmann, Conuent: https://www.confluent.io/blog/apache-kafka-samza-and-the-unix-philosophy-of-distributed-data/
  33. Big Data Intelligence & Analytics Unix Pipelines | 37 Apache

    Kafka, Samza, and the Unix Philosophy of Distributed Data: Martin Kleppmann, Conuent: https://www.confluent.io/blog/apache-kafka-samza-and-the-unix-philosophy-of-distributed-data/
  34. Big Data Intelligence & Analytics Unix Pipelines | 38 Apache

    Kafka, Samza, and the Unix Philosophy of Distributed Data: Martin Kleppmann, Conuent: https://www.confluent.io/blog/apache-kafka-samza-and-the-unix-philosophy-of-distributed-data/
  35. Big Data Intelligence & Analytics Real Time Processing Jay Kreps,

    LinkedIn The Log: What every software engineer should know about real-time data’s unifying abstraction | 39 Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  36. Big Data Intelligence & Analytics • The Log • perhaps

    the simplest possible storage abstraction. • It is an append-only, totally-ordered sequence of records ordered by time Real Time Processing | 40 Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  37. Big Data Intelligence & Analytics Real Time Processing • The

    Log • horizontal scaling by chop up the log into partitions: | 41 Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  38. Big Data Intelligence & Analytics Real Time Processing • The

    Log • building out custom data loads for each data source and destination, was clearly infeasible • fully connectivity would end up with O(N2) pipelines. | 42 Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  39. Big Data Intelligence & Analytics Real Time Processing • The

    Log • Unified log: | 43 Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  40. Big Data Intelligence & Analytics Real Time Processing • The

    Log • Event sourcing | 44 Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  41. Big Data Intelligence & Analytics Real Time Processing | 45

    Source: https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/ • The Log • Event sourcing based architecture https://martinfowler.com/eaaDev/EventSourcing.html
  42. Big Data Intelligence & Analytics Apache Kafka from LinkedIn •

    Apache Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. | 47 Source: https://kafka.apache.org/intro
  43. Big Data Intelligence & Analytics Apache Kafka from LinkedIn •

    Components of Apache Kafka • topics: categories that Kafka uses to maintains feeds of messages • producers: processes that publish messages to a Kafka topic • consumers: processes that subscribe to topics and process the feed of published messages • broker: server that is part of the cluster that runs Kafka | 48 Source: https://kafka.apache.org/intro
  44. Big Data Intelligence & Analytics Apache Kafka from LinkedIn •

    Under the hood: • The Kafka cluster maintains a partitioned log. • Each partition is an ordered, immutable sequence of messages that is continually appended to a commit log. • The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition. | 49 Source: https://kafka.apache.org/intro
  45. Big Data Intelligence & Analytics Apache Kafka from LinkedIn •

    A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups: | 50 Source: https://kafka.apache.org/intro
  46. Big Data Intelligence & Analytics Apache Kafka from LinkedIn •

    Guarantees: • Messages sent by a producer to a particular topic partition will be appended in the order they are sent. • A consumer instance sees messages in the order they are stored in the log. • For a topic with replication factor N, Kafka tolerates up to N-1 server failures without losing any messages committed to the log. | 51 Source: https://kafka.apache.org/intro
  47. Big Data Intelligence & Analytics Apache Samza • Samza is

    a stream processing framework with the following features: • Simple API: it provides a very simple callback-based ”process message” API comparable to MapReduce. • Managed state: Samza manages snapshotting and restoration of a stream processor’s state. • Fault tolerance: Whenever a machine fails, Samza works with YARN to transparently migrate your tasks to another machine. • Durability: Samza uses Kafka to guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost. • Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, replayable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in. • Pluggable: Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments. • Processor isolation: Samza works with Apache YARN | 53 Source: http://samza.apache.org/
  48. Big Data Intelligence & Analytics Apache Samza • Storm and

    Samza are fairly similar. Both systems provide: 1. a partitioned stream model, 2. a distributed execution environment, 3. an API for stream processing, 4. fault tolerance, 5. Kafka integration | 55 Source: http://samza.apache.org/
  49. Big Data Intelligence & Analytics Apache Samza • Samza components:

    • Streams: A stream is composed of immutable messages of a similar type or category • Jobs: code that performs a logical transformation on a set of input streams to append output messages to set of output streams • Samza parallel Components: • Partitions: Each streamis broken into one or more partitions. Each partition in the stream is a totally ordered sequence of messages. • Tasks: A job is scaled by breaking it into multiple tasks. The task is the unit of parallelism of the job, just as the partition is to the stream. | 56 Source: http://samza.apache.org/
  50. Big Data Intelligence & Analytics Apache Samza (parenthesis) • The

    dataflow pattern: • Adopted by many other platforms: • https://streamsets.com/ • https://beam.apache.org/ • https://cloud.google.com/dataflow/ • https://airflow.apache.org/ | 58 StreamSets Demo - Connected Car with StreamSets Data Collector: https://www.youtube.com/watch?time_continue=308&v=qAyFvC4c2n4
  51. Big Data Intelligence & Analytics Apache Samza • Samza architecture:

    • Samza is made up of three layers: 1. A streaming layer. 2. An execution layer. 3. A processing layer • Samza provides out of the box support for all three layers. 1. Streaming: Kafka 2. Execution: YARN 3. Processing: Samza API | 59 Source: http://samza.apache.org/
  52. Big Data Intelligence & Analytics Apache Samza • Samza architecture:

    • These three pieces fit together to form Samza: • This architecture follows a similar pattern to Hadoop (which also uses YARN as execution layer, HDFS for storage, and MapReduce as processing API): | 60 Source: http://samza.apache.org/
  53. Big Data Intelligence & Analytics Apache Samza • Samza, Yarn

    and Kafka integration: | 61 Source: http://samza.apache.org/ RM: Resource Manager) NM: Node Manager AM: Application Master)
  54. Big Data Intelligence & Analytics Apache Storm • Apache S4

    from Yahoo: | 63 Not longer an active project.
  55. Big Data Intelligence & Analytics Apache Storm A Storm application

    has essentially four components/abstractions: • Topology: the logic for any real-time application is packaged in the form of a topology – which is essentially a network of bolts and spouts. | 64 Source: http://storm.apache.org
  56. Big Data Intelligence & Analytics Apache Storm A Storm application

    has essentially four components/abstractions: • Streams: streams are a sequence of tuples that are created and processed in real-time in a distributed environment. | 65 Source: http://storm.apache.org Tuples are the main data structures in a Storm cluster. These are named lists of values where the values can be anything from integers, longs, shorts, bytes, doubles, strings, booleans floats, to byte arrays.
  57. Big Data Intelligence & Analytics Apache Storm A Storm application

    has essentially four components/abstractions: • Spout: A sprout is the source of streams in a Storm tuple. It is responsible for getting in touch with the actual data source, receiving data continuously, transforming those data into the actual stream of tuples and finally sending them to the bolts to be processed. | 66 Source: http://storm.apache.org
  58. Big Data Intelligence & Analytics Apache Storm A Storm application

    has essentially four components/abstractions: • Bolt: Bolts are responsible for performing all the processing of the topology. They form the processing logic unit of a Storm application. One can utilise bolt to perform many essential operations like- filtering, functions, joins, aggregations, connecting to databases, and many more. | 67 Source: http://storm.apache.org
  59. Big Data Intelligence & Analytics Apache Storm Stream groupings: •

    spouts and bolts execute in parallel as many tasks across the cluster. If you look at how a topology is executing at the task level, it looks something like this: | 68 Source: http://storm.apache.org
  60. Big Data Intelligence & Analytics Apache Storm General Architecture and

    Important Components • Storm cluster nodes: • Nimbus node: (master node, similar to the Hadoop JobTracker): • Uploads computations for execution • Distributes code across the cluster • Launches workers across the cluster • Monitors computation and reallocates workers as needed • ZooKeeper nodes: • coordinates the Storm cluster • Supervisor nodes: • communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus | 69 Source: https://www.upgrad.com/blog/everything-you-need-to-know-about-apache-storm/
  61. Big Data Intelligence & Analytics Apache Storm Storm Abstractions: •

    Tuples: an ordered list of elements. • Streams: an unbounded sequence of tuples. • Spouts: sources of streams in a computation • Bolts: process input streams and produce output streams. They can: run functions; filter, aggregate, or join data; or talk to databases. • Topologies: the overall calculation, represented visually as a network of spouts and bolts | 70 Source: http://storm.apache.org
  62. Big Data Intelligence & Analytics Apache Storm Main Storm Groupings:

    • Shuffle grouping: Tuples are randomly distributed but each bolt is guaranteed to get an equal number of tuples. • Fields grouping: The stream is partitioned by the fields specified in the grouping. • Partial Key grouping: The stream is partitioned by the fields specified in the grouping, but are load balanced between two downstream bolts. • All grouping: The stream is replicated across all the bolt’s tasks. • Global grouping: The entire stream goes to the task with the lowest id. | 71 Source: http://storm.apache.org
  63. Big Data Intelligence & Analytics Apache Storm Storm characteristics for

    real-time data processing workloads: 1. Fast 2. Scalable 3. Fault-tolerant 4. Reliable 5. Easy to operate | 72 Source: http://storm.apache.org
  64. Big Data Intelligence & Analytics Apache Heron A realtime, distributed,

    fault-tolerant stream processing engine from Twitter. It has a wide array of architectural improvements over it's predecessor: 1. Off the shelf scheduler 2. Handling spikes and congestion 3. Easy debugging 4. Compatibility with Storm 5. Scalability and latency | 74 Source: https://apache.github.io/incubator-heron/
  65. Big Data Intelligence & Analytics Apache Heron Heron Architecture: |

    75 Source: https://blog.twitter.com/engineering/en_us/a/2015/flying-faster-with-twitter-heron.html
  66. Big Data Intelligence & Analytics Apache Heron Topology Architecture: |

    76 Source: https://blog.twitter.com/engineering/en_us/a/2015/flying-faster-with-twitter-heron.html
  67. Big Data Intelligence & Analytics Apache Heron Throughput with acks

    enabled: | 77 Source: https://blog.twitter.com/engineering/en_us/a/2015/flying-faster-with-twitter-heron.html
  68. Big Data Intelligence & Analytics Apache Heron Latency with acks

    enabled: | 78 Source: https://blog.twitter.com/engineering/en_us/a/2015/flying-faster-with-twitter-heron.html
  69. Big Data Intelligence & Analytics Apache Heron Highlights: 1. Able

    to re-use the code written using Storm 2. Efficient in terms of resource usage 3. 3x reduction in hardware 4. Now open-source: https://blog.twitter.com/engineering/en_us/topics/open-source/2016/open-sourcing-twitter-heron.html | 79 Source: https://blog.twitter.com/engineering/en_us/a/2015/flying-faster-with-twitter-heron.html
  70. Big Data Intelligence & Analytics Apache Spark IBM and Apache

    Spark: (2015) | 81 Source: https://www-03.ibm.com/press/us/en/pressrelease/47107.wss
  71. Big Data Intelligence & Analytics Apache Spark IBM and Apache

    Spark: (2018) | 82 Source: https://developer.ibm.com/blogs/ibm-continues-commitment-to-apache-spark/
  72. Big Data Intelligence & Analytics Apache Spark What is Apache

    Spark Apache Spark is a fast and general engine for large-scale data Processing: • Speed: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk • Ease of Use: Write applications quickly in Java, Scala, Python, R and SQL. • Generality: Combine SQL, streaming, and complex analytics. • Runs Everywhere: Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud. It can access diverse data sources. | 83 Source: https://spark.apache.org/ Logistic regression in Hadoop and Spark
  73. Big Data Intelligence & Analytics Apache Spark Spark API Spark’s

    Python API text_file = spark.textFile("hdfs://...") text_file.flatMap(lambda line: line.split()) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a+b) Spark’s Scala API val textFile = sc.textFile("hdfs://...") val counts = textFile.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) | 85 Source: https://spark.apache.org/
  74. Big Data Intelligence & Analytics Apache Spark Apache Spark Project

    • Spark started as a research project at UC Berkeley • Matei Zaharia created Spark during his PhD • Ion Stoica was his advisor • DataBricks is the Spark start-up | 86 Source: https://spark.apache.org/ https://cs.stanford.edu/~matei/ Total Funding Amount $497M Source: https://www.crunchbase.com/organization/databricks
  75. Big Data Intelligence & Analytics Apache Spark Resilient Distributed Datasets

    (RDDs) • An RDD is a fault-tolerant collection of elements that can be operated on in parallel. • RDDs are created: • parallelizing an existing collection in your driver program, or • referencing a dataset in an external storage system | 87 Source: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
  76. Big Data Intelligence & Analytics Apache Spark Spark API: Parallelized

    Collections Spark’s Python API data = [1, 2, 3, 4, 5] distData = sc.parallelize(data) Spark’s Scala API val data = Array(1, 2, 3, 4, 5) val distData = sc.parallelize(data) Spark’s Java API List<Integer> data = Arrays.asList(1, 2, 3, 4, 5); JavaRDD<Integer> distData = sc.parallelize(data); | 88 Source: https://spark.apache.org/docs/latest/rdd-programming-guide.html
  77. Big Data Intelligence & Analytics Apache Spark Spark API: External

    Datasets Spark’s Python API >>> distFile = sc.textFile("data.txt") Spark’s Scala API scala> val distFile = sc.textFile("data.txt") distFile: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[10] at textFile at <console>:26 Spark’s Java API JavaRDD<String> distFile = sc.textFile("data.txt"); | 89 Source: https://spark.apache.org/docs/latest/rdd-programming-guide.html
  78. Big Data Intelligence & Analytics Apache Spark Spark API: RDD

    Operations Spark’s Python API lines = sc.textFile("data.txt") lineLengths = lines.map(lambda s: len(s)) totalLength = lineLengths.reduce(lambda a, b: a + b) Spark’s Scala API val lines = sc.textFile("data.txt") val lineLengths = lines.map(s => s.length) val totalLength = lineLengths.reduce((a, b) => a + b) Spark’s Java API JavaRDD<String> lines = sc.textFile("data.txt"); JavaRDD<Integer> lineLengths = lines.map(s -> s.length()); int totalLength = lineLengths.reduce((a, b) -> a + b); | 90 Source: https://spark.apache.org/docs/latest/rdd-programming-guide.html
  79. Big Data Intelligence & Analytics Apache Spark Spark API: Working

    with Key-Value Pairs Spark’s Python API lines = sc.textFile("data.txt") pairs = lines.map(lambda s: (s, 1)) counts = pairs.reduceByKey(lambda a, b: a + b) Spark’s Scala API val lines = sc.textFile("data.txt") val pairs = lines.map(s => (s, 1)) val counts = pairs.reduceByKey((a, b) => a + b) Spark’s Java API JavaRDD<String> lines = sc.textFile("data.txt"); JavaPairRDD<String, Integer> pairs = lines.mapToPair(s -> new Tuple2(s, 1)); JavaPairRDD<String, Integer> counts = pairs.reduceByKey((a, b) -> a + b); | 91 Source: https://spark.apache.org/docs/latest/rdd-programming-guide.html
  80. Big Data Intelligence & Analytics Apache Spark Spark API: Shared

    Variables Spark’s Python API >>> broadcastVar = sc.broadcast([1, 2, 3]) <pyspark.broadcast.Broadcast object at 0x102789f10> >>> broadcastVar.value [1, 2, 3] Spark’s Scala API scala> val broadcastVar = sc.broadcast(Array(1, 2, 3)) broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0) scala> broadcastVar.value res0: Array[Int] = Array(1, 2, 3) Spark’s Java API Broadcast<int[]> broadcastVar = sc.broadcast(new int[] {1, 2, 3}); broadcastVar.value(); // returns [1, 2, 3] | 92 Source: https://spark.apache.org/docs/latest/rdd-programming-guide.html
  81. Big Data Intelligence & Analytics Apache Spark Three Apache Spark

    APIs: • Resilient Distributed Datasets (RDDs) • DataFrames • Datasets | 93 Source: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
  82. Big Data Intelligence & Analytics Apache Spark Spark Cluster |

    94 Source: https://spark.apache.org/docs/latest/cluster-overview.html
  83. Big Data Intelligence & Analytics Apache Spark Spark Cluster •

    Spark is agnostic to the underlying cluster manager. • The spark driver is the program that declares the transformations and actions on RDDs of data and submits such requests to the master. • Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. Each driver schedules its own tasks. • The drivers must listen for and accept incoming connections from its executors throughout its lifetime • Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. | 95 Source: https://spark.apache.org/docs/latest/cluster-overview.html
  84. Big Data Intelligence & Analytics Apache Spark Apache Spark Streaming

    • Spark Streaming is an extension of Spark that allows processing data stream using micro-batches of data. | 96 Source: https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
  85. Big Data Intelligence & Analytics Apache Spark Discretized Streams (DStreams)

    • Discretized Stream or DStream represents a continuous stream • of data, either the input data stream received from source, or • the processed data stream generated by transforming the input stream. • Internally, a DStream is represented by a continuous series of RDDs | 97 Source: https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
  86. Big Data Intelligence & Analytics Apache Spark Discretized Streams (DStreams)

    • Any operation applied on a DStream translates to operations on the underlying RDDs. | 98 Source: https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
  87. Big Data Intelligence & Analytics Apache Spark Discretized Streams (DStreams)

    • Spark Streaming provides windowed computations, which allow transformations over a sliding window of data. | 99 Source: https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
  88. Big Data Intelligence & Analytics Apache Spark Example // Create

    a local StreamingContext with two working thread and batch interval of 1 second. // The master requires 2 cores to prevent from a starvation scenario. val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) // Create a DStream that will connect to hostname:port, like localhost:9999 val lines = ssc.socketTextStream("localhost", 9999) // Split each line into words val words = lines.flatMap(_.split(" ")) // Count each word in each batch val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) // Print the first ten elements of each RDD generated in this DStream to the console wordCounts.print() ssc.start() // Start the computation ssc.awaitTermination() // Wait for the computation to terminate | 100 Source: https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
  89. Big Data Intelligence & Analytics Apache Spark Spark SQL and

    DataFrames • Spark SQL is a Spark module for structured data processing. • It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine. • A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database. | 101 Source: https://spark.apache.org/sql/
  90. Big Data Intelligence & Analytics Apache Spark Spark Machine Learning

    Libraries • MLLib contains the original API built on top of RDDs. • spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines. | 102 Source: https://spark.apache.org/docs/latest/ml-pipeline.html
  91. Big Data Intelligence & Analytics Apache Spark Spark Machine Learning

    Libraries • MLLib contains the original API built on top of RDDs. • spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines. | 103 Source: https://spark.apache.org/docs/latest/ml-pipeline.html
  92. Big Data Intelligence & Analytics Apache Spark Spark GraphX •

    GraphX optimizes the representation of vertex and edge types when they are primitive data types • The property graph is a directed multigraph with user defined objects attached to each vertex and edge. | 104 Source: https://spark.apache.org/docs/latest/graphx-programming-guide.html
  93. Big Data Intelligence & Analytics Apache Spark Spark GraphX //

    Assume the SparkContext has already been constructed val sc: SparkContext // Create an RDD for the vertices val users: RDD[(VertexId, (String, String))] = sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica", "prof")))) // Create an RDD for edges val relationships: RDD[Edge[String]] = sc.parallelize(Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"))) // Define a default user in case there are relationship with missing user val defaultUser = ("John Doe", "Missing") // Build the initial Graph val graph = Graph(users, relationships, defaultUser) | 105 Source: https://spark.apache.org/docs/latest/graphx-programming-guide.html
  94. Big Data Intelligence & Analytics Apache SAMOA Apache SA(MOA) Vision

    • Data Stream mining platform • Library of state-of-the-art algorithms for practitioners • Development and collaboration framework for researchers • Algorithms & Systems | 107 Souce: https://samoa.incubator.apache.org/
  95. Big Data Intelligence & Analytics Apache SAMOA Importance: • Example:

    spam detection in comments on Yahoo News • Trends change in time • Need to retrain model with new data | 108 Souce: https://samoa.incubator.apache.org/
  96. Big Data Intelligence & Analytics Apache SAMOA Internet of Things

    | 109 Souce: https://samoa.incubator.apache.org/
  97. Big Data Intelligence & Analytics Apache SAMOA Big Data Stream

    • Volume + Velocity (+ Variety) • Too large for single commodity server main memory • Too fast for single commodity server CPU • A solution should be: • Distributed • Scalable | 110 Image: https://twitter.com/GrandCanyonNPS
  98. Big Data Intelligence & Analytics Apache SAMOA Big Data Processing

    Engines • Low latency • High Latency (Not real time) | 111 Souce: https://samoa.incubator.apache.org/
  99. Big Data Intelligence & Analytics Apache SAMOA Machine Learning over

    Big Data Streams • Classification • Regression • Clustering • Frequent Pattern Mining | 112 Souce: https://samoa.incubator.apache.org/
  100. Big Data Intelligence & Analytics What is Apache SAMOA? Streaming

    Model: • Sequence is potentially infinite • High amount of data, high speed of arrival • Change over time (concept drift) • Approximation algorithms (small error with high probability) • Single pass, one data item at a time • Sub-linear space and time per data item Apache SAMOA | 113 Souce: https://samoa.incubator.apache.org/
  101. Big Data Intelligence & Analytics Taxonomy: Apache SAMOA | 114

    Souce: https://samoa.incubator.apache.org/
  102. Big Data Intelligence & Analytics Architecture: Apache SAMOA | 115

    Souce: https://samoa.incubator.apache.org/ Flink Machine Learning Stream Engines
  103. Big Data Intelligence & Analytics Status: • Implementation of parallel

    algorithms: • Classification (Vertical Hoeffding Tree) • Clustering (CluStream) • Regression (Adaptive Model Rules) • Execution engines Apache SAMOA | 116 Souce: https://samoa.incubator.apache.org/
  104. Big Data Intelligence & Analytics Is SAMOA useful for you:

    • Only if you need to deal with: • Large fast data • Evolving process (model updates) • Regression (Adaptive Model Rules) • What is happening now? • Use feedback in real-time • Adapt to changes faster Apache SAMOA | 117 Souce: https://samoa.incubator.apache.org/
  105. Big Data Intelligence & Analytics ML Developer API: Apache SAMOA

    | 118 Souce: https://samoa.incubator.apache.org/
  106. Big Data Intelligence & Analytics ML Developer API: TopologyBuilder builder;

    Processor sourceOne = new SourceProcessor(); builder.addProcessor(sourceOne); Stream streamOne = builder.createStream(sourceOne); Processor sourceTwo = new SourceProcessor(); builder.addProcessor(sourceTwo); Stream streamTwo = builder.createStream(sourceTwo); Processor join = new JoinProcessor()); builder.addProcessor(join) .connectInputShuffle(streamOne) .connectInputKey(streamTwo); Apache SAMOA | 119 Souce: https://samoa.incubator.apache.org/ join join Connect Input Shuffle Connect Input Shuffle Connect Input Key Connect Input Key Stream One Stream Two
  107. Big Data Intelligence & Analytics Decision Tree: • Nodes are

    tests on attributes • Branches are possible outcomes • Leafs are class assignments Apache SAMOA | 120 Souce: https://samoa.incubator.apache.org/ Road Tested Mileage Age …
  108. Big Data Intelligence & Analytics Hoeffding Tree: • Sample of

    stream enough for near optimal decision • Estimate merit of alternatives from prefix of stream • Choose sample size based on statistical principles • When to expand a leaf? Let be the most informative attribute, the second most informative one Hoeffding bound: split if ∆ , ln 1/ 2 Apache SAMOA | 121 P. Domingos and G. Hulten, “Mining High-Speed Data Streams,” KDD ’00: https://homes.cs.washington.edu/~pedrod/papers/kdd00.pdf
  109. Big Data Intelligence & Analytics Parallel Decision Trees: • Which

    kind of parallelism? Apache SAMOA | 122 Souce: https://samoa.incubator.apache.org/
  110. Big Data Intelligence & Analytics Parallel Decision Trees: • Which

    kind of parallelism? • Task Apache SAMOA | 123 Souce: https://samoa.incubator.apache.org/
  111. Big Data Intelligence & Analytics Parallel Decision Trees: • Which

    kind of parallelism? • Task • Data • Horizontal • Vertical Apache SAMOA | 124 Souce: https://samoa.incubator.apache.org/ Data Attributes Instances
  112. Big Data Intelligence & Analytics Parallel Decision Trees: • Which

    kind of parallelism? • Task • Data • Horizontal • Vertical Apache SAMOA | 125 Souce: https://samoa.incubator.apache.org/ Data Attributes Instances
  113. Big Data Intelligence & Analytics Parallel Decision Trees: • Which

    kind of parallelism? • Task • Data • Horizontal • Vertical Apache SAMOA | 126 Souce: https://samoa.incubator.apache.org/ Data Attributes Instances
  114. Big Data Intelligence & Analytics Horizontal Parallelism: Apache SAMOA |

    127 Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010: http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf single attribute tracked in multiple nodes aggregation to compute splits
  115. Big Data Intelligence & Analytics Hoeffding Tree Profiling: Apache SAMOA

    | 128 Y. Ben-Haim and E. Tom-Tov, “A Streaming Parallel Decision Tree Algorithm,” JMLR, vol. 11, pp. 849–872, 2010: http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
  116. Big Data Intelligence & Analytics Vertical Parallelism: Apache SAMOA |

    129 Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, Arinto Murdopo, “VHT: Vertical Hoeffding Tree”, 2016: https://arxiv.org/pdf/1607.08325.pdf single attribute tracked in single node
  117. Big Data Intelligence & Analytics Advantages Vertical Parallelism: • High

    number of attributes => high level of parallelism • (e.g., documents) • Vs task parallelism • Parallelism observed immediately • Vs horizontal parallelism • Reduced memory usage (no model replication) • Parallelized split computation Apache SAMOA | 130 Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, Arinto Murdopo, “VHT: Vertical Hoeffding Tree”, 2016: https://arxiv.org/pdf/1607.08325.pdf
  118. Big Data Intelligence & Analytics Vertical Hoeffding Tree: Apache SAMOA

    | 131 Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, Arinto Murdopo, “VHT: Vertical Hoeffding Tree”, 2016: https://arxiv.org/pdf/1607.08325.pdf
  119. Big Data Intelligence & Analytics Accuracy: Apache SAMOA | 132

    Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, Arinto Murdopo, “VHT: Vertical Hoeffding Tree”, 2016: https://arxiv.org/pdf/1607.08325.pdf
  120. Big Data Intelligence & Analytics Performance: Apache SAMOA | 133

    Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet, Arinto Murdopo, “VHT: Vertical Hoeffding Tree”, 2016: https://arxiv.org/pdf/1607.08325.pdf
  121. Big Data Intelligence & Analytics Summary: • Streaming is an

    important V of Big Data • Mining big data streams is an open field • MOA: Massive Online Analytics • Available and open-source http://moa.cms.waikato.ac.nz/ • SAMOA: A Platform for Mining Big Data Streams • Available and open-source (incubating @ASF) http://samoa.incubator.apache.org Apache SAMOA | 134
  122. Big Data Intelligence & Analytics Open Challenges: • Distributed stream

    mining algorithms • Active & semi-supervised learning + crowdsourcing • Millions of classes (e.g., Wikipedia pages) • Multi-target learning • System issues (load balancing, communication) • Programming paradigms and abstractions Apache SAMOA | 135
  123. Big Data Intelligence & Analytics The journey Data Data Streams

    Big Data Big Data Stream | 137 Single Node Multi Node Real-time Analytics Batch Analytics
  124. Big Data Intelligence & Analytics • Streaming is the future

    and is happening now • Moving from Data to Data Streams requires • new algorithms, methods and techniques • approximated models, randomized methods, sketches, sampling, etc. • Mining big data streams is an open field • open fields usually mean huge opportunities • Moving from Big Data to Big Data Streams is not just changing technology • Parallelization, distribution, etc Conclusions | 138
  125. Big Data Intelligence & Analytics Data Streams Books | 139

    Machine Learning for Data Streams with Practical Examples in MOA By Albert Bifet, Ricard Gavaldà, Geoff Holmes and Bernhard Pfahringer Online: https://moa.cms.waikato.ac.nz/book/ Mining of Massive Datasets, 2nd Edition By Jure Leskovec, Anand Rajaraman, Jeff Ullman Online: http://www.mmds.org/#ver21v Knowledge Discovery from Data Streams By João Gama Online: http://www.liaad.up.pt/area/jgama/DataStre amsCRC.pdf