Slide 1

Slide 1 text

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Big Data Applications with Spring XD Thomas Darimont, Software Engineer, Pivotal Inc. @thomasdarimont

Slide 2

Slide 2 text

THE FASTEST PATH TO NEW BUSINESS VALUE

Slide 3

Slide 3 text

Agenda Overview Concepts Architecture Q&A

Slide 4

Slide 4 text

Big Data?

Slide 5

Slide 5 text

500 million tweets each day 
 2.3 Trillion GBs of data are created each day 100 Million Wearables ! 90% of enterprise data is unstructured 100 sensors in each car ! 22 Billion sensors by 2020 86% suspect data inaccuracy ! 30% revenue loss due to bad data quality Volume Velocity Variety Veracity Big Data Data Points: McKinsey, Twitter, Gartner, IBM

Slide 6

Slide 6 text

500 million tweets each day 
 2.3 Trillion GBs of data are created each day 100 Million Wearables ! 90% of enterprise data is unstructured 100 sensors in each car ! 22 Billion sensors by 2020 86% suspect data inaccuracy ! 30% revenue loss due to bad data quality Volume Velocity Variety Veracity Big Data Data Points: McKinsey, Twitter, Gartner, IBM

Slide 7

Slide 7 text

Common Problems • Batch and Streaming handled by multiple platforms ! • Fragmented Big Data Ecosystem ! • Simple things are not simple

Slide 8

Slide 8 text

Spring XD to the rescue • Batch and Streaming handled by multiple platforms ! • Fragmented Big Data Ecosystem ! • Simple things are not simple Unified Stream and Batch Platform Scalable, Distributed and Fault Tolerant Runtime Multi-platform Batch Workflow Orchestration Easy to extend and integrate other technologies Portable: On-prem/IaaS, YARN, Docker, Mesos, PCF Easy to Use- Many out of the box components Build on proven Spring Technologies “NoSQL” and Predictive Analytics unified platform

Slide 9

Slide 9 text

Spring XD - 10,000 Foot View Spring XD Runtime BIDIRECTIONAL Compute HDFS RDBMS NoSQL R, SAS Streams Jobs ingest workflow export taps Predictive Modelling >_ Redis

Slide 10

Slide 10 text

eXtreme Data “One stop shop for developing and deploying Big Data Apps“ X D

Slide 11

Slide 11 text

Spring IO Platform - 1.0.3

Slide 12

Slide 12 text

Spring IO Platform - 1.0.3

Slide 13

Slide 13 text

Agility: Easy to Setup and Run Store incoming HTTP data into HDFS

Slide 14

Slide 14 text

Agility: Easy to Setup and Run Writing HTTP Data to HDFS …that simple! or or

Slide 15

Slide 15 text

Demo Setup and Run

Slide 16

Slide 16 text

Streams

Slide 17

Slide 17 text

Streams http | transform --expression=payload.toUpperCase() | hdfs Source 0 .. N processors Sink module option

Slide 18

Slide 18 text

Streams Programming model for processing data streams How data is collected, processed, and stored or forwarded DSL analog to Unix Pipes and Filters Source | Processor 0…* | Sink Basic streams can be composed to more sophisticated streams Source Sink Processor Input Output

Slide 19

Slide 19 text

Modules Source, Processor, Sink, Jobs Unit of execution Implemented as a Spring ApplicationContext Modules in a Stream -> Spring Integration Job Modules -> Spring Batch Producer Consumer Message Channel send(message) receive() Spring Integration Message Flow

Slide 20

Slide 20 text

Stream Modules HTTP SFTP Tail File Mail Syslog TCP / TCP Client Reactor IP JMS RabbitMQ Time MQTT Kafka JDBC Gemfire CQ, Source Twitter Search, Stream Stdout Capture Filter Transform Splitter Aggregator HTTP Client Shell Command Script Groovy Python Java JPMML-Evaluator JSON-to-Tuple Object-to-JSON Log File JDBC TCP MQTT Mongo Mail Null Sink Redis RabbitMQ HDFS HDFS Dataset Shell Command GemFire Server Splunk Server Dynamic Router Counter + 1 Gauge + 1 #20 #13 #20 Source Processor Sink Roll your own!

Slide 21

Slide 21 text

Demo Streams

Slide 22

Slide 22 text

Taps & Analytics

Slide 23

Slide 23 text

Taps Stream Source Processor Sink … Message Bus Consume data along the Stream processing pipeline Original stream stays unaffected Collect metrics and perform analytics Works for Jobs too! stream create demo1tap --definition "tap:stream:demo1> counter" Tap Processor Sink …

Slide 24

Slide 24 text

Analytics Backed by Redis & In-memory Access via REST API Counters Simple & Field Value Counter Aggregate Counter Gauges Gauge Rich Gauge Predictive Model Evaluation Is this transaction fraudulent? What group does this user belong to? JPMML PMML Model Interoperable with R, Rattle KNIME, RapidMiner, MADLib Python Spark

Slide 25

Slide 25 text

Demo Taps & Analytics

Slide 26

Slide 26 text

Predictive Analytics Classification Clustering Regression Association Rules Question Data Algorithm Model New Data Prediction

Slide 27

Slide 27 text

Predictive Models Model Parameterised Algorithm Model Building Derive a parameterised algorithm from the data Slow process Usually large data volume -> done offline as a batch process Model Scoring Use the model to predict new information Fast process Can be done as part of stream processing

Slide 28

Slide 28 text

PMML Predictive Model Markup Language Open Standard Maintained by Data Mining Group (DMG) XML based DSL for predictive models Can be interpreted 15 Model Types (Naive Bayes, General Regression, Neural Networks, etc.) First Version (1999) – Current Version 4.2.1 “Lingua Franca for Predictive Models” “Bridge the Gap between Data Scientists and Engineers”

Slide 29

Slide 29 text

Anatomy of a PMML Model Predictive Model Algorithm description(s) Parameterisation “trained model” Pre Processing Post Processing Transform model output Thresholds / Business rules Source:(PMML(in(Ac/on,(2nd(Edi/on,(2012,(p.(7.

Slide 30

Slide 30 text

Predictive Analytics with Spring XD XD Module analytic-pmml Introduced in Spring 1.0.0 M6 (April 2014) Real-time evaluation and scoring Based on JPMML-Evaluator Wide range of Model types spring-xd-modules/analytics-ml-pmml on Github

Slide 31

Slide 31 text

Demo Offline Model Learning & Online Model Scoring

Slide 32

Slide 32 text

Jobs

Slide 33

Slide 33 text

Jobs Create, Launch and Monitor Jobs Define how batch processing steps are orchestrated Builds on top of Spring Batch and Spring Hadoop CSV to JDBC FTP to HDFS JDBC to HDFS HDFS to JDBC HDFS to MongoDB … or roll your own! with Spring Batch

Slide 34

Slide 34 text

Demo Jobs

Slide 35

Slide 35 text

Architecture

Slide 36

Slide 36 text

Spring XD - Node Types Admin Node Embedded servlet container REST endpoints via Spring MVC / HATEOAS Container management and module deployment Map processing tasks into processing modules Monitor and update runtime state Container Node Host Module instances Performs the actual processing

Slide 37

Slide 37 text

Spring XD Runtimes single-node XD Admin XD Container !! Module JVM ZK JVM XD Admin XD Container !! Module XD Container !! Module JVM JVM JVM JVM ZK multi-node

Slide 38

Slide 38 text

Single-Node Runtime Source | Sink XD Admin XD Container Transport ! Redis, Rabbit, Local, Other Control Bus Data Bus Source Sink Output Input ZooKeeper

Slide 39

Slide 39 text

About ZooKeeper Toolset for building distributed systems Originates from Hadoop but not tied to it! Coordination distributed processes Shared configuration Replicated hierarchal data store much like a file system Curator - Client Library to interact with ZooKeeper

Slide 40

Slide 40 text

ZooKeeper in Spring XD Centralised storage for Stream and Job Definitions Tracking of Containers and Modules Notification of Cluster structure changes Notification of Module changes

Slide 41

Slide 41 text

Multi-Node Distributed Runtime Deploy and un-deploy modules on containers Dedicated message bus “control bus” Deployment requests Dynamically discover new containers Reassign modules when containers fail

Slide 42

Slide 42 text

DIRT - Distributed Runtime XD Admin XD Admin XD Admin XD Container XD Container Container State ZooKeeper Control Bus Data Bus XD Shell HTTP POST /streams/aStream “M1 | M2” M1 M2 Spring App Context Transport ! Redis, Rabbit, Local, Other

Slide 43

Slide 43 text

Message Bus Source | Processor | Sink Pipe Symbol denotes Message Bus Binds module inputs & outputs to a transport Performs Serialisation (Kryo, pluggable) RabbitMQ, Redis, Kafka, Local (in memory), JMS

Slide 44

Slide 44 text

Deployment Manifest

Slide 45

Slide 45 text

Deployment Manifest Stream/Job Definition —> Logical View Deployment Manifest —> Physical View Important properties relate to Module Count Module Placement Data Partitioning Concurrency Defined on Stream deployment

Slide 46

Slide 46 text

Deployment Manifest - Module Count http | worker | hdfs http http worker worker worker worker hdfs hdfs hdfs stream deploy –name s1 --properties module.http.count=2, module.worker.count=4, module.hdfs.count=3

Slide 47

Slide 47 text

Deployment Manifest - Module Placement http | worker | hdfs stream deploy –name s1 --properties module.http.count=2, module.worker.count=4, module.hdfs.count=3 module.http.criteria= 
 group.contains(‘WEB’) http http worker worker worker worker hdfs hdfs hdfs Web xd/bin/xd-container --groups="WEB"

Slide 48

Slide 48 text

Deployment Manifest - Data Partitioning http | worker | hdfs stream deploy –name s1 --properties … ! module.http.producer .partitionKeyExpression= payload.customerId http http worker worker worker worker hdfs hdfs hdfs Web C C C C worker modules will always receive & process the same set of customer IDs!

Slide 49

Slide 49 text

Demo Partitioning

Slide 50

Slide 50 text

How does it fit in? Alternative to Flume, Oozie, Sqoop, Storm Complementary to many technologies Big SQL - Impala, HAWQ Spark batch processing and streaming Elastic Search, Cassandra, … Platform for Data Integration/ Data Movement

Slide 51

Slide 51 text

Lambda Architecture Data Lake HAWQ Serving Layer Spring Boot Spring Boot Spring Boot Spring Boot Speed Layer Batch Layer Batch Views Real-time Views Gemfire Spring Stream Processing Batch Processing Analytics Ingest Export Workflow Orchestration Predictive Analytics XD>

Slide 52

Slide 52 text

What’s next?

Slide 53

Slide 53 text

Breaking News Spring XD 1.1 M1 (and 1.0.2) just Released yesterday! Java Config Module definitions Batch Jobs with Apache Spark Python support Kafka support LDAP Authentication Kerberized Hadoop Cluster

Slide 54

Slide 54 text

Roadmap - 1.1 RC1 and beyond Spring Boot like Module Deployment Deployment targets: Docker, CF Service, Mesos Security ACLs Kafka based Message Bus Reactive Streams Integration Spark Streaming

Slide 55

Slide 55 text

Learn more Project http://projects.spring.io/spring-xd GitHub https://github.com/spring-projects/spring-xd Wiki https://github.com/spring-projects/spring-xd/wiki Samples https://github.com/spring-projects/spring-xd-samples JIRA https://jira.spring.io/browse/XD Stackoverflow http://stackoverflow.com/questions/tagged/spring-xd

Slide 56

Slide 56 text

Takeaway - Spring XD Unified runtime for both Real-time and Batch use cases Scalable, Distributed and Fault Tolerant Runtime Increased Productivity through out-of-the-box components Closed Loop Analytics through online (stream) and offline (batch) data Swiss-army knife of data movement and data pipelines Repeatable ‘turnkey’ solution for next generation data-centric use cases

Slide 57

Slide 57 text

Admin UI http://localhost:9393/admin-ui/#/streams/definitions