Entwicklung von Big Data Applikationen mit Spring XD

Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software,
Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Big Data Applications with Spring XD Thomas Darimont, Software Engineer, Pivotal Inc. @thomasdarimont

THE FASTEST PATH TO NEW BUSINESS VALUE

Agenda Overview Concepts Architecture Q&A

Big Data?

500 million tweets each day   2.3 Trillion GBs of
data are created each day 100 Million Wearables ! 90% of enterprise data is unstructured 100 sensors in each car ! 22 Billion sensors by 2020 86% suspect data inaccuracy ! 30% revenue loss due to bad data quality Volume Velocity Variety Veracity Big Data Data Points: McKinsey, Twitter, Gartner, IBM

Common Problems • Batch and Streaming handled by multiple platforms
! • Fragmented Big Data Ecosystem ! • Simple things are not simple

Spring XD to the rescue • Batch and Streaming handled
by multiple platforms ! • Fragmented Big Data Ecosystem ! • Simple things are not simple Unified Stream and Batch Platform Scalable, Distributed and Fault Tolerant Runtime Multi-platform Batch Workflow Orchestration Easy to extend and integrate other technologies Portable: On-prem/IaaS, YARN, Docker, Mesos, PCF Easy to Use- Many out of the box components Build on proven Spring Technologies “NoSQL” and Predictive Analytics unified platform

Spring XD - 10,000 Foot View Spring XD Runtime BIDIRECTIONAL
Compute HDFS RDBMS NoSQL R, SAS Streams Jobs ingest workflow export taps Predictive Modelling >_ Redis

eXtreme Data “One stop shop for developing and deploying Big
Data Apps“ X D

Spring IO Platform - 1.0.3

Agility: Easy to Setup and Run Store incoming HTTP data
into HDFS

Agility: Easy to Setup and Run Writing HTTP Data to
HDFS …that simple! or or

Demo Setup and Run

Streams

Streams http | transform --expression=payload.toUpperCase() | hdfs Source 0 ..
N processors Sink module option

Streams Programming model for processing data streams How data is
collected, processed, and stored or forwarded DSL analog to Unix Pipes and Filters Source | Processor 0…* | Sink Basic streams can be composed to more sophisticated streams Source Sink Processor Input Output

Modules Source, Processor, Sink, Jobs Unit of execution Implemented as
a Spring ApplicationContext Modules in a Stream -> Spring Integration Job Modules -> Spring Batch Producer Consumer Message Channel send(message) receive() Spring Integration Message Flow

Stream Modules HTTP SFTP Tail File Mail Syslog TCP /
TCP Client Reactor IP JMS RabbitMQ Time MQTT Kafka JDBC Gemfire CQ, Source Twitter Search, Stream Stdout Capture Filter Transform Splitter Aggregator HTTP Client Shell Command Script Groovy Python Java JPMML-Evaluator JSON-to-Tuple Object-to-JSON Log File JDBC TCP MQTT Mongo Mail Null Sink Redis RabbitMQ HDFS HDFS Dataset Shell Command GemFire Server Splunk Server Dynamic Router Counter + 1 Gauge + 1 #20 #13 #20 Source Processor Sink Roll your own!

Demo Streams

Taps & Analytics

Taps Stream Source Processor Sink … Message Bus Consume data
along the Stream processing pipeline Original stream stays unaffected Collect metrics and perform analytics Works for Jobs too! stream create demo1tap --definition "tap:stream:demo1> counter" Tap Processor Sink …

Analytics Backed by Redis & In-memory Access via REST API
Counters Simple & Field Value Counter Aggregate Counter Gauges Gauge Rich Gauge Predictive Model Evaluation Is this transaction fraudulent? What group does this user belong to? JPMML PMML Model Interoperable with R, Rattle KNIME, RapidMiner, MADLib Python Spark

Demo Taps & Analytics

Predictive Analytics Classiﬁcation Clustering Regression Association Rules Question Data Algorithm
Model New Data Prediction

Predictive Models Model Parameterised Algorithm Model Building Derive a parameterised
algorithm from the data Slow process Usually large data volume -> done ofﬂine as a batch process Model Scoring Use the model to predict new information Fast process Can be done as part of stream processing

PMML Predictive Model Markup Language Open Standard Maintained by Data
Mining Group (DMG) XML based DSL for predictive models Can be interpreted 15 Model Types (Naive Bayes, General Regression, Neural Networks, etc.) First Version (1999) – Current Version 4.2.1 “Lingua Franca for Predictive Models” “Bridge the Gap between Data Scientists and Engineers”

Anatomy of a PMML Model Predictive Model Algorithm description(s) Parameterisation
“trained model” Pre Processing Post Processing Transform model output Thresholds / Business rules Source:(PMML(in(Ac/on,(2nd(Edi/on,(2012,(p.(7.

Predictive Analytics with Spring XD XD Module analytic-pmml Introduced in
Spring 1.0.0 M6 (April 2014) Real-time evaluation and scoring Based on JPMML-Evaluator Wide range of Model types spring-xd-modules/analytics-ml-pmml on Github

Demo Offline Model Learning & Online Model Scoring

Jobs Create, Launch and Monitor Jobs Deﬁne how batch processing
steps are orchestrated Builds on top of Spring Batch and Spring Hadoop CSV to JDBC FTP to HDFS JDBC to HDFS HDFS to JDBC HDFS to MongoDB … or roll your own! with Spring Batch

Demo Jobs

Architecture

Spring XD - Node Types Admin Node Embedded servlet container
REST endpoints via Spring MVC / HATEOAS Container management and module deployment Map processing tasks into processing modules Monitor and update runtime state Container Node Host Module instances Performs the actual processing

Spring XD Runtimes single-node XD Admin XD Container !! Module
JVM ZK JVM XD Admin XD Container !! Module XD Container !! Module JVM JVM JVM JVM ZK multi-node

Single-Node Runtime Source | Sink XD Admin XD Container Transport
! Redis, Rabbit, Local, Other Control Bus Data Bus Source Sink Output Input ZooKeeper

About ZooKeeper Toolset for building distributed systems Originates from Hadoop
but not tied to it! Coordination distributed processes Shared conﬁguration Replicated hierarchal data store much like a ﬁle system Curator - Client Library to interact with ZooKeeper

ZooKeeper in Spring XD Centralised storage for Stream and Job
Definitions Tracking of Containers and Modules Notification of Cluster structure changes Notification of Module changes

Multi-Node Distributed Runtime Deploy and un-deploy modules on containers Dedicated
message bus “control bus” Deployment requests Dynamically discover new containers Reassign modules when containers fail

DIRT - Distributed Runtime XD Admin XD Admin XD Admin
XD Container XD Container Container State ZooKeeper Control Bus Data Bus XD Shell HTTP POST /streams/aStream “M1 | M2” M1 M2 Spring App Context Transport ! Redis, Rabbit, Local, Other

Message Bus Source | Processor | Sink Pipe Symbol denotes
Message Bus Binds module inputs & outputs to a transport Performs Serialisation (Kryo, pluggable) RabbitMQ, Redis, Kafka, Local (in memory), JMS

Deployment Manifest

Deployment Manifest Stream/Job Deﬁnition —> Logical View Deployment Manifest —>
Physical View Important properties relate to Module Count Module Placement Data Partitioning Concurrency Deﬁned on Stream deployment

Deployment Manifest - Module Count http | worker | hdfs
http http worker worker worker worker hdfs hdfs hdfs stream deploy –name s1 --properties module.http.count=2, module.worker.count=4, module.hdfs.count=3

Deployment Manifest - Module Placement http | worker | hdfs
stream deploy –name s1 --properties module.http.count=2, module.worker.count=4, module.hdfs.count=3 module.http.criteria=   group.contains(‘WEB’) http http worker worker worker worker hdfs hdfs hdfs Web xd/bin/xd-container --groups="WEB"

Deployment Manifest - Data Partitioning http | worker | hdfs
stream deploy –name s1 --properties … ! module.http.producer .partitionKeyExpression= payload.customerId http http worker worker worker worker hdfs hdfs hdfs Web C C C C worker modules will always receive & process the same set of customer IDs!

Demo Partitioning

How does it fit in? Alternative to Flume, Oozie, Sqoop,
Storm Complementary to many technologies Big SQL - Impala, HAWQ Spark batch processing and streaming Elastic Search, Cassandra, … Platform for Data Integration/ Data Movement

Lambda Architecture Data Lake HAWQ Serving Layer Spring Boot Spring
Boot Spring Boot Spring Boot Speed Layer Batch Layer Batch Views Real-time Views Gemfire Spring Stream Processing Batch Processing Analytics Ingest Export Workflow Orchestration Predictive Analytics XD>

What’s next?

Breaking News Spring XD 1.1 M1 (and 1.0.2) just Released
yesterday! Java Conﬁg Module deﬁnitions Batch Jobs with Apache Spark Python support Kafka support LDAP Authentication Kerberized Hadoop Cluster

Roadmap - 1.1 RC1 and beyond Spring Boot like Module
Deployment Deployment targets: Docker, CF Service, Mesos Security ACLs Kafka based Message Bus Reactive Streams Integration Spark Streaming

Learn more Project http://projects.spring.io/spring-xd GitHub https://github.com/spring-projects/spring-xd Wiki https://github.com/spring-projects/spring-xd/wiki Samples https://github.com/spring-projects/spring-xd-samples
JIRA https://jira.spring.io/browse/XD Stackoverﬂow http://stackoverﬂow.com/questions/tagged/spring-xd

Takeaway - Spring XD Unified runtime for both Real-time and
Batch use cases Scalable, Distributed and Fault Tolerant Runtime Increased Productivity through out-of-the-box components Closed Loop Analytics through online (stream) and offline (batch) data Swiss-army knife of data movement and data pipelines Repeatable ‘turnkey’ solution for next generation data-centric use cases

Admin UI http://localhost:9393/admin-ui/#/streams/definitions

Entwicklung von Big Data Applikationen mit Spri...

Entwicklung von Big Data Applikationen mit Spring XD

More Decks by Thomas Darimont

Other Decks in Programming

Featured

Transcript