Slide 1

Slide 1 text

FAST DATA AND BIG DATA WITH SPRING XD SERGIU – SERGIU.BODIU@JPMORAN.COM KAILASH – KKUTTI@PIVOTAL.IO

Slide 2

Slide 2 text

AGENDA •  Use Cases and BIG DATA •  Getting Started with SPRING XD •  How to BUILD fast applications •  Data Ingestion •  Taps •  Processors •  Sinks •  Batch Jobs, Workflow Orchestration and Export •  Real-Time Analytics

Slide 3

Slide 3 text

WHAT IS BIG DATA BIG DATA definition: data that won’t fit on a single machine. By definition has to be DISTRIBUTED. It’s hard to estimate the necessary threshold or to give a hard metric. If data can be hosted on a single machine that is a CLASSICAL use case store it in a database.

Slide 4

Slide 4 text

WHAT IS FAST DATA FAST DATA definition: data processed near real-time, low latency. The next step after BIG data comprises Volume, Variety, Velocity and Value (4 Vs). Encompass event processing, in-memory databases, or hybrid data stores that optimize cache with disk. Fast Data is nothing new, it was traditionally restricted to a handful of extremely high-value use cases: - stock trading, airline schedulers, teleco etc..

Slide 5

Slide 5 text

DEMO DETAILS Use case 1: •  A retailer collaborates with a Telco. •  Retailer wants to know the “big callers” who are in the vicinity of their shop. •  Send promotional offers to those callers through various channels like text message, big screen ads and public address system ads etc Big caller = Call duration > 2500 (configurable)

Slide 6

Slide 6 text

DEMO DETAILS Use case 1: Solution strategy •  Ingest data into HDFS (for historical analysis) •  Tap the data for real time analytics •  Push the data to a map based dash board

Slide 7

Slide 7 text

DEMO – PART 1.1 Get caller data into HDFS HTTP --800 CDR Data HDFS | CDR Data stream cdr_rawdata_stream

Slide 8

Slide 8 text

DEMO – PART 1.2 GROOVY Convert JSON to POJO GROOVY Act on the rule | cdr_rawdata_stream ESPER Apply Data extraction Rule | | Node.JS Visualize cdr_processor_tap

Slide 9

Slide 9 text

DEMO DETAILS Use case 2: •  A retailer collaborates with a Telco. •  For every dropped call, telco wants to offer a coupon of a retail outlet near by caller’s location.

Slide 10

Slide 10 text

NEED COMPLEX PROCESSING FOR BIG DATA AND FAST DATA •  More and more enterprises have become customer centric with increased demand to create customized experience. •  Bring Your Own Device (BYOD) – Data needs to made available for all devices with greater user experience, responsive design for all devices. •  Advent of micro batch and near real time processing of data, real time data streaming. •  Integration new data sources to existing data assets for value creation.

Slide 11

Slide 11 text

WHAT IS CEP COMPLEX EVENT PROCESSING (CEP) Algol Traders and Quants writing new trading ALGORITHMS. Compliance and Risk management require real time monitoring of LIQUIDITY. Security and Audit require in-build FRUD detection. Telco phone calls, video streaming ensuring high SLA.

Slide 12

Slide 12 text

LAMBDA ARCHITECTURE

Slide 13

Slide 13 text

WHAT IS SPRING XD •  Stands for Spring eXtreme Data •  More of a data integration platform with execution platform •  Completely open source •  Natural evolution of Spring Integration, Spring batch and Spring data •  Support both big & fast data processing architecture patterns like MapReduce, Stream processing etc. •  Support for predictive analytics

Slide 14

Slide 14 text

SPRING XD ARCHITECTURE •  Master/Slave distributed computing architecture •  Operates in two modes •  Stand alone/ single node •  Distributed/clustered •  Key components are •  XD Admin server – Plan the action based on DSL •  XD Container server – The worker •  Admin server uses Zookeeper for co-ordination •  The run time is called DIRT (Distributed Intelligent Run Time)

Slide 15

Slide 15 text

XD Container SPRING XD ARCHITECTURE Admin Server XD Container XD Module XD Module TCP –port=80 HDFS –dire…. create stream –definition “tcp –-port=80 | hdfs –-directory=/ incoming” –name testStream --deploy

Slide 16

Slide 16 text

SPRING XD ADMIN SERVER •  Provides a fairly basic browser-based GUI at localhost:9393/admin-ui Key components are •  Embedded Container •  Exposes REST endpoints •  Zookeeper for management

Slide 17

Slide 17 text

SPRING XD ATTRACTIONS •  Unified platform - Stream Processing and Batch Jobs •  Hadoop Batch workflow orchestration •  NoSQL Analytics •  Machine Learning algorithms •  Runtime that provides critical non-functional requirements Not yet another framework •  Scalable, distributed, Fault-Tolerant •  Portable. On premise cluster, YARN, EC2.

Slide 18

Slide 18 text

XD DISTRIBUTED MODE HORIZONTAL SCALING •  Native mode – user is responsible for starting/stopping services •  Managed by YARN (Apache Hadoop 2.2.0 or later. This includes Apache Hadoop 2.2.0, Pivotal HD 2.0, Hortonworks HDP 2.1 and Cloudera CDH5)

Slide 19

Slide 19 text

SPRING XD – KEY COMPONENTS OF DATA PROCESSING •  Streams – Fast data ingestion, processing, store •  E.g real time web log processing •  Jobs – Batch/Micro batch orchestration •  E.g Hadoop jobs •  Taps – Wire tap pattern implementation. Non intrusive way to process data

Slide 20

Slide 20 text

STREAMS •  The programming model for processing event streams •  Need the modules such as (1) An Input source (2)Processing steps (3) An Output sink •  An Input source produces messages from an external source. XD supports a variety of sources, e.g. syslog, tcp, http •  The output from a module is a Spring Message containing a payload of data and a collection of key-value headers •  Processing steps are optional

Slide 21

Slide 21 text

STREAMS – LINEAR PROCESSING EXAMPLE •  “http –port=8000 | hdfs –directory=/incoming” •  Unix pipe like structure •  Everything comes through 8000 ports gets saves into HDFS •  Future support for other protocols like JMS •  Support non linear flow too

Slide 22

Slide 22 text

STREAMS DEPLOYMENT XD Admin Zookeeper XD Container XD Container Spring application context Spring application context http port=8000 Outbound adapter HDFS directory=/ Inbound adapter DataBus (e.g)Redis

Slide 23

Slide 23 text

BATCH JOBS, WORKFLOW ORCHESTRATION AND EXPORT Spring Batch is used to support the workflow orchestration and export use cases. The concept of workflow translates to a batch job, which can be thought of as a directed graph of steps, each of which is a processing step. Spring XD ships with a small number of predefined jobs: -  FTP to HDFS -  HDFS to JDBC Export -  HDFS to MongoDB Export -  JDBC to HDFS Import

Slide 24

Slide 24 text

LAUNCH AND MONITOR JOBS JobLauncher triggers the job and JobRepository keeps track of job execution. Spring XD can launch and monitor Map Reduce jobs, Pig/Hive scripts. Note: -  Steps can be executed in parallel or remotely. -  Jobs can be scheduled using a cron expression. -  Jobs can be executed on demand as a reaction to data on a stream.

Slide 25

Slide 25 text

TAPS •  A Tap allows you to "listen" to data •  Existing process is not interrupted •  Taps is a type of streams •  Syntax to create a tap is very similar to streams E.g. 1) stream create --name foo1tap --definition "tap:stream:csvToHDFSstream > log" –deploy E.g. 2) stream create --name foo1tap --definition "tap:job:csvToHDFSstream > log" –deploy

Slide 26

Slide 26 text

TAPS CONTD.. •  A tap can consume data from any point along the target stream’s processing pipeline tap:stream:mystream.filter > .... •  Taps data after the filter on mystream is applied •  Streams has no clue of taps •  Is a stream recreated, taps will continue to work

Slide 27

Slide 27 text

SPRING XD RUNTIME

Slide 28

Slide 28 text

SPRING XD AND AGILITY •  Get started in less than 10 minutes! •  XD Single-Node •  Shell like DSL •  No build scripts •  IDE agnostic •  Easy to extend

Slide 29

Slide 29 text

SPRING IO PLATFORM

Slide 30

Slide 30 text

CONCLUSION •  SpringXD may be new, but as we've seen, it builds on mature foundations, battle tested and production ready components: •  Spring Integration •  Spring Data •  Spring Batch •  Provides a lightweight runtime environment that is easily configured and assembled via a DSL with little or no code. •  Provides a "one stop shop" for developers to get started building a Big Data application and deploying such applications. •  11+ years of experience in building large scale enterprise application across the globe.

Slide 31

Slide 31 text

JPMC & PIVOTAL IS HIRING!!

Slide 32

Slide 32 text

EXTENDING SPRING XD •  All 4 modules (source,processor,sink) for streams and jobs can be custom built. •  Standards Spring configuration files with supporting binaries

Slide 33

Slide 33 text

REAL-TIME ANALYTICS Spring XD provides out of the box a few simple analytics tools implemented as an Abstract API with implementations for in-memory and Redis, as follows: Simple Counter Field Value Counter: Counts the occurrence of named fields. Aggregate Counter: Popular in tools like Mongo and Redis, this allows you to timeslice data by, for example, minute, hour, month, year and so on. Gauge: Last value Rich Gauge: Last value, running average, min/max Real-time data analytics The analytics functionality is provided via modules that can be added to a stream. In that sense, real-time analytics is accomplished via exactly the same model as data ingestion. Spring XD provides support for the real-time evaluation of various machine learning scoring algorithms as well simple real-time data analytics using various types of counters and gauges.

Slide 34

Slide 34 text

JOBS LIFE CYCLE •  Register a Job Module – copy XML to $XD_HOME/modules/ jobs •  Create a Job Definition – “create job myJob ….” •  Deploy a Job – “create ……. –deploy” •  Launch a Job – “job launch ……..” (Can be launched using Cron or as a Sink) •  Job Execution •  Un-deploy a Job – “job undeploy –name ….. “ •  Destroy a Job Definition – “job destroy –name …..”

Slide 35

Slide 35 text

MONITORING & MANAGEMENT ADMIN UI Modules: lists the available batch job modules and more details (such as the job module options and the module XML configuration file). Definitions: lists the XD batch job definitions and provides actions to deploy or un-deploy those jobs. Deployments: lists all the deployed jobs and provides an option to launch the deployed job. Once the job is deployed, it can be launched through the admin UI as well. Executions: lists the batch job executions and provides an option to restart if the batch job is restartable and stopped/failed

Slide 36

Slide 36 text

DEMO CODE CAN BE FOUND https://github.com/kailashnathkutti/spring-xd-examples https://github.com/kailashnathkutti/spring-xd-examples-web-gui

Slide 37

Slide 37 text

DEMO DETAILS Use case 3: •  A retailer want to know the average sales by region every X seconds.

Slide 38

Slide 38 text

BATCH