Fast data and big data with SpringXD

FAST DATA AND BIG DATA WITH SPRING XD SERGIU –
[email protected] KAILASH – [email protected]

AGENDA •  Use Cases and BIG DATA •  Getting Started
with SPRING XD •  How to BUILD fast applications •  Data Ingestion •  Taps •  Processors •  Sinks •  Batch Jobs, Workflow Orchestration and Export •  Real-Time Analytics

WHAT IS BIG DATA BIG DATA definition: data that won’t
fit on a single machine. By definition has to be DISTRIBUTED. It’s hard to estimate the necessary threshold or to give a hard metric. If data can be hosted on a single machine that is a CLASSICAL use case store it in a database.

WHAT IS FAST DATA FAST DATA definition: data processed near
real-time, low latency. The next step after BIG data comprises Volume, Variety, Velocity and Value (4 Vs). Encompass event processing, in-memory databases, or hybrid data stores that optimize cache with disk. Fast Data is nothing new, it was traditionally restricted to a handful of extremely high-value use cases: - stock trading, airline schedulers, teleco etc..

DEMO DETAILS Use case 1: •  A retailer collaborates with
a Telco. •  Retailer wants to know the “big callers” who are in the vicinity of their shop. •  Send promotional offers to those callers through various channels like text message, big screen ads and public address system ads etc Big caller = Call duration > 2500 (configurable)

DEMO DETAILS Use case 1: Solution strategy •  Ingest data
into HDFS (for historical analysis) •  Tap the data for real time analytics •  Push the data to a map based dash board

DEMO – PART 1.1 Get caller data into HDFS HTTP
--800 CDR Data HDFS | CDR Data stream cdr_rawdata_stream

DEMO – PART 1.2 GROOVY Convert JSON to POJO GROOVY
Act on the rule | cdr_rawdata_stream ESPER Apply Data extraction Rule | | Node.JS Visualize cdr_processor_tap

DEMO DETAILS Use case 2: •  A retailer collaborates with
a Telco. •  For every dropped call, telco wants to offer a coupon of a retail outlet near by caller’s location.

NEED COMPLEX PROCESSING FOR BIG DATA AND FAST DATA • 
More and more enterprises have become customer centric with increased demand to create customized experience. •  Bring Your Own Device (BYOD) – Data needs to made available for all devices with greater user experience, responsive design for all devices. •  Advent of micro batch and near real time processing of data, real time data streaming. •  Integration new data sources to existing data assets for value creation.

WHAT IS CEP COMPLEX EVENT PROCESSING (CEP) Algol Traders and
Quants writing new trading ALGORITHMS. Compliance and Risk management require real time monitoring of LIQUIDITY. Security and Audit require in-build FRUD detection. Telco phone calls, video streaming ensuring high SLA.

LAMBDA ARCHITECTURE

WHAT IS SPRING XD •  Stands for Spring eXtreme Data
•  More of a data integration platform with execution platform •  Completely open source •  Natural evolution of Spring Integration, Spring batch and Spring data •  Support both big & fast data processing architecture patterns like MapReduce, Stream processing etc. •  Support for predictive analytics

SPRING XD ARCHITECTURE •  Master/Slave distributed computing architecture •  Operates
in two modes •  Stand alone/ single node •  Distributed/clustered •  Key components are •  XD Admin server – Plan the action based on DSL •  XD Container server – The worker •  Admin server uses Zookeeper for co-ordination •  The run time is called DIRT (Distributed Intelligent Run Time)

XD Container SPRING XD ARCHITECTURE Admin Server XD Container XD
Module XD Module TCP –port=80 HDFS –dire…. create stream –definition “tcp –-port=80 | hdfs –-directory=/ incoming” –name testStream --deploy

SPRING XD ADMIN SERVER •  Provides a fairly basic browser-based
GUI at localhost:9393/admin-ui Key components are •  Embedded Container •  Exposes REST endpoints •  Zookeeper for management

SPRING XD ATTRACTIONS •  Unified platform - Stream Processing and
Batch Jobs •  Hadoop Batch workflow orchestration •  NoSQL Analytics •  Machine Learning algorithms •  Runtime that provides critical non-functional requirements Not yet another framework •  Scalable, distributed, Fault-Tolerant •  Portable. On premise cluster, YARN, EC2.

XD DISTRIBUTED MODE HORIZONTAL SCALING •  Native mode – user
is responsible for starting/stopping services •  Managed by YARN (Apache Hadoop 2.2.0 or later. This includes Apache Hadoop 2.2.0, Pivotal HD 2.0, Hortonworks HDP 2.1 and Cloudera CDH5)

SPRING XD – KEY COMPONENTS OF DATA PROCESSING •  Streams
– Fast data ingestion, processing, store •  E.g real time web log processing •  Jobs – Batch/Micro batch orchestration •  E.g Hadoop jobs •  Taps – Wire tap pattern implementation. Non intrusive way to process data

STREAMS •  The programming model for processing event streams • 
Need the modules such as (1) An Input source (2)Processing steps (3) An Output sink •  An Input source produces messages from an external source. XD supports a variety of sources, e.g. syslog, tcp, http •  The output from a module is a Spring Message containing a payload of data and a collection of key-value headers •  Processing steps are optional

STREAMS – LINEAR PROCESSING EXAMPLE •  “http –port=8000 | hdfs
–directory=/incoming” •  Unix pipe like structure •  Everything comes through 8000 ports gets saves into HDFS •  Future support for other protocols like JMS •  Support non linear flow too

STREAMS DEPLOYMENT XD Admin Zookeeper XD Container XD Container Spring
application context Spring application context http port=8000 Outbound adapter HDFS directory=/ Inbound adapter DataBus (e.g)Redis

BATCH JOBS, WORKFLOW ORCHESTRATION AND EXPORT Spring Batch is used
to support the workflow orchestration and export use cases. The concept of workflow translates to a batch job, which can be thought of as a directed graph of steps, each of which is a processing step. Spring XD ships with a small number of predefined jobs: -  FTP to HDFS -  HDFS to JDBC Export -  HDFS to MongoDB Export -  JDBC to HDFS Import

LAUNCH AND MONITOR JOBS JobLauncher triggers the job and JobRepository
keeps track of job execution. Spring XD can launch and monitor Map Reduce jobs, Pig/Hive scripts. Note: -  Steps can be executed in parallel or remotely. -  Jobs can be scheduled using a cron expression. -  Jobs can be executed on demand as a reaction to data on a stream.

TAPS •  A Tap allows you to "listen" to data
•  Existing process is not interrupted •  Taps is a type of streams •  Syntax to create a tap is very similar to streams E.g. 1) stream create --name foo1tap --definition "tap:stream:csvToHDFSstream > log" –deploy E.g. 2) stream create --name foo1tap --definition "tap:job:csvToHDFSstream > log" –deploy

TAPS CONTD.. •  A tap can consume data from any
point along the target stream’s processing pipeline tap:stream:mystream.filter > .... •  Taps data after the filter on mystream is applied •  Streams has no clue of taps •  Is a stream recreated, taps will continue to work

SPRING XD RUNTIME

SPRING XD AND AGILITY •  Get started in less than
10 minutes! •  XD Single-Node •  Shell like DSL •  No build scripts •  IDE agnostic •  Easy to extend

SPRING IO PLATFORM

CONCLUSION •  SpringXD may be new, but as we've seen,
it builds on mature foundations, battle tested and production ready components: •  Spring Integration •  Spring Data •  Spring Batch •  Provides a lightweight runtime environment that is easily configured and assembled via a DSL with little or no code. •  Provides a "one stop shop" for developers to get started building a Big Data application and deploying such applications. •  11+ years of experience in building large scale enterprise application across the globe.

JPMC & PIVOTAL IS HIRING!!

EXTENDING SPRING XD •  All 4 modules (source,processor,sink) for streams
and jobs can be custom built. •  Standards Spring configuration files with supporting binaries

REAL-TIME ANALYTICS Spring XD provides out of the box a
few simple analytics tools implemented as an Abstract API with implementations for in-memory and Redis, as follows: Simple Counter Field Value Counter: Counts the occurrence of named fields. Aggregate Counter: Popular in tools like Mongo and Redis, this allows you to timeslice data by, for example, minute, hour, month, year and so on. Gauge: Last value Rich Gauge: Last value, running average, min/max Real-time data analytics The analytics functionality is provided via modules that can be added to a stream. In that sense, real-time analytics is accomplished via exactly the same model as data ingestion. Spring XD provides support for the real-time evaluation of various machine learning scoring algorithms as well simple real-time data analytics using various types of counters and gauges.

JOBS LIFE CYCLE •  Register a Job Module – copy
XML to $XD_HOME/modules/ jobs •  Create a Job Definition – “create job myJob ….” •  Deploy a Job – “create ……. –deploy” •  Launch a Job – “job launch ……..” (Can be launched using Cron or as a Sink) •  Job Execution •  Un-deploy a Job – “job undeploy –name ….. “ •  Destroy a Job Definition – “job destroy –name …..”

MONITORING & MANAGEMENT ADMIN UI Modules: lists the available batch
job modules and more details (such as the job module options and the module XML configuration file). Definitions: lists the XD batch job definitions and provides actions to deploy or un-deploy those jobs. Deployments: lists all the deployed jobs and provides an option to launch the deployed job. Once the job is deployed, it can be launched through the admin UI as well. Executions: lists the batch job executions and provides an option to restart if the batch job is restartable and stopped/failed

DEMO CODE CAN BE FOUND https://github.com/kailashnathkutti/spring-xd-examples https://github.com/kailashnathkutti/spring-xd-examples-web-gui

DEMO DETAILS Use case 3: •  A retailer want to
know the average sales by region every X seconds.

Fast data and big data with SpringXD

Fast data and big data with SpringXD

More Decks by sergiubodiu

Other Decks in Programming

Featured

Transcript