Use Cases and BIG DATA
Getting Started with Spring XD
How to BUILD fast applications
- Data Ingestion
- Taps
- Processors
- Sinks
- Batch Jobs, Workflow Orchestration and Export
- Real-Time Analytics
with SPRING XD • How to BUILD fast applications • Data Ingestion • Taps • Processors • Sinks • Batch Jobs, Workflow Orchestration and Export • Real-Time Analytics
fit on a single machine. By definition has to be DISTRIBUTED. It’s hard to estimate the necessary threshold or to give a hard metric. If data can be hosted on a single machine that is a CLASSICAL use case store it in a database.
real-time, low latency. The next step after BIG data comprises Volume, Variety, Velocity and Value (4 Vs). Encompass event processing, in-memory databases, or hybrid data stores that optimize cache with disk. Fast Data is nothing new, it was traditionally restricted to a handful of extremely high-value use cases: - stock trading, airline schedulers, teleco etc..
a Telco. • Retailer wants to know the “big callers” who are in the vicinity of their shop. • Send promotional offers to those callers through various channels like text message, big screen ads and public address system ads etc Big caller = Call duration > 2500 (configurable)
More and more enterprises have become customer centric with increased demand to create customized experience. • Bring Your Own Device (BYOD) – Data needs to made available for all devices with greater user experience, responsive design for all devices. • Advent of micro batch and near real time processing of data, real time data streaming. • Integration new data sources to existing data assets for value creation.
Quants writing new trading ALGORITHMS. Compliance and Risk management require real time monitoring of LIQUIDITY. Security and Audit require in-build FRUD detection. Telco phone calls, video streaming ensuring high SLA.
• More of a data integration platform with execution platform • Completely open source • Natural evolution of Spring Integration, Spring batch and Spring data • Support both big & fast data processing architecture patterns like MapReduce, Stream processing etc. • Support for predictive analytics
in two modes • Stand alone/ single node • Distributed/clustered • Key components are • XD Admin server – Plan the action based on DSL • XD Container server – The worker • Admin server uses Zookeeper for co-ordination • The run time is called DIRT (Distributed Intelligent Run Time)
is responsible for starting/stopping services • Managed by YARN (Apache Hadoop 2.2.0 or later. This includes Apache Hadoop 2.2.0, Pivotal HD 2.0, Hortonworks HDP 2.1 and Cloudera CDH5)
– Fast data ingestion, processing, store • E.g real time web log processing • Jobs – Batch/Micro batch orchestration • E.g Hadoop jobs • Taps – Wire tap pattern implementation. Non intrusive way to process data
Need the modules such as (1) An Input source (2)Processing steps (3) An Output sink • An Input source produces messages from an external source. XD supports a variety of sources, e.g. syslog, tcp, http • The output from a module is a Spring Message containing a payload of data and a collection of key-value headers • Processing steps are optional
–directory=/incoming” • Unix pipe like structure • Everything comes through 8000 ports gets saves into HDFS • Future support for other protocols like JMS • Support non linear flow too
to support the workflow orchestration and export use cases. The concept of workflow translates to a batch job, which can be thought of as a directed graph of steps, each of which is a processing step. Spring XD ships with a small number of predefined jobs: - FTP to HDFS - HDFS to JDBC Export - HDFS to MongoDB Export - JDBC to HDFS Import
keeps track of job execution. Spring XD can launch and monitor Map Reduce jobs, Pig/Hive scripts. Note: - Steps can be executed in parallel or remotely. - Jobs can be scheduled using a cron expression. - Jobs can be executed on demand as a reaction to data on a stream.
• Existing process is not interrupted • Taps is a type of streams • Syntax to create a tap is very similar to streams E.g. 1) stream create --name foo1tap --definition "tap:stream:csvToHDFSstream > log" –deploy E.g. 2) stream create --name foo1tap --definition "tap:job:csvToHDFSstream > log" –deploy
point along the target stream’s processing pipeline tap:stream:mystream.filter > .... • Taps data after the filter on mystream is applied • Streams has no clue of taps • Is a stream recreated, taps will continue to work
it builds on mature foundations, battle tested and production ready components: • Spring Integration • Spring Data • Spring Batch • Provides a lightweight runtime environment that is easily configured and assembled via a DSL with little or no code. • Provides a "one stop shop" for developers to get started building a Big Data application and deploying such applications. • 11+ years of experience in building large scale enterprise application across the globe.
few simple analytics tools implemented as an Abstract API with implementations for in-memory and Redis, as follows: Simple Counter Field Value Counter: Counts the occurrence of named fields. Aggregate Counter: Popular in tools like Mongo and Redis, this allows you to timeslice data by, for example, minute, hour, month, year and so on. Gauge: Last value Rich Gauge: Last value, running average, min/max Real-time data analytics The analytics functionality is provided via modules that can be added to a stream. In that sense, real-time analytics is accomplished via exactly the same model as data ingestion. Spring XD provides support for the real-time evaluation of various machine learning scoring algorithms as well simple real-time data analytics using various types of counters and gauges.
XML to $XD_HOME/modules/ jobs • Create a Job Definition – “create job myJob ….” • Deploy a Job – “create ……. –deploy” • Launch a Job – “job launch ……..” (Can be launched using Cron or as a Sink) • Job Execution • Un-deploy a Job – “job undeploy –name ….. “ • Destroy a Job Definition – “job destroy –name …..”
job modules and more details (such as the job module options and the module XML configuration file). Definitions: lists the XD batch job definitions and provides actions to deploy or un-deploy those jobs. Deployments: lists all the deployed jobs and provides an option to launch the deployed job. Once the job is deployed, it can be launched through the admin UI as well. Executions: lists the batch job executions and provides an option to restart if the batch job is restartable and stopped/failed