Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Fast data and big data with SpringXD

Fast data and big data with SpringXD

Use Cases and BIG DATA
Getting Started with Spring XD
How to BUILD fast applications
- Data Ingestion
- Taps
- Processors
- Sinks
- Batch Jobs, Workflow Orchestration and Export
- Real-Time Analytics

sergiubodiu

August 07, 2014
Tweet

More Decks by sergiubodiu

Other Decks in Programming

Transcript

  1. AGENDA •  Use Cases and BIG DATA •  Getting Started

    with SPRING XD •  How to BUILD fast applications •  Data Ingestion •  Taps •  Processors •  Sinks •  Batch Jobs, Workflow Orchestration and Export •  Real-Time Analytics
  2. WHAT IS BIG DATA BIG DATA definition: data that won’t

    fit on a single machine. By definition has to be DISTRIBUTED. It’s hard to estimate the necessary threshold or to give a hard metric. If data can be hosted on a single machine that is a CLASSICAL use case store it in a database.
  3. WHAT IS FAST DATA FAST DATA definition: data processed near

    real-time, low latency. The next step after BIG data comprises Volume, Variety, Velocity and Value (4 Vs). Encompass event processing, in-memory databases, or hybrid data stores that optimize cache with disk. Fast Data is nothing new, it was traditionally restricted to a handful of extremely high-value use cases: - stock trading, airline schedulers, teleco etc..
  4. DEMO DETAILS Use case 1: •  A retailer collaborates with

    a Telco. •  Retailer wants to know the “big callers” who are in the vicinity of their shop. •  Send promotional offers to those callers through various channels like text message, big screen ads and public address system ads etc Big caller = Call duration > 2500 (configurable)
  5. DEMO DETAILS Use case 1: Solution strategy •  Ingest data

    into HDFS (for historical analysis) •  Tap the data for real time analytics •  Push the data to a map based dash board
  6. DEMO – PART 1.1 Get caller data into HDFS HTTP

    --800 CDR Data HDFS | CDR Data stream cdr_rawdata_stream
  7. DEMO – PART 1.2 GROOVY Convert JSON to POJO GROOVY

    Act on the rule | cdr_rawdata_stream ESPER Apply Data extraction Rule | | Node.JS Visualize cdr_processor_tap
  8. DEMO DETAILS Use case 2: •  A retailer collaborates with

    a Telco. •  For every dropped call, telco wants to offer a coupon of a retail outlet near by caller’s location.
  9. NEED COMPLEX PROCESSING FOR BIG DATA AND FAST DATA • 

    More and more enterprises have become customer centric with increased demand to create customized experience. •  Bring Your Own Device (BYOD) – Data needs to made available for all devices with greater user experience, responsive design for all devices. •  Advent of micro batch and near real time processing of data, real time data streaming. •  Integration new data sources to existing data assets for value creation.
  10. WHAT IS CEP COMPLEX EVENT PROCESSING (CEP) Algol Traders and

    Quants writing new trading ALGORITHMS. Compliance and Risk management require real time monitoring of LIQUIDITY. Security and Audit require in-build FRUD detection. Telco phone calls, video streaming ensuring high SLA.
  11. WHAT IS SPRING XD •  Stands for Spring eXtreme Data

    •  More of a data integration platform with execution platform •  Completely open source •  Natural evolution of Spring Integration, Spring batch and Spring data •  Support both big & fast data processing architecture patterns like MapReduce, Stream processing etc. •  Support for predictive analytics
  12. SPRING XD ARCHITECTURE •  Master/Slave distributed computing architecture •  Operates

    in two modes •  Stand alone/ single node •  Distributed/clustered •  Key components are •  XD Admin server – Plan the action based on DSL •  XD Container server – The worker •  Admin server uses Zookeeper for co-ordination •  The run time is called DIRT (Distributed Intelligent Run Time)
  13. XD Container SPRING XD ARCHITECTURE Admin Server XD Container XD

    Module XD Module TCP –port=80 HDFS –dire…. create stream –definition “tcp –-port=80 | hdfs –-directory=/ incoming” –name testStream --deploy
  14. SPRING XD ADMIN SERVER •  Provides a fairly basic browser-based

    GUI at localhost:9393/admin-ui Key components are •  Embedded Container •  Exposes REST endpoints •  Zookeeper for management
  15. SPRING XD ATTRACTIONS •  Unified platform - Stream Processing and

    Batch Jobs •  Hadoop Batch workflow orchestration •  NoSQL Analytics •  Machine Learning algorithms •  Runtime that provides critical non-functional requirements Not yet another framework •  Scalable, distributed, Fault-Tolerant •  Portable. On premise cluster, YARN, EC2.
  16. XD DISTRIBUTED MODE HORIZONTAL SCALING •  Native mode – user

    is responsible for starting/stopping services •  Managed by YARN (Apache Hadoop 2.2.0 or later. This includes Apache Hadoop 2.2.0, Pivotal HD 2.0, Hortonworks HDP 2.1 and Cloudera CDH5)
  17. SPRING XD – KEY COMPONENTS OF DATA PROCESSING •  Streams

    – Fast data ingestion, processing, store •  E.g real time web log processing •  Jobs – Batch/Micro batch orchestration •  E.g Hadoop jobs •  Taps – Wire tap pattern implementation. Non intrusive way to process data
  18. STREAMS •  The programming model for processing event streams • 

    Need the modules such as (1) An Input source (2)Processing steps (3) An Output sink •  An Input source produces messages from an external source. XD supports a variety of sources, e.g. syslog, tcp, http •  The output from a module is a Spring Message containing a payload of data and a collection of key-value headers •  Processing steps are optional
  19. STREAMS – LINEAR PROCESSING EXAMPLE •  “http –port=8000 | hdfs

    –directory=/incoming” •  Unix pipe like structure •  Everything comes through 8000 ports gets saves into HDFS •  Future support for other protocols like JMS •  Support non linear flow too
  20. STREAMS DEPLOYMENT XD Admin Zookeeper XD Container XD Container Spring

    application context Spring application context http port=8000 Outbound adapter HDFS directory=/ Inbound adapter DataBus (e.g)Redis
  21. BATCH JOBS, WORKFLOW ORCHESTRATION AND EXPORT Spring Batch is used

    to support the workflow orchestration and export use cases. The concept of workflow translates to a batch job, which can be thought of as a directed graph of steps, each of which is a processing step. Spring XD ships with a small number of predefined jobs: -  FTP to HDFS -  HDFS to JDBC Export -  HDFS to MongoDB Export -  JDBC to HDFS Import
  22. LAUNCH AND MONITOR JOBS JobLauncher triggers the job and JobRepository

    keeps track of job execution. Spring XD can launch and monitor Map Reduce jobs, Pig/Hive scripts. Note: -  Steps can be executed in parallel or remotely. -  Jobs can be scheduled using a cron expression. -  Jobs can be executed on demand as a reaction to data on a stream.
  23. TAPS •  A Tap allows you to "listen" to data

    •  Existing process is not interrupted •  Taps is a type of streams •  Syntax to create a tap is very similar to streams E.g. 1) stream create --name foo1tap --definition "tap:stream:csvToHDFSstream > log" –deploy E.g. 2) stream create --name foo1tap --definition "tap:job:csvToHDFSstream > log" –deploy
  24. TAPS CONTD.. •  A tap can consume data from any

    point along the target stream’s processing pipeline tap:stream:mystream.filter > .... •  Taps data after the filter on mystream is applied •  Streams has no clue of taps •  Is a stream recreated, taps will continue to work
  25. SPRING XD AND AGILITY •  Get started in less than

    10 minutes! •  XD Single-Node •  Shell like DSL •  No build scripts •  IDE agnostic •  Easy to extend
  26. CONCLUSION •  SpringXD may be new, but as we've seen,

    it builds on mature foundations, battle tested and production ready components: •  Spring Integration •  Spring Data •  Spring Batch •  Provides a lightweight runtime environment that is easily configured and assembled via a DSL with little or no code. •  Provides a "one stop shop" for developers to get started building a Big Data application and deploying such applications. •  11+ years of experience in building large scale enterprise application across the globe.
  27. EXTENDING SPRING XD •  All 4 modules (source,processor,sink) for streams

    and jobs can be custom built. •  Standards Spring configuration files with supporting binaries
  28. REAL-TIME ANALYTICS Spring XD provides out of the box a

    few simple analytics tools implemented as an Abstract API with implementations for in-memory and Redis, as follows: Simple Counter Field Value Counter: Counts the occurrence of named fields. Aggregate Counter: Popular in tools like Mongo and Redis, this allows you to timeslice data by, for example, minute, hour, month, year and so on. Gauge: Last value Rich Gauge: Last value, running average, min/max Real-time data analytics The analytics functionality is provided via modules that can be added to a stream. In that sense, real-time analytics is accomplished via exactly the same model as data ingestion. Spring XD provides support for the real-time evaluation of various machine learning scoring algorithms as well simple real-time data analytics using various types of counters and gauges.
  29. JOBS LIFE CYCLE •  Register a Job Module – copy

    XML to $XD_HOME/modules/ jobs •  Create a Job Definition – “create job myJob ….” •  Deploy a Job – “create ……. –deploy” •  Launch a Job – “job launch ……..” (Can be launched using Cron or as a Sink) •  Job Execution •  Un-deploy a Job – “job undeploy –name ….. “ •  Destroy a Job Definition – “job destroy –name …..”
  30. MONITORING & MANAGEMENT ADMIN UI Modules: lists the available batch

    job modules and more details (such as the job module options and the module XML configuration file). Definitions: lists the XD batch job definitions and provides actions to deploy or un-deploy those jobs. Deployments: lists all the deployed jobs and provides an option to launch the deployed job. Once the job is deployed, it can be launched through the admin UI as well. Executions: lists the batch job executions and provides an option to restart if the batch job is restartable and stopped/failed
  31. DEMO DETAILS Use case 3: •  A retailer want to

    know the average sales by region every X seconds.