Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Entwicklung von Big Data Applikationen mit Spring XD

Entwicklung von Big Data Applikationen mit Spring XD

Spring XD ist eine Anwendungsplattform für Big Data Applikationen und bietet eine skalierbare, fehlertolerante und verteilte Laufzeitumgebung für Data Ingestion, Analytics und Workflow Orchestration an. Darüber hinaus wird ein einheitliches Modell für Entwicklung, Konfiguration und Erweiterungen angeboten. Die von Spring XD bereitgestellte Plattform stellt dem Entwickler flexible Bausteine zur Verfügung, mit denen sich durch Kombination zahlreiche Anwendungsfälle leicht realisieren lassen. Spring XD baut auf bewährten Technologien aus dem Spring Ökosystem auf, insbesondere Spring Integration, Spring Batch und Spring Boot. Dieser Vortrag bietet einen Überblick zu Spring XD und zeigt anhand von Beispielen, wie eine skalierbare Laufzeitumgebung für Big Data Applikation aussehen kann.

Thomas Darimont

November 20, 2014
Tweet

More Decks by Thomas Darimont

Other Decks in Programming

Transcript

  1. Unless otherwise indicated, these slides are © 2013-2014 Pivotal Software,

    Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Big Data Applications with Spring XD Thomas Darimont, Software Engineer, Pivotal Inc. @thomasdarimont
  2. 500 million tweets each day 
 2.3 Trillion GBs of

    data are created each day 100 Million Wearables ! 90% of enterprise data is unstructured 100 sensors in each car ! 22 Billion sensors by 2020 86% suspect data inaccuracy ! 30% revenue loss due to bad data quality Volume Velocity Variety Veracity Big Data Data Points: McKinsey, Twitter, Gartner, IBM
  3. 500 million tweets each day 
 2.3 Trillion GBs of

    data are created each day 100 Million Wearables ! 90% of enterprise data is unstructured 100 sensors in each car ! 22 Billion sensors by 2020 86% suspect data inaccuracy ! 30% revenue loss due to bad data quality Volume Velocity Variety Veracity Big Data Data Points: McKinsey, Twitter, Gartner, IBM
  4. Common Problems • Batch and Streaming handled by multiple platforms

    ! • Fragmented Big Data Ecosystem ! • Simple things are not simple
  5. Spring XD to the rescue • Batch and Streaming handled

    by multiple platforms ! • Fragmented Big Data Ecosystem ! • Simple things are not simple Unified Stream and Batch Platform Scalable, Distributed and Fault Tolerant Runtime Multi-platform Batch Workflow Orchestration Easy to extend and integrate other technologies Portable: On-prem/IaaS, YARN, Docker, Mesos, PCF Easy to Use- Many out of the box components Build on proven Spring Technologies “NoSQL” and Predictive Analytics unified platform
  6. Spring XD - 10,000 Foot View Spring XD Runtime BIDIRECTIONAL

    Compute HDFS RDBMS NoSQL R, SAS Streams Jobs ingest workflow export taps Predictive Modelling >_ Redis
  7. Streams Programming model for processing data streams How data is

    collected, processed, and stored or forwarded DSL analog to Unix Pipes and Filters Source | Processor 0…* | Sink Basic streams can be composed to more sophisticated streams Source Sink Processor Input Output
  8. Modules Source, Processor, Sink, Jobs Unit of execution Implemented as

    a Spring ApplicationContext Modules in a Stream -> Spring Integration Job Modules -> Spring Batch Producer Consumer Message Channel send(message) receive() Spring Integration Message Flow
  9. Stream Modules HTTP SFTP Tail File Mail Syslog TCP /

    TCP Client Reactor IP JMS RabbitMQ Time MQTT Kafka JDBC Gemfire CQ, Source Twitter Search, Stream Stdout Capture Filter Transform Splitter Aggregator HTTP Client Shell Command Script Groovy Python Java JPMML-Evaluator JSON-to-Tuple Object-to-JSON Log File JDBC TCP MQTT Mongo Mail Null Sink Redis RabbitMQ HDFS HDFS Dataset Shell Command GemFire Server Splunk Server Dynamic Router Counter + 1 Gauge + 1 #20 #13 #20 Source Processor Sink Roll your own!
  10. Taps Stream Source Processor Sink … Message Bus Consume data

    along the Stream processing pipeline Original stream stays unaffected Collect metrics and perform analytics Works for Jobs too! stream create demo1tap --definition "tap:stream:demo1> counter" Tap Processor Sink …
  11. Analytics Backed by Redis & In-memory Access via REST API

    Counters Simple & Field Value Counter Aggregate Counter Gauges Gauge Rich Gauge Predictive Model Evaluation Is this transaction fraudulent? What group does this user belong to? JPMML PMML Model Interoperable with R, Rattle KNIME, RapidMiner, MADLib Python Spark
  12. Predictive Models Model Parameterised Algorithm Model Building Derive a parameterised

    algorithm from the data Slow process Usually large data volume -> done offline as a batch process Model Scoring Use the model to predict new information Fast process Can be done as part of stream processing
  13. PMML Predictive Model Markup Language Open Standard Maintained by Data

    Mining Group (DMG) XML based DSL for predictive models Can be interpreted 15 Model Types (Naive Bayes, General Regression, Neural Networks, etc.) First Version (1999) – Current Version 4.2.1 “Lingua Franca for Predictive Models” “Bridge the Gap between Data Scientists and Engineers”
  14. Anatomy of a PMML Model Predictive Model Algorithm description(s) Parameterisation

    “trained model” Pre Processing Post Processing Transform model output Thresholds / Business rules Source:(PMML(in(Ac/on,(2nd(Edi/on,(2012,(p.(7.
  15. Predictive Analytics with Spring XD XD Module analytic-pmml Introduced in

    Spring 1.0.0 M6 (April 2014) Real-time evaluation and scoring Based on JPMML-Evaluator Wide range of Model types spring-xd-modules/analytics-ml-pmml on Github
  16. Jobs Create, Launch and Monitor Jobs Define how batch processing

    steps are orchestrated Builds on top of Spring Batch and Spring Hadoop CSV to JDBC FTP to HDFS JDBC to HDFS HDFS to JDBC HDFS to MongoDB … or roll your own! with Spring Batch
  17. Spring XD - Node Types Admin Node Embedded servlet container

    REST endpoints via Spring MVC / HATEOAS Container management and module deployment Map processing tasks into processing modules Monitor and update runtime state Container Node Host Module instances Performs the actual processing
  18. Spring XD Runtimes single-node XD Admin XD Container !! Module

    JVM ZK JVM XD Admin XD Container !! Module XD Container !! Module JVM JVM JVM JVM ZK multi-node
  19. Single-Node Runtime Source | Sink XD Admin XD Container Transport

    ! Redis, Rabbit, Local, Other Control Bus Data Bus Source Sink Output Input ZooKeeper
  20. About ZooKeeper Toolset for building distributed systems Originates from Hadoop

    but not tied to it! Coordination distributed processes Shared configuration Replicated hierarchal data store much like a file system Curator - Client Library to interact with ZooKeeper
  21. ZooKeeper in Spring XD Centralised storage for Stream and Job

    Definitions Tracking of Containers and Modules Notification of Cluster structure changes Notification of Module changes
  22. Multi-Node Distributed Runtime Deploy and un-deploy modules on containers Dedicated

    message bus “control bus” Deployment requests Dynamically discover new containers Reassign modules when containers fail
  23. DIRT - Distributed Runtime XD Admin XD Admin XD Admin

    XD Container XD Container Container State ZooKeeper Control Bus Data Bus XD Shell HTTP POST /streams/aStream “M1 | M2” M1 M2 Spring App Context Transport ! Redis, Rabbit, Local, Other
  24. Message Bus Source | Processor | Sink Pipe Symbol denotes

    Message Bus Binds module inputs & outputs to a transport Performs Serialisation (Kryo, pluggable) RabbitMQ, Redis, Kafka, Local (in memory), JMS
  25. Deployment Manifest Stream/Job Definition —> Logical View Deployment Manifest —>

    Physical View Important properties relate to Module Count Module Placement Data Partitioning Concurrency Defined on Stream deployment
  26. Deployment Manifest - Module Count http | worker | hdfs

    http http worker worker worker worker hdfs hdfs hdfs stream deploy –name s1 --properties module.http.count=2, module.worker.count=4, module.hdfs.count=3
  27. Deployment Manifest - Module Placement http | worker | hdfs

    stream deploy –name s1 --properties module.http.count=2, module.worker.count=4, module.hdfs.count=3 module.http.criteria= 
 group.contains(‘WEB’) http http worker worker worker worker hdfs hdfs hdfs Web xd/bin/xd-container --groups="WEB"
  28. Deployment Manifest - Data Partitioning http | worker | hdfs

    stream deploy –name s1 --properties … ! module.http.producer .partitionKeyExpression= payload.customerId http http worker worker worker worker hdfs hdfs hdfs Web C C C C worker modules will always receive & process the same set of customer IDs!
  29. How does it fit in? Alternative to Flume, Oozie, Sqoop,

    Storm Complementary to many technologies Big SQL - Impala, HAWQ Spark batch processing and streaming Elastic Search, Cassandra, … Platform for Data Integration/ Data Movement
  30. Lambda Architecture Data Lake HAWQ Serving Layer Spring Boot Spring

    Boot Spring Boot Spring Boot Speed Layer Batch Layer Batch Views Real-time Views Gemfire Spring Stream Processing Batch Processing Analytics Ingest Export Workflow Orchestration Predictive Analytics XD>
  31. Breaking News Spring XD 1.1 M1 (and 1.0.2) just Released

    yesterday! Java Config Module definitions Batch Jobs with Apache Spark Python support Kafka support LDAP Authentication Kerberized Hadoop Cluster
  32. Roadmap - 1.1 RC1 and beyond Spring Boot like Module

    Deployment Deployment targets: Docker, CF Service, Mesos Security ACLs Kafka based Message Bus Reactive Streams Integration Spark Streaming
  33. Takeaway - Spring XD Unified runtime for both Real-time and

    Batch use cases Scalable, Distributed and Fault Tolerant Runtime Increased Productivity through out-of-the-box components Closed Loop Analytics through online (stream) and offline (batch) data Swiss-army knife of data movement and data pipelines Repeatable ‘turnkey’ solution for next generation data-centric use cases