Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Distributed Data Processing Pipeline...

Kenny Bastani
November 02, 2015

Building a Distributed Data Processing Pipeline Using Spring Cloud Data Flow

In this talk I will introduce you to Spring Cloud Data Flow, a tool set for building cloud-native JVM applications as a distributed data processing pipeline. We will take a look at some of the common patterns for processing data streams using Spring Boot applications that consume and produce streams of messages.

We will then dive into a microservice example project of a cloud-native data processing application built using Spring Boot and Spring Cloud Data Flow. Using this example project, I'll show you how to use a container scheduler to spin up a microservice cluster on a development machine that processes a stream of tweets in a data flow pipeline.

Kenny Bastani

November 02, 2015
Tweet

More Decks by Kenny Bastani

Other Decks in Programming

Transcript

  1. © 2014 Pivotal Software, Inc. All rights reserved. Distributed Data

    Processing with Spring Cloud Data Flow Kenny Bastani Spring Developer Advocate 1
  2. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Kenny Bastani 3
  3. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Agenda 4 1 Agenda 2 Microservices 3 What is Spring Boot? 4 What is Spring Cloud? 5 Lattice - Cloud Native Platform 6 Spring Cloud Data Flow 7 Streaming Analytics Example (Twitter) 9. Demo
  4. © 2015 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Quick Explanation of Microservices • Each team gets one database and one service • Shared caches are platform provided services that are shared for consistency 6
  5. © 2015 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Cloud-native Microservice Deployment • Each microservice can be containerized with their application dependencies • Containers get scheduled on virtual machines with an allotted resource policy 9
  6. © 2015 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Auto-scaling • An elastic runtime handles auto-scaling of VMs with cloud providers • Microservices should be load balanced vertically and not horizontally 10
  7. © 2015 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Composition & Orchestration • Each microservice needs to communicate outside containers • Service discovery provides an automatic method for finding other service dependencies 11
  8. © 2014 Pivotal Software, Inc. All rights reserved. Spring Boot

    A JVM micro-framework for building microservices 12
  9. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani What is Spring Boot? 13
  10. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Spring Boot Roles 14
  11. © 2015 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Automatic Configuration • An application class is annotated with @SpringBootApplication • Additional annotations are added to indicate the role of the Spring Boot application 15
  12. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Spring Boot for Microservices 16
  13. © 2014 Pivotal Software, Inc. All rights reserved. Spring Cloud

    A toolset designed for building distributed systems 17
  14. © 2015 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani What is Spring Cloud? • Spring Cloud provides a way to turn Spring Boot microservices into distributed applications 18
  15. © 2015 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani What is Spring Cloud? 19 ✴ Service Discovery ✴ API Gateway ✴ Circuit Breakers ✴ Distributed Tracing
  16. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Service Discovery & Configuration Service 20
  17. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Configuration Service 21
  18. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Service Discovery 22
  19. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani API Gateway 23
  20. © 2014 Pivotal Software, Inc. All rights reserved. Lattice A

    cloud-native platform for deploying and scaling containers in production 25
  21. © 2015 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Containers, containers, containers • Lattice helps you manage Docker container deployments on clusters of VMs • Choose the cloud provider you want, deploys containers from Docker hub 26
  22. © 2014 Pivotal Software, Inc. All rights reserved. Spring Cloud

    Data Flow Java microservices that process data in a pipeline 27
  23. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Spring Cloud Data Flow • What is Spring Cloud Data Flow? – Spring Cloud Data Flow is a data processing pipeline that uses Spring Boot microservices – Each Spring Boot microservice takes in a message and produces a message, containing the data that you're processing 28
  24. © 2014 Pivotal Software, Inc. All rights reserved. How do

    we design a data processing pipeline? 29
  25. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Start with understanding your source data • What do you want to measure? – Trending analytics in real time 30
  26. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Understand the function of your pipeline • What does an individual message contain? – A single tweet 31
  27. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Understand what you want to measure • What do you want to filter? – A set of hash tags in the body of a Tweet 32
  28. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Understand the outputs • What do I want to measure over time? – The velocity of hash tag counts from tweets every second 33
  29. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani What is the result of our measurements? • A graph that shows the velocity of each tweet and its hash tags over time • Real time streaming analytics so we can make fast informed decisions 34
  30. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Data Processing Pipeline Example 35 Source Process Filter Counter Data Data Filter messages Transformation Increment counters Ingest messages Database
  31. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Scalable distributed data processing 36 Source Process Filter Counter Data Data Filter messages Transformation Increment counters Ingest messages Source Filter Process Counter Database
  32. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Input and output channels • Each Spring Boot microservice has an input channel and an output channel 37 Spring Boot Service Input Channel Output Channel Microservice Message queue Message queue
  33. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Input and output channels 38 Spring Boot Service Input Channel Output Channel Microservice Message queue Message queue Message Message Messages Messages
  34. © 2015 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Building Real-time Analytics on Twitter hash tags 39
  35. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani We want to build something like this: 41 Source Process Filter Counter Data Data Filter messages Transformation Increment counters Ingest messages Source Filter Process Counter Database
  36. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Responsibility of a source module • Ingest data from multiple sources, such as a streaming REST API endpoint or HDFS • Transform a stream into discrete messages that are uniformly distributed, such as an individual tweet • Output those messages to an output channel for the next service to process 43
  37. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Ingesting tweets • We start by building a Spring Boot source module that imports tweets from Twitter 44 Data Tweet Ingest tweets Source Module
  38. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Visualize the tweet source module 45 Spring Boot Streaming Endpoint Output Channel Source Module Twitter API Twitter stream Message Message Tweets Twitter Stream Service Channel:
 twitter-stream
  39. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Responsibility of a filter module • Filter messages from the source module • Filters noise to increase quality of measurements in down stream modules • Example: – I only want to measure tweets containing the hash tag #java2days 47
  40. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Visualizing the filter module 48 Spring Boot Input Channel Output Channel Only tweets containing:
 
 #java2days Message Message Tweets Filter Service Channel:
 twitter-stream Channel:
 processor-stream
  41. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Our pipeline now looks like: 49 Source Filter Data Tweets #java2days Twitter Stream API Twitter Stream Filter Tweets …
  42. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Responsibility of a processor module • Take a filtered stream of messages and produce multiple output messages by transforming the payload into multiple dimensions of attributes • For example: – Take a #java2days tweet and parse the other hash tags and output one message per hash tag – #java2days -> (#Java, #SpringBoot, #JavaEE) 51
  43. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Processor module 52 Spring Boot Input Channel Output Channel “… #java2days …" 
 -> 
 #Java, #SpringBoot, #JavaEE… Message Message #java2days
 Tweets Processor Service Channel:
 processor-stream Channel:
 counter-stream
  44. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Our pipeline now looks like: 53 Source Filter Data Tweets #java2days Twitter Stream API Twitter Stream Filter Tweets Filter #Java, #SpringBoot… Process Hash tags …
  45. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Responsibility of a counter module • Take messages from an input channel and output an increment to multiple buckets that count message attributes over time • Save the results to a sink, for example a Redis database • Use Spring Cloud Data Flow admin tool to measure tweets over time 55
  46. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Counter module 56 Spring Boot Input Channel Redis DB Increment counts for hashtags:
 
 #Java -> +1, 
 #SpringBoot -> +1 Message Message Hash tags Counter Service Channel:
 counter-stream #Java: 201 #SpringBoot: 120 #JavaEE: 111 Counter Metrics:
  47. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Our pipeline now looks like: 57 Source Filter Data Tweets #java2days Twitter Stream API Twitter Stream Filter Tweets Filter #Java, 
 #SpringBoot… Process Hash tags Filter #Java: 201
 #SpringBoot: 123 Increment Counter Metrics Redis Tweets Tweets Hash tags
  48. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Scaling our pipeline • Services in the pipeline can be scaled up and down automatically to handle the load and prevent bottlenecks 58
  49. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Auto-scaling 59 Source Process Filter Counter Filter Process 5,121 Filter 52 Filter 23 1 4 2 1 Scale up Scale down Instances: Instances: Instances: Instances:
  50. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Auto-scaling 60 Source Process Filter Counter Filter Process 123 Filter 412 121 1 3 3 1 Scale down Scale up Instances: Instances: Instances: Instances: Process
  51. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Goal of auto-scaling • Keep the data processing pipeline uniform to prevent bottlenecks • Optimize the instance count on cloud providers so that cost can be predicted and optimized 61
  52. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Sinking messages into counters • Each message in the pipeline has an opportunity to increase multiple counters • Counters are like buckets, and we can increment those buckets with a name and timestamp 62
  53. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Twitter Analytics Demo • We’re going to brave the demo gods • Wish me luck • First, let’s review the steps for the demo (in case it fails) 63
  54. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani What we will demo: 64 Source Filter Data Tweets All tweets Twitter Stream API Twitter Stream Filter Tweets Filter #Fun, 
 #Awesome… Process Hash tags Filter #Fun: 201
 #Awesome: 123 Increment Counter Metrics Redis Tweets Tweets Hash tags
  55. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Let’s go through the steps • Start a Redis server • Start the twitter streaming module • Start the filter module • Start the processor module • Start the counter module • Start Spring Cloud Data Flow Admin UI 65
  56. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Start Redis Server 66
  57. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Start Twitter Stream Module 67
  58. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Start the Filter Module 68
  59. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Start the Processor Module 69
  60. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Start the Counter Module 70
  61. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Start Spring Cloud Data Flow Admin UI 71
  62. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Navigate to the Admin UI 72
  63. © 2014 Pivotal Software, Inc. All rights reserved. Follow me

    on Twitter: @kennybastani Thanks! Questions? http://start.spring.io/ 73