Dataservices - Processing Big Data The Microservice Way

DATASERVICES PROCESSING (BIG) DATA THE MICROSERVICE WAY Dr. Josef Adersberger
( @adersberger), QAware GmbH Dataservices are about leveraging the microservices approach for data processing. [50min] We see a big data processing pattern emerging using the Microservice approach to build an integrated, flexible, and distributed system of data processing tasks. We call this the Dataservice pattern. In this presentation we'll introduce into Dataservices: their basic concepts, the technology typically in use (like Kubernetes, Kafka, Cassandra and Spring) and some architectures from real-life.

http://www.datasciencecentral.com ENTERPRISE http://www.cardinalfang.net/misc/companies_list.html ? PROCESSING I’d like to start with
a question: What are the requirements and challenges of modern enterprises in data processing?

BIG DATA FAST DATA SMART DATA All things distributed: ‣distributed
  processing ‣distributed   databases Data to information: ‣machine (deep) learning ‣advanced statistics ‣natural language processing ‣semantic web Low latency and   high throughput: ‣stream processing ‣messaging ‣event-driven First, they need to combine the three current aspects of data: • big data enabling to process large amounts of data by distributing the data as well as the storage • fast data enabling to process data as close as possible to the point in time its created by performing stream processing and messaging • smart data to transform data into information and knowledge by applying advanced statistics, ML, NLP or even semantic web approaches (may you remember? the thing that got killed by XML)

DATA  PROCESSING SYSTEM  INTEGRATION APIS UIS data -> information information
-> user information -> systems information   -> blended information Second, they do not want data processing silos. They want data processing systems being integrated with the surrounding application landscape to blend data and information. And they want the gathered information to be accessible by users and other systems.

SOLUTIONS So what is the state of the art answer
on how to build data processing solutions which meet those requirements?

Th e {big,SMART,FAST} data   Swiss Army Knifes ( )
The ﬁrst technologies coming in mind are the well-known swiss army knifes for data processing like Spark, Flink and the - yet a little bit outdated - Hadoop MapReduce.

node Distributed Data Distributed Processing Driver data ﬂow icon credits
to Nimal Raj (database), Arthur Shlain (console) and alvarobueno (takslist) They’ve in common that jobs are centrally planned, controlled and merged at the Driver side. The platform (more or less) hides away the pain of distributed processing and distributed data storage. An optimizer calculates an optimal distributed execution plan. That’s very eﬃcient for most data processing use cases.

DATA SERVICES {BIG, FAST, SMART} DATA MICRO-  SERVICE As an
alternative to the swiss army knifes for certain use cases we see data service platforms emerging which try to combine the three ﬂavours of data processing with the microservice architecture paradigm.

BASIC IDEA: DATA PROCESSING WITHIN A GRAPH OF MICROSERVICES Microservice 
(aka Dataservice) Message   Queue Sources Processors Sinks DIRECTED GRAPH OF MICROSERVICES EXCHANGING DATA VIA MESSAGING The basic idea is to orchestrate data processing as a graph of microservices being connected by message queues. Microservices can have three diﬀerent kinds of roles: sources which emit messages, processors which consume and produce messages and sinks which swallow messages.

BASIC IDEA: COHERENT PLATFORM FOR MICRO- AND DATASERVICES CLUSTER OPERATING
SYSTEM IAAS ON PREM LOCAL MICROSERVICES DATASERVICES MICROSERVICES PLATFORM DATASERVICES PLATFORM As dataservices are microservices both share a common stack. The basic layer is a cluster operating system like kubernetes scheduling and orchestrating containerized workloads. On top of kubernetes the next layer is a microservice platform like Spring Cloud in combo with Spring Boot providing the required infrastructure to build, deploy and run microservices. Dataservices are then running on top of a dataservice platform which deploys dataservices and their required infrastructure as microservices.

OPEN SOURCE DATASERVICE PLATFORMS ‣ Open source project based on
the Spring stack ‣ Microservices: Spring Boot ‣ Messaging: Kafka 0.9, Kafka 0.10, RabbitMQ ‣ More: this presentation ‣ Standardized API with several open source implementations ‣ Microservices: JavaEE micro container ‣ Messaging: JMS (or Kafka with CDI integration) ‣ More: goo.gl/Tr37pB ‣ Open source by Lightbend (part. commercialised & proprietary) ‣ Microservices: Lagom, Play ‣ Messaging: akka ‣ More: goo.gl/XeG1fk Streams ‣ Stream processing tightly integrated with Kafka ‣ Microservices: main() ‣ Messaging: Kafka ‣ More: goo.gl/oFmvws There are three major open source dataservice platforms available: Spring Cloud Data Flow, Lagom and maybe surprisingly I also consider JEE as a possible dataservice platform. The slide shows their main diﬀerentiators…

ARCHITECT’S VIEW - ON SPRING CLOUD DATA FLOW DATASERVICES For
the further talk I’ll focus on Spring Cloud Data Flow as it’s the most elaborated dataservice platform in my optinion. Let’s begin with its architecture.

BASIC IDEA: DATA PROCESSING WITHIN A GRAPH OF MICROSERVICES Sources
Processors Sinks DIRECTED GRAPH OF SPRING BOOT MICROSERVICES EXCHANGING DATA VIA MESSAGING Stream App Message   Broker Channel Having the basic idea of dataservices in mind, SCDF uses Spring Boot as microservice chassis for the dataservices which are called apps. At the messaging side you’ve the choice between Kafka and RabbitMQ. The interconnection of apps and message brokers are called channel. A graph of apps is called stream.

THE BIG PICTURE SPRING CLOUD DATA FLOW SERVER (SCDF SERVER)
TARGET RUNTIME SPI API LOCAL SCDF Shell SCDF Admin UI Flo Stream Designer The SCDF server is the heart of SCDF. It provides an API for clients to submit and control streams (this is the SCDF term for a dataservice graph). Three clients come with SCDF: (1) a powerful command line shell (2) a web admin UI (3) a visual stream designer to compose a stream of dataservices. The SCDF server also provides an SPI to plugin target runtimes. The microservices and the required infrastructure are then deployed onto the chosen target runtime. It’s best practice to also deploy the SCDF server onto the target runtime.

THE BIG PICTURE SPRING CLOUD DATA FLOW SERVER (SCDF SERVER)
TARGET RUNTIME MESSAGE BROKER APP SPRING BOOT SPRING FRAMEWORK SPRING CLOUD STREAM SPRING INTEGRATION BINDER APP APP APP CHANNELS  (input/output) If you do so, the architecture looks like this. All relevant parts are running within the target runtime. The SCDF server is the heart and the Message Broker provides the veins of the platform. The brain resides within the dataservices (called apps). They’re built on the shoulders of giants. An app uses the Spring Cloud Stream API which provides inbound and outbound messaging channels, payload conversion and a ramp to the messaging autobahn called binder.

THE VEINS: SCALABLE DATA LOGISTICS WITH MESSAGING Sources Processors Sinks
STREAM PARTITIONING: TO BE ABLE TO SCALE MICROSERVICES BACK PRESSURE HANDLING: TO BE ABLE TO COPE WITH PEEKS Messaging enables scalable data logistics within the system of microservices. Two design principles of SCDF are very important for being scalable: (1) stream partitioning to parallelise processing and (2) back pressure handling to compensate load peeks.

STREAM PARTITIONING output   instances  (consumer group) PARTITION KEY ->
PARTITION SELECTOR -> PARTITION INDEX input  (provider) f(message)->field f(field)->index f(index)->pindex pindex = index % output instances message   partitioning The idea of stream partitioning is quite simple. The stream of outbound messages of a provider microservice is split into n parts. This allows max n consumer instances working in parallel. To do so you’ve to provide a partition key expression for the outbound messages identifying a message ﬁeld which is used as partitioning criteria. Then you’ve also to provide an partition selector which maps the ﬁeld value onto an index number. The message is then forwarded to the partition with the index mod the number of microservice instances. Hence the number of instances and the number of partitions are decoupled.

BACK PRESSURE HANDLING 1 3 2 1. Signals if (message)
pressure is too high 2. Regulates inbound (message) ﬂow 3. (Data) retention lake Back pressure handling is about protecting the sensitive parts which may break at a too high message pressure. Those sensitive parts are the microservices. So if SCDF observes that a microservice is not able to handle new messages any more it dams up the messages within the message brokers. Especially Kafka is very good at storing large amounts of messages temporarily.

DISCLAIMER: THERE IS ALSO A TASK EXECUTION MODEL (WE WILL
IGNORE) ‣ short-living ‣ﬁnite data set ‣programming model = Spring Cloud Task ‣starters available for JDBC and Spark   as data source/sink Beside this described streaming model based on Spring Cloud Streaming, SCDF also provides a task execution model based on Spring Cloud Task for short-living tasks on ﬁnite data sets. But we will focus on streaming in this talk.

CONNECTED CAR PLATFORM EDGE SERVICE MQTT Broker  (apigee Link) MQTT
Source Data   Cleansing Realtime trafﬁc  analytics KPI ANALYTICS Spark DASHBOARD react-vis Presto Masterdata  Blending Camel Kafka Kafka ESB gPRC Here’s an illustrative architecture using SCDF. It’s a connected car platform collecting car telemetry data at the edge with the MQTT protocol. The MQTT messages are then ingested into a SCDF stream by a MQTT source. The messages are then cleaned (de-duplication, drop broken messages, …) and blended with master data (like vehicle information). This is then source for KPI analytics as well as a realtime traﬃc analytics leading to messages back to the vehicles. The whole solution is integrated with the corporate ESB, Presto as big data warehouse and a custom dashboard based on react-vis.

DEVELOPERS’S VIEW -ON SPRING CLOUD DATA FLOW DATASERVICES Now let’s
dig into SCDF code

ASSEMBLING A STREAM ▸ App starters: A set of pre-built 
apps aka dataservices ▸ Composition of apps with linux-style   pipe syntax: http | magichappenshere | log Starter app Custom app Basically you don’t have to code for certain use cases. SCDF provides a large set of pre-built apps called starter apps. You can compose streams with a linux-style pipe syntax using starter apps as well as custom made apps.

https://www.pinterest.de/pin/272116002461148164 MORE PIPES twitterstream   --consumerKey=<CONSUMER_KEY>   --consumerSecret=<CONSUMER_SECRET>   --accessToken=<ACCESS_TOKEN>
  --accessTokenSecret=<ACCESS_TOKEN_SECRET>   | log :tweets.twitterstream >   field-value-counter   --fieldName=lang --name=language :tweets.twitterstream >   filter   --expression=#jsonPath(payload,’$.lang’)=='en'   --outputType=application/json with parameters: with explicit input channel & analytics: with SpEL expression and explicit output type Here you see more advanced examples of the pipe syntax: ‣ how to pass parameters to an app ‣ how to decompose streams by referring to named channels ‣ how to use starter sink apps which are integrated into the analytics and visualization capabilities of SCDF within the admin UI ‣ how to use the Spring Expression Language to bring logic into the starter apps ‣ how to speciﬁy the output (or input) message data type like JSON, tuples or objects

OUR SAMPLE APPLICATION: WORLD MOOD https://github.com/adersberger/spring-cloud-dataflow-samples twitterstream Starter app Custom
app filter  (lang=en) log twitter ingester  (test data) tweet extractor  (text) sentiment  analysis  (StanfordNLP) field-value-counter To have a non-trivial example I’ve built an twitter sentiment analysis application called WorldMood based on SCDF. This graph illustrates the different parts and their interconnection. A stream of tweets is either ingested from Twitter directly (twitterstream, starter app) or from a test data pool (twitter ingester, custom). Then only english tweets are kept and the tweet text is extracted and cleaned. Then sentiment analysis is performed on the tweet texts and the distribution of the sentiments is aggregated.

DEVELOPING CUSTOM APPS: THE VERY BEGINNING https://start.spring.io At the very
beginning of implementing custom apps stands beloved Spring Initializer. You can generate a project skeleton for the app by choosing the build tool, spring boot version, packages and dependencies. You’ve to choose between Stream Kafka and Stream RabbitMQ as dependency according to which message broker you want to use.

@SpringBootApplication @EnableBinding(Source.class) public class TwitterIngester { private Iterator<String> lines; @Bean
@InboundChannelAdapter(value = Source.OUTPUT, poller = @Poller(fixedDelay = "200", maxMessagesPerPoll = "1")) public MessageSource<String> twitterMessageSource() { return () -> new GenericMessage<>(emitTweet()); } private String emitTweet() { if (lines == null || !lines.hasNext()) lines = readTweets(); return lines.next(); } private Iterator<String> readTweets() { //… } } PROGRAMMING MODEL: SOURCE And then you can code on. Here’s an example for a source app.

@RunWith(SpringRunner.class) @SpringBootTest(webEnvironment= SpringBootTest.WebEnvironment.RANDOM_PORT) public class TwitterIngesterTest { @Autowired private Source
source; @Autowired private MessageCollector collector; @Test public void tweetIngestionTest() throws InterruptedException { for (int i = 0; i < 100; i++) { Message<String> message = (Message<String>)   collector.forChannel(source.output()).take(); assert (message.getPayload().length() > 0); } } } PROGRAMMING MODEL: SOURCE TESTING You can use the Spring Cloud Stream testing harness to implement unit tests of apps as you can see at this code sample testing our source.

PROGRAMMING MODEL: PROCESSOR (WITH STANFORD NLP) @SpringBootApplication @EnableBinding(Processor.class) public class
TweetSentimentProcessor { @Autowired StanfordNLP nlp; @StreamListener(Processor.INPUT) //input channel with default name @SendTo(Processor.OUTPUT) //output channel with default name public Tuple analyzeSentiment(String tweet){ return TupleBuilder.tuple().of("mood", findSentiment(tweet)); } public int findSentiment(String tweet) { int mainSentiment = 0; if (tweet != null && tweet.length() > 0) { int longest = 0; Annotation annotation = nlp.process(tweet); for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) { Tree tree = sentence.get(SentimentCoreAnnotations.SentimentAnnotatedTree.class); int sentiment = RNNCoreAnnotations.getPredictedClass(tree); String partText = sentence.toString(); if (partText.length() > longest) { mainSentiment = sentiment; longest = partText.length(); } } } return mainSentiment; } } This is an example of our processor performing sentiment analysis based on Stanford NLP

PROGRAMMING MODEL: PROCESSOR TESTING @RunWith(SpringRunner.class) @SpringBootTest(webEnvironment= SpringBootTest.WebEnvironment.RANDOM_PORT) public class TweetSentimentProcessorTest
{ @Autowired private Processor processor; @Autowired private MessageCollector collector; @Autowired private TweetSentimentProcessor sentimentProcessor; @Test public void testAnalysis() { checkFor("I hate everybody around me!"); checkFor("The world is lovely"); checkFor("I f***ing hate everybody around me. They're from hell"); checkFor("Sunny day today!"); } private void checkFor(String msg) { processor.input().send(new GenericMessage<>(msg)); assertThat( collector.forChannel(processor.output()), receivesPayloadThat( equalTo(TupleBuilder.tuple().of("mood", sentimentProcessor.findSentiment(msg))); } } Here you can see a more complex and ﬂuent way how to test your custom apps

DEVELOPING THE STREAM DEFINITIONS WITH FLO http://projects.spring.io/spring-ﬂo/ You can then
use Flo to compose your stream of starter and custom apps.

RUNNING IT LOCAL RUNNING THE DATASERVICES $ redis-server &  $
zookeeper-server-start.sh . /config/zookeeper.properties &  $ kafka-server-start.sh ./config/server.properties &  $ java -jar spring-cloud-dataflow-server-local-1.2.0.RELEASE.jar &  $ java -jar ../spring-cloud-dataflow-shell-1.2.0.RELEASE.jar dataflow:> app import —uri [1]    dataflow:> app register --name tweetsentimentalyzer --type processor --uri file:///libs/ worldmoodindex-0.0.2-SNAPSHOT.jar    dataflow:> stream create tweets-ingestion --definition "twitterstream --consumerKey=A -- consumerSecret=B --accessToken=C --accessTokenSecret=D | filter — expression=#jsonPath(payload,’$.lang')=='en' | log" —deploy    dataflow:> stream create tweets-analyzer --definition “:tweets-ingestion.filter > tweetsentimentalyzer | field-value-counter --fieldName=mood —name=Mood"   dataflow:> stream deploy tweets-analyzer —properties “deployer.tweetsentimentalyzer.memory=1024m,deployer.tweetsentimentalyzer.count=8,  app.transform.producer.partitionKeyExpression=payload.id" [1] http://repo.spring.io/libs-release/org/springframework/cloud/stream/app/spring-cloud-stream-app-descriptor/Bacon.RELEASE/ spring-cloud-stream-app-descriptor-Bacon.RELEASE.kafka-10-apps-maven-repo-url.properties And now we’re all set to run the stream on our local computer. Assuming you’ve already downloaded redis, kafka and SCDF this is more or less the shell code to deploy the stream.

And we’ve to build a solution which scales-out. So local
machine is not enough. We need a cluster operating system.

RUNNING IT IN THE CLOUD RUNNING THE DATASERVICES $ git
clone https://github.com/spring-cloud/spring-cloud-dataflow-server-kubernetes  $ kubectl create -f src/etc/kubernetes/kafka-zk-controller.yml  $ kubectl create -f src/etc/kubernetes/kafka-zk-service.yml  $ kubectl create -f src/etc/kubernetes/kafka-controller.yml  $ kubectl create -f src/etc/kubernetes/mysql-controller.yml  $ kubectl create -f src/etc/kubernetes/mysql-service.yml  $ kubectl create -f src/etc/kubernetes/kafka-service.yml  $ kubectl create -f src/etc/kubernetes/redis-controller.yml  $ kubectl create -f src/etc/kubernetes/redis-service.yml  $ kubectl create -f src/etc/kubernetes/scdf-config-kafka.yml  $ kubectl create -f src/etc/kubernetes/scdf-secrets.yml  $ kubectl create -f src/etc/kubernetes/scdf-service.yml  $ kubectl create -f src/etc/kubernetes/scdf-controller.yml  $ kubectl get svc #lookup external ip “scdf” <IP> $ java -jar ../spring-cloud-dataflow-shell-1.2.0.RELEASE.jar dataflow:> dataflow config server --uri http://<IP>:9393  dataflow:> app import —uri [2]  dataflow:> app register --type processor --name tweetsentimentalyzer --uri docker:qaware/ tweetsentimentalyzer-processor:latest dataflow:> … [2] http://repo.spring.io/libs-release/org/springframework/cloud/stream/app/spring-cloud-stream-app-descriptor/Bacon.RELEASE/spring- cloud-stream-app-descriptor-Bacon.RELEASE.stream-apps-kafka-09-docker Here is the shell script to deploy to a pre-existing kubernetes cluster. First you’ve to deploy the different parts and configurations to kubernetes. Then you’ve to lookup the external IP of the SCDF server and then bind the shell to this IP. Then you register the Docker variant of the starter apps and register the custom apps. Please note: the custom apps have also to be packaged within a docker container and deployed to a docker registry therefor. All further steps are equal to the local way how to define and deploy streams. http://docs.spring.io/spring-cloud-dataflow-server-kubernetes/docs/current-SNAPSHOT/reference/htmlsingle/#_deploying_streams_on_kubernetes

LESSONS LEARNED Here are our lessens learned by using SCDF
aside of our swiss army knife Spark.

PRO CON specialized programming  model -> efﬁcient specialized execution  
environment -> efﬁcient support for all types of data  (big, fast, smart) disjoint programming model   (data processing <-> services) maybe a disjoint execution  environment  (data stack <-> service stack) BEST USED further on: as default for {big,fast,smart} data processing

PRO CON coherent execution environment (runs on microservice stack) coherent
programming model with emphasis on separation of concerns bascialy supports all types of data (big, fast, smart) has limitations on throughput  (big & fast data) due to less optimization (like data afﬁnity, query optimizer, …) and message-wise processing technology immature in certain  parts (e.g. diagnosability) BEST USED FOR hybrid applications of data processing, system integration, API, UI moderate throughput data applications with existing dev team Message by message processing

TWITTER.COM/QAWARE - SLIDESHARE.NET/QAWARE Thank you! Questions? [email protected] @adersberger https://github.com/adersberger/spring-cloud-dataﬂow-samples

BONUS SLIDES

MORE… ▸ Reactive programming ▸ Diagnosability public Flux<String> transform(@Input(“input”) Flux<String>
input) { return input.map(s -> s.toUpperCase()); } There are lot more things possible with SCDF like a reactive programming model within the custom apps and diagnosability mechanisms like throughput statistics. But this is too much for this talk. http://docs.spring.io/spring-cloud-dataﬂow/docs/1.2.0.BUILD-SNAPSHOT/reference/htmlsingle/#conﬁguration-monitoring-management

@EnableBinding(Sink::class) @EnableConfigurationProperties(PostgresSinkProperties::class) class PostgresSink { @Autowired lateinit var props: PostgresSinkProperties
@StreamListener(Sink.INPUT) fun processTweet(message: String) { Database.connect(props.url, user = props.user, password = props.password, driver = "org.postgresql.Driver") transaction { SchemaUtils.create(Messages) Messages.insert { it[Messages.message] = message } } } } object Messages : Table() { val id = integer("id").autoIncrement().primaryKey() val message = text("message") } PROGRAMMING MODEL: SINK (WITH KOTLIN) And last but not least an example of a Sink programmed in lovely Kotlin

MICRO ANALYTICS SERVICES Microservice Dashboard Microservice …

BLUEPRINT ARCHITECTURE

THE BIG PICTURE http://cloud.spring.io/spring-cloud-dataﬂow

BASIC IDEA: BI-MODAL SOURCES AND SINKS Sources Processors Sinks READ
FROM / WRITE TO: FILE, DATABASE, URL, … INGEST FROM / DIGEST TO: TWITTER, MQ, LOG, … More or less “pure” microservices -> magic happens around (in channels)

ARCHITECT’S VIEW THE SECRET OF BIG DATA PERFORMANCE Rule 1:
Be as close to the data as possible!  (CPU cache > memory > local disk > network) Rule 2: Reduce data volume as early as possible!   (as long as you don’t sacrifice parallelization) Rule 3: Parallelize as much as possible! Rule 4: Premature diagnosability and optimization The secret of distributed performance lies in my opinion in following three basic rules to optimize the two dimensions of distribution: - vertical processing: how to split up a job into a distributed execution tree - horizontal processing: how to scale-out each execution step

Dataservices - Processing Big Data The Microser...

Dataservices - Processing Big Data The Microservice Way

More Decks by Josef Adersberger

Other Decks in Technology

Featured

Transcript