Rethinking how distributed applications are built

Rethinking how distributed applications are built Does it have to
be so hard? Till Rohrmann [email protected] stsffap

About me • PMC member of Apache Flink • Up
until recently, software engineer at Alibaba • Co-founder of dataArtisans, original creators of Apache Flink • Worked on the distributed runtime of Apache Flink • Focus: Building a scalable and correct stream processing engine ◦ Scheduling, fault tolerance, high availability

Current state of application development

The good old monolith • Single tier applications • Consistency
fairly easy to achieve • Not so easy to scale out, rather scaling up • Not highly available unless you make it Source: https://de.wikipedia.org/wiki/Monolith#/media/Datei:Expo02_op6 987.jpg, Author: Daniel Steger, License: CC-BY-SA-2.5

Internet scale applications • Amount of data is ever increasing
→ Applications need to scale • Users cannot wait → Highly available with low latencies • Monolith not well suited for these requirements

Microservices to the rescue • Splitting the monolith up into
separate services • Services communicate over a network • Potential task parallelism • Loose coupling to scale development process ◦ Services can be owned by different teams, use different tech stacks • Individual service can be scaled independently Monolith Service A Service B Service C

Microservice architecture examples • Netflix’s Cosmos platform ◦ Platform to
run microservices together with workflows and serverless functions • Uber runs 2200 critical microservices ◦ Increased complexity motivated Domain-oriented microservice architecture • Amazon scaling market cap to 1.1 trillion $ Source: https://twitter.com/Werner/status/741673514567143424

Promises of microservice architectures • Loose coupling & separation of
concerns • Scalable software development & agility • Easy scalability • Higher resiliency • Reusable code

Entering the realm of distributed systems • What if a
service invocation gets lost? • Has the change been applied? • Do I have to retry? For how long? • Can I make my request idempotent? What if not? • Are distributed transactions viable? Do I have to resort to Sagas? → Maintaining consistency among multiple services is hard Service A Service B Service B Service B What’s the state?

Reality of microservice architectures • Scalable services if you make
them scalable ◦ Data partitioning ◦ Shuffling ◦ Distributed algorithms • High-available services if you make them highly available ◦ Cluster membership ◦ Replication of state ◦ Fault tolerance • Consistency ◦ Keeping multiple processes in sync ◦ Distributed failures • Managing a zoo of services ◦ Deployment and operations of a multitude of processes

Microservices are not the panacea • Microservices help to decompose
your application into smaller parts • If not done right, then microservices can add a lot of complexity ◦ Networked monoliths ◦ Higher overhead ◦ Services form information barriers • High availability, scalability, consistency, deployment & operation is on you

What about serverless computing? • Simplifies problem of deployment &
operations • No more management & capacity planning for compute resources • Elasticity & cost efficiency • Too simplistic for many use cases ◦ Stateful functions pose challenges Source: https://en.wikipedia.org/wiki/AWS_Lambda#/media/F ile:Amazon_Lambda_architecture_logo.svg Source: https://github.com/knative/community/blob/main/icons/l ogo.svg

Building distributed scalable applications is hard! Programming Distributed systems Domain
knowledge Database systems Cloud computing People with the right skill set to build distributed scalable applications

A different approach

What makes it so hard? • Traditional application development gives
maximum flexibility • With great power comes great responsibility ◦ Reliable communication ◦ Consistent state changes ◦ Fault tolerance • Cause: Control flow of the application starts in the business logic

Traditional control flow in a tiered application • Application is
triggered by external event and then calls database • Database does not know about service dependencies and communication patterns Database Service A Service B Database Client

Inverting the control flow • Putting the database in charge
of running the application • Database is responsible for state and messaging • Invokes functions with state • Knows about communication patterns, can retry and make sure that state remains consistent Database Service A Service B Client

Stateful Functions

Stateful Functions • Platform-independent stateful serverless stack to build event-driven
distributed applications • Stateful functions is build on top of Apache Flink • Stateful Functions runs user-defined stateful functions ◦ Invokes functions and forwards messages to other functions ◦ Keeps state consistent • Stateful functions is “distributed database” that drives the application forward f(i,s) f(i,s) f(i,s) Ingress Egress

Stateful functions high level architecture Source: https://nightlies.apache.org/flink/flink-statefun-docs-release-3.2/docs/concepts/distributed_architecture/

Stateful functions components Source: https://nightlies.apache.org/flink/flink-statefun-docs-release-3.2/docs/concepts/distributed_architecture/

How to achieve arbitrary messaging in a dataflow graph? •
Stateful functions requires arbitrary messaging ◦ Functions can invoke other functions • Flink is a stream processing engine that runs dataflows (DAGs) • Solution: Introduce feedback channel to support loops • New function invocations are sent back through feedback channel Feedback union Function dispatcher Feedback sender Ingress Egress Function endpoint

How to checkpoint loops • Flink creates globally consistent checkpoints
using the ABS algorithm • General iterations are not supported • Feedback union operator creates snapshot of its state + records all events from the feedback channel until it sees the checkpoint barrier Feedback union Feedback Input Checkpoint barrier 1 Checkpoint barrier 1 Checkpoint 1

Benefits of Stateful Functions • Easy to build scalable and
consistent event-driven applications • Reliable messaging → no retries required • Consistent state → no distributed transactions/sagas/etc. needed • Exactly once processing guarantees inherited from Flink • Segregation of state/message from compute ◦ Compute part is “stateless” → easily scalable → Removing the distributed complexity from distributed applications

Limitations of Stateful Functions • Result latency with exactly once
processing guarantees is lower bounded by checkpoint interval ◦ Checkpoint interval ⪆ couple of seconds ◦ Not well suited for near real-time applications (e.g. web applications) • Single task failure requires recovery of the whole job graph ◦ All operators will be redeployed ◦ No availability during failure recoveries → Not well suited for low latency, high availability applications :-(

What’s causing the limitations? • Caused by Flink’s checkpointing algorithm
• Globally consistent checkpoint requires all operators to create a consistent checkpoint → coupling of operators and their checkpointing • Globally consistent checkpoint requires all operators to be reset in case of a recovery → Potential solution: Independent checkpointing & recovery of operators

Conclusion

Wrapping it up • Building distributed applications is complicated •
Making DB responsible for state & messaging leads to simpler application development • Stateful functions shows the idea but has severe limitations • Need for a better runtime that is highly available and guarantees consistent results with low latencies → Good ideas required!

Questions? [email protected] stsffap

Rethinking how distributed applications are built

Rethinking how distributed applications are built

Till Rohrmann

More Decks by Till Rohrmann

Other Decks in Technology

Featured

Transcript