Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Rethinking how distributed applications are built

Rethinking how distributed applications are built

In our more and more connected world where people are used to managing their lives via digital services, it has become mandatory for a successful company to build applications that can scale with the popularity of the company’s services. Scalability is not the only requirement but similarly important is that modern applications are highly available and fast because users are not willing to wait in our ever faster moving world. Due to this, we have seen a shift from the classic monolith towards micro service architectures which promise to be more easily scalable. The emergence of serverless functions further strengthened this trend more recently.

By implementing a micro service architecture, application developers are all of a sudden exposed to the realm of distributed applications with its seemingly limitless scalability but also its pitfalls nobody tells you about upfront. So instead of solving business domain problems, developers find themselves fighting with race conditions, distributed failures, inconsistencies and in general a drastically increased complexity. In order to solve some of these problems, people introduce endless retries, timeouts, sagas and distributed transactions. These band aids can quickly result in a not so scalable system that is brittle and hard to maintain.

The underlying problem is that developers are responsible for ensuring reliable communication and consistent state changes. Having a system that takes care of these aspects could drastically reduce the complexity of developing scalable distributed applications. By inverting the traditional control-flow from application-to-database to database-to-application, we can put the database in charge of ensuring reliable communication and consistent state changes and, thus, freeing the developer to think about it.

In this keynote, I want to explore the idea of putting the database in charge of driving the application logic using the example of Stateful Functions, a library built on top of Apache Flink that follows this idea. I will explain how Stateful Functions achieves scalability and consistency but also what its limitations are. Based on these results, I would like to sketch the requirements for a runtime that can truly realise the full potential of Stateful Functions and discuss with you ideas how it could be implemented.

Till Rohrmann

June 28, 2022
Tweet

More Decks by Till Rohrmann

Other Decks in Technology

Transcript

  1. About me • PMC member of Apache Flink • Up

    until recently, software engineer at Alibaba • Co-founder of dataArtisans, original creators of Apache Flink • Worked on the distributed runtime of Apache Flink • Focus: Building a scalable and correct stream processing engine ◦ Scheduling, fault tolerance, high availability
  2. The good old monolith • Single tier applications • Consistency

    fairly easy to achieve • Not so easy to scale out, rather scaling up • Not highly available unless you make it Source: https://de.wikipedia.org/wiki/Monolith#/media/Datei:Expo02_op6 987.jpg, Author: Daniel Steger, License: CC-BY-SA-2.5
  3. Internet scale applications • Amount of data is ever increasing

    → Applications need to scale • Users cannot wait → Highly available with low latencies • Monolith not well suited for these requirements
  4. Microservices to the rescue • Splitting the monolith up into

    separate services • Services communicate over a network • Potential task parallelism • Loose coupling to scale development process ◦ Services can be owned by different teams, use different tech stacks • Individual service can be scaled independently Monolith Service A Service B Service C
  5. Microservice architecture examples • Netflix’s Cosmos platform ◦ Platform to

    run microservices together with workflows and serverless functions • Uber runs 2200 critical microservices ◦ Increased complexity motivated Domain-oriented microservice architecture • Amazon scaling market cap to 1.1 trillion $ Source: https://twitter.com/Werner/status/741673514567143424
  6. Promises of microservice architectures • Loose coupling & separation of

    concerns • Scalable software development & agility • Easy scalability • Higher resiliency • Reusable code
  7. Entering the realm of distributed systems • What if a

    service invocation gets lost? • Has the change been applied? • Do I have to retry? For how long? • Can I make my request idempotent? What if not? • Are distributed transactions viable? Do I have to resort to Sagas? → Maintaining consistency among multiple services is hard Service A Service B Service B Service B What’s the state?
  8. Reality of microservice architectures • Scalable services if you make

    them scalable ◦ Data partitioning ◦ Shuffling ◦ Distributed algorithms • High-available services if you make them highly available ◦ Cluster membership ◦ Replication of state ◦ Fault tolerance • Consistency ◦ Keeping multiple processes in sync ◦ Distributed failures • Managing a zoo of services ◦ Deployment and operations of a multitude of processes
  9. Microservices are not the panacea • Microservices help to decompose

    your application into smaller parts • If not done right, then microservices can add a lot of complexity ◦ Networked monoliths ◦ Higher overhead ◦ Services form information barriers • High availability, scalability, consistency, deployment & operation is on you
  10. What about serverless computing? • Simplifies problem of deployment &

    operations • No more management & capacity planning for compute resources • Elasticity & cost efficiency • Too simplistic for many use cases ◦ Stateful functions pose challenges Source: https://en.wikipedia.org/wiki/AWS_Lambda#/media/F ile:Amazon_Lambda_architecture_logo.svg Source: https://github.com/knative/community/blob/main/icons/l ogo.svg
  11. Building distributed scalable applications is hard! Programming Distributed systems Domain

    knowledge Database systems Cloud computing People with the right skill set to build distributed scalable applications
  12. What makes it so hard? • Traditional application development gives

    maximum flexibility • With great power comes great responsibility ◦ Reliable communication ◦ Consistent state changes ◦ Fault tolerance • Cause: Control flow of the application starts in the business logic
  13. Traditional control flow in a tiered application • Application is

    triggered by external event and then calls database • Database does not know about service dependencies and communication patterns Database Service A Service B Database Client
  14. Inverting the control flow • Putting the database in charge

    of running the application • Database is responsible for state and messaging • Invokes functions with state • Knows about communication patterns, can retry and make sure that state remains consistent Database Service A Service B Client
  15. Stateful Functions • Platform-independent stateful serverless stack to build event-driven

    distributed applications • Stateful functions is build on top of Apache Flink • Stateful Functions runs user-defined stateful functions ◦ Invokes functions and forwards messages to other functions ◦ Keeps state consistent • Stateful functions is “distributed database” that drives the application forward f(i,s) f(i,s) f(i,s) Ingress Egress
  16. How to achieve arbitrary messaging in a dataflow graph? •

    Stateful functions requires arbitrary messaging ◦ Functions can invoke other functions • Flink is a stream processing engine that runs dataflows (DAGs) • Solution: Introduce feedback channel to support loops • New function invocations are sent back through feedback channel Feedback union Function dispatcher Feedback sender Ingress Egress Function endpoint
  17. How to checkpoint loops • Flink creates globally consistent checkpoints

    using the ABS algorithm • General iterations are not supported • Feedback union operator creates snapshot of its state + records all events from the feedback channel until it sees the checkpoint barrier Feedback union Feedback Input Checkpoint barrier 1 Checkpoint barrier 1 Checkpoint 1
  18. Benefits of Stateful Functions • Easy to build scalable and

    consistent event-driven applications • Reliable messaging → no retries required • Consistent state → no distributed transactions/sagas/etc. needed • Exactly once processing guarantees inherited from Flink • Segregation of state/message from compute ◦ Compute part is “stateless” → easily scalable → Removing the distributed complexity from distributed applications
  19. Limitations of Stateful Functions • Result latency with exactly once

    processing guarantees is lower bounded by checkpoint interval ◦ Checkpoint interval ⪆ couple of seconds ◦ Not well suited for near real-time applications (e.g. web applications) • Single task failure requires recovery of the whole job graph ◦ All operators will be redeployed ◦ No availability during failure recoveries → Not well suited for low latency, high availability applications :-(
  20. What’s causing the limitations? • Caused by Flink’s checkpointing algorithm

    • Globally consistent checkpoint requires all operators to create a consistent checkpoint → coupling of operators and their checkpointing • Globally consistent checkpoint requires all operators to be reset in case of a recovery → Potential solution: Independent checkpointing & recovery of operators
  21. Wrapping it up • Building distributed applications is complicated •

    Making DB responsible for state & messaging leads to simpler application development • Stateful functions shows the idea but has severe limitations • Need for a better runtime that is highly available and guarantees consistent results with low latencies → Good ideas required!