$30 off During Our Annual Pro Sale. View Details »

Rethinking how distributed applications are built

Rethinking how distributed applications are built

In our more and more connected world where people are used to managing their lives via digital services, it has become mandatory for a successful company to build applications that can scale with the popularity of the company’s services. Scalability is not the only requirement but similarly important is that modern applications are highly available and fast because users are not willing to wait in our ever faster moving world. Due to this, we have seen a shift from the classic monolith towards micro service architectures which promise to be more easily scalable. The emergence of serverless functions further strengthened this trend more recently.

By implementing a micro service architecture, application developers are all of a sudden exposed to the realm of distributed applications with its seemingly limitless scalability but also its pitfalls nobody tells you about upfront. So instead of solving business domain problems, developers find themselves fighting with race conditions, distributed failures, inconsistencies and in general a drastically increased complexity. In order to solve some of these problems, people introduce endless retries, timeouts, sagas and distributed transactions. These band aids can quickly result in a not so scalable system that is brittle and hard to maintain.

The underlying problem is that developers are responsible for ensuring reliable communication and consistent state changes. Having a system that takes care of these aspects could drastically reduce the complexity of developing scalable distributed applications. By inverting the traditional control-flow from application-to-database to database-to-application, we can put the database in charge of ensuring reliable communication and consistent state changes and, thus, freeing the developer to think about it.

In this keynote, I want to explore the idea of putting the database in charge of driving the application logic using the example of Stateful Functions, a library built on top of Apache Flink that follows this idea. I will explain how Stateful Functions achieves scalability and consistency but also what its limitations are. Based on these results, I would like to sketch the requirements for a runtime that can truly realise the full potential of Stateful Functions and discuss with you ideas how it could be implemented.

Till Rohrmann

June 28, 2022
Tweet

More Decks by Till Rohrmann

Other Decks in Technology

Transcript

  1. Rethinking how distributed
    applications are built
    Does it have to be so hard?
    Till Rohrmann
    [email protected]
    stsffap

    View Slide

  2. About me
    ● PMC member of Apache Flink
    ● Up until recently, software engineer at Alibaba
    ● Co-founder of dataArtisans, original creators of
    Apache Flink
    ● Worked on the distributed runtime of Apache
    Flink
    ● Focus: Building a scalable and correct stream
    processing engine
    ○ Scheduling, fault tolerance, high availability

    View Slide

  3. Current state of application development

    View Slide

  4. The good old monolith
    ● Single tier applications
    ● Consistency fairly easy to achieve
    ● Not so easy to scale out, rather scaling up
    ● Not highly available unless you make it
    Source:
    https://de.wikipedia.org/wiki/Monolith#/media/Datei:Expo02_op6
    987.jpg, Author: Daniel Steger, License: CC-BY-SA-2.5

    View Slide

  5. Internet scale applications
    ● Amount of data is ever increasing → Applications need to scale
    ● Users cannot wait → Highly available with low latencies
    ● Monolith not well suited for these requirements

    View Slide

  6. Microservices to the rescue
    ● Splitting the monolith up into separate services
    ● Services communicate over a network
    ● Potential task parallelism
    ● Loose coupling to scale development process
    ○ Services can be owned by different teams, use different tech
    stacks
    ● Individual service can be scaled independently
    Monolith
    Service A
    Service B Service C

    View Slide

  7. Microservice architecture examples
    ● Netflix’s Cosmos platform
    ○ Platform to run microservices together with workflows and serverless functions
    ● Uber runs 2200 critical microservices
    ○ Increased complexity motivated Domain-oriented microservice architecture
    ● Amazon scaling market cap to 1.1 trillion $
    Source: https://twitter.com/Werner/status/741673514567143424

    View Slide

  8. Promises of microservice architectures
    ● Loose coupling & separation of concerns
    ● Scalable software development & agility
    ● Easy scalability
    ● Higher resiliency
    ● Reusable code

    View Slide

  9. Entering the realm of distributed systems
    ● What if a service invocation gets lost?
    ● Has the change been applied?
    ● Do I have to retry? For how long?
    ● Can I make my request idempotent? What if not?
    ● Are distributed transactions viable? Do I have to resort to Sagas?
    → Maintaining consistency among multiple services is hard
    Service A Service B
    Service B
    Service B
    What’s the state?

    View Slide

  10. Reality of microservice architectures
    ● Scalable services if you make them scalable
    ○ Data partitioning
    ○ Shuffling
    ○ Distributed algorithms
    ● High-available services if you make them highly available
    ○ Cluster membership
    ○ Replication of state
    ○ Fault tolerance
    ● Consistency
    ○ Keeping multiple processes in sync
    ○ Distributed failures
    ● Managing a zoo of services
    ○ Deployment and operations of a multitude of processes

    View Slide

  11. Microservices are not the panacea
    ● Microservices help to decompose your application into smaller parts
    ● If not done right, then microservices can add a lot of complexity
    ○ Networked monoliths
    ○ Higher overhead
    ○ Services form information barriers
    ● High availability, scalability, consistency, deployment & operation is on you

    View Slide

  12. What about serverless computing?
    ● Simplifies problem of deployment & operations
    ● No more management & capacity planning for
    compute resources
    ● Elasticity & cost efficiency
    ● Too simplistic for many use cases
    ○ Stateful functions pose challenges
    Source:
    https://en.wikipedia.org/wiki/AWS_Lambda#/media/F
    ile:Amazon_Lambda_architecture_logo.svg
    Source:
    https://github.com/knative/community/blob/main/icons/l
    ogo.svg

    View Slide

  13. Building distributed scalable applications is hard!
    Programming
    Distributed
    systems
    Domain
    knowledge
    Database
    systems
    Cloud computing
    People with the right skill set
    to build distributed scalable
    applications

    View Slide

  14. A different approach

    View Slide

  15. What makes it so hard?
    ● Traditional application development gives maximum flexibility
    ● With great power comes great responsibility
    ○ Reliable communication
    ○ Consistent state changes
    ○ Fault tolerance
    ● Cause: Control flow of the application starts in the business logic

    View Slide

  16. Traditional control flow in a tiered application
    ● Application is triggered by external event and then calls database
    ● Database does not know about service dependencies and communication
    patterns
    Database
    Service A Service B
    Database
    Client

    View Slide

  17. Inverting the control flow
    ● Putting the database in charge of running the application
    ● Database is responsible for state and messaging
    ● Invokes functions with state
    ● Knows about communication patterns, can retry and make sure that state
    remains consistent
    Database
    Service A Service B
    Client

    View Slide

  18. Stateful Functions

    View Slide

  19. Stateful Functions
    ● Platform-independent stateful serverless stack to build event-driven
    distributed applications
    ● Stateful functions is build on top of Apache Flink
    ● Stateful Functions runs user-defined stateful functions
    ○ Invokes functions and forwards messages to other functions
    ○ Keeps state consistent
    ● Stateful functions is “distributed database” that drives the application forward
    f(i,s)
    f(i,s)
    f(i,s)
    Ingress Egress

    View Slide

  20. Stateful functions high level architecture
    Source: https://nightlies.apache.org/flink/flink-statefun-docs-release-3.2/docs/concepts/distributed_architecture/

    View Slide

  21. Stateful functions components
    Source: https://nightlies.apache.org/flink/flink-statefun-docs-release-3.2/docs/concepts/distributed_architecture/

    View Slide

  22. How to achieve arbitrary messaging in a dataflow graph?
    ● Stateful functions requires arbitrary messaging
    ○ Functions can invoke other functions
    ● Flink is a stream processing engine that runs dataflows (DAGs)
    ● Solution: Introduce feedback channel to support loops
    ● New function invocations are sent back through feedback channel
    Feedback union Function dispatcher Feedback sender
    Ingress Egress
    Function endpoint

    View Slide

  23. How to checkpoint loops
    ● Flink creates globally consistent checkpoints using the ABS algorithm
    ● General iterations are not supported
    ● Feedback union operator creates snapshot of its state + records all events
    from the feedback channel until it sees the checkpoint barrier
    Feedback
    union
    Feedback
    Input
    Checkpoint barrier 1
    Checkpoint barrier 1
    Checkpoint 1

    View Slide

  24. Benefits of Stateful Functions
    ● Easy to build scalable and consistent event-driven applications
    ● Reliable messaging → no retries required
    ● Consistent state → no distributed transactions/sagas/etc. needed
    ● Exactly once processing guarantees inherited from Flink
    ● Segregation of state/message from compute
    ○ Compute part is “stateless” → easily scalable
    → Removing the distributed complexity from distributed applications

    View Slide

  25. Limitations of Stateful Functions
    ● Result latency with exactly once processing guarantees is lower bounded by
    checkpoint interval
    ○ Checkpoint interval ⪆ couple of seconds
    ○ Not well suited for near real-time applications (e.g. web applications)
    ● Single task failure requires recovery of the whole job graph
    ○ All operators will be redeployed
    ○ No availability during failure recoveries
    → Not well suited for low latency, high availability applications :-(

    View Slide

  26. What’s causing the limitations?
    ● Caused by Flink’s checkpointing algorithm
    ● Globally consistent checkpoint requires all operators to create a consistent
    checkpoint → coupling of operators and their checkpointing
    ● Globally consistent checkpoint requires all operators to be reset in case of a
    recovery
    → Potential solution: Independent checkpointing & recovery of operators

    View Slide

  27. Conclusion

    View Slide

  28. Wrapping it up
    ● Building distributed applications is complicated
    ● Making DB responsible for state & messaging leads to simpler application
    development
    ● Stateful functions shows the idea but has severe limitations
    ● Need for a better runtime that is highly available and guarantees consistent
    results with low latencies → Good ideas required!

    View Slide

  29. Questions?
    [email protected]
    stsffap

    View Slide