Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Towards Robust Large-scale Concurrent and Distributed Programming

Philipp Haller
July 30, 2021
120

Towards Robust Large-scale Concurrent and Distributed Programming

Keynote given at the 20th International Symposium on Parallel and Distributed Computing (ISPDC 2021), 28-30 July 2021, Cluj-Napoca, Romania.

Conference website: https://ispdc2021.utcluj.ro/

Abstract: Software systems must satisfy rapidly increasing demands imposed by emerging applications. For example, new AI applications, such as autonomous driving, require quick responses to an environment that is changing continuously. At the same time, software systems must be fault-tolerant in order to ensure a high degree of availability. As it stands, however, developing these new distributed software systems is extremely challenging even for expert software engineers due to the interplay of concurrency, asynchronicity, and failure of components. The objective of our research is to develop reusable solutions to the above challenges by means of novel programming models and frameworks that can be used to build a wide range of applications. This talk reports on our work on the design, implementation, and foundations of programming models and languages that enable the robust construction of large-scale concurrent and distributed software systems.

Philipp Haller

July 30, 2021
Tweet

Transcript

  1. Towards Robust Large-scale Concurrent and Distributed Programming Philipp Haller Associate

    Professor School of Electrical Engineering and Computer Science KTH Royal Institute of Technology Stockholm, Sweden 20th International Symposium on Parallel and Distributed Computing (ISPDC 2021) 30 July 2021 Virtual
  2. Philipp Haller Cloud computing continues to grow... 2

  3. Philipp Haller 3 Bugs in cloud-based distributed systems can be

    catastrophic.
  4. Philipp Haller 4 The Northeast blackout of 2003: a widespread

    power outage throughout parts of the Northeastern and Midwestern US and the Canadian province of Ontario on August 14, 2003 Primary cause: a data-race bug in the alarm system at the control room of FirstEnergy Corporation
  5. Philipp Haller Characteristics of distributed systems • Concurrency and parallelism

    – Distributed, networked multi-core computers • Asynchronicity – Lack of a global clock
 Except for expensive infrastructure based on GPS and atomic clocks [Corbett et al. 2013] • Failure of components – Computers can fail/crash – Networks can drop/duplicate/ reorder messages 5 Corbett et al. Spanner: Google's Globally Distributed Database. ACM Trans. Comput. Syst. 31(3): 8:1-8:22 (2013)
  6. Philipp Haller Challenges for programming The characteristics of distributed systems

    lead to challenges for software development: • Concurrent and parallel programming plagued by hazards – Data races, deadlocks, non-determinism, ... • Tolerating asynchronicity and failures is difficult – Bad selection of protocols/distributed algorithms leads to loss of scalability, performance, availability, or consistency • Scalability and performance hard to achieve – Data and control dependencies, resource contention, ... – Must not violate concurrency safety, fault tolerance, consistency 6 1 2 3 Fault-tolerance adds to difficulty!
  7. Philipp Haller Perspectives for addressing challenges • Computer systems •

    Data management • Programming languages and systems • Software engineering • Computer communication • ... 7
  8. Philipp Haller Perspectives for addressing challenges • Computer systems •

    Data management • Programming languages and systems • Software engineering • Computer communication • ... 8 Our perspective
  9. Philipp Haller Background & Context 9

  10. Philipp Haller Background & Context 10 Scala
 Language • 2005-2014

    Scala language core team member • 2019 ACM SIGPLAN Programming Languages
 Software Award Close to 600 meetup groups globally: • Integrates object-oriented and functional
 programming • Reconcile type-safety & flexibility
  11. Philipp Haller Background & Context 11 Scala
 Language

  12. Philipp Haller Background & Context 12 Scala
 Language Actors in

    Scala • Haller & Odersky. Scala Actors: Unifying thread-based and event-based programming. Theoretical Computer Science 410(2-3): 202-220 (2009)
 - 463 citations • Nomination for 2010 EPFL Doctorate Award
  13. Philipp Haller Background & Context 13 • Haller & Odersky.

    Scala Actors: Unifying thread-based and event-based programming. Theoretical Computer Science 410(2-3): 202-220 (2009)
 - 462 citations • Nomination for 2010 EPFL Doctorate Award Production use in Twitter core: Scala and Actors for tracking cargo
 on Dutch rail network:
  14. Philipp Haller Background & Context 14 Scala
 Language Actors in

    Scala
  15. Philipp Haller Background & Context 15 Scala
 Language Actors in

    Scala • Leader of the standardization of Scala's futures package • Co-author of Scala's async/await constructs • Co-author of Spores, safe distributable closures
 Supported by IBM Spark Technology Center
 (now IBM CODAIT) Practical concurrency & distribution Released in 2020 as part of all Scala versions
  16. Philipp Haller Goal: Robust, large-scale concurrent and distributed programming •

    Reconcile – Fault tolerance – Scalability – Safety • Provide programming models and languages applicable to a variety of distributed applications 16 Instead of building single "one-of" systems for specific domains Fault tolerance Scalability Safety
  17. Philipp Haller A safety challenge: data races What is a

    data race? • A data race occurs – when two tasks (threads, processes, actors) concurrently access the same shared variable (or object field) and – at least one of the accesses is a write (an assignment) • In practice, data races are difficult to find and fix 17 Fault tolerance Scalability Safety
  18. Philipp Haller A safety challenge: data races Q: Aren't data

    races only affecting local, non-distributed systems? A: No, due to: – Exploitation of multi-core processors – Processing of large amounts of data or requests/tasks requires concurrency 18 Fault tolerance Scalability Safety
  19. Philipp Haller Data race: an example 19 var x: Int

    = 0 async { if (x == 0) { x = 1 assert(x == 1) } } x = 5
  20. Philipp Haller Data race: an example 20 var x: Int

    = 0 async { if (x == 0) { x = 1 assert(x == 1) } } x = 5 Start a concurrent computation
  21. Philipp Haller Data race: an example 21 var x: Int

    = 0 async { if (x == 0) { x = 1 assert(x == 1) } } x = 5 Start a concurrent computation Checks whether the condition holds
  22. Philipp Haller Data race: an example 22 var x: Int

    = 0 async { if (x == 0) { x = 1 assert(x == 1) } } x = 5 Start a concurrent computation Checks whether the condition holds Concurrent assignment
  23. Philipp Haller Data race: an example 23 var x: Int

    = 0 async { if (x == 0) { x = 1 assert(x == 1) } } x = 5 Start a concurrent computation Assertion may fail! Concurrent assignment
  24. Philipp Haller What about higher-level abstractions? 24 • The example

    to the right looks harmless... • ...until we inspect class C var input: Int = ... var future = async { var x = new C() x.expensiveComputation(input) } var z = (new C()).get() class C: def expensiveComputation(init: Int): Int = set(init) ... def set(x: Int): Unit = Global.f = x def get(): Int = Global.f object Global: var f: Int = 0 Shared singleton object Global.f = global variable! Data race! Creates and uses fresh instance
  25. Philipp Haller Idea: Static prevention of data races via
 type-and-effect

    checking • Prevent data races by restricting the effects of concurrent code – No access to shared mutable state
 Unless concurrency-safe (e.g., concurrent collections) – Only allowed to instantiate safe classes • Safe classes: (does not access shared mutable state, directly or indirectly) – Methods only access parameters and receiver (this) – Methods only instantiate safe classes – Field types are either primitive or safe class types – Superclasses are safe 25
  26. Philipp Haller Idea: Static prevention of data races via
 type-and-effect

    checking • Prevent data races by restricting the effects of concurrent code – No access to shared mutable state
 Unless concurrency-safe (e.g., concurrent collections) – Only allowed to instantiate safe classes • Safe classes: (does not access shared mutable state, directly or indirectly) – Methods only access parameters and receiver (this) – Methods only instantiate safe classes – Field types are either primitive or safe class types – Superclasses are safe 26 class C: def expensiveComputation(init: Int): Int = set(init) ... def set(x: Int): Unit = Global.f = x def get(): Int = Global.f object Global: var f: Int = 0 Not a safe class!
  27. Philipp Haller How practical are these restrictions? • How common

    are safe classes in Scala? • Such classes can be used safely in concurrent code without changes! • Empirical study of over 75'000 LOC of open-source Scala code: 27 Project Version SLOC GitHub stats Scala stdlib 2.11.7 33.107 ✭5,795 👥 257 Signal/Collect 8.0.6 10.159 ✭123 👥 11 GeoTrellis 0.10.0-RC2 35.351 ✭400 👥 38 -engine 3.868 -raster 22.291 -spark 9.192
  28. Philipp Haller Results of empirical study In the analyzed medium

    to large open-source Scala projects,
 21-67% of all classes are safe: 28 Project #classes/traits #safe (%) #dir. unsafe (%) Scala stdlib 1.505 644 (43%) 212/861 (25%) Signal/ Collect 236 159 (67%) 60/77 (78%) GeoTrellis -engine 190 40 (21%) 124/150 (83%) -raster 670 233 (35%) 325/437 (74%) -spark 326 101 (31%) 167/225 (74%) Total 2.927 1,177 (40%) 888/1,750 (51%)
  29. Philipp Haller Further results • Formalization as a type-and-effect system:

    – Object-oriented core language with heap separation and concurrency – Proof of type soundness – Proof of isolation theorem for processes with shared heap and ownership transfer • Integration with the actor model of concurrency: 29 Haller and Loiko. LaCasa: Lightweight affinity and object capabilities in Scala. OOPSLA 2016
 (Google Scholar: 31 citations) Haller and Odersky. Scala Actors: Unifying thread-based and event-based programming. Theor. Comput. Sci. 410(2-3): 202-220 (2009)
 (Google Scholar: 463 citations)
  30. Philipp Haller Data-race freedom and determinism • Due to nondeterminism,

    concurrent programs are difficult to reason about
 even when they are data-race free • Empirical results suggest that developers prefer deterministic operations, when processing collections and streams
 
 (based on an analysis of ~5.53 million lines of Java source code) 30 Khatchadourian, Tang, Bagherzadeh, Ray. An Empirical Study on the Use and Misuse of Java 8 Streams. FASE 2020: 97-118
  31. Philipp Haller From data-race freedom to determinism: Quo vadis? Can

    we strengthen the guarantee of data-race freedom to determinism? Requirements: – Extension of a general-purpose programming language supporting imperative, object-oriented, and functional programming • Global, shared state • Pervasive aliasing – Applicability to irregular programs • Irregular programs = programs organized around pointer-based data structures such as trees and graphs 31 Potential application to widely-used languages
  32. Philipp Haller Our approach • Novel concurrent programming model building

    on: – Event-driven concurrency – Lattice-based data types • Goal: provide static determinism guarantee: 32 "All non-failing executions compute the same result." "Quasi-determinism" [Kuper et al. 2014] Kuper et al. Freeze after writing: Quasi-deterministic parallel programming with LVars. POPL 2014
  33. Philipp Haller Programming model Key constructs: – Cells: shared variables

    storing values of a lattice – Event-handlers triggered by updates of cells 33 f g h <code> <code> <code> f.when(g) { update => if (update.value == Impure) FinalOutcome(Impure) else NoOutcome } Lattice provides commutative update: newValue := oldValue ⊔ update
  34. Philipp Haller Programming model Upon computing the value Impure for

    g: – Event handler runs, since f depends on g – f is updated with a new value – Value of f is final → remove dependencies 34 f h <code> f.when(g) { update => if (update.value == Impure) FinalOutcome(Impure) else NoOutcome } g g <code> f <code>
  35. Philipp Haller Application: Static program analysis Example: return type analysis

    35 Which types does method f possibly return? Class type hierarchy: D E F G class C { def f(x: Int): D = if (x <= 0) g(x) else h(x-1) def g(y: Int): E = new E(y) def h(z: Int): D = if (z == 0) new F else f(z) }
  36. Philipp Haller Return type analysis • Method f calls methods

    g and h; method h calls method f • Programming model let's us express the resulting dependencies, forming a directed graph: 36 class C { def f(x: Int): D = if (x <= 0) g(x) else h(x-1) def g(y: Int): E = new E(y) def h(z: Int): D = if (z == 0) new F else f(z) } f g h "calls"
  37. Philipp Haller Experimental evaluation: static analysis • Integration of our

    programming model with OPAL, a state-of-the-art JVM bytecode analysis framework • Implementation of two different kinds of static analyses – A standard purity analysis – The IFDS analysis framework (IFDS = Interprocedural Finite Distributive Subset) • Implementation of a state-of-the-art taint analysis based on IFDS – Finds security vulnerabilities based on dynamic class loading 37
  38. Philipp Haller Parallel static analysis: Results 38 0 20 40

    60 80 100 120 140 DefaultScheduling SourcesWithManyTargetsLast TargetsWithManyTargetsLast TargetsWithManySourcesLast SourcesWithManySourcesLast OPAL - Sequential Heros Runtime (s) Threads 1 5 10 15 20 20 25 30 35 • Heros: best speed-up 2.36x @ 8 threads • Reactive Async: 3.53x @ 8 threads, 3.98x @ 16 threads • Analysis of Java RE 1.7.0 update 95 (from Doop benchmarks project) • System: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz (10 cores)
 using 16 GB RAM running Ubuntu 18.04.3 and OpenJDK 1.8_212
  39. Philipp Haller Reactive Async: Summary • Deterministic concurrent programming model

    – Extension of imperative, object-oriented base language (Scala) – Resolution of cyclic dependencies • Practical, applicable to irregular workloads processing large graphs • Result: fastest, most scalable JVM bytecode analysis framework 39 Haller, Geries, Eichberg, Salvaneschi. Reactive Async: Expressive deterministic concurrency. ACM SIGPLAN Scala Symposium 2016: 11-20 Helm, Kübler, Kölzer, Haller, Eichberg, Salvaneschi, Mezini. A programming model for semi-implicit parallelization of static analyses. ISSTA 2020: 428-439
  40. Philipp Haller Goal: Robust, large-scale concurrent and distributed programming •

    Reconcile – Fault tolerance – Scalability – Safety • Provide programming models and languages applicable to a variety of distributed applications 40 Fault tolerance Scalability Safety
  41. Philipp Haller Context: Massively parallel and distributed systems • Data

    distributed over tens to millions of nodes • (Geo-)distribution across datacenters • Reliable, large-scale processing of batch and streaming data 41 Cloud Data Center fault tolerance
  42. Philipp Haller Programming model for lineage-based distributed computation 42 Guiding

    question: Can we provide provable fault-tolerance guarantees
 for a general class of distributed applications? Approach: • Design of programming model with lineage as first- class construct for fault recovery • Functional processing of distributed data, similar to big data frameworks like MapReduce, Flink, and Spark • Strong static typing Lineage = “information that permits reconstructing a dataset” First step:
 Precise formalization of fault recovery mechanism
  43. Philipp Haller Example: creation and materialization of lineage Silo[List[Person]] Machine

    1 SiloRef[List[Person]] Let’s make an interesting DAG! Machine 2 persons: val persons: SiloRef[List[Person]] = ... 43
  44. Philipp Haller Silo[List[Person]] Machine 1 SiloRef[List[Person]] Let’s make an interesting

    DAG! Machine 2 persons: val persons: SiloRef[List[Person]] = ... val adults = persons.apply(spore { ps => val res = ps.filter(p => p.age >= 18) SiloRef.populate(currentHost, res) }) adults 44 Example: creation and materialization of lineage
  45. Philipp Haller Silo[List[Person]] Machine 1 SiloRef[List[Person]] Let’s make an interesting

    DAG! Machine 2 persons: val persons: SiloRef[List[Person]] = ... val vehicles: SiloRef[List[Vehicle]] = ... // adults that own a vehicle val owners = adults.apply(spore { val localVehicles = vehicles // spore header ps => localVehicles.apply(spore { val localps = ps // spore header vs => SiloRef.populate(currentHost, localps.flatMap(p => // list of (p, v) for a single person p vs.flatMap { v => if (v.owner.name == p.name) List((p, v)) else Nil } ) adults owners vehicles val adults = persons.apply(spore { ps => val res = ps.filter(p => p.age >= 18) SiloRef.populate(currentHost, res) }) 45 Example: creation and materialization of lineage
  46. Philipp Haller Silo[List[Person]] Machine 1 SiloRef[List[Person]] Let’s make an interesting

    DAG! Machine 2 persons: val persons: SiloRef[List[Person]] = ... val vehicles: SiloRef[List[Vehicle]] = ... // adults that own a vehicle val owners = adults.apply(...) adults owners vehicles val adults = persons.apply(spore { ps => val res = ps.filter(p => p.age >= 18) SiloRef.populate(currentHost, res) }) 46 Example: creation and materialization of lineage
  47. Philipp Haller Silo[List[Person]] Machine 1 SiloRef[List[Person]] Let’s make an interesting

    DAG! Machine 2 persons: val persons: SiloRef[List[Person]] = ... val vehicles: SiloRef[List[Vehicle]] = ... // adults that own a vehicle val owners = adults.apply(...) adults owners vehicles val sorted = adults.apply(spore { ps => SiloRef.populate(currentHost, ps.sortWith(p => p.age)) }) val labels = sorted.apply(spore { ps => SiloRef.populate(currentHost, ps.map(p => "Hi, " + p.name)) }) sorted labels val adults = persons.apply(spore { ps => val res = ps.filter(p => p.age >= 18) SiloRef.populate(currentHost, res) }) 47 Example: creation and materialization of lineage
  48. Philipp Haller Silo[List[Person]] Machine 1 SiloRef[List[Person]] Let’s make an interesting

    DAG! Machine 2 persons: val persons: SiloRef[List[Person]] = ... val vehicles: SiloRef[List[Vehicle]] = ... // adults that own a vehicle val owners = adults.apply(...) adults owners vehicles sorted labels so far we just staged computation, we haven’t yet “kicked it off”. val adults = persons.apply(spore { ps => val res = ps.filter(p => p.age >= 18) SiloRef.populate(currentHost, res) }) val sorted = adults.apply(spore { ps => SiloRef.populate(currentHost, ps.sortWith(p => p.age)) }) val labels = sorted.apply(spore { ps => SiloRef.populate(currentHost, ps.map(p => "Hi, " + p.name)) }) 48 Example: creation and materialization of lineage
  49. Philipp Haller Silo[List[Person]] Machine 1 SiloRef[List[Person]] Let’s make an interesting

    DAG! Machine 2 persons: val persons: SiloRef[List[Person]] = ... val vehicles: SiloRef[List[Vehicle]] = ... // adults that own a vehicle val owners = adults.apply(...) adults owners vehicles sorted labels λ List[Person]㱺List[String] Silo[List[String]] val adults = persons.apply(spore { ps => val res = ps.filter(p => p.age >= 18) SiloRef.populate(currentHost, res) }) val sorted = adults.apply(spore { ps => SiloRef.populate(currentHost, ps.sortWith(p => p.age)) }) val labels = sorted.apply(spore { ps => SiloRef.populate(currentHost, ps.map(p => "Hi, " + p.name)) }) labels.persist().send() 49 Example: creation and materialization of lineage
  50. Philipp Haller Lineage-based distributed computation: Results • Formalization – Typed

    lambda-calculus with distributed hosts, silos and lineages – Asynchronous, distributed operational semantics • Proof establishing the preservation of lineage mobility • Proof of finite materialization of remote, lineage-based data 50 Haller, Miller, Müller. A programming model and foundation for lineage-based distributed computation. J. Funct. Program. 28: e7 (2018) First correctness results for a core calculus of lineage-based distributed computation
  51. Philipp Haller Further directions 51 Security & Privacy Privacy-aware distribution

    Information- flow security Salvaneschi, Köhler, Sokolowski, Haller, Erdweg, Mezini. Language-integrated privacy-aware distributed queries. Proc. ACM Program. Lang. 3(OOPSLA): 167:1-167:30 (2019) Chaos Engineering Testing hypotheses about resilience in production systems Zhang, Morin, Haller, Baudry, Monperrus. A chaos engineering system for live analysis and falsification of exception-handling in the JVM. IEEE Transactions on Software Engineering, to appear Zhao, Haller. Replicated data types that unify eventual consistency and observable atomic consistency. J. Log. Algebraic Methods Program. 114: 100561 (2020) Consistency, availability, partition tolerance Distributed Shared State Replicated data types Zhao, Haller. Consistency types for replicated data in a higher-order distributed programming language. Art Sci. Eng. Program. 5(2): 6 (2021) Xin Zhao
  52. Philipp Haller 10-year vision (1): From centralized to resilient decentralized

    computing • Overcoming the limitations of centralized clouds:
 communication latencies, data privacy • Enabling applications spanning edge and cloud • Challenges: – Inherent asynchrony – Partial failure – Privacy, integrity • Objective: – Practical programming system for
 resilient decentralized computing 52 eventual consistency Offline Computation Secure Data Center Spenger, Carbone, Haller. WIP: Pods: Privacy Compliant Scalable Decentralized Data Services. Poly'21 @ VLDB (2021) Example application:
 Reinforcement learning deployed across edge and cloud
  53. Philipp Haller 10-year vision (2): Semi-implicit massively parallel and distributed

    programming Multi-core processors and accelerators (GPGPUs, FPGAs etc.) the only means to higher scalability Challenges: • Safety and liveness hazards, nondeterminism • Weak memory models • Distribution: partial failure, asynchronicity Objectives: • Deterministic concurrent programming system • Safe, modular composition of deterministic and nondeterministic code • Automatic identification of opportunities for parallelization 53 Source: C. E. Leiserson et al., Science 368, eaam9744 (2020).
 DOI: 10.1126/science.aam9744
  54. Philipp Haller 10-year vision (3): Safe smart contract programming •

    Increased demands on data protection, e.g., GDPR, CCPA • Smart contracts important for management of digital assets • Challenges: – How to provably enforce consented data access policies? – How to prevent critical vulnerabilities in smart contract languages? • Objective: – Safe smart contract languages, with provable correctness properties 54 Image source: ECB
  55. Philipp Haller Conclusion 55 Fault tolerance Scalability Safety Goal:
 Robust,

    large-scale concurrent and distributed programming • Programming languages and systems: Design & implementation • Type systems: Theory & practice • Experimental evaluation • Empirical studies Methods: Thank You! Scala Actors LaCasa Reactive Async Function passing Spores