Slide 1

Slide 1 text

Towards Robust Large-scale Concurrent and Distributed Programming Philipp Haller Associate Professor School of Electrical Engineering and Computer Science KTH Royal Institute of Technology Stockholm, Sweden 20th International Symposium on Parallel and Distributed Computing (ISPDC 2021) 30 July 2021 Virtual

Slide 2

Slide 2 text

Philipp Haller Cloud computing continues to grow... 2

Slide 3

Slide 3 text

Philipp Haller 3 Bugs in cloud-based distributed systems can be catastrophic.

Slide 4

Slide 4 text

Philipp Haller 4 The Northeast blackout of 2003: a widespread power outage throughout parts of the Northeastern and Midwestern US and the Canadian province of Ontario on August 14, 2003 Primary cause: a data-race bug in the alarm system at the control room of FirstEnergy Corporation

Slide 5

Slide 5 text

Philipp Haller Characteristics of distributed systems • Concurrency and parallelism – Distributed, networked multi-core computers • Asynchronicity – Lack of a global clock
 Except for expensive infrastructure based on GPS and atomic clocks [Corbett et al. 2013] • Failure of components – Computers can fail/crash – Networks can drop/duplicate/ reorder messages 5 Corbett et al. Spanner: Google's Globally Distributed Database. ACM Trans. Comput. Syst. 31(3): 8:1-8:22 (2013)

Slide 6

Slide 6 text

Philipp Haller Challenges for programming The characteristics of distributed systems lead to challenges for software development: • Concurrent and parallel programming plagued by hazards – Data races, deadlocks, non-determinism, ... • Tolerating asynchronicity and failures is difficult – Bad selection of protocols/distributed algorithms leads to loss of scalability, performance, availability, or consistency • Scalability and performance hard to achieve – Data and control dependencies, resource contention, ... – Must not violate concurrency safety, fault tolerance, consistency 6 1 2 3 Fault-tolerance adds to difficulty!

Slide 7

Slide 7 text

Philipp Haller Perspectives for addressing challenges • Computer systems • Data management • Programming languages and systems • Software engineering • Computer communication • ... 7

Slide 8

Slide 8 text

Philipp Haller Perspectives for addressing challenges • Computer systems • Data management • Programming languages and systems • Software engineering • Computer communication • ... 8 Our perspective

Slide 9

Slide 9 text

Philipp Haller Background & Context 9

Slide 10

Slide 10 text

Philipp Haller Background & Context 10 Scala
 Language • 2005-2014 Scala language core team member • 2019 ACM SIGPLAN Programming Languages
 Software Award Close to 600 meetup groups globally: • Integrates object-oriented and functional
 programming • Reconcile type-safety & flexibility

Slide 11

Slide 11 text

Philipp Haller Background & Context 11 Scala
 Language

Slide 12

Slide 12 text

Philipp Haller Background & Context 12 Scala
 Language Actors in Scala • Haller & Odersky. Scala Actors: Unifying thread-based and event-based programming. Theoretical Computer Science 410(2-3): 202-220 (2009)
 - 463 citations • Nomination for 2010 EPFL Doctorate Award

Slide 13

Slide 13 text

Philipp Haller Background & Context 13 • Haller & Odersky. Scala Actors: Unifying thread-based and event-based programming. Theoretical Computer Science 410(2-3): 202-220 (2009)
 - 462 citations • Nomination for 2010 EPFL Doctorate Award Production use in Twitter core: Scala and Actors for tracking cargo
 on Dutch rail network:

Slide 14

Slide 14 text

Philipp Haller Background & Context 14 Scala
 Language Actors in Scala

Slide 15

Slide 15 text

Philipp Haller Background & Context 15 Scala
 Language Actors in Scala • Leader of the standardization of Scala's futures package • Co-author of Scala's async/await constructs • Co-author of Spores, safe distributable closures
 Supported by IBM Spark Technology Center
 (now IBM CODAIT) Practical concurrency & distribution Released in 2020 as part of all Scala versions

Slide 16

Slide 16 text

Philipp Haller Goal: Robust, large-scale concurrent and distributed programming • Reconcile – Fault tolerance – Scalability – Safety • Provide programming models and languages applicable to a variety of distributed applications 16 Instead of building single "one-of" systems for specific domains Fault tolerance Scalability Safety

Slide 17

Slide 17 text

Philipp Haller A safety challenge: data races What is a data race? • A data race occurs – when two tasks (threads, processes, actors) concurrently access the same shared variable (or object field) and – at least one of the accesses is a write (an assignment) • In practice, data races are difficult to find and fix 17 Fault tolerance Scalability Safety

Slide 18

Slide 18 text

Philipp Haller A safety challenge: data races Q: Aren't data races only affecting local, non-distributed systems? A: No, due to: – Exploitation of multi-core processors – Processing of large amounts of data or requests/tasks requires concurrency 18 Fault tolerance Scalability Safety

Slide 19

Slide 19 text

Philipp Haller Data race: an example 19 var x: Int = 0 async { if (x == 0) { x = 1 assert(x == 1) } } x = 5

Slide 20

Slide 20 text

Philipp Haller Data race: an example 20 var x: Int = 0 async { if (x == 0) { x = 1 assert(x == 1) } } x = 5 Start a concurrent computation

Slide 21

Slide 21 text

Philipp Haller Data race: an example 21 var x: Int = 0 async { if (x == 0) { x = 1 assert(x == 1) } } x = 5 Start a concurrent computation Checks whether the condition holds

Slide 22

Slide 22 text

Philipp Haller Data race: an example 22 var x: Int = 0 async { if (x == 0) { x = 1 assert(x == 1) } } x = 5 Start a concurrent computation Checks whether the condition holds Concurrent assignment

Slide 23

Slide 23 text

Philipp Haller Data race: an example 23 var x: Int = 0 async { if (x == 0) { x = 1 assert(x == 1) } } x = 5 Start a concurrent computation Assertion may fail! Concurrent assignment

Slide 24

Slide 24 text

Philipp Haller What about higher-level abstractions? 24 • The example to the right looks harmless... • ...until we inspect class C var input: Int = ... var future = async { var x = new C() x.expensiveComputation(input) } var z = (new C()).get() class C: def expensiveComputation(init: Int): Int = set(init) ... def set(x: Int): Unit = Global.f = x def get(): Int = Global.f object Global: var f: Int = 0 Shared singleton object Global.f = global variable! Data race! Creates and uses fresh instance

Slide 25

Slide 25 text

Philipp Haller Idea: Static prevention of data races via
 type-and-effect checking • Prevent data races by restricting the effects of concurrent code – No access to shared mutable state
 Unless concurrency-safe (e.g., concurrent collections) – Only allowed to instantiate safe classes • Safe classes: (does not access shared mutable state, directly or indirectly) – Methods only access parameters and receiver (this) – Methods only instantiate safe classes – Field types are either primitive or safe class types – Superclasses are safe 25

Slide 26

Slide 26 text

Philipp Haller Idea: Static prevention of data races via
 type-and-effect checking • Prevent data races by restricting the effects of concurrent code – No access to shared mutable state
 Unless concurrency-safe (e.g., concurrent collections) – Only allowed to instantiate safe classes • Safe classes: (does not access shared mutable state, directly or indirectly) – Methods only access parameters and receiver (this) – Methods only instantiate safe classes – Field types are either primitive or safe class types – Superclasses are safe 26 class C: def expensiveComputation(init: Int): Int = set(init) ... def set(x: Int): Unit = Global.f = x def get(): Int = Global.f object Global: var f: Int = 0 Not a safe class!

Slide 27

Slide 27 text

Philipp Haller How practical are these restrictions? • How common are safe classes in Scala? • Such classes can be used safely in concurrent code without changes! • Empirical study of over 75'000 LOC of open-source Scala code: 27 Project Version SLOC GitHub stats Scala stdlib 2.11.7 33.107 ✭5,795 👥 257 Signal/Collect 8.0.6 10.159 ✭123 👥 11 GeoTrellis 0.10.0-RC2 35.351 ✭400 👥 38 -engine 3.868 -raster 22.291 -spark 9.192

Slide 28

Slide 28 text

Philipp Haller Results of empirical study In the analyzed medium to large open-source Scala projects,
 21-67% of all classes are safe: 28 Project #classes/traits #safe (%) #dir. unsafe (%) Scala stdlib 1.505 644 (43%) 212/861 (25%) Signal/ Collect 236 159 (67%) 60/77 (78%) GeoTrellis -engine 190 40 (21%) 124/150 (83%) -raster 670 233 (35%) 325/437 (74%) -spark 326 101 (31%) 167/225 (74%) Total 2.927 1,177 (40%) 888/1,750 (51%)

Slide 29

Slide 29 text

Philipp Haller Further results • Formalization as a type-and-effect system: – Object-oriented core language with heap separation and concurrency – Proof of type soundness – Proof of isolation theorem for processes with shared heap and ownership transfer • Integration with the actor model of concurrency: 29 Haller and Loiko. LaCasa: Lightweight affinity and object capabilities in Scala. OOPSLA 2016
 (Google Scholar: 31 citations) Haller and Odersky. Scala Actors: Unifying thread-based and event-based programming. Theor. Comput. Sci. 410(2-3): 202-220 (2009)
 (Google Scholar: 463 citations)

Slide 30

Slide 30 text

Philipp Haller Data-race freedom and determinism • Due to nondeterminism, concurrent programs are difficult to reason about
 even when they are data-race free • Empirical results suggest that developers prefer deterministic operations, when processing collections and streams
 
 (based on an analysis of ~5.53 million lines of Java source code) 30 Khatchadourian, Tang, Bagherzadeh, Ray. An Empirical Study on the Use and Misuse of Java 8 Streams. FASE 2020: 97-118

Slide 31

Slide 31 text

Philipp Haller From data-race freedom to determinism: Quo vadis? Can we strengthen the guarantee of data-race freedom to determinism? Requirements: – Extension of a general-purpose programming language supporting imperative, object-oriented, and functional programming • Global, shared state • Pervasive aliasing – Applicability to irregular programs • Irregular programs = programs organized around pointer-based data structures such as trees and graphs 31 Potential application to widely-used languages

Slide 32

Slide 32 text

Philipp Haller Our approach • Novel concurrent programming model building on: – Event-driven concurrency – Lattice-based data types • Goal: provide static determinism guarantee: 32 "All non-failing executions compute the same result." "Quasi-determinism" [Kuper et al. 2014] Kuper et al. Freeze after writing: Quasi-deterministic parallel programming with LVars. POPL 2014

Slide 33

Slide 33 text

Philipp Haller Programming model Key constructs: – Cells: shared variables storing values of a lattice – Event-handlers triggered by updates of cells 33 f g h f.when(g) { update => if (update.value == Impure) FinalOutcome(Impure) else NoOutcome } Lattice provides commutative update: newValue := oldValue ⊔ update

Slide 34

Slide 34 text

Philipp Haller Programming model Upon computing the value Impure for g: – Event handler runs, since f depends on g – f is updated with a new value – Value of f is final → remove dependencies 34 f h f.when(g) { update => if (update.value == Impure) FinalOutcome(Impure) else NoOutcome } g g f

Slide 35

Slide 35 text

Philipp Haller Application: Static program analysis Example: return type analysis 35 Which types does method f possibly return? Class type hierarchy: D E F G class C { def f(x: Int): D = if (x <= 0) g(x) else h(x-1) def g(y: Int): E = new E(y) def h(z: Int): D = if (z == 0) new F else f(z) }

Slide 36

Slide 36 text

Philipp Haller Return type analysis • Method f calls methods g and h; method h calls method f • Programming model let's us express the resulting dependencies, forming a directed graph: 36 class C { def f(x: Int): D = if (x <= 0) g(x) else h(x-1) def g(y: Int): E = new E(y) def h(z: Int): D = if (z == 0) new F else f(z) } f g h "calls"

Slide 37

Slide 37 text

Philipp Haller Experimental evaluation: static analysis • Integration of our programming model with OPAL, a state-of-the-art JVM bytecode analysis framework • Implementation of two different kinds of static analyses – A standard purity analysis – The IFDS analysis framework (IFDS = Interprocedural Finite Distributive Subset) • Implementation of a state-of-the-art taint analysis based on IFDS – Finds security vulnerabilities based on dynamic class loading 37

Slide 38

Slide 38 text

Philipp Haller Parallel static analysis: Results 38 0 20 40 60 80 100 120 140 DefaultScheduling SourcesWithManyTargetsLast TargetsWithManyTargetsLast TargetsWithManySourcesLast SourcesWithManySourcesLast OPAL - Sequential Heros Runtime (s) Threads 1 5 10 15 20 20 25 30 35 • Heros: best speed-up 2.36x @ 8 threads • Reactive Async: 3.53x @ 8 threads, 3.98x @ 16 threads • Analysis of Java RE 1.7.0 update 95 (from Doop benchmarks project) • System: Intel(R) Core(TM) i9-7900X CPU @ 3.30GHz (10 cores)
 using 16 GB RAM running Ubuntu 18.04.3 and OpenJDK 1.8_212

Slide 39

Slide 39 text

Philipp Haller Reactive Async: Summary • Deterministic concurrent programming model – Extension of imperative, object-oriented base language (Scala) – Resolution of cyclic dependencies • Practical, applicable to irregular workloads processing large graphs • Result: fastest, most scalable JVM bytecode analysis framework 39 Haller, Geries, Eichberg, Salvaneschi. Reactive Async: Expressive deterministic concurrency. ACM SIGPLAN Scala Symposium 2016: 11-20 Helm, Kübler, Kölzer, Haller, Eichberg, Salvaneschi, Mezini. A programming model for semi-implicit parallelization of static analyses. ISSTA 2020: 428-439

Slide 40

Slide 40 text

Philipp Haller Goal: Robust, large-scale concurrent and distributed programming • Reconcile – Fault tolerance – Scalability – Safety • Provide programming models and languages applicable to a variety of distributed applications 40 Fault tolerance Scalability Safety

Slide 41

Slide 41 text

Philipp Haller Context: Massively parallel and distributed systems • Data distributed over tens to millions of nodes • (Geo-)distribution across datacenters • Reliable, large-scale processing of batch and streaming data 41 Cloud Data Center fault tolerance

Slide 42

Slide 42 text

Philipp Haller Programming model for lineage-based distributed computation 42 Guiding question: Can we provide provable fault-tolerance guarantees
 for a general class of distributed applications? Approach: • Design of programming model with lineage as first- class construct for fault recovery • Functional processing of distributed data, similar to big data frameworks like MapReduce, Flink, and Spark • Strong static typing Lineage = “information that permits reconstructing a dataset” First step:
 Precise formalization of fault recovery mechanism

Slide 43

Slide 43 text

Philipp Haller Example: creation and materialization of lineage Silo[List[Person]] Machine 1 SiloRef[List[Person]] Let’s make an interesting DAG! Machine 2 persons: val persons: SiloRef[List[Person]] = ... 43

Slide 44

Slide 44 text

Philipp Haller Silo[List[Person]] Machine 1 SiloRef[List[Person]] Let’s make an interesting DAG! Machine 2 persons: val persons: SiloRef[List[Person]] = ... val adults = persons.apply(spore { ps => val res = ps.filter(p => p.age >= 18) SiloRef.populate(currentHost, res) }) adults 44 Example: creation and materialization of lineage

Slide 45

Slide 45 text

Philipp Haller Silo[List[Person]] Machine 1 SiloRef[List[Person]] Let’s make an interesting DAG! Machine 2 persons: val persons: SiloRef[List[Person]] = ... val vehicles: SiloRef[List[Vehicle]] = ... // adults that own a vehicle val owners = adults.apply(spore { val localVehicles = vehicles // spore header ps => localVehicles.apply(spore { val localps = ps // spore header vs => SiloRef.populate(currentHost, localps.flatMap(p => // list of (p, v) for a single person p vs.flatMap { v => if (v.owner.name == p.name) List((p, v)) else Nil } ) adults owners vehicles val adults = persons.apply(spore { ps => val res = ps.filter(p => p.age >= 18) SiloRef.populate(currentHost, res) }) 45 Example: creation and materialization of lineage

Slide 46

Slide 46 text

Philipp Haller Silo[List[Person]] Machine 1 SiloRef[List[Person]] Let’s make an interesting DAG! Machine 2 persons: val persons: SiloRef[List[Person]] = ... val vehicles: SiloRef[List[Vehicle]] = ... // adults that own a vehicle val owners = adults.apply(...) adults owners vehicles val adults = persons.apply(spore { ps => val res = ps.filter(p => p.age >= 18) SiloRef.populate(currentHost, res) }) 46 Example: creation and materialization of lineage

Slide 47

Slide 47 text

Philipp Haller Silo[List[Person]] Machine 1 SiloRef[List[Person]] Let’s make an interesting DAG! Machine 2 persons: val persons: SiloRef[List[Person]] = ... val vehicles: SiloRef[List[Vehicle]] = ... // adults that own a vehicle val owners = adults.apply(...) adults owners vehicles val sorted = adults.apply(spore { ps => SiloRef.populate(currentHost, ps.sortWith(p => p.age)) }) val labels = sorted.apply(spore { ps => SiloRef.populate(currentHost, ps.map(p => "Hi, " + p.name)) }) sorted labels val adults = persons.apply(spore { ps => val res = ps.filter(p => p.age >= 18) SiloRef.populate(currentHost, res) }) 47 Example: creation and materialization of lineage

Slide 48

Slide 48 text

Philipp Haller Silo[List[Person]] Machine 1 SiloRef[List[Person]] Let’s make an interesting DAG! Machine 2 persons: val persons: SiloRef[List[Person]] = ... val vehicles: SiloRef[List[Vehicle]] = ... // adults that own a vehicle val owners = adults.apply(...) adults owners vehicles sorted labels so far we just staged computation, we haven’t yet “kicked it off”. val adults = persons.apply(spore { ps => val res = ps.filter(p => p.age >= 18) SiloRef.populate(currentHost, res) }) val sorted = adults.apply(spore { ps => SiloRef.populate(currentHost, ps.sortWith(p => p.age)) }) val labels = sorted.apply(spore { ps => SiloRef.populate(currentHost, ps.map(p => "Hi, " + p.name)) }) 48 Example: creation and materialization of lineage

Slide 49

Slide 49 text

Philipp Haller Silo[List[Person]] Machine 1 SiloRef[List[Person]] Let’s make an interesting DAG! Machine 2 persons: val persons: SiloRef[List[Person]] = ... val vehicles: SiloRef[List[Vehicle]] = ... // adults that own a vehicle val owners = adults.apply(...) adults owners vehicles sorted labels λ List[Person]㱺List[String] Silo[List[String]] val adults = persons.apply(spore { ps => val res = ps.filter(p => p.age >= 18) SiloRef.populate(currentHost, res) }) val sorted = adults.apply(spore { ps => SiloRef.populate(currentHost, ps.sortWith(p => p.age)) }) val labels = sorted.apply(spore { ps => SiloRef.populate(currentHost, ps.map(p => "Hi, " + p.name)) }) labels.persist().send() 49 Example: creation and materialization of lineage

Slide 50

Slide 50 text

Philipp Haller Lineage-based distributed computation: Results • Formalization – Typed lambda-calculus with distributed hosts, silos and lineages – Asynchronous, distributed operational semantics • Proof establishing the preservation of lineage mobility • Proof of finite materialization of remote, lineage-based data 50 Haller, Miller, Müller. A programming model and foundation for lineage-based distributed computation. J. Funct. Program. 28: e7 (2018) First correctness results for a core calculus of lineage-based distributed computation

Slide 51

Slide 51 text

Philipp Haller Further directions 51 Security & Privacy Privacy-aware distribution Information- flow security Salvaneschi, Köhler, Sokolowski, Haller, Erdweg, Mezini. Language-integrated privacy-aware distributed queries. Proc. ACM Program. Lang. 3(OOPSLA): 167:1-167:30 (2019) Chaos Engineering Testing hypotheses about resilience in production systems Zhang, Morin, Haller, Baudry, Monperrus. A chaos engineering system for live analysis and falsification of exception-handling in the JVM. IEEE Transactions on Software Engineering, to appear Zhao, Haller. Replicated data types that unify eventual consistency and observable atomic consistency. J. Log. Algebraic Methods Program. 114: 100561 (2020) Consistency, availability, partition tolerance Distributed Shared State Replicated data types Zhao, Haller. Consistency types for replicated data in a higher-order distributed programming language. Art Sci. Eng. Program. 5(2): 6 (2021) Xin Zhao

Slide 52

Slide 52 text

Philipp Haller 10-year vision (1): From centralized to resilient decentralized computing • Overcoming the limitations of centralized clouds:
 communication latencies, data privacy • Enabling applications spanning edge and cloud • Challenges: – Inherent asynchrony – Partial failure – Privacy, integrity • Objective: – Practical programming system for
 resilient decentralized computing 52 eventual consistency Offline Computation Secure Data Center Spenger, Carbone, Haller. WIP: Pods: Privacy Compliant Scalable Decentralized Data Services. Poly'21 @ VLDB (2021) Example application:
 Reinforcement learning deployed across edge and cloud

Slide 53

Slide 53 text

Philipp Haller 10-year vision (2): Semi-implicit massively parallel and distributed programming Multi-core processors and accelerators (GPGPUs, FPGAs etc.) the only means to higher scalability Challenges: • Safety and liveness hazards, nondeterminism • Weak memory models • Distribution: partial failure, asynchronicity Objectives: • Deterministic concurrent programming system • Safe, modular composition of deterministic and nondeterministic code • Automatic identification of opportunities for parallelization 53 Source: C. E. Leiserson et al., Science 368, eaam9744 (2020).
 DOI: 10.1126/science.aam9744

Slide 54

Slide 54 text

Philipp Haller 10-year vision (3): Safe smart contract programming • Increased demands on data protection, e.g., GDPR, CCPA • Smart contracts important for management of digital assets • Challenges: – How to provably enforce consented data access policies? – How to prevent critical vulnerabilities in smart contract languages? • Objective: – Safe smart contract languages, with provable correctness properties 54 Image source: ECB

Slide 55

Slide 55 text

Philipp Haller Conclusion 55 Fault tolerance Scalability Safety Goal:
 Robust, large-scale concurrent and distributed programming • Programming languages and systems: Design & implementation • Type systems: Theory & practice • Experimental evaluation • Empirical studies Methods: Thank You! Scala Actors LaCasa Reactive Async Function passing Spores