Systematic error management - we ported rudder to zio

Slide 1

Slide 1 text

Systematic error management in application We ported Rudder to ZIO 2019-10-30 [email protected] @fanf42

Slide 2

Slide 2 text

Hi! devops automation/compliance app manage ten of thousands computers 2 François ARMAND CTO Founder Free Software Company “Stay Up”

Slide 3

Slide 3 text

Hi! devops automation/compliance app manage ten of thousands computers 3 François ARMAND CTO Founder Free Software Company “Stay Up” Developer

Slide 4

Slide 4 text

Developer ? ● Model the world into code ○ Try to make it useful 4

Slide 5

Slide 5 text

Developer ? ● Model the world into code ○ Try to make it useful ● Nominal case necessary (of course) 5

Slide 6

Slide 6 text

Developer ? ● Model the world into code ○ Try to make it useful ● Nominal case necessary (of course) ● But not suﬃcient (models are false) ○ Bugs ○ Misunderstanding of needs ○ open world ○ Damn users using your app ■ 6

Slide 7

Slide 7 text

This talk ● systematic management of errors ● caveat emptor: ○ I’m a scala dev, mainly ■ ■ ○ application, not library ■ 7

Slide 8

Slide 8 text

This talk ● It's an important talk for me ● Much harder to do than expected ○ based on lots of deeply rooted, fuzzy, experimental knowledge ● Please, please, I beg you: if anything unclear, come chat with me / ask questions (whatever the medium) 8

Slide 9

Slide 9 text

9 Not so popular opinions - 4 Hills I would die on -

Slide 10

Slide 10 text

Our work as developers is to discover and assess failure modes 10 Not so popular opinion 1/4

Slide 11

Slide 11 text

ERRORS are a SOCIAL construction to give AGENCY to the receiver of the error 11 Not so popular opinion 2/4

Slide 12

Slide 12 text

An application has always at least 3 kinds of users: users ; devs ; and ops. Don’t forget any. 12 Not so popular opinion 3/4

Slide 13

Slide 13 text

It’s YOUR work to choose the SEMANTIC between nominal case and error and KEEP your PROMISES Not so popular opinion 4/4 13

Slide 14

Slide 14 text

OK. But in concret terms 14 ?

Slide 15

Slide 15 text

15 Assess failure modes. Give agency to your users and don’t forget any of them. You are responsible to keep promises made.

Slide 16

Slide 16 text

16 Pure, total functions Explicit error channel Program to strict interfaces and protocols Composition and tooling 1. 2. 4. 5. Failures vs Errors 3. Assess failure modes. Give agency to your users and don’t forget any of them. You are responsible to keep promises made.

Slide 17

Slide 17 text

17 1. 2. 4. 5. These points are also important and cans be translated at architecture / UX / team / ecosystem levels. But let’s keep it simple with code. 3. Assess failure modes. Give agency to your users and don’t forget any of them. You are responsible to keep promises made.

Slide 18

Slide 18 text

1. 18 don’t lie about your promises Pure, total functions

Slide 19

Slide 19 text

Don’t lie! divide(a: Int, b: Int): Int 19

Slide 20

Slide 20 text

Don’t lie! 20 Divide By Zero ? divide(a: Int, b: Int): Int

Slide 21

Slide 21 text

Don’t lie! 21 Divide By Zero ? ● non total functions are a lie ○ your promises are unsound ○ your users can’t react appropriately divide(a: Int, b: Int): Int

Slide 22

Slide 22 text

Don’t lie! 22 getUserFromDB(id: UserId): User

Slide 23

Slide 23 text

Don’t lie! 23 No such user ? (non total) getUserFromDB(id: UserId): User

Slide 24

Slide 24 text

Don’t lie! 24 No such user ? (non total) DB connexion error? getUserFromDB(id: UserId): User

Slide 25

Slide 25 text

Don’t lie! 25 No such user ? (non total) DB connexion error? ● non pure functions are a lie ○ your promises are unsound ○ your users can’t react appropriately getUserFromDB(id: UserId): User

Slide 26

Slide 26 text

Sound promises 26 ● use total functions ○ or make them total with union return type ● use pure functions ○ or make them pure with IO monad ● Don’t lie to your users, ● allow them to react eﬃciently:

Slide 27

Slide 27 text

2. 27 make it unambiguous in your types Explicit error channel

Slide 28

Slide 28 text

28 It’s a signal make it unambiguous give agency

Slide 29

Slide 29 text

● Don’t assume what’s obvious ● It’s an open world out there ● Don’t force users to revert-engineer possible cases 29 It’s a signal make it unambiguous give agency

Slide 30

Slide 30 text

Which intent is less ambiguous? 30 blobzurg(a: Int, b: Int): Option[Int] blobzurg(a: Int, b: Int): PureResult[DivideByZero, Int] It’s a signal make it unambiguous give agency

Slide 31

Slide 31 text

31 It’s a signal make it unambiguous give agency automate it ● Use the type system to automate classiﬁcation of errors?

Slide 32

Slide 32 text

32 A type system is a tractable syntactic method for proving the absence of certain program behaviors by classifying phrases according to the kinds of values they compute. Benjamin Pierce It’s a signal make it unambiguous give agency automate it ● Use the type system to automate classiﬁcation of errors?

Slide 33

Slide 33 text

33 By deﬁnition, a type system automatically categorize results ⟹ need for a dedicated error chanel + a common error trait A type system is a tractable syntactic method for proving the absence of certain program behaviors by classifying phrases according to the kinds of values they compute. Benjamin Pierce It’s a signal make it unambiguous give agency automate it def divide(a: Int, b: Int): PureResult[Int]

Slide 34

Slide 34 text

34 A type system is a tractable syntactic method for proving the absence of certain program behaviors by classifying phrases according to the kinds of values they compute. Benjamin Pierce trait MyAppError // common properties of errors type PureResult[A] = Either[MyAppError, A] It’s a signal make it unambiguous give agency automate it def divide(a: Int, b: Int): PureResult[Int] By deﬁnition, a type system automatically categorize results ⟹ need for a dedicated error chanel + a common error trait

Slide 35

Slide 35 text

35 It’s a signal make it unambiguous give agency automate it def getUser(id: UserId): IOResult[User] By deﬁnition, a type system automatically categorize results ⟹ need for a dedicated error chanel + a common error trait Same for effectful functions!

Slide 36

Slide 36 text

Same for effectful functions! 36 trait MyAppError // common properties of errors type IOResult[A] = IO[MyAppError, A] It’s a signal make it unambiguous give agency automate it def getUser(id: UserId): IOResult[User] By deﬁnition, a type system automatically categorize results ⟹ need for a dedicated error chanel + a common error trait

Slide 37

Slide 37 text

37 It’s a signal make it unambiguous give agency automate it ● Use a dedicated error channel ○ ~ Either[E, A] for pure code, ○ else ~ IO[E, A] monad ● use a parent trait for common error properties… ● and for automatic categorization of errors by compiler

Slide 38

Slide 38 text

3. 38 models are false by construction Failures vs Errors

Slide 39

Slide 39 text

Model everything? 39 writeFile(path: String, value: String): IOResult[Unit]

Slide 40

Slide 40 text

Model everything? 40 java.lang.SecurityException? (jvm perm to access FS) writeFile(path: String, value: String): IOResult[Unit]

Slide 41

Slide 41 text

Model everything? 41 java.lang.SecurityException? (jvm perm to access FS) ⟹ where do you put the limit? writeFile(path: String, value: String): IOResult[Unit]

Slide 42

Slide 42 text

Systems? Need for a systematic approach to error management 42 A school of systems

Slide 43

Slide 43 text

Systems? Need for a systematic approach to error management 43 ○ BOUNDED group of things ○ with a NAME Interacting ○ with others systems A school of systems

Slide 44

Slide 44 text

Systems have horizon. 44 ○ nothing exists beyond horizon

Slide 45

Slide 45 text

Systems have horizon. Horrors lie beyond. 45 ○ nothing exists beyond horizon ○ Like with Lovecraft: if something from beyond interact with a system, the system becomes inconsistent

Slide 46

Slide 46 text

Errors vs Failures 46 Errors ● expected non nominal case ● signal for users ● social construction: you choose alternative or error ● reflected in types Failures ● unexpected case: by definition, application is in an unknown state ● only choice is stop as cleanly as possible ● not reflected in types

Slide 47

Slide 47 text

Horizon limit is your choice - by deﬁnition 47 java.lang.SecurityException?

Slide 48

Slide 48 text

Horizon limit is your choice - by deﬁnition 48 java.lang.SecurityException? execScript(js: String): IOResult[String] In Rudder, we have a JS engine (JS from users):

Slide 49

Slide 49 text

Horizon limit is your choice - by deﬁnition 49 java.lang.SecurityException? execScript(js: String): IOResult[String] In Rudder, we have a JS engine (JS from users): ⟹ SecurityException is an expected error case here

Slide 50

Slide 50 text

Horizon limit is your choice - by deﬁnition 50 java.lang.SecurityException? execScript(js: String): IOResult[String] In Rudder, we have a JS engine (JS from users): ⟹ SecurityException is an expected error case here … but nowhere else in Rudder. By our choice.

Slide 51

Slide 51 text

4. 51 use systems to materialize promises Program to strict interfaces and protocols

Slide 52

Slide 52 text

Need for a systematic approach to error management 52 ○ BOUNDED group of things ○ with a NAME Interacting ○ with others systems A school of systems A bit more about systems

Slide 53

Slide 53 text

A bit more about systems Need for a systematic approach to error management 53 ○ BOUNDED group of things ○ with a NAME Interacting ○ via INTERFACES ○ by a PROTOCOL with other systems ○ And PROMISING to have a behavior A school of systems

Slide 54

Slide 54 text

Example? 54 Typical web application.

Slide 55

Slide 55 text

Example? 55 Typical web application.

Slide 56

Slide 56 text

Example? 56 Typical web application. How to keep contradictory promises? Promises to third parties about REST behaviour Promises to business and developers about code manageability

Slide 57

Slide 57 text

Make promises, Keep them 57 ● systems allow to bound responsibilities

Slide 58

Slide 58 text

Make promises, Keep them 58 ● systems allow to bound responsibilities

Slide 59

Slide 59 text

Make promises, Keep them 59 ● systems allow to bound responsibilities Business Core sub-system: ● own ADT / logic (mostly pure) ● lifecycle bounded to developers understanding of needs (rapid changes)

Slide 60

Slide 60 text

Make promises, Keep them 60 ● systems allow to bound responsibilities Business Core sub-system: ● own ADT / logic (mostly pure) ● lifecycle bounded to developers understanding of needs (rapid changes) Pattern: “A pure heart (core) surrounded by side effects”* * works better in French: “un coeur pur encerclé par les effets de bords”

Slide 61

Slide 61 text

Make promises, Keep them 61 ● systems allow to bound responsibilities Users of the API want stability and to know what errors can happen Business Core sub-system: ● own ADT / logic (mostly pure) ● lifecycle bounded to developers understanding of needs (rapid changes)

Slide 62

Slide 62 text

Make promises, Keep them 62 ● systems allow to bound responsibilities Business Core sub-system: ● own ADT / logic (mostly pure) ● lifecycle bounded to developers understanding of needs (rapid changes) REST sub-system : ● own ADT / logic (mostly effects) ● lifecycle bounded to REST contract: strict versioning, changes are breaking changes Users of the API want stability and to know what errors can happen

Slide 63

Slide 63 text

Make promises, Keep them 63 ● systems allow to bound responsibilities Business Core sub-system: ● own ADT / logic (mostly pure) ● lifecycle bounded to developers understanding of needs (rapid changes) REST sub-system : ● own ADT / logic (mostly effects) ● lifecycle bounded to REST contract: strict versioning, changes are breaking changes Stable API : interface, strict protocol & promises (nominal cases + errors) Users of the API have agency (able to react eﬃciently)

Slide 64

Slide 64 text

Make promises, Keep them 64 ● systems allow to bound responsibilities Business Core sub-system: ● own ADT / logic (mostly pure) ● lifecycle bounded to developers understanding of needs (rapid changes) REST sub-system : ● own ADT / logic (mostly effects) ● lifecycle bounded to REST contract: strict versioning, changes are breaking changes Stable API : interface, strict protocol & promises (nominal cases + errors) Users of the API have agency (able to react eﬃciently) Translation between sub-systems: API: interface, protocol & promises!

Slide 65

Slide 65 text

Make promises, Keep them 65 ● systems allow to bound responsibilities ● translate errors between sub-systems ○ make errors relevant to their users ● It’s a model, it’s false ○ there is NO deﬁnitive answer. ○ discuss, share, iterate ● the bigger the promises, the stricter the API

Slide 66

Slide 66 text

5. 66 make it extremely convenient to use Composition and tooling

Slide 67

Slide 67 text

Checked exceptions are a good signal for users 67 Unpopular opinion (sure)

Slide 68

Slide 68 text

Checked exceptions are a good signal for users Who likes them ? 68 Unpopular opinion (sure)

Slide 69

Slide 69 text

What’s missing for good error management in code ? ● signal must be unambiguous ○ exception are a pile of ambiguity ○ Error ? ○ Fatal error ? ○ Checked ? Unchecked ? 69

Slide 70

Slide 70 text

What’s missing for good error management in code ? ● signal must be unambiguous ○ exception are a pile of ambiguity ● exceptions are A PAIN to use ○ no tooling, no inference, nothing ■ ■ ○ no composition ■ 70

Slide 71

Slide 71 text

Make it a joy! 71 ● managing error should be enjoyable ! ○ automatic (in for loop + inference) ○ or as expressive as nominal case! ● safely, easely managing error should be the default ! ○ composition (referential transparency…) ○ higher level resource management: bracket, etc ● make the code extremely readable ○ add all the combinators you need! ○ it’s cheap with pure, total functions

Slide 72

Slide 72 text

72 In Rudder: Why ZIO?

Slide 73

Slide 73 text

Why ZIO ? 73 ● you still have to think in systems by yourself

Slide 74

Slide 74 text

Why ZIO ? 74 ● you still have to think in systems by yourself ● then ZIO provides : ○ effect management ○ with an explicit error channel ○ IO[+E, +A] val pureCode = IO.effect(effectfulCode)

Slide 75

Slide 75 text

Why ZIO ? 75 ● you still have to think in systems by yourself ● then ZIO provides : ○ debuggable failures Complex error composition Async code trace

Slide 76

Slide 76 text

Why ZIO ? 76 ● you still have to think in systems by yourself ● then ZIO provides : ○ tons of convenience to manipulate errors ■ ■ ■ ○ composable effects ■ ● safe, composable resource management

Slide 77

Slide 77 text

Why ZIO ? 77 ● you still have to think in system by yourself ● then ZIO provides : ○ effect management ○ with an explicit error channel ○ debuggable failures ○ tons of convenience to manipulate errors ○ composable

Slide 78

Slide 78 text

Why ZIO ? 78 ● you still have to think in system by yourself ● then ZIO provides : ○ effect management ○ with an explicit error channel ○ debuggable failures ○ tons of convenience to manipulate errors ○ composable ● Everything work in parallel, asynchronous code too! ● Inference just work!

Slide 79

Slide 79 text

Why ZIO ? 79 ● you still have to think in system by yourself ● then ZIO provides : ○ effect management ○ with an explicit error channel ○ debuggable failures ○ tons of convenience to manipulate errors ○ composable ● Everything work in parallel, concurrent code too! ● Inference just work! Lots of details: “Error Management: Future vs ZIO” https://www.slideshare.net/jdegoes/error-management-future-vs-zio

Slide 80

Slide 80 text

80 In Rudder, with ZIO: we settled on that

Slide 81

Slide 81 text

One error hierarchy 81 ● One error type (trait) providing common tooling

Slide 82

Slide 82 text

Unambiguous type 82 ● one result type for pure terms, ● one that encapsulates effects

Slide 83

Slide 83 text

Generic, useful errors 83 ● java exceptions are translated into SystemError ● Chained allows to add context for humans ● Accumulated groups several errors into one

Slide 84

Slide 84 text

Specialized error for subsystems 84 ● real code from rudder ⇒ specialized errors for the LDAP subsystem ⇒ adapt semantic from java lib (exceptions) to pure value that can be composed and behave as others errors in Rudder (printable information)

Slide 85

Slide 85 text

Full example - real code from Rudder 85 ● inference just works ● each sub-system add relevant information ● simple combinators (in white) used as syntax sugar (None, msg) => Unexpected(msg) PureResult[A] => IOResult[A] (err: RudderError[A], msg) => Chained(msg, err) error contextualisation between systems

Slide 86

Slide 86 text

86 Pure, total functions don’t lie about your promises Explicit error channel make it unambiguous in your types Program to strict interfaces and protocols use systems to materialize promises Composition and tooling make it extremely convenient to use Assess failure modes. Give agency to your users and don’t forget any of them. You are responsible to keep promises made. 1. 2. 4. 5. Failures vs Errors models are false by construction 3.

Slide 87

Slide 87 text

Question? Contact me / Chat with me! https://twitter.com/fanf42 https://github.com/fanf https://keybase.io/fanf42 irc/freenode: fanf [email protected] 87 Ressources ○ Error management: future vs ZIO A much more detailed presentation of ZIO error management capabilities https://www.slideshare.net/jdegoes/error-management-future-vs-zio ○ Understand Things As Interacting Systems More insights on systems. https://medium.com/@fanf42/understand-things-as-interacting-systems-b273bdba5dec ○ Stay Up! Journey of a Free Software Company. One decade in search for a sustainable model https://medium.com/@fanf42/stay-up-5b780511109d

Slide 88

Slide 88 text

● What about making impossible state unrepresentable from the beginning? ○ That’s a very good point and you should ALWAYS try to do so. The idea is to change method’s domain definition (ie, the parameter’s shape) to only work on inputs that can’t rise errors. Typically, in my trivial “divide” example, we should have use “non zero integer” for denominator input. ○ Alexis King (@lexy_lambda) wrote a wonderful article on that, so just go read it, she explains it better than I can: “Parse, don’t validate” https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-validate/ ○ We use that technique a lot in Rudder to drive understanding of what is possible. Each time we can restrict domain definition, we try to keep that information for latter use. ○ Typical example: parsing plugin license (we have 4 “xxxLicenses” classes depending what we now about its state); Validating user policy (again several “SomethingPolicyDraft” with different hypothesis needed to build the “Something”). ○ the general goal is the same than with error management: assess failure mode, give agency to users to react efficiently. ○ There’s still plenty of cases where that technique is hard to use (fluzzy business cases…) or not what you are looking for (you just want to tell users that something is the nominal case, or not, and give them agency to react accordingly). Some questions asked after the talk 88

Slide 89

Slide 89 text

Some questions asked after the talk 89 ● Is SystemError used to catch / materialize failure ? ○ no, SystemError is here to translate Error that need to be dealts with (like connection error to DB, FS related problem, etc) but are encoded in Java with an Exception. SystemError is not used to catch Java “OutOfMemoryError”. These exception kills Rudder. We use the JVM Thread.setDefaultUncaughtExceptionHandler to try to give more information to dev/ops and clean things before killing the app.

Slide 90

Slide 90 text

Some questions asked after the talk 90 ● You have only one parent type for errors. Don’t you lose a lot of details with all special errors in subsystems losing the speciﬁcities when they are seen as RudderError? ○ this is a very pertinent question, and we spend a log of time pondering between the current design and one where all sub-systems would have their own error type (with no common super type). In the end, we settled on the current design because: ■ ■ ■ ■ ○ So all in all, the wins in convenience and joy of just having evering working without boilerplate clearly outpaced the not clear gain of having different error hierarchies. ○ The problem would have been different if Rudder was not one monolithic app with a need of separated compilation between services. I think we would have made an “error” lib in that case.

Slide 91

Slide 91 text

Some questions asked after the talk 91 ● We use Future[Either[E,A]] + MTL, why should we switch to ZIO? ○ Well, the decision to switch is yours, and I don’t know the speciﬁc context of your company to give an advice on that. Nonetheless, here is my personal opinion: ■ ■ ■ ■ ■ pertinent stack trace in concurrent code ● But at the end of the day, you decide!

Slide 92

Slide 92 text

Some questions asked after the talk 92 ● How long did it took to port Rudder to ZIO? ○ It’s complicated :). 1 month of part time (me), plus lots more time for teaching, refactoring, understanding new paradigm limits, etc ■ ■ ■ ■ ■