Upgrade to Pro — share decks privately, control downloads, hide ads and more …

STC: Towards Haskell in the Cloud

norm2782
December 18, 2011

STC: Towards Haskell in the Cloud

My STC presentation at UU about the paper "Towards Haskell in the Cloud" by Epstein, Black and Peyton-Jones, 2011

norm2782

December 18, 2011
Tweet

More Decks by norm2782

Other Decks in Programming

Transcript

  1. [Faculty of Science Information and Computing Sciences] 1 Towards Haskell

    in the Cloud Epstein, Black and Peyton-Jones, 2011 Jurriën Stutterheim 15 December, 2011
  2. [Faculty of Science Information and Computing Sciences] 3 Cloud Haskell

    §1 EDSL for cloud computing Computing on processors with separate memories, connected via a network Strongly based on Erlang’s message-passing model Concurrent processes have no access to each other’s data Data needs to be communicated explicitly A concurrent process can fail without affecting others
  3. [Faculty of Science Information and Computing Sciences] 4 Contributions §1

    A low-level framework for distributed computing in Haskell Part library, part extended GHC A way to serialise functions and send them across a network Improves over the Erlang solution Type safety and a strong separation between pure and effectful code Computing with pure functions eliminates the need for transactions; computations can be resumed elsewhere in case of failure
  4. [Faculty of Science Information and Computing Sciences] 5 Example scenario

    §1 Process a set of numbers 1..n k slave nodes, each node consuming n k numbers Processing function is transmitted over the network from master to slaves, together with their respective input Master merges answers from slaves and spits out a final answer How does the master orchestrate the slaves? How do we serialise and send a function over a network?
  5. [Faculty of Science Information and Computing Sciences] 7 Basic elements

    §2 Processes Basic unit of concurrency Can send and receive messages Lightweight: low creation and scheduling overhead Only message-passing between processes, shared memory only inside a single process Messages Asynchronous and buffered All code related to messaging in ProcessM monad Two styles of message passing Untyped messages (like Erlang) Typed channels
  6. [Faculty of Science Information and Computing Sciences] 8 Untyped messaging

    §2 Basic functions1 for untyped messaging: send :: Serializable a ⇒ ProcessId → a → ProcessM () expect :: Serializable a ⇒ ProcessM a Messages of any serialisable type may be sent and received. Channels (discussed later) allow for restricting the type of the messages. 1Authors call these primitives, which is not entirely true
  7. [Faculty of Science Information and Computing Sciences] 9 Serialisation §2

    All types for which there are Binary and Typeable instances are serialisable for Cloud Haskell: class (Binary a, Typeable a) ⇒ Serializable a Binary Binary serialisation of Haskell values to and from lazy ByteStrings Typeable Associates type representation to types, giving runtime type information No such instances for types like MVar
  8. [Faculty of Science Information and Computing Sciences] 10 Untyped messaging

    example §2 data Ping = Ping ProcessId deriving Typeable data Pong = Pong ProcessId deriving Typeable -- omitted: Binary instances ping :: ProcessM () ping = forever $ do Pong partner ← expect self ← getSelfPid send partner (Ping self ) expect waits for a message and accepts it when its pattern matches
  9. [Faculty of Science Information and Computing Sciences] 11 Matching messages

    of multiple types §2 expect lets us received messages of one particular type. receiveWait allows us to match against multiple types: receiveWait :: [MatchM q ()] → ProcessM q Matching happens in the MatchM monad, which contains a computation to execute when a match is succesful. q is the result type of the matched computation. expect is implemented in terms of receiveWait.
  10. [Faculty of Science Information and Computing Sciences] 12 Creating MatchM

    computations §2 We use match and matchIf to construct the MatchM q () computations: match :: Serializable a ⇒ (a → ProcessM q) → MatchM q () matchIf accepts a predicate that must hold for the matching to succeed: matchIf :: Serializable a ⇒ (a → Bool) → (a → ProcessM q) → MatchM q ()
  11. [Faculty of Science Information and Computing Sciences] 13 Matching example

    §2 data Add = Add ProcessId Double Double data Div = Div ProcessId Double Double data DivByZero = DivByZero math :: ProcessM () math = forever $ receiveWait [match (λ(Add pid num1 num2) → send pid (num1 + num2)) , matchIf (λ(Div num2) → num2 ≡ 0) (λ(Div pid num1 num2) → send pid (num1 / num2)) , match (λ(Div pid ) → send pid DivByZero) ]
  12. [Faculty of Science Information and Computing Sciences] 14 Matching without

    blocking §2 receiveWait waits indefinitely for a message, but we also want to be able to stop waiting for messages to arrive after n seconds: receiveTimeout :: Int → [MatchM q ()] → ProcessM (Maybe q) Example: rcvTO = do send pid (Query stuff ) ret ← receiveTimeout 50000 [match (λ(Response answer) → return answer)] case ret of Nothing → showError "Timeout!" Just ans → showAnswer ans
  13. [Faculty of Science Information and Computing Sciences] 15 Starting and

    locating processes §2 How do we start and locate a process? A node is Cloud Haskell’s unit of location NodeId is a unique node identifier that contains an IP address We use spawn to start a new process data NodeId = NodeId HostName PortId spawn :: NodeId → Closure (ProcessM ()) → ProcessM ProcessId We will talk about Closure later
  14. [Faculty of Science Information and Computing Sciences] 16 Fault tolerance

    §2 Fault tolerance in Cloud Haskell is (again) based on ideas from Erlang Do not try to recover, but terminate: another process will take over Cloud Haskell/Erlang philosophy: trying to recover will not lead to robustness Made easier by lack of shared state between processes and pure functions
  15. [Faculty of Science Information and Computing Sciences] 17 Fault handling

    between two processes §2 A process can monitor another process for termination Notification of termination can be received in two ways A regular Haskell exception A Cloud Haskell message Unidirectional monitoring One process monitors another Choose notification style Bidirectional monitoring One node is informed of the other’s abnormal termination Notification by exception
  16. [Faculty of Science Information and Computing Sciences] 19 Messages through

    channels §3 send allowed us to send any message to any node Typed channels give a static guarantee that messages are only sent to nodes that can deal with them Consist of a send port and a receive port Only send ports can be serialised for transport To avoid losing messages during transport, receive ports cannot Messages are extracted from the receive port in FIFO order Send port only accepts messages of a specific type
  17. [Faculty of Science Information and Computing Sciences] 20 Main channel

    functions §3 newChan :: Serializable a ⇒ ProcessM (SendPort a, ReceivePort a) sendChan :: Serializable a ⇒ SendPort a → a → ProcessM () receiveChan :: Serializable a ⇒ ReceivePort a → ProcessM a There are also channel-based counterparts of receiveWait
  18. [Faculty of Science Information and Computing Sciences] 21 Channel example

    §3 ping2 :: SendPort Ping → ReceivePort Pong → ProcessM () ping2 pingout pongin = forever $ do Pong partner ← receiveChan pongin sendChan partner (Ping pongout)
  19. [Faculty of Science Information and Computing Sciences] 23 Closures §4

    We want to be able to send functions over the network, so masters can execute them on slaves. A first (wrong) attempt: sendFunc :: SendPort (Int → Int) → Int → ProcessM () sendFunc p x = sendChan p (λy → x + y + 1) Note that x is free. To serialize a function, one must also serialize its free variables. But how? The types of these free variables are unrelated to the type of the function.
  20. [Faculty of Science Information and Computing Sciences] 24 Problem with

    serialising functions §4 To serialise a datatype, we normally deconstruct its structure. E.g.: instance Serializable a ⇒ Serializable [a] If we can serialise a, we can serialise a list of a, because we know how to deconstruct a list. How can we specify that free variables in a function need to be serialisable? The instance below cannot be expressed: instance Serializable (types of the free variables of an a → b) ⇒ Serializable (a → b) where
  21. [Faculty of Science Information and Computing Sciences] 25 Prior solutions

    §4 Make serialisation a built-in language feature (also used in Erlang) No control over serialisation process Cannot prevent certain types from being serialised (e.g. MVars) Makes cost of serialisation and transmitting implicit Half built-in: Java RMI’s Serializable and Externalizable interfaces Depends on runtime instantiation of a deserialiser for a given type
  22. [Faculty of Science Information and Computing Sciences] 26 Cloud Haskell

    solution §4 Observation: functions without free variables can be serialised without performing additional work These are trivially a closure since they do not need an environment Closure: a function together with an environment containing the function’s free variables Provided nodes run the same code, these closures can be serialised as a single linker label Cloud Haskell introduces a new built-in type constructor for identifying readily serialisable types: Static τ And two new terms: static e and unstatic e
  23. [Faculty of Science Information and Computing Sciences] 27 The Static

    type §4 Static will be a new built-in type constructor in GHC Also built-in: instance Serializable (Static a) All values of type Static a can be serialised Even without a Serializable constraint on a First evaluates a Static, then serialises the code label of the evaluation result Works for data types as well E.g. Static Tree
  24. [Faculty of Science Information and Computing Sciences] 28 static and

    unstatic terms §4 Will also be built into GHC static and unstatic introduce or eliminate the Static constructor static may only be applied to top-level functions unstatic has no such restrictions If e : τ, then static e : Static τ If e : Static τ, then unstatic e : τ
  25. [Faculty of Science Information and Computing Sciences] 29 Ingredients §4

    Now what do we need to serialise a function with free variables? It needs to be top-level, since only top-level functions can be static and static functions are the ones we can serialise A serialised environment containing the function’s free variables, so we can make a closure Some way to reconstruct the serialised function using the serialised environment
  26. [Faculty of Science Information and Computing Sciences] 30 From static

    values to closures §4 We model a closure as a pair containing an environment and a static function that, when given the environment, gives us our closure A first attempt (using a GADT): data Closure a where MkClosure :: Serializable env ⇒ Static (env → a) → env → Closure a We require that the environment is serialisable as well. Note that it is existentially quantified. Although this seems reasonable, it will give us problems on the receiving end. Why?
  27. [Faculty of Science Information and Computing Sciences] 31 Serialising a

    closure §4 We need to be able to serialise a closure for transmission over the network. Recall that we need an instance of Binary for Closure for this to work: instance Binary (Closure a) where put (MkClosure f env) = put f > > put env get = ⊥ How do we deserialise? At the receiving end, we do not know which deserialiser to use.
  28. [Faculty of Science Information and Computing Sciences] 32 From static

    values to closures §4 Solution: we do both serialisation and deserialisation of the environment at closure construction time: data Closure a where MkClosure :: Static (ByteString → a) → ByteString → Closure a Functions defined in Data.Binary to help with the conversion to/from ByteString: encode :: Binary a ⇒ a → ByteString decode :: Binary a ⇒ ByteString → a
  29. [Faculty of Science Information and Computing Sciences] 33 sendFunc revisited

    §4 Recall the wrong approach: sendFunc :: SendPort (Int → Int) → Int → ProcessM () sendFunc p x = sendChan p (λy → x + y + 1) Now the right approach: sendFunc :: SendPort (Closure (Int → Int)) → Int → ProcessM () sendFunc p x = sendChan p clo where clo = MkClosure (static sfun) (encode x) sfun :: ByteString → Int → Int sfun bs y = let x = decode bs in x + y + 1
  30. [Faculty of Science Information and Computing Sciences] 35 Performance §5

    and responds to s three messages: e counter; query, er to the sender; ess. The type of ce functions, de- e messages. The r storing process- after processing e for the counter. ounter, so it tail- ment message is e current value. t thread synchro- e counter process is impossible to rdinate compute- without incurring rformance of our 0 1000 2000 3000 4000 5000 6000 1 10 20 30 40 50 60 70 80 Time (s) Number of mappers Performance of k-means on EC2 cluster Cloud Haskell Hadoop Figure 4. The run-time of the k-means algorithm, implemented under Cloud Haskell and Hadoop. The input data was one million 100-dimensional data points.
  31. [Faculty of Science Information and Computing Sciences] 36 Conclusion and

    future work §5 Cloud Haskell provides a good starting point for building distributed applications Already performs quite well Current bottleneck is acquiring and loading data Authors suggest improving file handling will improve performance Still in early stages: Static etc. still need to be implemented in GHC Currently it requires workarounds with Template Haskell Authors propose an additional higher-level framework, revolving around tasks rather than processes Task: an idempotent restartable block of code that produces a well-defined result