§1 EDSL for cloud computing Computing on processors with separate memories, connected via a network Strongly based on Erlang’s message-passing model Concurrent processes have no access to each other’s data Data needs to be communicated explicitly A concurrent process can fail without affecting others
A low-level framework for distributed computing in Haskell Part library, part extended GHC A way to serialise functions and send them across a network Improves over the Erlang solution Type safety and a strong separation between pure and effectful code Computing with pure functions eliminates the need for transactions; computations can be resumed elsewhere in case of failure
§1 Process a set of numbers 1..n k slave nodes, each node consuming n k numbers Processing function is transmitted over the network from master to slaves, together with their respective input Master merges answers from slaves and spits out a final answer How does the master orchestrate the slaves? How do we serialise and send a function over a network?
§2 Processes Basic unit of concurrency Can send and receive messages Lightweight: low creation and scheduling overhead Only message-passing between processes, shared memory only inside a single process Messages Asynchronous and buffered All code related to messaging in ProcessM monad Two styles of message passing Untyped messages (like Erlang) Typed channels
§2 Basic functions1 for untyped messaging: send :: Serializable a ⇒ ProcessId → a → ProcessM () expect :: Serializable a ⇒ ProcessM a Messages of any serialisable type may be sent and received. Channels (discussed later) allow for restricting the type of the messages. 1Authors call these primitives, which is not entirely true
All types for which there are Binary and Typeable instances are serialisable for Cloud Haskell: class (Binary a, Typeable a) ⇒ Serializable a Binary Binary serialisation of Haskell values to and from lazy ByteStrings Typeable Associates type representation to types, giving runtime type information No such instances for types like MVar
of multiple types §2 expect lets us received messages of one particular type. receiveWait allows us to match against multiple types: receiveWait :: [MatchM q ()] → ProcessM q Matching happens in the MatchM monad, which contains a computation to execute when a match is succesful. q is the result type of the matched computation. expect is implemented in terms of receiveWait.
computations §2 We use match and matchIf to construct the MatchM q () computations: match :: Serializable a ⇒ (a → ProcessM q) → MatchM q () matchIf accepts a predicate that must hold for the matching to succeed: matchIf :: Serializable a ⇒ (a → Bool) → (a → ProcessM q) → MatchM q ()
blocking §2 receiveWait waits indefinitely for a message, but we also want to be able to stop waiting for messages to arrive after n seconds: receiveTimeout :: Int → [MatchM q ()] → ProcessM (Maybe q) Example: rcvTO = do send pid (Query stuff ) ret ← receiveTimeout 50000 [match (λ(Response answer) → return answer)] case ret of Nothing → showError "Timeout!" Just ans → showAnswer ans
locating processes §2 How do we start and locate a process? A node is Cloud Haskell’s unit of location NodeId is a unique node identifier that contains an IP address We use spawn to start a new process data NodeId = NodeId HostName PortId spawn :: NodeId → Closure (ProcessM ()) → ProcessM ProcessId We will talk about Closure later
§2 Fault tolerance in Cloud Haskell is (again) based on ideas from Erlang Do not try to recover, but terminate: another process will take over Cloud Haskell/Erlang philosophy: trying to recover will not lead to robustness Made easier by lack of shared state between processes and pure functions
between two processes §2 A process can monitor another process for termination Notification of termination can be received in two ways A regular Haskell exception A Cloud Haskell message Unidirectional monitoring One process monitors another Choose notification style Bidirectional monitoring One node is informed of the other’s abnormal termination Notification by exception
channels §3 send allowed us to send any message to any node Typed channels give a static guarantee that messages are only sent to nodes that can deal with them Consist of a send port and a receive port Only send ports can be serialised for transport To avoid losing messages during transport, receive ports cannot Messages are extracted from the receive port in FIFO order Send port only accepts messages of a specific type
functions §3 newChan :: Serializable a ⇒ ProcessM (SendPort a, ReceivePort a) sendChan :: Serializable a ⇒ SendPort a → a → ProcessM () receiveChan :: Serializable a ⇒ ReceivePort a → ProcessM a There are also channel-based counterparts of receiveWait
We want to be able to send functions over the network, so masters can execute them on slaves. A first (wrong) attempt: sendFunc :: SendPort (Int → Int) → Int → ProcessM () sendFunc p x = sendChan p (λy → x + y + 1) Note that x is free. To serialize a function, one must also serialize its free variables. But how? The types of these free variables are unrelated to the type of the function.
serialising functions §4 To serialise a datatype, we normally deconstruct its structure. E.g.: instance Serializable a ⇒ Serializable [a] If we can serialise a, we can serialise a list of a, because we know how to deconstruct a list. How can we specify that free variables in a function need to be serialisable? The instance below cannot be expressed: instance Serializable (types of the free variables of an a → b) ⇒ Serializable (a → b) where
§4 Make serialisation a built-in language feature (also used in Erlang) No control over serialisation process Cannot prevent certain types from being serialised (e.g. MVars) Makes cost of serialisation and transmitting implicit Half built-in: Java RMI’s Serializable and Externalizable interfaces Depends on runtime instantiation of a deserialiser for a given type
solution §4 Observation: functions without free variables can be serialised without performing additional work These are trivially a closure since they do not need an environment Closure: a function together with an environment containing the function’s free variables Provided nodes run the same code, these closures can be serialised as a single linker label Cloud Haskell introduces a new built-in type constructor for identifying readily serialisable types: Static τ And two new terms: static e and unstatic e
type §4 Static will be a new built-in type constructor in GHC Also built-in: instance Serializable (Static a) All values of type Static a can be serialised Even without a Serializable constraint on a First evaluates a Static, then serialises the code label of the evaluation result Works for data types as well E.g. Static Tree
unstatic terms §4 Will also be built into GHC static and unstatic introduce or eliminate the Static constructor static may only be applied to top-level functions unstatic has no such restrictions If e : τ, then static e : Static τ If e : Static τ, then unstatic e : τ
Now what do we need to serialise a function with free variables? It needs to be top-level, since only top-level functions can be static and static functions are the ones we can serialise A serialised environment containing the function’s free variables, so we can make a closure Some way to reconstruct the serialised function using the serialised environment
values to closures §4 We model a closure as a pair containing an environment and a static function that, when given the environment, gives us our closure A first attempt (using a GADT): data Closure a where MkClosure :: Serializable env ⇒ Static (env → a) → env → Closure a We require that the environment is serialisable as well. Note that it is existentially quantified. Although this seems reasonable, it will give us problems on the receiving end. Why?
closure §4 We need to be able to serialise a closure for transmission over the network. Recall that we need an instance of Binary for Closure for this to work: instance Binary (Closure a) where put (MkClosure f env) = put f > > put env get = ⊥ How do we deserialise? At the receiving end, we do not know which deserialiser to use.
values to closures §4 Solution: we do both serialisation and deserialisation of the environment at closure construction time: data Closure a where MkClosure :: Static (ByteString → a) → ByteString → Closure a Functions defined in Data.Binary to help with the conversion to/from ByteString: encode :: Binary a ⇒ a → ByteString decode :: Binary a ⇒ ByteString → a
§4 Recall the wrong approach: sendFunc :: SendPort (Int → Int) → Int → ProcessM () sendFunc p x = sendChan p (λy → x + y + 1) Now the right approach: sendFunc :: SendPort (Closure (Int → Int)) → Int → ProcessM () sendFunc p x = sendChan p clo where clo = MkClosure (static sfun) (encode x) sfun :: ByteString → Int → Int sfun bs y = let x = decode bs in x + y + 1
and responds to s three messages: e counter; query, er to the sender; ess. The type of ce functions, de- e messages. The r storing process- after processing e for the counter. ounter, so it tail- ment message is e current value. t thread synchro- e counter process is impossible to rdinate compute- without incurring rformance of our 0 1000 2000 3000 4000 5000 6000 1 10 20 30 40 50 60 70 80 Time (s) Number of mappers Performance of k-means on EC2 cluster Cloud Haskell Hadoop Figure 4. The run-time of the k-means algorithm, implemented under Cloud Haskell and Hadoop. The input data was one million 100-dimensional data points.
future work §5 Cloud Haskell provides a good starting point for building distributed applications Already performs quite well Current bottleneck is acquiring and loading data Authors suggest improving file handling will improve performance Still in early stages: Static etc. still need to be implemented in GHC Currently it requires workarounds with Template Haskell Authors propose an additional higher-level framework, revolving around tasks rather than processes Task: an idempotent restartable block of code that produces a well-defined result