STC: Towards Haskell in the Cloud

[Faculty of Science Information and Computing Sciences] 1 Towards Haskell
in the Cloud Epstein, Black and Peyton-Jones, 2011 Jurriën Stutterheim 15 December, 2011

[Faculty of Science Information and Computing Sciences] 2 1. Introduction

[Faculty of Science Information and Computing Sciences] 3 Cloud Haskell
§1 EDSL for cloud computing Computing on processors with separate memories, connected via a network Strongly based on Erlang’s message-passing model Concurrent processes have no access to each other’s data Data needs to be communicated explicitly A concurrent process can fail without aﬀecting others

[Faculty of Science Information and Computing Sciences] 4 Contributions §1
A low-level framework for distributed computing in Haskell Part library, part extended GHC A way to serialise functions and send them across a network Improves over the Erlang solution Type safety and a strong separation between pure and eﬀectful code Computing with pure functions eliminates the need for transactions; computations can be resumed elsewhere in case of failure

[Faculty of Science Information and Computing Sciences] 5 Example scenario
§1 Process a set of numbers 1..n k slave nodes, each node consuming n k numbers Processing function is transmitted over the network from master to slaves, together with their respective input Master merges answers from slaves and spits out a ﬁnal answer How does the master orchestrate the slaves? How do we serialise and send a function over a network?

[Faculty of Science Information and Computing Sciences] 6 2. The
basics

[Faculty of Science Information and Computing Sciences] 7 Basic elements
§2 Processes Basic unit of concurrency Can send and receive messages Lightweight: low creation and scheduling overhead Only message-passing between processes, shared memory only inside a single process Messages Asynchronous and buﬀered All code related to messaging in ProcessM monad Two styles of message passing Untyped messages (like Erlang) Typed channels

[Faculty of Science Information and Computing Sciences] 8 Untyped messaging
§2 Basic functions1 for untyped messaging: send :: Serializable a ⇒ ProcessId → a → ProcessM () expect :: Serializable a ⇒ ProcessM a Messages of any serialisable type may be sent and received. Channels (discussed later) allow for restricting the type of the messages. 1Authors call these primitives, which is not entirely true

[Faculty of Science Information and Computing Sciences] 9 Serialisation §2
All types for which there are Binary and Typeable instances are serialisable for Cloud Haskell: class (Binary a, Typeable a) ⇒ Serializable a Binary Binary serialisation of Haskell values to and from lazy ByteStrings Typeable Associates type representation to types, giving runtime type information No such instances for types like MVar

[Faculty of Science Information and Computing Sciences] 10 Untyped messaging
example §2 data Ping = Ping ProcessId deriving Typeable data Pong = Pong ProcessId deriving Typeable -- omitted: Binary instances ping :: ProcessM () ping = forever $ do Pong partner ← expect self ← getSelfPid send partner (Ping self ) expect waits for a message and accepts it when its pattern matches

[Faculty of Science Information and Computing Sciences] 11 Matching messages
of multiple types §2 expect lets us received messages of one particular type. receiveWait allows us to match against multiple types: receiveWait :: [MatchM q ()] → ProcessM q Matching happens in the MatchM monad, which contains a computation to execute when a match is succesful. q is the result type of the matched computation. expect is implemented in terms of receiveWait.

[Faculty of Science Information and Computing Sciences] 12 Creating MatchM
computations §2 We use match and matchIf to construct the MatchM q () computations: match :: Serializable a ⇒ (a → ProcessM q) → MatchM q () matchIf accepts a predicate that must hold for the matching to succeed: matchIf :: Serializable a ⇒ (a → Bool) → (a → ProcessM q) → MatchM q ()

[Faculty of Science Information and Computing Sciences] 13 Matching example
§2 data Add = Add ProcessId Double Double data Div = Div ProcessId Double Double data DivByZero = DivByZero math :: ProcessM () math = forever $ receiveWait [match (λ(Add pid num1 num2) → send pid (num1 + num2)) , matchIf (λ(Div num2) → num2 ≡ 0) (λ(Div pid num1 num2) → send pid (num1 / num2)) , match (λ(Div pid ) → send pid DivByZero) ]

[Faculty of Science Information and Computing Sciences] 14 Matching without
blocking §2 receiveWait waits indeﬁnitely for a message, but we also want to be able to stop waiting for messages to arrive after n seconds: receiveTimeout :: Int → [MatchM q ()] → ProcessM (Maybe q) Example: rcvTO = do send pid (Query stuﬀ ) ret ← receiveTimeout 50000 [match (λ(Response answer) → return answer)] case ret of Nothing → showError "Timeout!" Just ans → showAnswer ans

[Faculty of Science Information and Computing Sciences] 15 Starting and
locating processes §2 How do we start and locate a process? A node is Cloud Haskell’s unit of location NodeId is a unique node identiﬁer that contains an IP address We use spawn to start a new process data NodeId = NodeId HostName PortId spawn :: NodeId → Closure (ProcessM ()) → ProcessM ProcessId We will talk about Closure later

[Faculty of Science Information and Computing Sciences] 16 Fault tolerance
§2 Fault tolerance in Cloud Haskell is (again) based on ideas from Erlang Do not try to recover, but terminate: another process will take over Cloud Haskell/Erlang philosophy: trying to recover will not lead to robustness Made easier by lack of shared state between processes and pure functions

[Faculty of Science Information and Computing Sciences] 17 Fault handling
between two processes §2 A process can monitor another process for termination Notification of termination can be received in two ways A regular Haskell exception A Cloud Haskell message Unidirectional monitoring One process monitors another Choose notification style Bidirectional monitoring One node is informed of the other’s abnormal termination Notification by exception

[Faculty of Science Information and Computing Sciences] 18 3. Type
safe messaging: channels

[Faculty of Science Information and Computing Sciences] 19 Messages through
channels §3 send allowed us to send any message to any node Typed channels give a static guarantee that messages are only sent to nodes that can deal with them Consist of a send port and a receive port Only send ports can be serialised for transport To avoid losing messages during transport, receive ports cannot Messages are extracted from the receive port in FIFO order Send port only accepts messages of a speciﬁc type

[Faculty of Science Information and Computing Sciences] 20 Main channel
functions §3 newChan :: Serializable a ⇒ ProcessM (SendPort a, ReceivePort a) sendChan :: Serializable a ⇒ SendPort a → a → ProcessM () receiveChan :: Serializable a ⇒ ReceivePort a → ProcessM a There are also channel-based counterparts of receiveWait

[Faculty of Science Information and Computing Sciences] 21 Channel example
§3 ping2 :: SendPort Ping → ReceivePort Pong → ProcessM () ping2 pingout pongin = forever $ do Pong partner ← receiveChan pongin sendChan partner (Ping pongout)

[Faculty of Science Information and Computing Sciences] 22 4. Serialising
and transmitting functions

[Faculty of Science Information and Computing Sciences] 23 Closures §4
We want to be able to send functions over the network, so masters can execute them on slaves. A ﬁrst (wrong) attempt: sendFunc :: SendPort (Int → Int) → Int → ProcessM () sendFunc p x = sendChan p (λy → x + y + 1) Note that x is free. To serialize a function, one must also serialize its free variables. But how? The types of these free variables are unrelated to the type of the function.

[Faculty of Science Information and Computing Sciences] 24 Problem with
serialising functions §4 To serialise a datatype, we normally deconstruct its structure. E.g.: instance Serializable a ⇒ Serializable [a] If we can serialise a, we can serialise a list of a, because we know how to deconstruct a list. How can we specify that free variables in a function need to be serialisable? The instance below cannot be expressed: instance Serializable (types of the free variables of an a → b) ⇒ Serializable (a → b) where

[Faculty of Science Information and Computing Sciences] 25 Prior solutions
§4 Make serialisation a built-in language feature (also used in Erlang) No control over serialisation process Cannot prevent certain types from being serialised (e.g. MVars) Makes cost of serialisation and transmitting implicit Half built-in: Java RMI’s Serializable and Externalizable interfaces Depends on runtime instantiation of a deserialiser for a given type

[Faculty of Science Information and Computing Sciences] 26 Cloud Haskell
solution §4 Observation: functions without free variables can be serialised without performing additional work These are trivially a closure since they do not need an environment Closure: a function together with an environment containing the function’s free variables Provided nodes run the same code, these closures can be serialised as a single linker label Cloud Haskell introduces a new built-in type constructor for identifying readily serialisable types: Static τ And two new terms: static e and unstatic e

[Faculty of Science Information and Computing Sciences] 27 The Static
type §4 Static will be a new built-in type constructor in GHC Also built-in: instance Serializable (Static a) All values of type Static a can be serialised Even without a Serializable constraint on a First evaluates a Static, then serialises the code label of the evaluation result Works for data types as well E.g. Static Tree

[Faculty of Science Information and Computing Sciences] 28 static and
unstatic terms §4 Will also be built into GHC static and unstatic introduce or eliminate the Static constructor static may only be applied to top-level functions unstatic has no such restrictions If e : τ, then static e : Static τ If e : Static τ, then unstatic e : τ

[Faculty of Science Information and Computing Sciences] 29 Ingredients §4
Now what do we need to serialise a function with free variables? It needs to be top-level, since only top-level functions can be static and static functions are the ones we can serialise A serialised environment containing the function’s free variables, so we can make a closure Some way to reconstruct the serialised function using the serialised environment

[Faculty of Science Information and Computing Sciences] 30 From static
values to closures §4 We model a closure as a pair containing an environment and a static function that, when given the environment, gives us our closure A ﬁrst attempt (using a GADT): data Closure a where MkClosure :: Serializable env ⇒ Static (env → a) → env → Closure a We require that the environment is serialisable as well. Note that it is existentially quantiﬁed. Although this seems reasonable, it will give us problems on the receiving end. Why?

[Faculty of Science Information and Computing Sciences] 31 Serialising a
closure §4 We need to be able to serialise a closure for transmission over the network. Recall that we need an instance of Binary for Closure for this to work: instance Binary (Closure a) where put (MkClosure f env) = put f > > put env get = ⊥ How do we deserialise? At the receiving end, we do not know which deserialiser to use.

[Faculty of Science Information and Computing Sciences] 32 From static
values to closures §4 Solution: we do both serialisation and deserialisation of the environment at closure construction time: data Closure a where MkClosure :: Static (ByteString → a) → ByteString → Closure a Functions deﬁned in Data.Binary to help with the conversion to/from ByteString: encode :: Binary a ⇒ a → ByteString decode :: Binary a ⇒ ByteString → a

[Faculty of Science Information and Computing Sciences] 33 sendFunc revisited
§4 Recall the wrong approach: sendFunc :: SendPort (Int → Int) → Int → ProcessM () sendFunc p x = sendChan p (λy → x + y + 1) Now the right approach: sendFunc :: SendPort (Closure (Int → Int)) → Int → ProcessM () sendFunc p x = sendChan p clo where clo = MkClosure (static sfun) (encode x) sfun :: ByteString → Int → Int sfun bs y = let x = decode bs in x + y + 1

[Faculty of Science Information and Computing Sciences] 34 5. Wrapping
up

[Faculty of Science Information and Computing Sciences] 35 Performance §5
and responds to s three messages: e counter; query, er to the sender; ess. The type of ce functions, de- e messages. The r storing process- after processing e for the counter. ounter, so it tail- ment message is e current value. t thread synchro- e counter process is impossible to rdinate compute- without incurring rformance of our 0 1000 2000 3000 4000 5000 6000 1 10 20 30 40 50 60 70 80 Time (s) Number of mappers Performance of k-means on EC2 cluster Cloud Haskell Hadoop Figure 4. The run-time of the k-means algorithm, implemented under Cloud Haskell and Hadoop. The input data was one million 100-dimensional data points.

[Faculty of Science Information and Computing Sciences] 36 Conclusion and
future work §5 Cloud Haskell provides a good starting point for building distributed applications Already performs quite well Current bottleneck is acquiring and loading data Authors suggest improving ﬁle handling will improve performance Still in early stages: Static etc. still need to be implemented in GHC Currently it requires workarounds with Template Haskell Authors propose an additional higher-level framework, revolving around tasks rather than processes Task: an idempotent restartable block of code that produces a well-deﬁned result

STC: Towards Haskell in the Cloud

STC: Towards Haskell in the Cloud

More Decks by norm2782

Other Decks in Programming

Featured

Transcript