$30 off During Our Annual Pro Sale. View Details »

STC: Towards Haskell in the Cloud

norm2782
December 18, 2011

STC: Towards Haskell in the Cloud

My STC presentation at UU about the paper "Towards Haskell in the Cloud" by Epstein, Black and Peyton-Jones, 2011

norm2782

December 18, 2011
Tweet

More Decks by norm2782

Other Decks in Programming

Transcript

  1. [Faculty of Science
    Information and Computing Sciences]
    1
    Towards Haskell in the Cloud
    Epstein, Black and Peyton-Jones, 2011
    Jurriën Stutterheim
    15 December, 2011

    View Slide

  2. [Faculty of Science
    Information and Computing Sciences]
    2
    1. Introduction

    View Slide

  3. [Faculty of Science
    Information and Computing Sciences]
    3
    Cloud Haskell §1
    EDSL for cloud computing
    Computing on processors with separate memories,
    connected via a network
    Strongly based on Erlang’s message-passing model
    Concurrent processes have no access to each other’s data
    Data needs to be communicated explicitly
    A concurrent process can fail without affecting others

    View Slide

  4. [Faculty of Science
    Information and Computing Sciences]
    4
    Contributions §1
    A low-level framework for distributed computing in Haskell
    Part library, part extended GHC
    A way to serialise functions and send them across a network
    Improves over the Erlang solution
    Type safety and a strong separation between pure and
    effectful code
    Computing with pure functions eliminates the need for
    transactions; computations can be resumed elsewhere in
    case of failure

    View Slide

  5. [Faculty of Science
    Information and Computing Sciences]
    5
    Example scenario §1
    Process a set of numbers 1..n
    k slave nodes, each node consuming n
    k
    numbers
    Processing function is transmitted over the network from
    master to slaves, together with their respective input
    Master merges answers from slaves and spits out a final
    answer
    How does the master orchestrate the slaves? How do we
    serialise and send a function over a network?

    View Slide

  6. [Faculty of Science
    Information and Computing Sciences]
    6
    2. The basics

    View Slide

  7. [Faculty of Science
    Information and Computing Sciences]
    7
    Basic elements §2
    Processes
    Basic unit of concurrency
    Can send and receive messages
    Lightweight: low creation and scheduling overhead
    Only message-passing between processes, shared memory
    only inside a single process
    Messages
    Asynchronous and buffered
    All code related to messaging in ProcessM monad
    Two styles of message passing
    Untyped messages (like Erlang)
    Typed channels

    View Slide

  8. [Faculty of Science
    Information and Computing Sciences]
    8
    Untyped messaging §2
    Basic functions1 for untyped messaging:
    send :: Serializable a ⇒ ProcessId → a → ProcessM ()
    expect :: Serializable a ⇒ ProcessM a
    Messages of any serialisable type may be sent and received.
    Channels (discussed later) allow for restricting the type of the
    messages.
    1Authors call these primitives, which is not entirely true

    View Slide

  9. [Faculty of Science
    Information and Computing Sciences]
    9
    Serialisation §2
    All types for which there are Binary and Typeable instances are
    serialisable for Cloud Haskell:
    class (Binary a, Typeable a) ⇒ Serializable a
    Binary Binary serialisation of Haskell values to and from lazy
    ByteStrings
    Typeable Associates type representation to types, giving runtime type
    information
    No such instances for types like MVar

    View Slide

  10. [Faculty of Science
    Information and Computing Sciences]
    10
    Untyped messaging example §2
    data Ping = Ping ProcessId deriving Typeable
    data Pong = Pong ProcessId deriving Typeable
    -- omitted: Binary instances
    ping :: ProcessM ()
    ping = forever $ do
    Pong partner ← expect
    self ← getSelfPid
    send partner (Ping self )
    expect waits for a message and accepts it when its pattern
    matches

    View Slide

  11. [Faculty of Science
    Information and Computing Sciences]
    11
    Matching messages of multiple types §2
    expect lets us received messages of one particular type.
    receiveWait allows us to match against multiple types:
    receiveWait :: [MatchM q ()] → ProcessM q
    Matching happens in the MatchM monad, which contains a
    computation to execute when a match is succesful. q is the
    result type of the matched computation.
    expect is implemented in terms of receiveWait.

    View Slide

  12. [Faculty of Science
    Information and Computing Sciences]
    12
    Creating MatchM computations §2
    We use match and matchIf to construct the MatchM q ()
    computations:
    match :: Serializable a ⇒ (a → ProcessM q)
    → MatchM q ()
    matchIf accepts a predicate that must hold for the matching to
    succeed:
    matchIf :: Serializable a ⇒ (a → Bool)
    → (a → ProcessM q) → MatchM q ()

    View Slide

  13. [Faculty of Science
    Information and Computing Sciences]
    13
    Matching example §2
    data Add = Add ProcessId Double Double
    data Div = Div ProcessId Double Double
    data DivByZero = DivByZero
    math :: ProcessM ()
    math = forever $ receiveWait
    [match (λ(Add pid num1 num2) →
    send pid (num1 + num2))
    , matchIf (λ(Div num2) → num2 ≡ 0)
    (λ(Div pid num1 num2) →
    send pid (num1 / num2))
    , match (λ(Div pid ) →
    send pid DivByZero)
    ]

    View Slide

  14. [Faculty of Science
    Information and Computing Sciences]
    14
    Matching without blocking §2
    receiveWait waits indefinitely for a message, but we also want
    to be able to stop waiting for messages to arrive after n seconds:
    receiveTimeout :: Int → [MatchM q ()]
    → ProcessM (Maybe q)
    Example:
    rcvTO = do
    send pid (Query stuff )
    ret ← receiveTimeout 50000
    [match (λ(Response answer) → return answer)]
    case ret of
    Nothing → showError "Timeout!"
    Just ans → showAnswer ans

    View Slide

  15. [Faculty of Science
    Information and Computing Sciences]
    15
    Starting and locating processes §2
    How do we start and locate a process?
    A node is Cloud Haskell’s unit of location
    NodeId is a unique node identifier that contains an IP
    address
    We use spawn to start a new process
    data NodeId = NodeId HostName PortId
    spawn :: NodeId → Closure (ProcessM ())
    → ProcessM ProcessId
    We will talk about Closure later

    View Slide

  16. [Faculty of Science
    Information and Computing Sciences]
    16
    Fault tolerance §2
    Fault tolerance in Cloud Haskell is (again) based on ideas
    from Erlang
    Do not try to recover, but terminate: another process will
    take over
    Cloud Haskell/Erlang philosophy: trying to recover will not
    lead to robustness
    Made easier by lack of shared state between processes and
    pure functions

    View Slide

  17. [Faculty of Science
    Information and Computing Sciences]
    17
    Fault handling between two processes §2
    A process can monitor another process for termination
    Notification of termination can be received in two ways
    A regular Haskell exception
    A Cloud Haskell message
    Unidirectional monitoring
    One process monitors another
    Choose notification style
    Bidirectional monitoring
    One node is informed of the other’s abnormal termination
    Notification by exception

    View Slide

  18. [Faculty of Science
    Information and Computing Sciences]
    18
    3. Type safe messaging: channels

    View Slide

  19. [Faculty of Science
    Information and Computing Sciences]
    19
    Messages through channels §3
    send allowed us to send any message to any node
    Typed channels give a static guarantee that messages are
    only sent to nodes that can deal with them
    Consist of a send port and a receive port
    Only send ports can be serialised for transport
    To avoid losing messages during transport, receive ports
    cannot
    Messages are extracted from the receive port in FIFO order
    Send port only accepts messages of a specific type

    View Slide

  20. [Faculty of Science
    Information and Computing Sciences]
    20
    Main channel functions §3
    newChan :: Serializable a
    ⇒ ProcessM (SendPort a, ReceivePort a)
    sendChan :: Serializable a ⇒ SendPort a → a
    → ProcessM ()
    receiveChan :: Serializable a ⇒ ReceivePort a
    → ProcessM a
    There are also channel-based counterparts of receiveWait

    View Slide

  21. [Faculty of Science
    Information and Computing Sciences]
    21
    Channel example §3
    ping2 :: SendPort Ping → ReceivePort Pong
    → ProcessM ()
    ping2 pingout pongin = forever $ do
    Pong partner ← receiveChan pongin
    sendChan partner (Ping pongout)

    View Slide

  22. [Faculty of Science
    Information and Computing Sciences]
    22
    4. Serialising and transmitting functions

    View Slide

  23. [Faculty of Science
    Information and Computing Sciences]
    23
    Closures §4
    We want to be able to send functions over the network, so
    masters can execute them on slaves. A first (wrong) attempt:
    sendFunc :: SendPort (Int → Int) → Int → ProcessM ()
    sendFunc p x = sendChan p (λy → x + y + 1)
    Note that x is free. To serialize a function, one must also
    serialize its free variables.
    But how? The types of these free variables are unrelated to the
    type of the function.

    View Slide

  24. [Faculty of Science
    Information and Computing Sciences]
    24
    Problem with serialising functions §4
    To serialise a datatype, we normally deconstruct its structure.
    E.g.:
    instance Serializable a ⇒ Serializable [a]
    If we can serialise a, we can serialise a list of a, because we
    know how to deconstruct a list.
    How can we specify that free variables in a function need to be
    serialisable? The instance below cannot be expressed:
    instance Serializable (types of the free variables of an
    a → b) ⇒ Serializable (a → b) where

    View Slide

  25. [Faculty of Science
    Information and Computing Sciences]
    25
    Prior solutions §4
    Make serialisation a built-in language feature (also used in
    Erlang)
    No control over serialisation process
    Cannot prevent certain types from being serialised (e.g.
    MVars)
    Makes cost of serialisation and transmitting implicit
    Half built-in: Java RMI’s Serializable and
    Externalizable interfaces
    Depends on runtime instantiation of a deserialiser for a
    given type

    View Slide

  26. [Faculty of Science
    Information and Computing Sciences]
    26
    Cloud Haskell solution §4
    Observation: functions without free variables can be
    serialised without performing additional work
    These are trivially a closure since they do not need an
    environment
    Closure: a function together with an environment
    containing the function’s free variables
    Provided nodes run the same code, these closures can be
    serialised as a single linker label
    Cloud Haskell introduces a new built-in type constructor for
    identifying readily serialisable types: Static τ
    And two new terms: static e and unstatic e

    View Slide

  27. [Faculty of Science
    Information and Computing Sciences]
    27
    The Static type §4
    Static will be a new built-in type constructor in GHC
    Also built-in: instance Serializable (Static a)
    All values of type Static a can be serialised
    Even without a Serializable constraint on a
    First evaluates a Static, then serialises the code label of the
    evaluation result
    Works for data types as well
    E.g. Static Tree

    View Slide

  28. [Faculty of Science
    Information and Computing Sciences]
    28
    static and unstatic terms §4
    Will also be built into GHC
    static and unstatic introduce or eliminate the Static
    constructor
    static may only be applied to top-level functions
    unstatic has no such restrictions
    If e : τ, then static e : Static τ
    If e : Static τ, then unstatic e : τ

    View Slide

  29. [Faculty of Science
    Information and Computing Sciences]
    29
    Ingredients §4
    Now what do we need to serialise a function with free variables?
    It needs to be top-level, since only top-level functions can
    be static and static functions are the ones we can serialise
    A serialised environment containing the function’s free
    variables, so we can make a closure
    Some way to reconstruct the serialised function using the
    serialised environment

    View Slide

  30. [Faculty of Science
    Information and Computing Sciences]
    30
    From static values to closures §4
    We model a closure as a pair containing an environment and a
    static function that, when given the environment, gives us our
    closure
    A first attempt (using a GADT):
    data Closure a where
    MkClosure :: Serializable env ⇒ Static (env → a)
    → env → Closure a
    We require that the environment is serialisable as well. Note
    that it is existentially quantified.
    Although this seems reasonable, it will give us problems on the
    receiving end. Why?

    View Slide

  31. [Faculty of Science
    Information and Computing Sciences]
    31
    Serialising a closure §4
    We need to be able to serialise a closure for transmission over
    the network.
    Recall that we need an instance of Binary for Closure for this
    to work:
    instance Binary (Closure a) where
    put (MkClosure f env) = put f >
    > put env
    get = ⊥
    How do we deserialise? At the receiving end, we do not know
    which deserialiser to use.

    View Slide

  32. [Faculty of Science
    Information and Computing Sciences]
    32
    From static values to closures §4
    Solution: we do both serialisation and deserialisation of the
    environment at closure construction time:
    data Closure a where
    MkClosure :: Static (ByteString → a) → ByteString
    → Closure a
    Functions defined in Data.Binary to help with the conversion
    to/from ByteString:
    encode :: Binary a ⇒ a → ByteString
    decode :: Binary a ⇒ ByteString → a

    View Slide

  33. [Faculty of Science
    Information and Computing Sciences]
    33
    sendFunc revisited §4
    Recall the wrong approach:
    sendFunc :: SendPort (Int → Int) → Int → ProcessM ()
    sendFunc p x = sendChan p (λy → x + y + 1)
    Now the right approach:
    sendFunc :: SendPort (Closure (Int → Int))
    → Int → ProcessM ()
    sendFunc p x = sendChan p clo
    where clo = MkClosure (static sfun) (encode x)
    sfun :: ByteString → Int → Int
    sfun bs y = let x = decode bs
    in x + y + 1

    View Slide

  34. [Faculty of Science
    Information and Computing Sciences]
    34
    5. Wrapping up

    View Slide

  35. [Faculty of Science
    Information and Computing Sciences]
    35
    Performance §5
    and responds to
    s three messages:
    e counter; query,
    er to the sender;
    ess. The type of
    ce functions, de-
    e messages. The
    r storing process-
    after processing
    e for the counter.
    ounter, so it tail-
    ment message is
    e current value.
    t thread synchro-
    e counter process
    is impossible to
    rdinate compute-
    without incurring
    rformance of our
    0
    1000
    2000
    3000
    4000
    5000
    6000
    1 10 20 30 40 50 60 70 80
    Time (s)
    Number of mappers
    Performance of k-means on EC2 cluster
    Cloud Haskell
    Hadoop
    Figure 4. The run-time of the k-means algorithm, implemented
    under Cloud Haskell and Hadoop. The input data was one million
    100-dimensional data points.

    View Slide

  36. [Faculty of Science
    Information and Computing Sciences]
    36
    Conclusion and future work §5
    Cloud Haskell provides a good starting point for building
    distributed applications
    Already performs quite well
    Current bottleneck is acquiring and loading data
    Authors suggest improving file handling will improve
    performance
    Still in early stages: Static etc. still need to be
    implemented in GHC
    Currently it requires workarounds with Template Haskell
    Authors propose an additional higher-level framework,
    revolving around tasks rather than processes
    Task: an idempotent restartable block of code that
    produces a well-defined result

    View Slide