Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Art of Incremental Stream Processing

The Art of Incremental Stream Processing

Purely functional, elegant, correct, incremental and composable stream processing that is CPU and memory efficient. This is our (worthy) goal, but where do we start?

This problem space is being extensively explored across a variety of languages and libraries, each with subtly different trade-offs and not-so subtly different APIs and terminology. However, these libraries share common goals, and most share common ancestry from Oleg Kiselyov's original Iteratee work or its Free Monad based derivatives.

This talk aims to build up an intuition for stream processing in general by first building up the core concepts and language of stream processing, and then grounding those by carefully examining the trade-offs and internals of several productionised implementations. Of particular interest are the pipes and conduits libraries from the Haskell community, and scalaz-stream from the Scala community.

Mark Hibberd

May 08, 2014
Tweet

More Decks by Mark Hibberd

Other Decks in Programming

Transcript

  1. Incremental
    The Art of
    Stream
    Processing
    @markhibberd

    View Slide

  2. I have a problem

    View Slide

  3. View Slide

  4. ~150k Emails

    View Slide

  5. ~4GB Emails

    View Slide

  6. 100s of new email / day

    View Slide

  7. ~5 I want to Read

    View Slide

  8. ~2 I want to Reply To

    View Slide

  9. Messages delivered where
    and when I need them

    View Slide

  10. Ability to locate important
    messages from days past

    View Slide

  11. From oleg-at-okmij.org Thu Sep 18 23:51:59 2008
    To: [email protected]
    Subject: Lazy vs correct IO [Was: A round of golf]
    Message-ID: <[email protected]>
    Date: Thu, 18 Sep 2008 23:51:59 -0700 (PDT)
    Status: OR
    !
    Lennart Augustsson wrote
    !
    > main = do
    > name:_ <- getArgs
    > file <- readFile name
    > print $ length $ lines file
    !
    Given the stance against top-level mutable variables, I have not
    expected to see this Lazy IO code. After all, what could be more against
    the spirit of Haskell than a `pure' function with observable side
    effects. With Lazy IO, one indeed has to choose between correctness
    and performance. The appearance of such code is especially strange
    after the evidence of deadlocks with Lazy IO, presented on this list
    less than a month ago. Let alone unpredictable resource usage and
    reliance on finalizers to close files (forgetting that GHC does not
    guarantee that finalizers will be run at all).
    !
    Is there an alternative?

    View Slide

  12. From oleg-at-okmij.org Thu Sep 18 23:51:59 2008
    To: [email protected]
    Subject: Lazy vs correct IO [Was: A round of golf]
    Message-ID: <[email protected]>
    Date: Thu, 18 Sep 2008 23:51:59 -0700 (PDT)
    Status: OR
    !
    Lennart Augustsson wrote
    !
    > main = do
    > name:_ <- getArgs
    > file <- readFile name
    > print $ length $ lines file
    !
    Given the stance against top-level mutable variables, I have not
    expected to see this Lazy IO code. After all, what could be more against
    the spirit of Haskell than a `pure' function with observable side
    effects. With Lazy IO, one indeed has to choose between correctness
    and performance. The appearance of such code is especially strange
    after the evidence of deadlocks with Lazy IO, presented on this list
    less than a month ago. Let alone unpredictable resource usage and
    reliance on finalizers to close files (forgetting that GHC does not
    guarantee that finalizers will be run at all).
    !
    Is there an alternative?

    View Slide

  13. Zeitgeist

    View Slide

  14. !
    13 find "$MAILDIR" -type f | \
    14 xargs -n 1 stat -f "%m|%N" | \
    15 sort -n | \
    16 cut -d '|' -f 2 | \
    17 xargs grep -l "$QUERY" | \
    18 head -5 | \
    19 xargs less

    View Slide

  15. !
    5 data Email =
    6 Email { date :: Int, content :: String }
    7 deriving (Show, Eq)
    8
    9 search :: String -> [Email] -> [Email]
    10 search term =
    11 take 5
    12 . filter (isInfixOf term . content)
    13 . sortBy (compare `on` date)

    View Slide

  16. Reality Calling

    View Slide

  17. Use mmap(2) instead of read(2)
    to read input, which can result
    in better performance under
    some circumstances but can
    cause undefined behaviour. ”
    “--mmap
    — $(man grep)

    View Slide

  18. !
    275 struct file *
    276 grep_open(const char *path)
    277 {
    278 struct file *f;
    279
    280 f = grep_malloc(sizeof *f);
    281 memset(f, 0, sizeof *f);
    282 if (path == NULL) {
    283 /* Processing stdin implies --line-buffered.
    284 */
    285 lbflag = true;
    286 f->fd = STDIN_FILENO;
    287 } else if ((f->fd = open(path, O_RDONLY)) == -
    288 1)
    289 goto error1;
    290
    291 if (filebehave == FILE_MMAP) {
    292 struct stat st;
    293
    294 if ((fstat(f->fd, &st) == -1) || (st.st_size >
    295 OFF_MAX) ||
    296 (!S_ISREG(st.st_mode)))
    297 filebehave = FILE_STDIO;
    298 else {
    299 int flags = MAP_PRIVATE | MAP_NOCORE |
    300 MAP_NOSYNC;
    301 #ifdef MAP_PREFAULT_READ
    302 flags |= MAP_PREFAULT_READ;
    303 #endif
    304 fsiz = st.st_size;
    305 buffer = mmap(NULL, fsiz, PROT_READ, flags,
    306 f->fd, (off_t)0);
    307 if (buffer == MAP_FAILED)
    308 filebehave = FILE_STDIO;
    309 else {
    310 bufrem = st.st_size;
    311 bufpos = buffer;
    312 madvise(buffer, st.st_size, MADV_SEQUENTIAL)
    313 ;
    314 }
    315 }
    316 }
    317
    318 if ((buffer == NULL) || (buffer == MAP_FAILED))
    319 buffer = grep_malloc(MAXBUFSIZ);
    320
    321 if (filebehave == FILE_GZIP &&
    322 (gzbufdesc = gzdopen(f->fd, "r")) == NULL)
    323 goto error2;
    324
    325 #ifndef WITHOUT_BZIP2
    326 if (filebehave == FILE_BZIP &&
    327 (bzbufdesc = BZ2_bzdopen(f->fd, "r")) ==
    328 NULL)
    329 goto error2;
    330 #endif
    331
    332 /* Fill read buffer, also catches errors early
    333 */
    334 if (bufrem == 0 && grep_refill(f) != 0)
    335 goto error2;
    336
    337 /* Check for binary stuff, if necessary */
    338 if (binbehave != BINFILE_TEXT && memchr(bufpos,
    339 '\0', bufrem) != NULL)
    340 f->binary = true;
    341
    342 return (f);
    343
    344 error2:
    345 close(f->fd);
    346 error1:
    347 free(f);
    348 return (NULL);
    349 }
    350
    351

    View Slide

  19. “With Lazy IO, one indeed has
    to choose between correctness
    and performance.”
    — Oleg Kiselyov

    View Slide

  20. !
    5 type Maildir =
    6 FilePath
    7
    8 data Email =
    9 Email { date :: Int, content :: String }
    10 deriving (Show, Eq)
    11
    12 search :: String -> Maildir -> IO [Email]
    13 search term =
    14 {— oh noes! It’s so horrible
    15 I can’t even show it -}

    View Slide

  21. Is there an alternative?

    View Slide

  22. Intuition 1: A Language

    View Slide

  23. I Need To Produce Values

    View Slide

  24. !
    1 type In i
    !
    !
    !

    View Slide

  25. I Need To Consume Values

    View Slide

  26. !
    1 type In i
    2
    3 type Out o
    !
    !
    !

    View Slide

  27. I Need To Transform Values

    View Slide

  28. !
    1 type In i
    2
    3 type Out o
    4
    5 data Pipeline i o
    !
    !
    !

    View Slide

  29. I May Have Effects

    View Slide

  30. !
    1 type In i m
    2
    3 type Out o m
    4
    5 data Pipeline i o m
    !
    !
    !

    View Slide

  31. I May Compute A Value

    View Slide

  32. !
    1 type In i m a
    2
    3 type Out o m a
    4
    5 data Pipeline i o m a
    !
    !
    !

    View Slide

  33. A (Simple) Interface

    View Slide

  34. !
    1 type In i m a = Pipeline i () m a
    2
    3 type Out o m a = Pipeline Void o m a
    4
    5 data Pipeline i o m a
    !
    !
    !

    View Slide

  35. !
    1 type In i m a = Pipeline i () m a
    2
    3 type Out o m a = Pipeline Void o m a
    4
    5 data Pipeline i o m a
    6 = Done a
    7 | Yield o (Pipeline i o a)
    8 | Await (i -> Pipeline i o a)

    View Slide

  36. Intuition 2: Pipelines

    View Slide

  37. Prepare Classify Deliver
    Poll

    View Slide

  38. Prepare Classify Deliver
    Poll

    View Slide

  39. Prepare Classify Deliver
    Poll
    Mail

    View Slide

  40. Prepare Classify Deliver
    Poll
    Mail

    View Slide

  41. Prepare Classify Deliver
    Poll
    Mail
    Features

    View Slide

  42. Prepare Classify Deliver
    Poll
    Features

    View Slide

  43. Prepare Classify Deliver
    Poll
    Features
    Scores

    View Slide

  44. Prepare Classify Deliver
    Poll
    Scores

    View Slide

  45. Prepare Classify Deliver
    Poll
    :: In Mail :: Pipeline Mail Features :: Out Scores
    :: Pipeline Features Scores

    View Slide

  46. Getting Real

    View Slide

  47. !
    1 data Proxy i' i o' o m a
    2
    3 type Producer i m a = Proxy X () () o m a
    4 type Consumer o m a = Proxy () i () X m a
    5 type Pipe i o m a = Proxy () i () o m a
    Pipes
    !
    1 type In i m a
    2
    3 type Out o m a
    4
    5 type Pipeline i o m a

    View Slide

  48. !
    1 data Proxy i' i o' o m a
    2
    3 type Producer i m a = Proxy X () () o m a
    4 type Consumer o m a = Proxy () i () X m a
    5 type Pipe i o m a = Proxy () i () o m a
    Pipes
    Explicit Input and Output
    at Each Component

    View Slide

  49. !
    1 data Proxy i' i o' o m a
    2
    3 type Producer i m a = Proxy X () () o m a
    4 type Consumer o m a = Proxy () i () X m a
    5 type Pipe i o m a = Proxy () i () o m a
    Pipes
    Effects On Producers,
    Consumers And Pipes

    View Slide

  50. !
    1 data Proxy i' i o' o m a
    2
    3 type Producer i m a = Proxy X () () o m a
    4 type Consumer o m a = Proxy () i () X m a
    5 type Pipe i o m a = Proxy () i () o m a
    Pipes
    Can Terminate With A
    Value Anywhere In Pipeline

    View Slide

  51. 1 data Pipe l i o u m r
    2
    3 newtype ConduitM i o m r =
    4 ConduitM { unConduitM :: Pipe i i o () m r }
    5
    6 type Source m o = ConduitM () o m ()
    7 type Sink i m a = ConduitM i Void m a
    8 type Conduit i m o = ConduitM i o m ()
    Conduit
    !
    1 type In i m a
    2
    3 type Out o m a
    4
    5 type Pipeline i o m a

    View Slide

  52. 1 data Pipe l i o u m r
    2
    3 newtype ConduitM i o m r =
    4 ConduitM { unConduitM :: Pipe i i o () m r }
    5
    6 type Source m o = ConduitM () o m ()
    7 type Sink i m a = ConduitM i Void m a
    8 type Conduit i m o = ConduitM i o m ()
    Conduit
    Explicit Input and Output
    at Each Component

    View Slide

  53. 1 data Pipe l i o u m r
    2
    3 newtype ConduitM i o m r =
    4 ConduitM { unConduitM :: Pipe i i o () m r }
    5
    6 type Source m o = ConduitM () o m ()
    7 type Sink i m a = ConduitM i Void m a
    8 type Conduit i m o = ConduitM i o m ()
    Conduit
    Effects On Sources,
    Sinks And Conduits

    View Slide

  54. 1 data Pipe l i o u m r
    2
    3 newtype ConduitM i o m r =
    4 ConduitM { unConduitM :: Pipe i i o () m r }
    5
    6 type Source m o = ConduitM () o m ()
    7 type Sink i m a = ConduitM i Void m a
    8 type Conduit i m o = ConduitM i o m ()
    Conduit
    Can Only Terminate With A
    Value On a Sink

    View Slide

  55. 1 sealed abstract class Process[F[_],O]
    2
    3 type Process0[O] = Process[Env[_,_]#Is, O]
    4 type Process1[I, O] = Process[Env[I,_]#Is, O]
    5 type Sink[F[_], O] = Process[F, O => F[Unit]]
    Scalaz Streams
    !
    1 type In i m a
    2
    3 type Out o m a
    4
    5 type Pipeline i o m a

    View Slide

  56. !
    1 data Process m o
    2
    3 type Process0 o = forall a. Process (Is a) o
    4 type Process1 i o = Process (Is i) o
    5 type Sink m o = Process m (o -> m ())
    Scalaz Streams
    !
    1 type In i m a
    2
    3 type Out o m a
    4
    5 type Pipeline i o m a

    View Slide

  57. !
    1 data Process m o
    2
    3 type Process0 o = forall a. Process (Is a) o
    4 type Process1 i o = Process (Is i) o
    5 type Sink m o = Process m (o -> m ())
    Scalaz Streams
    Model Request And Production
    Rather Than Input and Output

    View Slide

  58. !
    1 data Process m o
    2
    3 type Process0 o = forall a. Process (Is a) o
    4 type Process1 i o = Process (Is i) o
    5 type Sink m o = Process m (o -> m ())
    Scalaz Streams
    Effects Are Returned As
    Values, Transducers are Pure

    View Slide

  59. !
    1 data Process m o
    2
    3 type Process0 o = forall a. Process (Is a) o
    4 type Process1 i o = Process (Is i) o
    5 type Sink m o = Process m (o -> m ())
    6
    7 runFoldMap :: (Monad m, Monoid b) =>
    8 Process m o -> (o -> m b) -> m b
    Scalaz Streams
    Computation of Values
    Modelled Externally

    View Slide

  60. Prepare Classify Deliver
    Poll
    :: Producer Mail :: Pipe Mail Features :: Consumer Score
    :: Pipe Features Score
    Pipes

    View Slide

  61. Prepare Classify Deliver
    Poll
    :: Source Mail :: Conduit Mail Features :: Sink Score
    :: Conduit Features Score
    Conduit

    View Slide

  62. Prepare Classify Deliver
    Poll
    :: Process m Mail :: Process1 Mail Features :: Sink m Score
    :: Process1 Features Score
    Scalaz Stream

    View Slide

  63. Horizontal Composition

    View Slide

  64. Prepare Classify Deliver
    Poll
    Mail Delivery

    View Slide

  65. Prepare Classify Deliver
    Poll
    Mail Delivery

    View Slide

  66. Throttle Read
    Scan
    Poll

    View Slide

  67. Prepare Classify Deliver
    Poll
    :: In Event :: Pipeline Mail Features :: Out Event
    :: Pipeline Features Score

    View Slide

  68. Prepare Classify Deliver
    Poll
    :: In Event :: Pipeline Mail Features :: Out Event
    :: Pipeline Features Score
    (>|) :: Pipeline i o m a -> Pipeline o o’ m a -> Pipeline i o’ m a

    View Slide

  69. Prepare Classify Deliver
    Poll
    :: In Event :: Pipeline Mail Features :: Out Event
    :: Pipeline Features Score
    >| >| >|
    :: Pipeline () Void
    (>|) :: Pipeline i o m a -> Pipeline o o’ m a -> Pipeline i o’ m a

    View Slide

  70. Prepare Classify Deliver
    Poll
    :: In Event :: Pipeline Mail Features :: Out Event
    :: Pipeline Features Score
    >| >| >|
    eval :: Pipeline () Void m a -> m a
    (>|) :: Pipeline i o m a -> Pipeline o o’ m a -> Pipeline i o’ m a

    View Slide

  71. Prepare Classify Deliver
    Poll
    :: Producer Event :: Pipe Mail Features :: Consumer Score
    :: Pipe Features Score
    >-> >-> >->
    Pipes
    :: Effect

    View Slide

  72. Prepare Classify Deliver
    Poll
    :: Producer Event :: Pipe Mail Features :: Consumer Score
    :: Pipe Features Score
    >-> >-> >->
    Pipes’
    runEffect :: Effect m a -> m a

    View Slide

  73. Prepare Classify Deliver
    Poll
    :: Source Event :: Conduit Mail Features :: Sink Score
    :: Conduit Features Score
    =$= =$= =$=
    Conduit
    :: Source ()

    View Slide

  74. Prepare Classify Deliver
    Poll
    :: Source Event :: Conduit Mail Features :: Sink Score
    :: Conduit Features Score
    $= =$= =$
    Conduit’
    :: Source ()

    View Slide

  75. Prepare Classify Deliver
    Poll
    :: Source Event :: Conduit Mail Features :: Sink Score
    :: Conduit Features Score
    $= =$= $$
    Conduit’’
    :: m ()

    View Slide

  76. Prepare Classify Deliver
    Poll
    :: Process m Event :: Process1 Mail Features :: Sink m Score
    :: Process1 Features Score
    |> |> to
    Scalaz Stream
    :: Process m ()

    View Slide

  77. Prepare Classify Deliver
    Poll
    :: Process m Event :: Process1 Mail Features :: Sink m Score
    :: Process1 Features Score
    |> |> to
    Scalaz Stream’
    run :: Process m a -> m ()

    View Slide

  78. Is Composition About
    Combinators or Laws?

    View Slide

  79. id
    Poll
    id id
    Poll
    Poll

    View Slide

  80. Vertical Composition

    View Slide

  81. Prepare Classify Deliver
    Poll
    Mail Delivery

    View Slide

  82. Throttle Read
    Scan
    Poll

    View Slide

  83. Throttle Read
    Scan
    Events

    View Slide

  84. Throttle Read
    Scan
    Events

    View Slide

  85. Throttle Read
    Scan
    Events
    Events

    View Slide

  86. Throttle Read
    Scan
    Events

    View Slide

  87. Throttle Read
    Scan
    Events
    Mail

    View Slide

  88. Throttle Read
    Work
    Home
    Work Throttle
    2 3
    5

    View Slide

  89. Throttle Read
    Work
    Home
    Work Throttle
    1
    5
    3

    View Slide

  90. Throttle Read
    Work
    Home
    Work Throttle
    1
    5
    2

    View Slide

  91. Throttle Read
    Work
    Home
    Work Throttle
    1
    5
    2

    View Slide

  92. Throttle Read
    Work
    Home
    Work Throttle
    0
    5
    2

    View Slide

  93. Throttle Read
    Work
    Home
    Work Throttle
    0
    5
    1

    View Slide

  94. Throttle Read
    Work
    Home
    Work Throttle
    0
    5
    1
    0

    View Slide

  95. Throttle Read
    Home
    Work
    Home Throttle
    Work
    1
    5

    View Slide

  96. Throttle Read
    Home
    Work
    Home Throttle
    Work
    1
    4

    View Slide

  97. Throttle Read
    Home
    Work
    Home Throttle
    Work
    0
    4

    View Slide

  98. Throttle Read
    Home
    Work
    Home Throttle
    Work
    0
    4

    View Slide

  99. Throttle Read
    Work
    Home
    Home Throttle
    Work
    3
    4
    Work Throttle

    View Slide

  100. Throttle Read
    Work
    Home
    Work Throttle
    Pipes
    :: Producer Event :: Pipe Event Event :: Pipe Event Mail
    >-> >->
    >>=
    forever

    View Slide

  101. Throttle Read
    Work
    Home
    Work Throttle
    Conduit
    :: Source Event :: Conduit Event Event :: Conduit Event Mail
    $= =$=
    >>=
    forever

    View Slide

  102. Throttle Read
    Work
    Home
    Work Throttle
    Scalaz Stream
    :: Process m Event :: Process1 Event Event :: Process1 Event Mail
    >| >|
    fby
    repeat

    View Slide

  103. Intuition 3: Parsers

    View Slide

  104. 1 type In i m a = Pipeline i () m a
    2
    3 type Out o m a = Pipeline Void o m a
    4
    5 data Pipeline i o m a
    6 = Done a
    7 | Yield o (Pipeline i o a)
    8 | Await (i -> Pipeline i o a)
    9
    !
    !
    !
    !
    !

    View Slide

  105. 1 type In i m a = Pipeline i () m a
    2
    3 type Out o m a = Pipeline Void o m a
    4
    5 data Pipeline i o m a
    6 = Done a
    7 | Yield o (Pipeline i o a)
    8 | Await (i -> Pipeline i o a)
    9
    10 yield :: o -> Pipeline i o m ()
    11 yield = Yield o (Done ())
    12
    13 await :: Pipeline i o m i
    14 await = Await Done

    View Slide

  106. 1 one :: Pipeline i i m ()
    2 one = do
    3 i <- await
    4 yield i
    5
    6 cat :: Pipeline i i m ()
    7 cat = forever one
    8
    9 pairs :: Pipeline i (i, i) m ()
    10 pairs = forever $ do
    11 i1 <- await
    12 i2 <- await
    13 yield (i1, i2)
    !

    View Slide

  107. !
    1 counter :: Monad m => Pipeline i (Int, i) m ()
    2 counter = flip evalStateT 0 . forever $ do
    3 i <- lift await
    4 n <- get
    5 lift . yield $ (n, i)
    6
    7 filter :: (i -> Bool) -> Pipeline i i m a
    8 filter f = forever $ do
    9 i <- await
    10 when (f i) $ yield i
    !
    !
    !
    !

    View Slide

  108. !
    1 yield :: o -> Pipe i o m ()
    2
    3 await :: Pipe i o m i
    !
    1 yield :: o -> ConduitM i o m r
    2
    3 await :: ConduitM i o m (Maybe i)
    4
    5 awaitForever :: (\i -> ConduitM i o m a)
    6 -> ConduitM i o m ()
    !
    1 emit :: o -> Process f o
    2
    3 await1 :: Process1 i i
    Pipes
    Conduit
    Scalaz Stream

    View Slide

  109. Subtlety Fights Back

    View Slide

  110. Internal vs External
    Management of Resources

    View Slide

  111. Layered Streams

    View Slide

  112. Constant Memory
    Streaming

    View Slide

  113. How much does elegance
    cost?

    View Slide

  114. to be continued...
    @markhibberd

    View Slide