The Art of Incremental Stream Processing

The Art of Incremental Stream Processing

Purely functional, elegant, correct, incremental and composable stream processing that is CPU and memory efficient. This is our (worthy) goal, but where do we start?

This problem space is being extensively explored across a variety of languages and libraries, each with subtly different trade-offs and not-so subtly different APIs and terminology. However, these libraries share common goals, and most share common ancestry from Oleg Kiselyov's original Iteratee work or its Free Monad based derivatives.

This talk aims to build up an intuition for stream processing in general by first building up the core concepts and language of stream processing, and then grounding those by carefully examining the trade-offs and internals of several productionised implementations. Of particular interest are the pipes and conduits libraries from the Haskell community, and scalaz-stream from the Scala community.

42d9867a0fee0fa6de6534e9df0f1e9b?s=128

Mark Hibberd

May 08, 2014
Tweet

Transcript

  1. Incremental The Art of Stream Processing @markhibberd

  2. I have a problem

  3. None
  4. ~150k Emails

  5. ~4GB Emails

  6. 100s of new email / day

  7. ~5 I want to Read

  8. ~2 I want to Reply To

  9. Messages delivered where and when I need them

  10. Ability to locate important messages from days past

  11. From oleg-at-okmij.org Thu Sep 18 23:51:59 2008 To: haskell-cafe@haskell.org Subject:

    Lazy vs correct IO [Was: A round of golf] Message-ID: <20080919065159.616A5AF09@Adric.metnet.fnmoc.navy.mil> Date: Thu, 18 Sep 2008 23:51:59 -0700 (PDT) Status: OR ! Lennart Augustsson wrote ! > main = do > name:_ <- getArgs > file <- readFile name > print $ length $ lines file ! Given the stance against top-level mutable variables, I have not expected to see this Lazy IO code. After all, what could be more against the spirit of Haskell than a `pure' function with observable side effects. With Lazy IO, one indeed has to choose between correctness and performance. The appearance of such code is especially strange after the evidence of deadlocks with Lazy IO, presented on this list less than a month ago. Let alone unpredictable resource usage and reliance on finalizers to close files (forgetting that GHC does not guarantee that finalizers will be run at all). ! Is there an alternative?
  12. From oleg-at-okmij.org Thu Sep 18 23:51:59 2008 To: haskell-cafe@haskell.org Subject:

    Lazy vs correct IO [Was: A round of golf] Message-ID: <20080919065159.616A5AF09@Adric.metnet.fnmoc.navy.mil> Date: Thu, 18 Sep 2008 23:51:59 -0700 (PDT) Status: OR ! Lennart Augustsson wrote ! > main = do > name:_ <- getArgs > file <- readFile name > print $ length $ lines file ! Given the stance against top-level mutable variables, I have not expected to see this Lazy IO code. After all, what could be more against the spirit of Haskell than a `pure' function with observable side effects. With Lazy IO, one indeed has to choose between correctness and performance. The appearance of such code is especially strange after the evidence of deadlocks with Lazy IO, presented on this list less than a month ago. Let alone unpredictable resource usage and reliance on finalizers to close files (forgetting that GHC does not guarantee that finalizers will be run at all). ! Is there an alternative?
  13. Zeitgeist

  14. ! 13 find "$MAILDIR" -type f | \ 14 xargs

    -n 1 stat -f "%m|%N" | \ 15 sort -n | \ 16 cut -d '|' -f 2 | \ 17 xargs grep -l "$QUERY" | \ 18 head -5 | \ 19 xargs less
  15. ! 5 data Email = 6 Email { date ::

    Int, content :: String } 7 deriving (Show, Eq) 8 9 search :: String -> [Email] -> [Email] 10 search term = 11 take 5 12 . filter (isInfixOf term . content) 13 . sortBy (compare `on` date)
  16. Reality Calling

  17. Use mmap(2) instead of read(2) to read input, which can

    result in better performance under some circumstances but can cause undefined behaviour. ” “--mmap — $(man grep)
  18. ! 275 struct file * 276 grep_open(const char *path) 277

    { 278 struct file *f; 279 280 f = grep_malloc(sizeof *f); 281 memset(f, 0, sizeof *f); 282 if (path == NULL) { 283 /* Processing stdin implies --line-buffered. 284 */ 285 lbflag = true; 286 f->fd = STDIN_FILENO; 287 } else if ((f->fd = open(path, O_RDONLY)) == - 288 1) 289 goto error1; 290 291 if (filebehave == FILE_MMAP) { 292 struct stat st; 293 294 if ((fstat(f->fd, &st) == -1) || (st.st_size > 295 OFF_MAX) || 296 (!S_ISREG(st.st_mode))) 297 filebehave = FILE_STDIO; 298 else { 299 int flags = MAP_PRIVATE | MAP_NOCORE | 300 MAP_NOSYNC; 301 #ifdef MAP_PREFAULT_READ 302 flags |= MAP_PREFAULT_READ; 303 #endif 304 fsiz = st.st_size; 305 buffer = mmap(NULL, fsiz, PROT_READ, flags, 306 f->fd, (off_t)0); 307 if (buffer == MAP_FAILED) 308 filebehave = FILE_STDIO; 309 else { 310 bufrem = st.st_size; 311 bufpos = buffer; 312 madvise(buffer, st.st_size, MADV_SEQUENTIAL) 313 ; 314 } 315 } 316 } 317 318 if ((buffer == NULL) || (buffer == MAP_FAILED)) 319 buffer = grep_malloc(MAXBUFSIZ); 320 321 if (filebehave == FILE_GZIP && 322 (gzbufdesc = gzdopen(f->fd, "r")) == NULL) 323 goto error2; 324 325 #ifndef WITHOUT_BZIP2 326 if (filebehave == FILE_BZIP && 327 (bzbufdesc = BZ2_bzdopen(f->fd, "r")) == 328 NULL) 329 goto error2; 330 #endif 331 332 /* Fill read buffer, also catches errors early 333 */ 334 if (bufrem == 0 && grep_refill(f) != 0) 335 goto error2; 336 337 /* Check for binary stuff, if necessary */ 338 if (binbehave != BINFILE_TEXT && memchr(bufpos, 339 '\0', bufrem) != NULL) 340 f->binary = true; 341 342 return (f); 343 344 error2: 345 close(f->fd); 346 error1: 347 free(f); 348 return (NULL); 349 } 350 351
  19. “With Lazy IO, one indeed has to choose between correctness

    and performance.” — Oleg Kiselyov
  20. ! 5 type Maildir = 6 FilePath 7 8 data

    Email = 9 Email { date :: Int, content :: String } 10 deriving (Show, Eq) 11 12 search :: String -> Maildir -> IO [Email] 13 search term = 14 {— oh noes! It’s so horrible 15 I can’t even show it -}
  21. Is there an alternative?

  22. Intuition 1: A Language

  23. I Need To Produce Values

  24. ! 1 type In i ! ! !

  25. I Need To Consume Values

  26. ! 1 type In i 2 3 type Out o

    ! ! !
  27. I Need To Transform Values

  28. ! 1 type In i 2 3 type Out o

    4 5 data Pipeline i o ! ! !
  29. I May Have Effects

  30. ! 1 type In i m 2 3 type Out

    o m 4 5 data Pipeline i o m ! ! !
  31. I May Compute A Value

  32. ! 1 type In i m a 2 3 type

    Out o m a 4 5 data Pipeline i o m a ! ! !
  33. A (Simple) Interface

  34. ! 1 type In i m a = Pipeline i

    () m a 2 3 type Out o m a = Pipeline Void o m a 4 5 data Pipeline i o m a ! ! !
  35. ! 1 type In i m a = Pipeline i

    () m a 2 3 type Out o m a = Pipeline Void o m a 4 5 data Pipeline i o m a 6 = Done a 7 | Yield o (Pipeline i o a) 8 | Await (i -> Pipeline i o a)
  36. Intuition 2: Pipelines

  37. Prepare Classify Deliver Poll

  38. Prepare Classify Deliver Poll

  39. Prepare Classify Deliver Poll Mail

  40. Prepare Classify Deliver Poll Mail

  41. Prepare Classify Deliver Poll Mail Features

  42. Prepare Classify Deliver Poll Features

  43. Prepare Classify Deliver Poll Features Scores

  44. Prepare Classify Deliver Poll Scores

  45. Prepare Classify Deliver Poll :: In Mail :: Pipeline Mail

    Features :: Out Scores :: Pipeline Features Scores
  46. Getting Real

  47. ! 1 data Proxy i' i o' o m a

    2 3 type Producer i m a = Proxy X () () o m a 4 type Consumer o m a = Proxy () i () X m a 5 type Pipe i o m a = Proxy () i () o m a Pipes ! 1 type In i m a 2 3 type Out o m a 4 5 type Pipeline i o m a
  48. ! 1 data Proxy i' i o' o m a

    2 3 type Producer i m a = Proxy X () () o m a 4 type Consumer o m a = Proxy () i () X m a 5 type Pipe i o m a = Proxy () i () o m a Pipes Explicit Input and Output at Each Component
  49. ! 1 data Proxy i' i o' o m a

    2 3 type Producer i m a = Proxy X () () o m a 4 type Consumer o m a = Proxy () i () X m a 5 type Pipe i o m a = Proxy () i () o m a Pipes Effects On Producers, Consumers And Pipes
  50. ! 1 data Proxy i' i o' o m a

    2 3 type Producer i m a = Proxy X () () o m a 4 type Consumer o m a = Proxy () i () X m a 5 type Pipe i o m a = Proxy () i () o m a Pipes Can Terminate With A Value Anywhere In Pipeline
  51. 1 data Pipe l i o u m r 2

    3 newtype ConduitM i o m r = 4 ConduitM { unConduitM :: Pipe i i o () m r } 5 6 type Source m o = ConduitM () o m () 7 type Sink i m a = ConduitM i Void m a 8 type Conduit i m o = ConduitM i o m () Conduit ! 1 type In i m a 2 3 type Out o m a 4 5 type Pipeline i o m a
  52. 1 data Pipe l i o u m r 2

    3 newtype ConduitM i o m r = 4 ConduitM { unConduitM :: Pipe i i o () m r } 5 6 type Source m o = ConduitM () o m () 7 type Sink i m a = ConduitM i Void m a 8 type Conduit i m o = ConduitM i o m () Conduit Explicit Input and Output at Each Component
  53. 1 data Pipe l i o u m r 2

    3 newtype ConduitM i o m r = 4 ConduitM { unConduitM :: Pipe i i o () m r } 5 6 type Source m o = ConduitM () o m () 7 type Sink i m a = ConduitM i Void m a 8 type Conduit i m o = ConduitM i o m () Conduit Effects On Sources, Sinks And Conduits
  54. 1 data Pipe l i o u m r 2

    3 newtype ConduitM i o m r = 4 ConduitM { unConduitM :: Pipe i i o () m r } 5 6 type Source m o = ConduitM () o m () 7 type Sink i m a = ConduitM i Void m a 8 type Conduit i m o = ConduitM i o m () Conduit Can Only Terminate With A Value On a Sink
  55. 1 sealed abstract class Process[F[_],O] 2 3 type Process0[O] =

    Process[Env[_,_]#Is, O] 4 type Process1[I, O] = Process[Env[I,_]#Is, O] 5 type Sink[F[_], O] = Process[F, O => F[Unit]] Scalaz Streams ! 1 type In i m a 2 3 type Out o m a 4 5 type Pipeline i o m a
  56. ! 1 data Process m o 2 3 type Process0

    o = forall a. Process (Is a) o 4 type Process1 i o = Process (Is i) o 5 type Sink m o = Process m (o -> m ()) Scalaz Streams ! 1 type In i m a 2 3 type Out o m a 4 5 type Pipeline i o m a
  57. ! 1 data Process m o 2 3 type Process0

    o = forall a. Process (Is a) o 4 type Process1 i o = Process (Is i) o 5 type Sink m o = Process m (o -> m ()) Scalaz Streams Model Request And Production Rather Than Input and Output
  58. ! 1 data Process m o 2 3 type Process0

    o = forall a. Process (Is a) o 4 type Process1 i o = Process (Is i) o 5 type Sink m o = Process m (o -> m ()) Scalaz Streams Effects Are Returned As Values, Transducers are Pure
  59. ! 1 data Process m o 2 3 type Process0

    o = forall a. Process (Is a) o 4 type Process1 i o = Process (Is i) o 5 type Sink m o = Process m (o -> m ()) 6 7 runFoldMap :: (Monad m, Monoid b) => 8 Process m o -> (o -> m b) -> m b Scalaz Streams Computation of Values Modelled Externally
  60. Prepare Classify Deliver Poll :: Producer Mail :: Pipe Mail

    Features :: Consumer Score :: Pipe Features Score Pipes
  61. Prepare Classify Deliver Poll :: Source Mail :: Conduit Mail

    Features :: Sink Score :: Conduit Features Score Conduit
  62. Prepare Classify Deliver Poll :: Process m Mail :: Process1

    Mail Features :: Sink m Score :: Process1 Features Score Scalaz Stream
  63. Horizontal Composition

  64. Prepare Classify Deliver Poll Mail Delivery

  65. Prepare Classify Deliver Poll Mail Delivery

  66. Throttle Read Scan Poll

  67. Prepare Classify Deliver Poll :: In Event :: Pipeline Mail

    Features :: Out Event :: Pipeline Features Score
  68. Prepare Classify Deliver Poll :: In Event :: Pipeline Mail

    Features :: Out Event :: Pipeline Features Score (>|) :: Pipeline i o m a -> Pipeline o o’ m a -> Pipeline i o’ m a
  69. Prepare Classify Deliver Poll :: In Event :: Pipeline Mail

    Features :: Out Event :: Pipeline Features Score >| >| >| :: Pipeline () Void (>|) :: Pipeline i o m a -> Pipeline o o’ m a -> Pipeline i o’ m a
  70. Prepare Classify Deliver Poll :: In Event :: Pipeline Mail

    Features :: Out Event :: Pipeline Features Score >| >| >| eval :: Pipeline () Void m a -> m a (>|) :: Pipeline i o m a -> Pipeline o o’ m a -> Pipeline i o’ m a
  71. Prepare Classify Deliver Poll :: Producer Event :: Pipe Mail

    Features :: Consumer Score :: Pipe Features Score >-> >-> >-> Pipes :: Effect
  72. Prepare Classify Deliver Poll :: Producer Event :: Pipe Mail

    Features :: Consumer Score :: Pipe Features Score >-> >-> >-> Pipes’ runEffect :: Effect m a -> m a
  73. Prepare Classify Deliver Poll :: Source Event :: Conduit Mail

    Features :: Sink Score :: Conduit Features Score =$= =$= =$= Conduit :: Source ()
  74. Prepare Classify Deliver Poll :: Source Event :: Conduit Mail

    Features :: Sink Score :: Conduit Features Score $= =$= =$ Conduit’ :: Source ()
  75. Prepare Classify Deliver Poll :: Source Event :: Conduit Mail

    Features :: Sink Score :: Conduit Features Score $= =$= $$ Conduit’’ :: m ()
  76. Prepare Classify Deliver Poll :: Process m Event :: Process1

    Mail Features :: Sink m Score :: Process1 Features Score |> |> to Scalaz Stream :: Process m ()
  77. Prepare Classify Deliver Poll :: Process m Event :: Process1

    Mail Features :: Sink m Score :: Process1 Features Score |> |> to Scalaz Stream’ run :: Process m a -> m ()
  78. Is Composition About Combinators or Laws?

  79. id Poll id id Poll Poll

  80. Vertical Composition

  81. Prepare Classify Deliver Poll Mail Delivery

  82. Throttle Read Scan Poll

  83. Throttle Read Scan Events

  84. Throttle Read Scan Events

  85. Throttle Read Scan Events Events

  86. Throttle Read Scan Events

  87. Throttle Read Scan Events Mail

  88. Throttle Read Work Home Work Throttle 2 3 5

  89. Throttle Read Work Home Work Throttle 1 5 3

  90. Throttle Read Work Home Work Throttle 1 5 2

  91. Throttle Read Work Home Work Throttle 1 5 2

  92. Throttle Read Work Home Work Throttle 0 5 2

  93. Throttle Read Work Home Work Throttle 0 5 1

  94. Throttle Read Work Home Work Throttle 0 5 1 0

  95. Throttle Read Home Work Home Throttle Work 1 5

  96. Throttle Read Home Work Home Throttle Work 1 4

  97. Throttle Read Home Work Home Throttle Work 0 4

  98. Throttle Read Home Work Home Throttle Work 0 4

  99. Throttle Read Work Home Home Throttle Work 3 4 Work

    Throttle
  100. Throttle Read Work Home Work Throttle Pipes :: Producer Event

    :: Pipe Event Event :: Pipe Event Mail >-> >-> >>= forever
  101. Throttle Read Work Home Work Throttle Conduit :: Source Event

    :: Conduit Event Event :: Conduit Event Mail $= =$= >>= forever
  102. Throttle Read Work Home Work Throttle Scalaz Stream :: Process

    m Event :: Process1 Event Event :: Process1 Event Mail >| >| fby repeat
  103. Intuition 3: Parsers

  104. 1 type In i m a = Pipeline i ()

    m a 2 3 type Out o m a = Pipeline Void o m a 4 5 data Pipeline i o m a 6 = Done a 7 | Yield o (Pipeline i o a) 8 | Await (i -> Pipeline i o a) 9 ! ! ! ! !
  105. 1 type In i m a = Pipeline i ()

    m a 2 3 type Out o m a = Pipeline Void o m a 4 5 data Pipeline i o m a 6 = Done a 7 | Yield o (Pipeline i o a) 8 | Await (i -> Pipeline i o a) 9 10 yield :: o -> Pipeline i o m () 11 yield = Yield o (Done ()) 12 13 await :: Pipeline i o m i 14 await = Await Done
  106. 1 one :: Pipeline i i m () 2 one

    = do 3 i <- await 4 yield i 5 6 cat :: Pipeline i i m () 7 cat = forever one 8 9 pairs :: Pipeline i (i, i) m () 10 pairs = forever $ do 11 i1 <- await 12 i2 <- await 13 yield (i1, i2) !
  107. ! 1 counter :: Monad m => Pipeline i (Int,

    i) m () 2 counter = flip evalStateT 0 . forever $ do 3 i <- lift await 4 n <- get 5 lift . yield $ (n, i) 6 7 filter :: (i -> Bool) -> Pipeline i i m a 8 filter f = forever $ do 9 i <- await 10 when (f i) $ yield i ! ! ! !
  108. ! 1 yield :: o -> Pipe i o m

    () 2 3 await :: Pipe i o m i ! 1 yield :: o -> ConduitM i o m r 2 3 await :: ConduitM i o m (Maybe i) 4 5 awaitForever :: (\i -> ConduitM i o m a) 6 -> ConduitM i o m () ! 1 emit :: o -> Process f o 2 3 await1 :: Process1 i i Pipes Conduit Scalaz Stream
  109. Subtlety Fights Back

  110. Internal vs External Management of Resources

  111. Layered Streams

  112. Constant Memory Streaming

  113. How much does elegance cost?

  114. to be continued... @markhibberd