Slide 1

Slide 1 text

Incremental The Art of Stream Processing @markhibberd

Slide 2

Slide 2 text

I have a problem

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

~150k Emails

Slide 5

Slide 5 text

~4GB Emails

Slide 6

Slide 6 text

100s of new email / day

Slide 7

Slide 7 text

~5 I want to Read

Slide 8

Slide 8 text

~2 I want to Reply To

Slide 9

Slide 9 text

Messages delivered where and when I need them

Slide 10

Slide 10 text

Ability to locate important messages from days past

Slide 11

Slide 11 text

From oleg-at-okmij.org Thu Sep 18 23:51:59 2008 To: [email protected] Subject: Lazy vs correct IO [Was: A round of golf] Message-ID: <[email protected]> Date: Thu, 18 Sep 2008 23:51:59 -0700 (PDT) Status: OR ! Lennart Augustsson wrote ! > main = do > name:_ <- getArgs > file <- readFile name > print $ length $ lines file ! Given the stance against top-level mutable variables, I have not expected to see this Lazy IO code. After all, what could be more against the spirit of Haskell than a `pure' function with observable side effects. With Lazy IO, one indeed has to choose between correctness and performance. The appearance of such code is especially strange after the evidence of deadlocks with Lazy IO, presented on this list less than a month ago. Let alone unpredictable resource usage and reliance on finalizers to close files (forgetting that GHC does not guarantee that finalizers will be run at all). ! Is there an alternative?

Slide 12

Slide 12 text

From oleg-at-okmij.org Thu Sep 18 23:51:59 2008 To: [email protected] Subject: Lazy vs correct IO [Was: A round of golf] Message-ID: <[email protected]> Date: Thu, 18 Sep 2008 23:51:59 -0700 (PDT) Status: OR ! Lennart Augustsson wrote ! > main = do > name:_ <- getArgs > file <- readFile name > print $ length $ lines file ! Given the stance against top-level mutable variables, I have not expected to see this Lazy IO code. After all, what could be more against the spirit of Haskell than a `pure' function with observable side effects. With Lazy IO, one indeed has to choose between correctness and performance. The appearance of such code is especially strange after the evidence of deadlocks with Lazy IO, presented on this list less than a month ago. Let alone unpredictable resource usage and reliance on finalizers to close files (forgetting that GHC does not guarantee that finalizers will be run at all). ! Is there an alternative?

Slide 13

Slide 13 text

Zeitgeist

Slide 14

Slide 14 text

! 13 find "$MAILDIR" -type f | \ 14 xargs -n 1 stat -f "%m|%N" | \ 15 sort -n | \ 16 cut -d '|' -f 2 | \ 17 xargs grep -l "$QUERY" | \ 18 head -5 | \ 19 xargs less

Slide 15

Slide 15 text

! 5 data Email = 6 Email { date :: Int, content :: String } 7 deriving (Show, Eq) 8 9 search :: String -> [Email] -> [Email] 10 search term = 11 take 5 12 . filter (isInfixOf term . content) 13 . sortBy (compare `on` date)

Slide 16

Slide 16 text

Reality Calling

Slide 17

Slide 17 text

Use mmap(2) instead of read(2) to read input, which can result in better performance under some circumstances but can cause undefined behaviour. ” “--mmap — $(man grep)

Slide 18

Slide 18 text

! 275 struct file * 276 grep_open(const char *path) 277 { 278 struct file *f; 279 280 f = grep_malloc(sizeof *f); 281 memset(f, 0, sizeof *f); 282 if (path == NULL) { 283 /* Processing stdin implies --line-buffered. 284 */ 285 lbflag = true; 286 f->fd = STDIN_FILENO; 287 } else if ((f->fd = open(path, O_RDONLY)) == - 288 1) 289 goto error1; 290 291 if (filebehave == FILE_MMAP) { 292 struct stat st; 293 294 if ((fstat(f->fd, &st) == -1) || (st.st_size > 295 OFF_MAX) || 296 (!S_ISREG(st.st_mode))) 297 filebehave = FILE_STDIO; 298 else { 299 int flags = MAP_PRIVATE | MAP_NOCORE | 300 MAP_NOSYNC; 301 #ifdef MAP_PREFAULT_READ 302 flags |= MAP_PREFAULT_READ; 303 #endif 304 fsiz = st.st_size; 305 buffer = mmap(NULL, fsiz, PROT_READ, flags, 306 f->fd, (off_t)0); 307 if (buffer == MAP_FAILED) 308 filebehave = FILE_STDIO; 309 else { 310 bufrem = st.st_size; 311 bufpos = buffer; 312 madvise(buffer, st.st_size, MADV_SEQUENTIAL) 313 ; 314 } 315 } 316 } 317 318 if ((buffer == NULL) || (buffer == MAP_FAILED)) 319 buffer = grep_malloc(MAXBUFSIZ); 320 321 if (filebehave == FILE_GZIP && 322 (gzbufdesc = gzdopen(f->fd, "r")) == NULL) 323 goto error2; 324 325 #ifndef WITHOUT_BZIP2 326 if (filebehave == FILE_BZIP && 327 (bzbufdesc = BZ2_bzdopen(f->fd, "r")) == 328 NULL) 329 goto error2; 330 #endif 331 332 /* Fill read buffer, also catches errors early 333 */ 334 if (bufrem == 0 && grep_refill(f) != 0) 335 goto error2; 336 337 /* Check for binary stuff, if necessary */ 338 if (binbehave != BINFILE_TEXT && memchr(bufpos, 339 '\0', bufrem) != NULL) 340 f->binary = true; 341 342 return (f); 343 344 error2: 345 close(f->fd); 346 error1: 347 free(f); 348 return (NULL); 349 } 350 351

Slide 19

Slide 19 text

“With Lazy IO, one indeed has to choose between correctness and performance.” — Oleg Kiselyov

Slide 20

Slide 20 text

! 5 type Maildir = 6 FilePath 7 8 data Email = 9 Email { date :: Int, content :: String } 10 deriving (Show, Eq) 11 12 search :: String -> Maildir -> IO [Email] 13 search term = 14 {— oh noes! It’s so horrible 15 I can’t even show it -}

Slide 21

Slide 21 text

Is there an alternative?

Slide 22

Slide 22 text

Intuition 1: A Language

Slide 23

Slide 23 text

I Need To Produce Values

Slide 24

Slide 24 text

! 1 type In i ! ! !

Slide 25

Slide 25 text

I Need To Consume Values

Slide 26

Slide 26 text

! 1 type In i 2 3 type Out o ! ! !

Slide 27

Slide 27 text

I Need To Transform Values

Slide 28

Slide 28 text

! 1 type In i 2 3 type Out o 4 5 data Pipeline i o ! ! !

Slide 29

Slide 29 text

I May Have Effects

Slide 30

Slide 30 text

! 1 type In i m 2 3 type Out o m 4 5 data Pipeline i o m ! ! !

Slide 31

Slide 31 text

I May Compute A Value

Slide 32

Slide 32 text

! 1 type In i m a 2 3 type Out o m a 4 5 data Pipeline i o m a ! ! !

Slide 33

Slide 33 text

A (Simple) Interface

Slide 34

Slide 34 text

! 1 type In i m a = Pipeline i () m a 2 3 type Out o m a = Pipeline Void o m a 4 5 data Pipeline i o m a ! ! !

Slide 35

Slide 35 text

! 1 type In i m a = Pipeline i () m a 2 3 type Out o m a = Pipeline Void o m a 4 5 data Pipeline i o m a 6 = Done a 7 | Yield o (Pipeline i o a) 8 | Await (i -> Pipeline i o a)

Slide 36

Slide 36 text

Intuition 2: Pipelines

Slide 37

Slide 37 text

Prepare Classify Deliver Poll

Slide 38

Slide 38 text

Prepare Classify Deliver Poll

Slide 39

Slide 39 text

Prepare Classify Deliver Poll Mail

Slide 40

Slide 40 text

Prepare Classify Deliver Poll Mail

Slide 41

Slide 41 text

Prepare Classify Deliver Poll Mail Features

Slide 42

Slide 42 text

Prepare Classify Deliver Poll Features

Slide 43

Slide 43 text

Prepare Classify Deliver Poll Features Scores

Slide 44

Slide 44 text

Prepare Classify Deliver Poll Scores

Slide 45

Slide 45 text

Prepare Classify Deliver Poll :: In Mail :: Pipeline Mail Features :: Out Scores :: Pipeline Features Scores

Slide 46

Slide 46 text

Getting Real

Slide 47

Slide 47 text

! 1 data Proxy i' i o' o m a 2 3 type Producer i m a = Proxy X () () o m a 4 type Consumer o m a = Proxy () i () X m a 5 type Pipe i o m a = Proxy () i () o m a Pipes ! 1 type In i m a 2 3 type Out o m a 4 5 type Pipeline i o m a

Slide 48

Slide 48 text

! 1 data Proxy i' i o' o m a 2 3 type Producer i m a = Proxy X () () o m a 4 type Consumer o m a = Proxy () i () X m a 5 type Pipe i o m a = Proxy () i () o m a Pipes Explicit Input and Output at Each Component

Slide 49

Slide 49 text

! 1 data Proxy i' i o' o m a 2 3 type Producer i m a = Proxy X () () o m a 4 type Consumer o m a = Proxy () i () X m a 5 type Pipe i o m a = Proxy () i () o m a Pipes Effects On Producers, Consumers And Pipes

Slide 50

Slide 50 text

! 1 data Proxy i' i o' o m a 2 3 type Producer i m a = Proxy X () () o m a 4 type Consumer o m a = Proxy () i () X m a 5 type Pipe i o m a = Proxy () i () o m a Pipes Can Terminate With A Value Anywhere In Pipeline

Slide 51

Slide 51 text

1 data Pipe l i o u m r 2 3 newtype ConduitM i o m r = 4 ConduitM { unConduitM :: Pipe i i o () m r } 5 6 type Source m o = ConduitM () o m () 7 type Sink i m a = ConduitM i Void m a 8 type Conduit i m o = ConduitM i o m () Conduit ! 1 type In i m a 2 3 type Out o m a 4 5 type Pipeline i o m a

Slide 52

Slide 52 text

1 data Pipe l i o u m r 2 3 newtype ConduitM i o m r = 4 ConduitM { unConduitM :: Pipe i i o () m r } 5 6 type Source m o = ConduitM () o m () 7 type Sink i m a = ConduitM i Void m a 8 type Conduit i m o = ConduitM i o m () Conduit Explicit Input and Output at Each Component

Slide 53

Slide 53 text

1 data Pipe l i o u m r 2 3 newtype ConduitM i o m r = 4 ConduitM { unConduitM :: Pipe i i o () m r } 5 6 type Source m o = ConduitM () o m () 7 type Sink i m a = ConduitM i Void m a 8 type Conduit i m o = ConduitM i o m () Conduit Effects On Sources, Sinks And Conduits

Slide 54

Slide 54 text

1 data Pipe l i o u m r 2 3 newtype ConduitM i o m r = 4 ConduitM { unConduitM :: Pipe i i o () m r } 5 6 type Source m o = ConduitM () o m () 7 type Sink i m a = ConduitM i Void m a 8 type Conduit i m o = ConduitM i o m () Conduit Can Only Terminate With A Value On a Sink

Slide 55

Slide 55 text

1 sealed abstract class Process[F[_],O] 2 3 type Process0[O] = Process[Env[_,_]#Is, O] 4 type Process1[I, O] = Process[Env[I,_]#Is, O] 5 type Sink[F[_], O] = Process[F, O => F[Unit]] Scalaz Streams ! 1 type In i m a 2 3 type Out o m a 4 5 type Pipeline i o m a

Slide 56

Slide 56 text

! 1 data Process m o 2 3 type Process0 o = forall a. Process (Is a) o 4 type Process1 i o = Process (Is i) o 5 type Sink m o = Process m (o -> m ()) Scalaz Streams ! 1 type In i m a 2 3 type Out o m a 4 5 type Pipeline i o m a

Slide 57

Slide 57 text

! 1 data Process m o 2 3 type Process0 o = forall a. Process (Is a) o 4 type Process1 i o = Process (Is i) o 5 type Sink m o = Process m (o -> m ()) Scalaz Streams Model Request And Production Rather Than Input and Output

Slide 58

Slide 58 text

! 1 data Process m o 2 3 type Process0 o = forall a. Process (Is a) o 4 type Process1 i o = Process (Is i) o 5 type Sink m o = Process m (o -> m ()) Scalaz Streams Effects Are Returned As Values, Transducers are Pure

Slide 59

Slide 59 text

! 1 data Process m o 2 3 type Process0 o = forall a. Process (Is a) o 4 type Process1 i o = Process (Is i) o 5 type Sink m o = Process m (o -> m ()) 6 7 runFoldMap :: (Monad m, Monoid b) => 8 Process m o -> (o -> m b) -> m b Scalaz Streams Computation of Values Modelled Externally

Slide 60

Slide 60 text

Prepare Classify Deliver Poll :: Producer Mail :: Pipe Mail Features :: Consumer Score :: Pipe Features Score Pipes

Slide 61

Slide 61 text

Prepare Classify Deliver Poll :: Source Mail :: Conduit Mail Features :: Sink Score :: Conduit Features Score Conduit

Slide 62

Slide 62 text

Prepare Classify Deliver Poll :: Process m Mail :: Process1 Mail Features :: Sink m Score :: Process1 Features Score Scalaz Stream

Slide 63

Slide 63 text

Horizontal Composition

Slide 64

Slide 64 text

Prepare Classify Deliver Poll Mail Delivery

Slide 65

Slide 65 text

Prepare Classify Deliver Poll Mail Delivery

Slide 66

Slide 66 text

Throttle Read Scan Poll

Slide 67

Slide 67 text

Prepare Classify Deliver Poll :: In Event :: Pipeline Mail Features :: Out Event :: Pipeline Features Score

Slide 68

Slide 68 text

Prepare Classify Deliver Poll :: In Event :: Pipeline Mail Features :: Out Event :: Pipeline Features Score (>|) :: Pipeline i o m a -> Pipeline o o’ m a -> Pipeline i o’ m a

Slide 69

Slide 69 text

Prepare Classify Deliver Poll :: In Event :: Pipeline Mail Features :: Out Event :: Pipeline Features Score >| >| >| :: Pipeline () Void (>|) :: Pipeline i o m a -> Pipeline o o’ m a -> Pipeline i o’ m a

Slide 70

Slide 70 text

Prepare Classify Deliver Poll :: In Event :: Pipeline Mail Features :: Out Event :: Pipeline Features Score >| >| >| eval :: Pipeline () Void m a -> m a (>|) :: Pipeline i o m a -> Pipeline o o’ m a -> Pipeline i o’ m a

Slide 71

Slide 71 text

Prepare Classify Deliver Poll :: Producer Event :: Pipe Mail Features :: Consumer Score :: Pipe Features Score >-> >-> >-> Pipes :: Effect

Slide 72

Slide 72 text

Prepare Classify Deliver Poll :: Producer Event :: Pipe Mail Features :: Consumer Score :: Pipe Features Score >-> >-> >-> Pipes’ runEffect :: Effect m a -> m a

Slide 73

Slide 73 text

Prepare Classify Deliver Poll :: Source Event :: Conduit Mail Features :: Sink Score :: Conduit Features Score =$= =$= =$= Conduit :: Source ()

Slide 74

Slide 74 text

Prepare Classify Deliver Poll :: Source Event :: Conduit Mail Features :: Sink Score :: Conduit Features Score $= =$= =$ Conduit’ :: Source ()

Slide 75

Slide 75 text

Prepare Classify Deliver Poll :: Source Event :: Conduit Mail Features :: Sink Score :: Conduit Features Score $= =$= $$ Conduit’’ :: m ()

Slide 76

Slide 76 text

Prepare Classify Deliver Poll :: Process m Event :: Process1 Mail Features :: Sink m Score :: Process1 Features Score |> |> to Scalaz Stream :: Process m ()

Slide 77

Slide 77 text

Prepare Classify Deliver Poll :: Process m Event :: Process1 Mail Features :: Sink m Score :: Process1 Features Score |> |> to Scalaz Stream’ run :: Process m a -> m ()

Slide 78

Slide 78 text

Is Composition About Combinators or Laws?

Slide 79

Slide 79 text

id Poll id id Poll Poll

Slide 80

Slide 80 text

Vertical Composition

Slide 81

Slide 81 text

Prepare Classify Deliver Poll Mail Delivery

Slide 82

Slide 82 text

Throttle Read Scan Poll

Slide 83

Slide 83 text

Throttle Read Scan Events

Slide 84

Slide 84 text

Throttle Read Scan Events

Slide 85

Slide 85 text

Throttle Read Scan Events Events

Slide 86

Slide 86 text

Throttle Read Scan Events

Slide 87

Slide 87 text

Throttle Read Scan Events Mail

Slide 88

Slide 88 text

Throttle Read Work Home Work Throttle 2 3 5

Slide 89

Slide 89 text

Throttle Read Work Home Work Throttle 1 5 3

Slide 90

Slide 90 text

Throttle Read Work Home Work Throttle 1 5 2

Slide 91

Slide 91 text

Throttle Read Work Home Work Throttle 1 5 2

Slide 92

Slide 92 text

Throttle Read Work Home Work Throttle 0 5 2

Slide 93

Slide 93 text

Throttle Read Work Home Work Throttle 0 5 1

Slide 94

Slide 94 text

Throttle Read Work Home Work Throttle 0 5 1 0

Slide 95

Slide 95 text

Throttle Read Home Work Home Throttle Work 1 5

Slide 96

Slide 96 text

Throttle Read Home Work Home Throttle Work 1 4

Slide 97

Slide 97 text

Throttle Read Home Work Home Throttle Work 0 4

Slide 98

Slide 98 text

Throttle Read Home Work Home Throttle Work 0 4

Slide 99

Slide 99 text

Throttle Read Work Home Home Throttle Work 3 4 Work Throttle

Slide 100

Slide 100 text

Throttle Read Work Home Work Throttle Pipes :: Producer Event :: Pipe Event Event :: Pipe Event Mail >-> >-> >>= forever

Slide 101

Slide 101 text

Throttle Read Work Home Work Throttle Conduit :: Source Event :: Conduit Event Event :: Conduit Event Mail $= =$= >>= forever

Slide 102

Slide 102 text

Throttle Read Work Home Work Throttle Scalaz Stream :: Process m Event :: Process1 Event Event :: Process1 Event Mail >| >| fby repeat

Slide 103

Slide 103 text

Intuition 3: Parsers

Slide 104

Slide 104 text

1 type In i m a = Pipeline i () m a 2 3 type Out o m a = Pipeline Void o m a 4 5 data Pipeline i o m a 6 = Done a 7 | Yield o (Pipeline i o a) 8 | Await (i -> Pipeline i o a) 9 ! ! ! ! !

Slide 105

Slide 105 text

1 type In i m a = Pipeline i () m a 2 3 type Out o m a = Pipeline Void o m a 4 5 data Pipeline i o m a 6 = Done a 7 | Yield o (Pipeline i o a) 8 | Await (i -> Pipeline i o a) 9 10 yield :: o -> Pipeline i o m () 11 yield = Yield o (Done ()) 12 13 await :: Pipeline i o m i 14 await = Await Done

Slide 106

Slide 106 text

1 one :: Pipeline i i m () 2 one = do 3 i <- await 4 yield i 5 6 cat :: Pipeline i i m () 7 cat = forever one 8 9 pairs :: Pipeline i (i, i) m () 10 pairs = forever $ do 11 i1 <- await 12 i2 <- await 13 yield (i1, i2) !

Slide 107

Slide 107 text

! 1 counter :: Monad m => Pipeline i (Int, i) m () 2 counter = flip evalStateT 0 . forever $ do 3 i <- lift await 4 n <- get 5 lift . yield $ (n, i) 6 7 filter :: (i -> Bool) -> Pipeline i i m a 8 filter f = forever $ do 9 i <- await 10 when (f i) $ yield i ! ! ! !

Slide 108

Slide 108 text

! 1 yield :: o -> Pipe i o m () 2 3 await :: Pipe i o m i ! 1 yield :: o -> ConduitM i o m r 2 3 await :: ConduitM i o m (Maybe i) 4 5 awaitForever :: (\i -> ConduitM i o m a) 6 -> ConduitM i o m () ! 1 emit :: o -> Process f o 2 3 await1 :: Process1 i i Pipes Conduit Scalaz Stream

Slide 109

Slide 109 text

Subtlety Fights Back

Slide 110

Slide 110 text

Internal vs External Management of Resources

Slide 111

Slide 111 text

Layered Streams

Slide 112

Slide 112 text

Constant Memory Streaming

Slide 113

Slide 113 text

How much does elegance cost?

Slide 114

Slide 114 text

to be continued... @markhibberd