Slide 1

Slide 1 text

Sipping Objects from a Firehose: Raku for ETL Steven Lembark Workhorse Computing [email protected]

Slide 2

Slide 2 text

Raku has a lot of features. More than most people know. More than most people use. They fit togeher so nicely.

Slide 3

Slide 3 text

Raku has a lot of features. More than most people know. More than most people use. They fit togeher so nicely. Trick now is getting someone to use it.

Slide 4

Slide 4 text

Who wants them? Q: Who would care about the features we have? Handle complex, variable requrements. Generate both boilerplate and plugins. Needs parallel execution.

Slide 5

Slide 5 text

Complex Processing: ETL “Extract, Transform, and Load”

Slide 6

Slide 6 text

Complex Processing: ETL Feeding data warehouses.

Slide 7

Slide 7 text

Complex Processing: ETL Feeding data warehouses. Sounds simple: Extract some data. Transform it to local format. Load it into a database/-lake/-cesspool.

Slide 8

Slide 8 text

Complex Processing: ETL Feeding data warehouses. Sounds simple: Extract some data. Transform it to local format. Load it into a database/-lake/-cesspool. Catch: It ain’t.

Slide 9

Slide 9 text

Complex Processing: ETL Extraction fails. Transforms require complex, variable scrubbing logic. Loading has extensive bookkeeping.

Slide 10

Slide 10 text

Complex Processing: ETL The whole is messier than the sum of its failures. Failures propagate: Extraction create orhpan data. Transforms create unloadable data. Loading creates bookeeping errors.

Slide 11

Slide 11 text

Why this fits Raku “Modern programming” isn’t about programming. It’s about fitting pices into a ‘framework’. Smaller pieces are better. Raku provides useful tools for frameworks. Nice ways to integrate the pieces.

Slide 12

Slide 12 text

Raku features for ETL Functors gather+take lazy+hyper+map try+catch junctions smartmatch (~~)

Slide 13

Slide 13 text

Raku features for ETL Functors avoid exposing framework hooks. gather/take simplify data acquisition. lazy simplifies handling large datasets. Junctions encapsulate complex, multi-level logic. Smartmatch simplifies comparing messy data.

Slide 14

Slide 14 text

Raku features for ETL Orthogonal, additive constructs. More opportunities for declarative approaches. Fine-grained control over execution. Effective for ETL framework and plugins.

Slide 15

Slide 15 text

Raku features for ETL Well-integrated interface to external libraries. Allows leveraging c, Rust, etc, code. Speed Compatility. Portability.

Slide 16

Slide 16 text

Lifecycle Extraction is mosly boilerplate: Remote system access: curl, wget, ssh. Exception handling. Minimal configuration: remote location, authn.

Slide 17

Slide 17 text

Lifecycle Extraction is mosly boilerplate: try/catch logic. Shell commands: wget, curl, ssh, scp. HTTP, FTP endpoints.

Slide 18

Slide 18 text

Lifecycle Transforms are mostly logic: Grammars. Junctions. Handle plugins: gather/take Parallel, large data: lazy/hyper/map

Slide 19

Slide 19 text

Lifecycle Loading is mostly bookkeeping: Loading data. Loading metadata. Loading audit data. Parallel execution.

Slide 20

Slide 20 text

Lifecycle All wrapped in shell executables. Configuration. Command line. Process control. Error handling.

Slide 21

Slide 21 text

Basic framework Top level try block. Catches aborts. Framework methods daisy-chain via: sub X(ETL $etl ->ETL); try { ETL .new .prior .extract .transform .load .after } catch { # audit log for morgue. }

Slide 22

Slide 22 text

Basic framework Transform stage: Takes the post-extract object. Generates the transformed output. try { ETL .new .prior .extract .transform .load .after } catch { ... }

Slide 23

Slide 23 text

Basic framework Transform stage: read-record() & transform-row() are closures. else-blocks handle failure to open (undef). method transform( -> ETL ) { with Plugin.open-input( $.in-path ) -> &read-record { my \input_rows = lazy gather loop { read-record() or last }; with Plugin.open-output( $.out-path ) -> &tranform-row { input_rows .hyper( degree => $.threads ) .map ( { sink transform-row $_ } ); } else { … } } else { … } self }

Slide 24

Slide 24 text

Basic framework Transform stage: input_rows is a bare promise returned from the lazy gather. This calls the loop once each time a new record is required. method transform( -> ETL ) { with Plugin.open-input( $.in-path ) -> &read-record { my \input_rows = lazy gather loop { read-record() or last }; with Plugin.open-output( $.out-path ) -> &tranform-row { input_rows .hyper( degree => $.threads ) .map ( { sink transform-row $_ } ); } else { … } } else { … } self }

Slide 25

Slide 25 text

Basic framework Transform stage: input_rows has a threaded map. Passed the gathered rows from input_rows. method transform( -> ETL ) { with Plugin.open-input( $.in-path ) -> &read-record { my \input_rows = lazy gather loop { read-record() or last }; with Plugin.open-output( $.out-path ) -> &tranform-row { input_rows .hyper( degree => $.threads ) .map ( { sink transform-row $_ } ); } else { … } } else { … } self }

Slide 26

Slide 26 text

Basic framework Transform stage: the object is passed out for the next stage. method transform( -> ETL ) { with Plugin.open-input( $.in-path ) -> &read-record { my \input_rows = lazy gather loop { read-record() or last }; with Plugin.open-output( $.out-path ) -> &tranform-row { input_rows .hyper( degree => $.threads ) .map ( { sink transform-row $_ } ); } else { … } } else { … } self }

Slide 27

Slide 27 text

Basic plugin Plugin reader: open-input returns a closure. $in-path is a string: open-input has no access to the ETL object. method open-input ( Str $in-path -> Code ) { with $in-path.IO.open -> my $fh will end {.close} { sub( -> Str ) { $fh.lines } } else { die "Failed open: $in-path" } }

Slide 28

Slide 28 text

Basic plugin Plugin writer: Writes to gzip via “run(...).in”, gzip’s stdin. method open-output ( Str $path -> Code ) { with $path.IO.child( "$path.gz" ).open( :w ) -> $out will end {.close} { my $gzip = run ( < /bin/gzip ‑9 >, :out( $out ), :in ).in ; sub { $gzip.print( $^line ) } } else { die "Failed open: $path, $!" } }

Slide 29

Slide 29 text

Basic plugin Plugin writer: Returns a closure that writes to gzip. method open-output ( Str $path -> Code ) { with $path.IO.child( "$path.gz" ).open( :w ) -> $out will end {.close} { my $gzip = run ( < /bin/gzip ‑9 >, :out( $out ), :in ).in ; sub { $gzip.print( $^line ) } } else { die "Failed open: $path, $!" } }

Slide 30

Slide 30 text

Framework and Plugin Are firewalled: gather doesn’t specify the take. take doesn’t know about the gather. Closures accept strings, not ETL objects. Closures encapsulate their filehandles. The isolation simplifies both layers. One more reason Raku is nice for frameworks.

Slide 31

Slide 31 text

But wait, there’s MORE!!! Junctions: Encapsulate messy, layered logic. Smartmatch: Simple, flexible code. Works nicely with Junctions.

Slide 32

Slide 32 text

But wait, there’s MORE!!! Effective, fast external library interface. Make use of fast Rust, etc, lib’s. Re-cycle published interfaces. Use Raku framework to exec multi-language stages.

Slide 33

Slide 33 text

Who would use it? Bioinformatics: Grammars handle Truly Messy ™ formats. Smart match simplifies validation & testing: if( $protein ~~ $dna ) … if( $protein ~~ $junction ) … Common requirement for external lib’s. Common requirement for speed.

Slide 34

Slide 34 text

Who would use it Finance: High-volume with metadata & audit requirements. External libraries helpful. Simple plugins.

Slide 35

Slide 35 text

Summary ETL is a nice niche space for Raku: Reliable frameworks. Straightforward plugins. Flexible grammars & comparison tools. It can handle the level of messyness required.

Slide 36

Slide 36 text

Bedside Reading More reading on guts of ETL operations: https://speakerdeck.com/lembark/gathers-and-takes More reading on Raku in general: https://docs.raku.org/