Upgrade to Pro — share decks privately, control downloads, hide ads and more …

ETL in Raku

ETL in Raku

Convincing users to adopt Raku requires finding a niche. Turns out that Raku is an ideal language for designing frameworks: Pugins in a variety of languages are easily added and the basic language constructs make firewalling simple. This talk looks at an example of designing an ETL framework and a sample transform plugin. It provides an example that would be easily adaptable to Bioinformatics, an area with complicated ETL that currently is heavily invested in Perl.

Steven Lembark

July 02, 2024
Tweet

More Decks by Steven Lembark

Other Decks in Technology

Transcript

  1. Raku has a lot of features. More than most people

    know. More than most people use. They fit togeher so nicely.
  2. Raku has a lot of features. More than most people

    know. More than most people use. They fit togeher so nicely. Trick now is getting someone to use it.
  3. Who wants them? Q: Who would care about the features

    we have? Handle complex, variable requrements. Generate both boilerplate and plugins. Needs parallel execution.
  4. Complex Processing: ETL Feeding data warehouses. Sounds simple: Extract some

    data. Transform it to local format. Load it into a database/-lake/-cesspool.
  5. Complex Processing: ETL Feeding data warehouses. Sounds simple: Extract some

    data. Transform it to local format. Load it into a database/-lake/-cesspool. Catch: It ain’t.
  6. Complex Processing: ETL The whole is messier than the sum

    of its failures. Failures propagate: Extraction create orhpan data. Transforms create unloadable data. Loading creates bookeeping errors.
  7. Why this fits Raku “Modern programming” isn’t about programming. It’s

    about fitting pices into a ‘framework’. Smaller pieces are better. Raku provides useful tools for frameworks. Nice ways to integrate the pieces.
  8. Raku features for ETL Functors avoid exposing framework hooks. gather/take

    simplify data acquisition. lazy simplifies handling large datasets. Junctions encapsulate complex, multi-level logic. Smartmatch simplifies comparing messy data.
  9. Raku features for ETL Orthogonal, additive constructs. More opportunities for

    declarative approaches. Fine-grained control over execution. Effective for ETL framework and plugins.
  10. Raku features for ETL Well-integrated interface to external libraries. Allows

    leveraging c, Rust, etc, code. Speed Compatility. Portability.
  11. Lifecycle Extraction is mosly boilerplate: Remote system access: curl, wget,

    ssh. Exception handling. Minimal configuration: remote location, authn.
  12. Basic framework Top level try block. Catches aborts. Framework methods

    daisy-chain via: sub X(ETL $etl ->ETL); try { ETL .new .prior .extract .transform .load .after } catch { # audit log for morgue. }
  13. Basic framework Transform stage: Takes the post-extract object. Generates the

    transformed output. try { ETL .new .prior .extract .transform .load .after } catch { ... }
  14. Basic framework Transform stage: read-record() & transform-row() are closures. else-blocks

    handle failure to open (undef). method transform( -> ETL ) { with Plugin.open-input( $.in-path ) -> &read-record { my \input_rows = lazy gather loop { read-record() or last }; with Plugin.open-output( $.out-path ) -> &tranform-row { input_rows .hyper( degree => $.threads ) .map ( { sink transform-row $_ } ); } else { … } } else { … } self }
  15. Basic framework Transform stage: input_rows is a bare promise returned

    from the lazy gather. This calls the loop once each time a new record is required. method transform( -> ETL ) { with Plugin.open-input( $.in-path ) -> &read-record { my \input_rows = lazy gather loop { read-record() or last }; with Plugin.open-output( $.out-path ) -> &tranform-row { input_rows .hyper( degree => $.threads ) .map ( { sink transform-row $_ } ); } else { … } } else { … } self }
  16. Basic framework Transform stage: input_rows has a threaded map. Passed

    the gathered rows from input_rows. method transform( -> ETL ) { with Plugin.open-input( $.in-path ) -> &read-record { my \input_rows = lazy gather loop { read-record() or last }; with Plugin.open-output( $.out-path ) -> &tranform-row { input_rows .hyper( degree => $.threads ) .map ( { sink transform-row $_ } ); } else { … } } else { … } self }
  17. Basic framework Transform stage: the object is passed out for

    the next stage. method transform( -> ETL ) { with Plugin.open-input( $.in-path ) -> &read-record { my \input_rows = lazy gather loop { read-record() or last }; with Plugin.open-output( $.out-path ) -> &tranform-row { input_rows .hyper( degree => $.threads ) .map ( { sink transform-row $_ } ); } else { … } } else { … } self }
  18. Basic plugin Plugin reader: open-input returns a closure. $in-path is

    a string: open-input has no access to the ETL object. method open-input ( Str $in-path -> Code ) { with $in-path.IO.open -> my $fh will end {.close} { sub( -> Str ) { $fh.lines } } else { die "Failed open: $in-path" } }
  19. Basic plugin Plugin writer: Writes to gzip via “run(...).in”, gzip’s

    stdin. method open-output ( Str $path -> Code ) { with $path.IO.child( "$path.gz" ).open( :w ) -> $out will end {.close} { my $gzip = run ( < /bin/gzip ‑9 >, :out( $out ), :in ).in ; sub { $gzip.print( $^line ) } } else { die "Failed open: $path, $!" } }
  20. Basic plugin Plugin writer: Returns a closure that writes to

    gzip. method open-output ( Str $path -> Code ) { with $path.IO.child( "$path.gz" ).open( :w ) -> $out will end {.close} { my $gzip = run ( < /bin/gzip ‑9 >, :out( $out ), :in ).in ; sub { $gzip.print( $^line ) } } else { die "Failed open: $path, $!" } }
  21. Framework and Plugin Are firewalled: gather doesn’t specify the take.

    take doesn’t know about the gather. Closures accept strings, not ETL objects. Closures encapsulate their filehandles. The isolation simplifies both layers. One more reason Raku is nice for frameworks.
  22. But wait, there’s MORE!!! Effective, fast external library interface. Make

    use of fast Rust, etc, lib’s. Re-cycle published interfaces. Use Raku framework to exec multi-language stages.
  23. Who would use it? Bioinformatics: Grammars handle Truly Messy ™

    formats. Smart match simplifies validation & testing: if( $protein ~~ $dna ) … if( $protein ~~ $junction ) … Common requirement for external lib’s. Common requirement for speed.
  24. Who would use it Finance: High-volume with metadata & audit

    requirements. External libraries helpful. Simple plugins.
  25. Summary ETL is a nice niche space for Raku: Reliable

    frameworks. Straightforward plugins. Flexible grammars & comparison tools. It can handle the level of messyness required.