gathers-and-takes.pdf

Hypers & Gathers & Takes! Oh My! The Yellow Brick
Road to Raku ETL. Steven Lembark Workhorse Computing [email protected]

Follow the yellow brick road... Show how Raku syntax applies.
https://docs.raku.org/

nr.gz is available on NCBI Non-redundent sequence from from NIH.
123GB of BLAST format AA sequences. 257_100_652 sequences (recently). From a dozen to ~100K chars in length.

Q: Is it really unique? NIH doesn’t its own quality
control. Catch: There can be duplicate sequences. How to validate uniqueness?

BLAST format Devised by biologists to make us all miserable:
Prefix ‘>’ starts sequence. EOF ends last sequence. > arbitrary text...\n sequence line...\n sequence line...\n > and so it goes...\n sequence line...\n

Comparing sequences Extract the sequences. Compare all 257_100_652 of them.

Comparing sequences Extract the sequences. Compare all 257_100_652 of them.
Only 33_050_372_629_412_552 pairs! About 1_048_020 years at 1KHz.

Comparing sequences But wait! Just SHA the sequences!

Comparing sequences But wait! Just SHA the sequences! SHA512 has
collisions. No one-step way to compare via digests.

Comparing sequences Ignore headers. Extract sequences. Generate two outputs: ID
+ Length + Digest ID + Sequence Merge by length + digest. Compare collision sequences.

Follow the yellow brick road... Bulk reads. Parallel processing.

Read sequences No good way to thread reading: Sequences are
variable length. Most larger than a filesystem block.

Read sequences Splitting the file doesn’t help: Takes too long
to read.

Pick a format. xz: half the space in twice the
time: 91483591675 nr.gz # gzip -9 -> 91GB 51888998124 nr.xz # xz -9e -> 51GB # time xzcat < nr.xz > /dev/null; real 110m3.088s user 109m3.145s sys 0m58.175s # time gzip -dc < nr.gz > /dev/null; real 41m28.595s user 40m16.881s sys 1m11.103s

Pick a format. de-compressing on nvme is a non-trivial in
both cases. 91483591675 nr.gz # gzip -9 -> 91GB 51888998124 nr.xz # xz -9e -> 51GB # time xzcat < nr.xz > /dev/null; real 110m3.088s user 109m3.145s sys 0m58.175s # time gzip -dc < nr.gz > /dev/null; real 41m28.595s user 40m16.881s sys 1m11.103s

Processing sequences Result: Single-stream read with parallel processing. Chunk the
input, pass chunks off to worker threads.

input, pass chunks off to worker threads. Ooooh,worker pools! Q: Who here enjoys fork-exec? IPC? Data pipes? Semaphores?

input, pass chunks off to worker threads. A: Nobody. Raku ends your pain.

One approach: seqence handlers subscribe. write sequence to channel. Graceful
delivery, bookkeeping all under the hood. Nicely described in docs: https://docs.raku.org/language/concurrency Processing sequences

Catch: It’s way too slow. The bookkeeping and channel latency
kill throughput. Processing sequences

Better approach: gather & take lazy & loop map &
race Processing sequences

First step: Read the input. Deal with input format. Chunk
the data. Pass chunks to worker threads.

Command line arguments --path=’X’ is input path of data. --threads=N
count of worker threads. --chunk_size=N size of read chunk unit sub MAIN ( IO::Path:D :$path = $*PROGRAM.dirname.IO.add( '../data/nr.gz' ) , Int:D :$threads = 1 , Int:D :$chunk_size = 18 );

Command line arguments $path defaults to /exec/path/../data/nr.gz. unit sub MAIN
( IO::Path:D :$path = $*PROGRAM.dirname.IO.add( '../data/nr.gz' ) ... );

Command line arguments Splat is a “twigil” ‘*’ is a
pre-defined dynamic variable. unit sub MAIN ( IO::Path:D :$path = $*PROGRAM.dirname.IO.add( '../data/nr.gz' ) ... );

Command line arguments Dots are method calls. unit sub MAIN
( IO::Path:D :$path = $*PROGRAM.dirname.IO.add( '../data/nr.gz' ) ... );

Command line arguments Dots are method calls. dirname returns String.
unit sub MAIN ( IO::Path:D :$path = $*PROGRAM.dirname.IO.add( '../data/nr.gz' ) ... );

.IO is a constructor. unit sub MAIN ( IO::Path:D :$path = $*PROGRAM.dirname.IO.add( '../data/nr.gz' ) ... );

.IO.add( … ) is a method call to the IO object. unit sub MAIN ( IO::Path:D :$path = $*PROGRAM.dirname.IO.add( '../data/nr.gz' ) ... );

ROOT of all evil ROOT is a value. It cannot
be changed. Re-assignment is a compile-time error. my \ROOT = $*PROGRAM.dirname.IO. add( '../out' ).IO.cleanup.add( $*PROGRAM.basename );

ROOT of all evil Base for paths of output files.
cleanup resolves “..” and “//” Not abspath. my \ROOT = $*PROGRAM.dirname.IO. add( '../out' ).IO.cleanup.add( $*PROGRAM.basename );

Housekeeping with localizes its argument. Assigns to $_ by default.
with ROOT.basename ~ '.' -> $match { .unlink for ROOT.dirname.IO.dir ( test => { .IO.basename ~~ rx{^ $match } } ); }

Housekeeping ROOT is a value: no sigil. with ROOT.basename ~
'.' -> $match { .unlink for ROOT.dirname.IO.dir ( test => { .IO.basename ~~ rx{^ $match } } ); }

Housekeeping ‘~’ is catenate. with ROOT.basename ~ '.' -> $match
{ .unlink for ROOT.dirname.IO.dir ( test => { .IO.basename ~~ rx{^ $match } } ); }

Housekeeping -> is assignment with ROOT.basename ~ '.' -> $match
{ .unlink for ROOT.dirname.IO.dir ( test => { .IO.basename ~~ rx{^ $match } } ); }

Housekeeping postfix for iterates $_.unlink. with ROOT.basename ~ '.' ->
$match { .unlink for ROOT.dirname.IO.dir ( test => { .IO.basename ~~ rx{^ $match } } ); }

Housekeeping Directory scan. with ROOT.basename ~ '.' -> $match {
.unlink for ROOT.dirname.IO.dir ( test => { .IO.basename ~~ rx{^ $match } } ); }

Housekeeping ‘:’ is syntatic sugar for trailing ( … ).
Raku isn’t ((((LISP )))):-) with ROOT.basename ~ '.' -> $match { .unlink for ROOT.dirname.IO.dir: test => { .IO.basename ~~ rx{^ $match } } ; }

Pick a format, any format... Input path defines extract command.
my $extract = gather given $path { when /[.]gz $/ { take 'gzip -dc' } when /[.]bz 2? $/ { take 'bzip2 -dc' } when /[.]xz $/ { take 'xzcat' } default { die "Unknown file type:'$path' (gz,bz,xz)." } };

Pick a format, any format... given localizes its argument to
$_. when does a smartmatch. my $extract = gather given $path { when /[.]gz $/ { take 'gzip -dc' } when /[.]bz 2? $/ { take 'bzip2 -dc' } when /[.]xz $/ { take 'xzcat' } default { die "Unknown file type:'$path' (gz,bz,xz)." } };

Pick a format, any format... gather accumulates the result of
takes. take can be anywhere: block, sub-call... my $extract = gather given $path { when /[.]gz $/ { take 'gzip -dc' } when /[.]bz 2? $/ { take 'bzip2 -dc' } when /[.]xz $/ { take 'xzcat' } default { die "Unknown file type:'$path' (gz,bz,xz)." } };

Reading the input run forks a shell command. :out makes
stdout available. with run( “$extract $path”, :out ).out -> $input { # read from $input } else { die “Failed ‘$extract $path’, $!\n”; }

Reading the input $input is assigned the stdout handle from
run. with run( “$extract $path”, :out ).out -> $input { # read from $input } else { die “Failed ‘$extract $path’, $!\n”; }

Reading the input Altneration of with is ‘undef’ in the
assignment. Leaves in else-block to handle failed exectution. with run( “$extract $path”, :out ).out -> $input { # read from $input } else { die “Failed ‘$extract $path’, $!\n”; }

Reading the input Catch: run over-buffers the input. Works fine
for “ls -l” or “git status” as commands. Looks like a memory leak with large input. Fix: Use a named pipe.

Find the pipe Generate & sanity check a pipe. my
$pipe = gather with $*PROGRAM.dirname.IO.add( 'read.p' )->$sanity { $sanity.e or die "Non-existant: '$sanity'."; $sanity.r or die "Non-readable: '$sanity'."; take $sanity; # doesn’t need to be at the end! say "Pipe: '$sanity'"; }

Write the pipe Shell syntax for extracting from a file
into a named pipe. with "$extract < $path > $pipe &" -> $cmd { say "Extract: '$cmd'"; start shell $cmd; }

Write the pipe Shell handles fork/exec. with "$extract < $path
> $pipe &" -> $cmd { say "Extract: '$cmd'"; start shell $cmd; }

Write the pipe start creates a thread. with "$extract <
$path > $pipe &" -> $cmd { say "Extract: '$cmd'"; start shell $cmd; }

Read the pipe Reading the named pipe: open a file.
with open $pipe, :r -> $input { # ... process $input } else { die “Failed open: ‘$pipe’, $!”; }

Read the pipe Raku files don’t auto-close! with open $pipe,
:r -> $input { # ... process $input close $input; } else { die “Failed open: ‘$pipe’, $!”; }

Progress meter my $chunks = 0; start loop { sleep
5; say "$chunks chunks"; }; Check if processing stalls. say adds a newline.

Reading the input. Balance chunk size with thread count. Ideally
keep everything running. One option: readline with m{ ^ > } as end-of-record. Way too slow.

Reading the input. Balance chunk size with thread count. Ideally
keep everything running. Faster option: Fixed-size, unbuffered read. Chunk the data, deal with trailing records.

Containers & Data Fixed data has a few advantages: Not
an lvalue, causes compile-time error. Faster: No container object to un-wrap. Variables: update & interpolate. my \READ_BYTES = 2 ** 16; my \READ_COUNT = 2 ** ( $chunk_size - 16 ); my \INTERVAL = 2 ** ( 25 - $chunk_size );

Chunky data read returns a Buffer. Buffers are blobs, not
text. my $chunk = ‘’; for ( 1 .. READ_COUNT ) { $chunk ~= $input.read( READ_BYTES ).decode( 'ascii' ); $input.eof and last; }

Chunky data decode produces text. ‘ascii’ is lowest-overhead. my $chunk
= ‘’; for ( 1 .. READ_COUNT ) { $chunk ~= $input.read( READ_BYTES ).decode( 'ascii' ); $input.eof and last; }

Chunky data decode produces text. ‘ascii’ is lowest-overhead. decoded text
is appended to the buffer. my $chunk = ‘’; for ( 1 .. READ_COUNT ) { $chunk ~= $input.read( READ_BYTES ).decode( 'ascii' ); $input.eof and last; }

Chunky data Chunks don’t align with records. state value preserved
between calls. state $next = ‘’; state $chunk = ‘’; my \offset = $input.tell - $next.chars; $chunk = $next; for ( 1 .. READ_COUNT ) { ... }

Chunky data Take feeds gather. No need to know where
it goes. given $chunk.chars { when CHUNK_CHARS # complete chunk { my $i = 1 + $chunk.rindex( ‘>’ ); $next = $chunk.substr( $i ); take( offset, $chunk.substr( 0, $i ) ); } when 0 # empty $next after EOF. { put "\tInput complete, final chunk at $chunks."; } default # partial chunk on EOF. { $next = ‘’; take( offset, $chunk ~ ‘>’ ); }

Chunky data Full, empty, or small chunk. given $chunk.chars {
when CHUNK_CHARS # complete chunk { my $i = 1 + $chunk.rindex( ‘>’ ); $next = $chunk.substr( $i ); take( offset, $chunk.substr( 0, $i ) ); } when 0 # empty $next after EOF. { put "\tInput complete, final chunk at $chunks."; } default # partial chunk on EOF. { $next = ‘’; take( offset, $chunk ~ ‘>’ ); } }

Chunky data Truncate chunk at final ‘>’. given $chunk.chars {

Chunky data Take hands back two values. given $chunk.chars {

Chunky data Subs are lexical to their defining scope. with
open $pipe, :r -> $input { sub read-chunk ( --> Bool ) { $input.read( ... ) ... $chunk.Bool } }

Chunky data Signatures are objects, define args & return type.
with open $pipe, :r -> $input { sub read-chunk ( --> Bool ) { $input.read( ... ) ... $chunk.Bool } }

Chunky data Return value requires explicit cast. with open $pipe,
:r -> $input { sub read-chunk ( --> Bool ) { $input.read( ... )... $chunk.Bool } }

sub read-chunk ( --> Bool ) { state $next =
‘’; state $chunk = ‘’; my \offset = $input.tell - $next.chars; $chunk = $next; for ( 1 .. $READ_COUNT ) { $chunk ~= $input.read( READ_BYTES ).decode( 'ascii' ); $input.eof and last; } given $chunk.chars { when CHUNK_CHARS { my $i = 1 + $chunk.rindex( EOR ); $next = $chunk.substr( $i ); take ( offset, $chunk.substr( 0, $i + 1 ) ); } when 0 { put "\tInput complete, final chunk at $chunks."; } default { $next = ''; take( offset, $chunk ~ ‘>’ ); } } $chunk.Bool } Chunked input handler: take accumulates data for caller. False from empty $chunk on EOF. Interface is simple and fast. Takes are passed back... Chunky data

Sipping from a firehose … to the gather outside them.
my @chunked-input = gather loop { read-chunk $input or last; };

Sipping from a firehose Catch: @chunked-input reads all of the
data. my @chunked-input = gather loop { read-chunk $input or last; };

Sipping from a firehose lazy returns a promise. Populated on
demand. my @chunked-input = lazy gather loop { read-chunk $input or last; };

Sipping from a firehose Catch: @chunked-input still buffers input. Fix:
\chunked-input is a bare promise. Delivers chunks without buffering them. my \chunked-input = lazy gather loop { read-chunk $input or last; };

Sipping from a firehose Result: single-stream reader. Agnostic to take
contents. my \chunked-input = lazy gather loop { read-chunk $input or last; };

Sipping from a firehose Count chunks for the watchdog thread.
my \chunked-input = lazy gather loop { read-chunk $input or last; ++$chunks; };

chunked-input promises data. map reads it. One chunk at a
time. chunked-input .map: { process-chunk |$_ } ; Sipping from a firehose

take( $offset, $chunk ) saves list object. slip flattens object
contents. Coordination between reader & processor, not map. chunked-input .map: { process-chunk |$_ } ; Sipping from a firehose

Result isn’t assigned. process-chunk() must publish results. chunked-input .map: {
process-chunk |$_ } ; Sipping from a firehose

Two-fisted drinking. hyper parallelizes an operator. Output guaranteed to be
in the order of inputs. Order bookkeeping done under the hood. chunked-input .hyper( degree => $threads, batch => 1 ) .map: { process-chunk |$_ } ;

Two-fisted drinking. Chunks are independent. race() doesn’t care about input
order. Faster with independent inputs. chunked-input .race( degree => $threads ) .map: { process-chunk |$_ } ;

Two-fisted drinking. Generic parallel dispatch in Raku: promise supplies data.
map consumes it. chunked-input .race( degree => $threads ) .map: { process-chunk |$_ } ;

Generic Raku ETL. Lazy reader: reader takes list. \inputs =
lazy gather loop{ read-one or last }; sub read-next ( -->Bool ) { ... take (...); $data.Bool }; Threaded processor: slip flattens it out. inputs.hyper( threads = $x ).map: { process |$_ }; inputs.race( threads = $x ).map: { process |$_ };

Generic Raku ETL. Two lines of code: \inputs = lazy
gather loop{ read-next or last }; inputs.race( threads = $x ).map: { process |$_ };

Process sequences For each chunk: Locate records. Skip headers. Extract
sequence. Save Offset, Length, Digest, Sequence. Separate files for staged processing.

Process sequences Two general approaches: Accumulate all records first, open
and write files. Open files and write records as they are generated. Tradeoff: memory vs. system call rate. Buffer all sequence data or Lots of separate kernel calls.

Take what you need Arguments expanded from |$_. read-chunk’s take
== process-chunk’s args. sub process-chunk ( , Int:D \chunk_off , Str:D \chunk --> Nil ) { ... }

Take what you need Arguments expanded from |$_. Returns nothing
(map wasn’t assigned). sub process-chunk ( , Int:D \chunk_off , Str:D \chunk --> Nil ) { ... }

Basename for output { my \stub = sprintf '%s.%08x', ROOT,
offset; ... offset makes finding sequences easy. %x sorts lexically.

Take what you need $i & $j are updated, need
variables. start is static through the loop. my $i = 0; my $j = 0; my @seqs = gather loop { $i = chunk.index( "\n", $j ) or last; $j = chunk.index( '>', $i ) or die ... ; my \start = chunk_off + $j + 1;

Take what you need substr & subst return Text objects.
Suitable for daisy-chaining. my \seq = chunk.substr( $i, $j - $i ).subst( "\n", '', :g ); use Digest::MurmurHash3; my \hash = sprintf "%08x\t%08x", seq.chars, murmurhash3_32(seq,0); take [ start, hash, seq ]; }

Take what you need use is lexically scoped. Avoids version
collisions in multiple parts of code. my \seq = chunk.substr( $i, $j - $i ).subst( "\n", '', :g ); use Digest::MurmurHash3; my \hash = sprintf "%08x\t%08x", seq.chars, murmurhash3_32(seq,0); take [ start, hash, seq ]; }

Take what you need take snags an array. Allows indexing
on output. my \seq = chunk.substr( $i, $j - $i ).subst( "\n", '', :g ); use Digest::MurmurHash3; my \hash = sprintf "%08x\t%08x", seq.chars, murmurhash3_32(seq,0); take [ start, hash, seq ]; }

Output what’s gathered < quoted word list > for <
digest 1 4096 sequence 2 4096 > -> Str \name, Int \field, Int \buffsize { }

Output what’s gathered Block signature takes three at a time.
for < digest 1 4096 sequence 2 4096 > -> Str \name, Int \field, Int \bytes { }

Output what’s gathered Adjust output buffer size. “True” is default,
“False” is sync. for < digest 1 4096 sequence 2 4096 > -> Str \name, Int \field, Int \bytes { }

Output what’s gathered ( list ).join.IO yields a file. Variable
simplifies interpolation. my $file = ( stub, name, 'tsv' ).join('.').IO; $file.e and die "Collision: '$file'"; my $fh will leave { .close } = $file.open: :w, :enc('ascii') ), :out-buffer( bytes ); $fh.say: $_[ 0, field ].join( "\t" ) for @seqs; say "\t$fh”;

Output what’s gathered Open for write. my $file = (
stub, name, 'tsv' ).join('.').IO; $file.e and die "Collision: '$file'"; my $fh will leave { .close } = $file.open: :w, :enc('ascii') ), :out-buffer( bytes ); $fh.say: $_[ 0, field ].join( "\t" ) for @seqs; say "\t$fh”;

Output what’s gathered Phaser LEAVE called on exit from scope.
Self-closing file handle. my $file = ( stub, name, 'tsv' ).join('.').IO; $file.e and die "Collision: '$file'"; my $fh will leave { .close } = $file.open: :w, :enc('ascii') ), :out-buffer( bytes ); $fh.say: $_[ 0, field ].join( "\t" ) for @seqs; say "\t$fh”;

Output what’s gathered Array slice uses ‘$’ not ‘@’. Where
the sprintf “\t” comes in. my $file = ( stub, name, 'tsv' ).join('.').IO; $file.e and die "Collision: '$file'"; my $fh will leave { .close } = $file.open: :w, :enc('ascii') ), :out-buffer( bytes ); $fh.say: $_[ 0, field ].join( "\t" ) for @seqs; say "\t$fh”;

Output what’s gathered Stringy filehandle is the path. my $file
= ( stub, name, 'tsv' ).join('.').IO; $file.e and die "Collision: '$file'"; my $fh will leave { .close } = $file.open: :w, :enc('ascii') ), :out-buffer( bytes ); $fh.say: $_[ 0, field ].join( "\t" ) for @seqs; say "\t$fh”;

Output what’s gathered Parallel writes. for < digest 1 …
> { ... start { my $fh will leave { .close } = ...; $fh.say: $_[ 0, field ].join( "\t" ) for @seqs; } }

More than one way... Buffering all of @seqs can use
a lot of memory. Nice if take in a closure or metadata. Allows genric process-chunks. Alternative: Immediate write.

Instant gratification Store open files – no “on leave’. Requires
separate close. my @filz = < digest 1 sequence 2 > .map: -> Str \name, Int \field { my $file = ( stub, name, 'tsv' ).join('.').IO; $file.e and die "Collision: '$file'"; |( $file.open( :w, :enc('ascii') ), field ) } ;

Instant gratification Block signature takes file handle & offset. loop
{ ... my \start = chunk_off + $j; ... for @filz -> \fh, \field { fh.say: ( start, hash, seq )[ 0, field ] } }

Instant gratification Explicitly close the files. Signature extracts two values
each pass. for @filz -> \fh, \field { fh.close; }

Fewer files 14_000 to 200_000 files in one dir? XFS
doesn’t mind. Simple to locate sequences for comparison. Smaller files speed up sequence compares.

Fewer files Alternate: Re-cycle files by thread. my \stub =
sprintf '%s.%02x', ROOT, $*THREAD.id; ... my $fh will leave { .close } = $file.open( :a, :enc('ascii') );

Fewer files Basename uses thread number. Open in append mode
at start of thread. my \stub = sprintf '%s.%02x', ROOT, $*THREAD.id; ... my $fh will leave { .close } = $file.open( :a, :enc('ascii') );

Fewer files Catch: It is sloooooow. Overhead of append. With
XFS using many files is twice the speed. my \stub = sprintf '%s.%02x', ROOT, $*THREAD.id; ... my $fh will leave { .close } = $file.open( :a, :enc('ascii') );

Logging print-stats on CPAN. Incremental & total system resources. Main
issues here are time & memory use.

Logging Interpolate anything: “… { frobnicate() } …”; my \foo
= ... say “Foo is: {foo}”;

my \chunked-input = lazy gather loop { state $prior =
now; state $last = 0; read-chunk $input or last; $chunks % INTERVAL or do { my \after = now; my \curr = $chunks; my \rate = ( ( curr - $last ) * CHUNK_CHARS / Mi / ( after - $prior ) ).Int; $prior = after; $last = curr; print-stats label => "Chunk $chunks {rate} MiBaud."; VM.request-garbage-collection; }; ++$chunks; }; Logging

Logging Feel free to mix your metaphores. $foo or {value}:
both work. This is the real power: Choosing what makes sense. my \rate = ( ( curr - $last ) * CHUNK_CHARS / Mi / ( after - $prior ) ).Int; $prior = after; $last = curr; print-stats label => "Chunk $chunks {rate} MiBaud."; VM.request-garbage-collection; }; ++$chunks; };

Benchmark Gather + Write vs. Immediate Write. 36 cores (via
taskset) on 3 CPUS. 64 threads in race(). Chunk size 18, 19, 20, 21, 22, 24.

Benchmark Wallclock Minimize. System time Hopefully small. User / Wallclock
~cpu utilization. MBaud Processing rate.

Benchmark Wallclock Minimize. System time Hopefully small. User / Wallclock
~cpu utilization. MBaud Processing rate. Time includes housekeeping. Tradeoffs between file size and inode count.

Benchmark Note: MOAR only uses threads it needs. race( degree=X
) != threads dispatched. degree=1000 dispatches ~60 threads.

Benchmark Note: MOAR only uses threads it needs. race( degree=X
) != threads dispatched. degree=1000 dispatches ~60 threads. Ideally chunk processing ~ read time.

Benchmark Pass1: Gather + write, 2 ** 17 chunk. label
: Chunk 45056 20 MiBaud. output : 11 sample : maxrss : +104472 minflt : +59642 stime : +13.371429 13 / 929 ~ small. utime : +929.906223 929/25 ~ 36 cores active wtime : +25.104481

Benchmark Pass1: Gather + write, 2 ** 18 chunk. Above
this it slows down. label : Chunk 212992 21 MiBaud. output : 104 sample : maxrss : +3728 minflt : +33320 stime : +12.577622 utime : +902.396013 902/24 ~ 36 cores active wtime : +24.161267

Benchmark Pass2: Immediate Write, 2**20 chunk 5380 / 173 ~
31 active threads, 23MByte/sec. label : Chunk 286720 23 MiBaud. output : 70 sample : maxrss : +23016 minflt : +243763 stime : +60.411462 vs. 12.5 in Pass1. utime : +5380.4308 5380/173 ~ 31 cores wtime : +173.413913

Benchmark Pass2: Immediate Write, 2**24 chunk Best throughput so far.
label : Chunk 10240 26 MiBaud. output : 40 sample : majflt : +287 minflt : +1904178 stime : +66.829123 66 / 6600 ~ 10% system utime : +6485.117116 6485 / 155 = 41 cores wtime : +155.685644

Benchmark Pass2: Immediate Write, chunk = 2**24 label : Total
chunks: 141530 (1 .. 228da) Final : True output : 70 141 * 2 ** 21 / 12 ~ 25 MiB / sec sample : inblock : +0 majflt : +17 maxrss : +13813668 minflt : +23557458 oublock : +0 stime : +3955.09245 utime : +379836.841958 380 / 12 ~ 32 cores

Benchmark Take + write handles smaller chunks. Less memory. More,
smaller files. Better core utilization. Immediate write does better bigger chunks. Fewer threads & files.

Summary Raku is well suited to ETL: Threading, data management
straightforward. Handles large inputs. Declarative code offers convienent syntax. 20+ years of 20/20 hindsight in dynamic languages.

Summary Even more than more than one way to do
it: Functional programming an option with values. Speed advantage on really large tasks. Signitures make list handling more flexible. Use them or don’t, whichever works.

https://docs.raku.org/ Nice documentation. Syntax & how-to. https://docs.raku.org/language.html X to Raku
guides (show your friends): https://docs.raku.org/language/5to6-nutshell Tutorials. https://docs.raku.org/language/concurrency

gathers-and-takes.pdf

gathers-and-takes.pdf

More Decks by Steven Lembark

Other Decks in Technology

Featured

Transcript