Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Neatly Folding a Tree: Functional Perl5 AWS Glacier Hashes

Neatly Folding a Tree: Functional Perl5 AWS Glacier Hashes

When functional programming works it is elegant and efficient. The AWS "Tree Hash" used to validate uploads is nice fodder for recursion, but has problems with stack size. The Perl5 features for implementing tail call recursion that reduces the stack but they look ugly. Fortunately, Perl5 also has Keyword::Declare to wrap the the Tail Call Recursion into a neat package.

This talk describes the Glacier Tree Hash, Tail Call Elimination in general, implementing it in Perl5, and going a few steps further to implement the solution in fast, minimal functional code.

Steven Lembark

July 09, 2022
Tweet

More Decks by Steven Lembark

Other Decks in Technology

Transcript

  1. In the beginning... There was Spaghetti Code. And it was

    bad. So we invented Objects. Now we have Spaghetti Objects.
  2. Basic rules Constant data. Transparent transforms. Functions require input. Output

    determined fully by inputs. Avoid internal state & side effects.
  3. Where it does: Tree Hash Used with AWS “Glacier” service.

    $0.01/GiB/Month. Large, cold data (discounts for EiB, PiB). Uploads require lots of sha256 values.
  4. Digesting large chunks Uploads chunked in multiples of 1MB. Digest

    for each chunk & entire upload. Result: tree-hash.
  5. One solution from Net::Amazon::TreeHash sub calc_tree { my ($self) =

    @_; my $prev_level = 0; while (scalar @{ $self->{tree}->[$prev_level] } > 1) { my $curr_level = $prev_level+1; $self->{tree}->[$curr_level] = []; my $prev_tree = $self->{tree}->[$prev_level]; my $curr_tree = $self->{tree}->[$curr_level]; my $len = scalar @$prev_tree; for (my $i = 0; $i < $len; $i += 2) { if ($len - $i > 1) { my $a = $prev_tree->[$i]; my $b = $prev_tree->[$i+1]; push @$curr_tree, { joined => 0, start => $a->{start}, finish => $b->{finish}, hash => sha256( $a->{hash}.$b->{hash} ) }; } else { push @$curr_tree, $prev_tree->[$i]; } } $prev_level = $curr_level; }
  6. Pass 1: Reduce the hashes Reduce pairs. Until one value

    remaining. sub reduce_hash { # undef for empty list @_ > 1 or return $_[0]; my $count = @_ / 2 + @_ % 2; reduce_hash map { @_ > 1 ? sha256 splice @_, 0, 2 : shift } ( 1 .. $count ) }
  7. Pass 1: Reduce the hashes Reduce pairs. Until one value

    remaining. Catch: Eats Stack sub reduce_hash { # undef for empty list @_ > 1 or return $_[0]; my $count = @_ / 2 + @_ % 2; reduce_hash map { @_ > 1 ? sha256 splice @_, 0, 2 : shift } ( 1 .. $count ) }
  8. Chasing your tail Tail recursion is common. "Tail call elimination"

    recycles stack. "Fold" is a feature of FP languages. Reduces the stack to a scalar.
  9. Fold in Perl5 Reset the stack. Restart the sub. my

    $foo = sub { @_ > 1 or return $_[0]; @_ = … ; # new in v5.16 goto __SUB__ };
  10. Pass 2: Reduce hashes Viola! Stack shrinks. sub reduce_hash {

    2 > @_ and return $_[0]; my $count = @_ / 2 + @_ % 2; @_ = map { @_ > 1 ? sha256 splice @_, 0, 2 : @_ } ( 1 .. $count ); goto __SUB__ };
  11. Pass 2: Reduce hashes Viola! Stack shrinks. @_ = is

    ugly. sub reduce_hash { 2 > @_ and return $_[0]; my $count = @_ / 2 + @_ % 2; @_ = map { @_ > 1 ? sha256 splice @_, 0, 2 : @_ } ( 1 .. $count ); goto __SUB__ };
  12. Pass 2: Reduce hashes Viola! Stack shrinks. @_ = is

    ugly. goto scares people. sub reduce_hash { 2 > @_ and return $_[0]; my $count = @_ / 2 + @_ % 2; @_ = map { @_ > 1 ? sha256 splice @_, 0, 2 : @_ } ( 1 .. $count ); goto __SUB__ };
  13. "Fold" is an FP Pattern. use Keyword::Declare; keyword tree_fold (

    Ident $name, Block $new_list ) { qq { sub $name { \@_ or return; ( \@_ = do $new_list ) > 1; and goto __SUB__; $_[0] } } } $new_list is source code not a subref! So is the result of interpolating it.
  14. "Fold" is an FP Pattern. use Keyword::Declare; keyword tree_fold (

    Ident $name, Block $new_list ) { qq { sub $name { \@_ or return; ( \@_ = do $new_list ) > 1; and goto __SUB__; $_[0] } } } See K::D POD for {{{…}}} to avoid "\ @_".
  15. Minimal syntax tree_fold reduce_hash { my $count = @_ /

    2 + @_ % 2; map { @_ > 1 ? sha256 splice @_, 0, 2 : @_ } ( 1 .. $count ) } User supplies generator a.k.a $new_list
  16. Minimal syntax tree_fold reduce_hash { my $count = @_ /

    2 + @_ % 2; map { @_ > 1 ? sha256 splice @_, 0, 2 : @_ } ( 1 .. $count ) } User supplies generator. NQFP: Hacks the stack.
  17. Don't hack the stack Replace splice with offsets. tree_fold reduce_hash

    { my $last = @_ / 2 + @_ % 2 - 1; map { $_[ $_ + 1 ] ? sha256 @_[ $_, $_ + 1 ] : $_[ $_ ] } map { 2 * $_ } ( 0 .. $last ) }
  18. Don't hack the stack Replace splice with offsets. Still messy:

    @_, stacked map. tree_fold reduce_hash { my $last = @_ / 2 + @_ % 2 - 1; map { $_[ $_ + 1 ] ? sha256 @_[ $_, $_ + 1 ] : $_[ $_ ] } map { 2 * $_ } ( 0 .. $last ) }
  19. Using lexical variables Declare fold_hash with parameters. Caller uses lexical

    vars. keyword tree_fold ( Ident $name, List $argz, Block $stack_op ) { ... }
  20. Boilerplate for lexical variables Extract lexical variables. See also: PPI::Token

    my @varz # ( '$foo', '$bar' ) = map { $_->isa( 'PPI::Token::Symbol' ) ? $_->{ content } : () } map { $_->isa( 'PPI::Statement::Expression' ) ? @{ $_->{ children } } : () } @{ $argz->{ children } };
  21. Boilerplate for lexical variables my $lexical = join ',' =>

    @varz; my $count = @varz; my $offset = $count -1; sub $name { \@_ or return; my \$last = \@_ % $count ? int( \@_ / $count ) : int( \@_ / $count ) - 1 ; ... Count & offset used to extract stack.
  22. Boilerplate for lexical variables \@_ = map { my (

    $lexical ) = \@_[ \$_ .. \$_ + $offset ]; do $stack_op } map { \$_ * $count } ( 0 .. \$last ); Interpolate variable names, count, stack offset.
  23. FP uses constants Replace lexical var's with values. Constant module

    uses ties. Slow, only one tie per variable.
  24. FP uses constants Replace lexical var's with values. Const::Fast handles

    scalar, array, hash. Uses SV magic struct. Conflicts with bless (magic gets set too early).
  25. Fast perly constants lvalue wrapper delays dlock. Plays nice with

    bless. sub const : lvalue { dlock $_[0]; $_[0] }
  26. Putting the fun into Perl5 Cannot modify $last, $_, return

    values. tree_fold reduce_hash { const my $last = @_ / 2 + @_ % 2 - 1; map { $_[ $_ + 1 ] ? const sha256 @_[ $_, $_ + 1 ] : const $_[ $_ ] } map { const 2 * $_ } ( 0 .. $last ) }
  27. Consume the buffer. Prior to 5.20: Pass large string by

    ref. Example: File::Slurp returns ref-to-scalar. e.g., my $size = length $$buffer; That or use $_[$i] directly to avoid copies.
  28. Magic constants Const for variable, subroutine results. use Glacier::Util::Const qw(

    const ); keyword value( Var $var ) {{{ const my <{$var}> }}} keyword value {{{ const }}}
  29. value keyword Keywords can be nested: tree_fold reduce_hash { value

    $last = @_ / 2 + @_ % 2 - 1; map { $_[ $_ + 1 ] ? value sha256 @_[ $_, $_ + 1 ] : value $_[ $_ ] } map { value 2 * $_ } ( 0 .. $last ) }
  30. value keyword Or just: tree_fold reduce_hash( $left, $rite ) {

    value $rite ? sha256 $left, $rite : $left }
  31. Avoid a copy: $$buffer Map input to chunks. Reduce them.

    Done. sub buffer_hash { state $mb = 2 ** 20; my $buffer = shift; my $size = length $$buffer; my $count = int $size / $mb + !!( $size % $mb ); reduce_hash map { sha256 substr $$buffer, 0, $mib, '' } ( 1 .. $count ) }
  32. • Q: What is left after hash_buffer? A: An empty

    buffer. The caller's buffer! Oops... caller needs a copy to upload! Result: No memory savings at all.
  33. Easier in v5.20+ Pass buffer as‑is. COW preserves caller's buffer.

    No extra copy. sub buffer_hash { state $mb = 2 ** 20; my $buffer = shift; my $size = length $$buffer; my $count = int $size / $mb + !!( $size % $mb ); reduce_hash map { sha256 substr $$buffer, 0, $mib, '' } ( 1 .. $count ) }
  34. Now try it with FP Consuming $buffer breaks the rules.

    Find a way to extract the pieces.
  35. Buffer hash with only copies unpack is the fastest way.

    Less code. One copy of buffer. sub buffer hash { state $format = '(a' . 2**20 . ')*'; const my $buffer = shift; reduce_hash map { const sha256 $_ } unpack $format => $buffer };
  36. Public interface sub tree_hash { @_ > 1 ? &reduce_hash

    : &buffer_hash } sub tree_hash_hex { unpack 'H*', &tree_hash } Multi-part: hash hashes. Single-part hashes buffer. Useful for tesing:
  37. Buffer Size vs. Usr Time Explicit map, keyword with and

    without lexicals. 8-32MiB are good chunk sizes. MiB Explicit Implicit Keyword 1 0.02 0.01 0.02 2 0.03 0.03 0.04 4 0.07 0.07 0.07 8 0.14 0.13 0.10 16 0.19 0.18 0.17 32 0.31 0.30 0.26 64 0.50 0.51 0.49 128 1.00 1.02 1.01 256 2.03 2.03 2.03 512 4.05 4.10 4.06 1024 8.10 8.10 8.11
  38. Result: FP in Perl5 When FP works it is elegant.

    Core Perl5 syntax helps: lvalue __SUB__ COW strings
  39. Result: FP in Perl5d When FP works it is elegant.

    Keywords: True Lazyness ® at its best. Don't repeat boilerplate. Multimethods in Perl5.