Neatly Folding a Tree: Functional Perl5 AWS Glacier Hashes

Neatly Folding a Tree: Functional Perl5 AWS Glacier Hashes Steven
Lembark Workhorse Computing [email protected]

In the beginning... There was Spaghetti Code. And it was
bad.

bad. So we invented Objects.

bad. So we invented Objects. Now we have Spaghetti Objects.

Alternative: Fucntional Programming Based on Lambda Calculus. Few basic ideas:
Transparency. Consistency.

Basic rules Constant data. Transparent transforms. Functions require input. Output
determined fully by inputs. Avoid internal state & side effects.

Catch: It doesn't always work. time() random() readline() fetchrow_array() Result:
State matters! Fix: Apply reality.

Where it does: Tree Hash Used with AWS “Glacier” service.
$0.01/GiB/Month. Large, cold data (discounts for EiB, PiB). Uploads require lots of sha256 values.

Digesting large chunks Uploads chunked in multiples of 1MB. Digest
for each chunk & entire upload. Result: tree-hash.

Image from Amazon Developer Guide (API Version 2012-06-01) http://docs.aws.amazon.com/amazonglacier/latest/dev/checksum-calculations.html

One solution from Net::Amazon::TreeHash sub calc_tree { my ($self) =
@_; my $prev_level = 0; while (scalar @{ $self->{tree}->[$prev_level] } > 1) { my $curr_level = $prev_level+1; $self->{tree}->[$curr_level] = []; my $prev_tree = $self->{tree}->[$prev_level]; my $curr_tree = $self->{tree}->[$curr_level]; my $len = scalar @$prev_tree; for (my $i = 0; $i < $len; $i += 2) { if ($len - $i > 1) { my $a = $prev_tree->[$i]; my $b = $prev_tree->[$i+1]; push @$curr_tree, { joined => 0, start => $a->{start}, finish => $b->{finish}, hash => sha256( $a->{hash}.$b->{hash} ) }; } else { push @$curr_tree, $prev_tree->[$i]; } } $prev_level = $curr_level; }

Possibly simpler? Trees are naturally recursive. Two-step generation: Split the
buffer. Reduce the hashes.

Pass 1: Reduce the hashes Reduce pairs. Until one value
remaining. sub reduce_hash { # undef for empty list @_ > 1 or return $_[0]; my $count = @_ / 2 + @_ % 2; reduce_hash map { @_ > 1 ? sha256 splice @_, 0, 2 : shift } ( 1 .. $count ) }

Pass 1: Reduce the hashes Reduce pairs. Until one value
remaining. Catch: Eats Stack sub reduce_hash { # undef for empty list @_ > 1 or return $_[0]; my $count = @_ / 2 + @_ % 2; reduce_hash map { @_ > 1 ? sha256 splice @_, 0, 2 : shift } ( 1 .. $count ) }

Chasing your tail Tail recursion is common. "Tail call elimination"
recycles stack. "Fold" is a feature of FP languages. Reduces the stack to a scalar.

Fold in Perl5 Reset the stack. Restart the sub. my
$foo = sub { @_ > 1 or return $_[0]; @_ = … ; # new in v5.16 goto __SUB__ };

Pass 2: Reduce hashes Viola! Stack shrinks. sub reduce_hash {
2 > @_ and return $_[0]; my $count = @_ / 2 + @_ % 2; @_ = map { @_ > 1 ? sha256 splice @_, 0, 2 : @_ } ( 1 .. $count ); goto __SUB__ };

Pass 2: Reduce hashes Viola! Stack shrinks. @_ = is
ugly. sub reduce_hash { 2 > @_ and return $_[0]; my $count = @_ / 2 + @_ % 2; @_ = map { @_ > 1 ? sha256 splice @_, 0, 2 : @_ } ( 1 .. $count ); goto __SUB__ };

Pass 2: Reduce hashes Viola! Stack shrinks. @_ = is
ugly. goto scares people. sub reduce_hash { 2 > @_ and return $_[0]; my $count = @_ / 2 + @_ % 2; @_ = map { @_ > 1 ? sha256 splice @_, 0, 2 : @_ } ( 1 .. $count ); goto __SUB__ };

"Fold" is an FP Pattern. use Keyword::Declare; keyword tree_fold (
Ident $name, Block $new_list ) { qq { sub $name { \@_ or return; ( \@_ = do $new_list ) > 1; and goto __SUB__; $_[0] } } } $new_list is source code not a subref! So is the result of interpolating it.

"Fold" is an FP Pattern. use Keyword::Declare; keyword tree_fold (
Ident $name, Block $new_list ) { qq { sub $name { \@_ or return; ( \@_ = do $new_list ) > 1; and goto __SUB__; $_[0] } } } See K::D POD for {{{…}}} to avoid "\ @_".

Minimal syntax tree_fold reduce_hash { my $count = @_ /
2 + @_ % 2; map { @_ > 1 ? sha256 splice @_, 0, 2 : @_ } ( 1 .. $count ) } User supplies generator a.k.a $new_list

Minimal syntax tree_fold reduce_hash { my $count = @_ /
2 + @_ % 2; map { @_ > 1 ? sha256 splice @_, 0, 2 : @_ } ( 1 .. $count ) } User supplies generator. NQFP: Hacks the stack.

Don't hack the stack Replace splice with offsets. tree_fold reduce_hash
{ my $last = @_ / 2 + @_ % 2 - 1; map { $_[ $_ + 1 ] ? sha256 @_[ $_, $_ + 1 ] : $_[ $_ ] } map { 2 * $_ } ( 0 .. $last ) }

Don't hack the stack Replace splice with offsets. Still messy:
@_, stacked map. tree_fold reduce_hash { my $last = @_ / 2 + @_ % 2 - 1; map { $_[ $_ + 1 ] ? sha256 @_[ $_, $_ + 1 ] : $_[ $_ ] } map { 2 * $_ } ( 0 .. $last ) }

Using lexical variables Declare fold_hash with parameters. Caller uses lexical
vars. keyword tree_fold ( Ident $name, List $argz, Block $stack_op ) { ... }

Boilerplate for lexical variables Extract lexical variables. See also: PPI::Token
my @varz # ( '$foo', '$bar' ) = map { $_->isa( 'PPI::Token::Symbol' ) ? $_->{ content } : () } map { $_->isa( 'PPI::Statement::Expression' ) ? @{ $_->{ children } } : () } @{ $argz->{ children } };

Boilerplate for lexical variables my $lexical = join ',' =>
@varz; my $count = @varz; my $offset = $count -1; sub $name { \@_ or return; my \$last = \@_ % $count ? int( \@_ / $count ) : int( \@_ / $count ) - 1 ; ... Count & offset used to extract stack.

Boilerplate for lexical variables \@_ = map { my (
$lexical ) = \@_[ \$_ .. \$_ + $offset ]; do $stack_op } map { \$_ * $count } ( 0 .. \$last ); Interpolate variable names, count, stack offset.

Chop shop Not much body left: tree_fold reduce_hash($left, $rite) {
$rite ? sha2656 $left, $rite : $left }

FP uses constants Replace lexical var's with values. Constant module
uses ties. Slow, only one tie per variable.

FP uses constants Replace lexical var's with values. Const::Fast handles
scalar, array, hash. Uses SV magic struct. Conflicts with bless (magic gets set too early).

FP uses constants Replace lexical var's with values. Data::Lock sets
the magic later. Co-exist with bless.

Fast perly constants lvalue wrapper delays dlock. Plays nice with
bless. sub const : lvalue { dlock $_[0]; $_[0] }

Putting the fun into Perl5 Cannot modify $last, $_, return
values. tree_fold reduce_hash { const my $last = @_ / 2 + @_ % 2 - 1; map { $_[ $_ + 1 ] ? const sha256 @_[ $_, $_ + 1 ] : const $_[ $_ ] } map { const 2 * $_ } ( 0 .. $last ) }

Consume the buffer. Prior to 5.20: Pass large string by
ref. Example: File::Slurp returns ref-to-scalar. e.g., my $size = length $$buffer; That or use $_[$i] directly to avoid copies.

Magic constants Const for variable, subroutine results. use Glacier::Util::Const qw(
const ); keyword value( Var $var ) {{{ const my <{$var}> }}} keyword value {{{ const }}}

value keyword Keywords can be nested: tree_fold reduce_hash { value
$last = @_ / 2 + @_ % 2 - 1; map { $_[ $_ + 1 ] ? value sha256 @_[ $_, $_ + 1 ] : value $_[ $_ ] } map { value 2 * $_ } ( 0 .. $last ) }

value keyword Or just: tree_fold reduce_hash( $left, $rite ) {
value $rite ? sha256 $left, $rite : $left }

Avoid a copy: $$buffer Map input to chunks. Reduce them.
Done. sub buffer_hash { state $mb = 2 ** 20; my $buffer = shift; my $size = length $$buffer; my $count = int $size / $mb + !!( $size % $mb ); reduce_hash map { sha256 substr $$buffer, 0, $mib, '' } ( 1 .. $count ) }

• Q: What is left after hash_buffer? A: An empty
buffer.

• Q: What is left after hash_buffer? A: An empty
buffer. The caller's buffer! Oops... caller needs a copy to upload! Result: No memory savings at all.

Easier in v5.20+ Pass buffer as‑is. COW preserves caller's buffer.
No extra copy. sub buffer_hash { state $mb = 2 ** 20; my $buffer = shift; my $size = length $$buffer; my $count = int $size / $mb + !!( $size % $mb ); reduce_hash map { sha256 substr $$buffer, 0, $mib, '' } ( 1 .. $count ) }

Now try it with FP Consuming $buffer breaks the rules.
Find a way to extract the pieces.

Buffer hash with only copies unpack is the fastest way.
Less code. One copy of buffer. sub buffer hash { state $format = '(a' . 2**20 . ')*'; const my $buffer = shift; reduce_hash map { const sha256 $_ } unpack $format => $buffer };

Public interface sub tree_hash { @_ > 1 ? &reduce_hash
: &buffer_hash } sub tree_hash_hex { unpack 'H*', &tree_hash } Multi-part: hash hashes. Single-part hashes buffer. Useful for tesing:

Buffer Size vs. Usr Time Explicit map, keyword with and
without lexicals. 8-32MiB are good chunk sizes. MiB Explicit Implicit Keyword 1 0.02 0.01 0.02 2 0.03 0.03 0.04 4 0.07 0.07 0.07 8 0.14 0.13 0.10 16 0.19 0.18 0.17 32 0.31 0.30 0.26 64 0.50 0.51 0.49 128 1.00 1.02 1.01 256 2.03 2.03 2.03 512 4.05 4.10 4.06 1024 8.10 8.10 8.11

Result: FP in Perl5 When FP works it is elegant.
Core Perl5 syntax helps: lvalue __SUB__ COW strings

Result: FP in Perl5d When FP works it is elegant.
Keywords: True Lazyness ® at its best. Don't repeat boilerplate. Multimethods in Perl5.

Neatly Folding a Tree: Functional Perl5 AWS Gla...

Neatly Folding a Tree: Functional Perl5 AWS Glacier Hashes

More Decks by Steven Lembark

Other Decks in Technology

Featured

Transcript