Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Simple Aggregation Function Interface

Simple Aggregation Function Interface

An introduction about the new simple function interface for user-defined aggregation functions (UDAFs). This interface allows UDAF authors to write less and row-based code when implementing a UDAF, with minimal to zero performance degradation, compared to the existing vector-based interface.

Wei He
Software Engineer at Meta

Ali LeClerc

April 05, 2024
Tweet

More Decks by Ali LeClerc

Other Decks in Technology

Transcript

  1. How much code do you write for one UDAF? •

    Required ◦ accumulatorFixedWidthSize() ◦ initializeNewGroups() ◦ addRawInput() ◦ addIntermediateResults() ◦ addSingleGroupRawInput() ◦ addSingleGroupIntermediateResults() ◦ extractValues() ◦ extractAccumulators() 2 • Optional ◦ accumulatorAlignmentSize() ◦ accumulatorUsesExternalMemory() ◦ isFixedSize() ◦ supportsToIntermediate() ◦ toIntermediate() ◦ destroy()
  2. How much code do you write for one UDAF? •

    Source lines of code Functions SLOC (excluding registration code) avg() 357 array_agg() 211 approx_distinct() 355
  3. How many UDAFs would you write? • Presto ◦ 91

    distinct aggregation functions* (as of March 2024). • Spark ◦ 65 distinct aggregation functions* (as of March 2024). * https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html#aggregate-functions * https://prestodb.io/docs/current/functions/aggregate.html
  4. Simple UDAF Interface • Required ◦ AccumulatorType() constructor ◦ addInput()

    ◦ combine() ◦ writeFinalResult() ◦ writeIntermediateResult() • Optional ◦ toIntermediate() ◦ destroy() All methods are ROW-based.
  5. Simple UDAF Interface A UDAF author defines a class with

    the input and output types. Example of array_agg:
  6. Simple UDAF Interface • static constexpr bool default_null_behavior_ = true

    ◦ Assume most aggregate functions have default null behavior, i.e., ignoring rows that have null values in raw input and intermediate states, and returning null for groups of no input rows or only null rows. E.g., SELECT sum(c0) FROM (values (NULL), (NULL)) AS t(c0); -- NULL Example of array_agg:
  7. struct AccumulatorType The UDAF author defines a struct called AccumulatorType

    inside the UDAF class with the following members. • static constexpr bool is_fixed_size_ = true • static constexpr bool use_external_memory_ = false • static constexpr bool is_aligned_ = false • AccumulatorType(HashStringAllocator* allocator) Example of array_agg:
  8. An optional AccumulatorType::destroy() can be called when the accumulator object

    is destructed. • void destroy(HashStringAllocator* allocator) Example (array_agg):
  9. struct AccumulatorType Non-default-null behavior: • bool addInput(HashStringAllocator* allocator, exec::optional_arg_type<T1> data,

    …) • bool combine(HashStringAllocator* allocator, exec::optional_arg_type<IntermediateType> other) • bool writeIntermediateResult(bool nonNullGroup, exec::out_type<IntermediateType>& out) • bool writeFinalResult(bool nonNullGroup, exec::out_type<OutputType>& out)
  10. Simple UDAF Interface • toIntermediate: An optional function that converts

    a raw input directly to an intermediate state, to be used in query plans that abandon the partial aggregation step. Example of array_agg:
  11. Registration Simple UDAFs currently use the same registration interface as

    existing ones. The factory creates a simple UDAF instance as follows. Used to be std::make_unique<ArrayAggAggregate>(resultType)
  12. Limitations • Doesn’t allow function-level states outside of accumulators. ◦

    Work in progress: #8711, #9167. • Doesn’t support optimizations on constant inputs. E.g., processing constant input once per input batch. • Doesn’t support pushdown.
  13. Simple UDAF Interface Reduce the amount and complexity of code

    that UDAF authors need to write. Does NOT sacrifice the performance.
  14. Acknowledgement Thanks @laithsakka for helping with the design of the

    simple UDAF interface. Thanks to early adopters of the simple UDAF interface: - @liujiayi771: decimal sum, spark collect_list - @xumingming: geometric_mean, bitwise_xor - @ericyuliu: fb_weighted_avg - @mbasmanova: fb_approx_most_frequent Special thanks to @liujiayi771 for extending the simple UDAF interface to allow function-level states.