ClickHouse Aggregate Functions: Design & Implementation Architecture

TECHNICAL DEEP DIVE ClickHouse Aggregate Functions: Design & Implementation Architecture
A Complete Guide to Building High- Performance Custom Aggregates Comprehensive coverage of design patterns, architectural reference implementations, and best practices for developing custom aggregate functions in ClickHouse. This technical guide provides actionable insights for database engineers and contributors working on high-performance analytical systems. Prepared by: ChistaDATA Platform Engineering Team Reviewed and approved by: Shiv Iyer, Founder and CEO - ChistaDATA Inc. Copyright © 2016–2026. All Rights Reserved by ChistaDATA®.

Why Aggregate Functions Matter Core to Analytics Aggregation functions are
fundamental to analytical queries, enabling operations such as SUM, AVG, COUNT, and complex GROUP BY computations that drive business intelligence and data analysis workflows. Massive Dataset Processing ClickHouse processes massive datasets efficiently through optimized aggregate implementations that leverage vectorization, SIMD operations, and intelligent memory management to maintain sub- second query response times. Custom Analytics Capabilities Custom aggregate functions unlock domain-specific analytical capabilities beyond built-in functionalities, enabling specialized calculations essential for ML feature engineering, financial metrics, and industry-specific Key Performance Indicators (KPIs). Production System Impact Well-designed aggregates deliver measurable improvements: efficient memory utilization, linear scaling with thread count, optimal performance at petabyte scale, and seamless distributed processing across cluster nodes.

Problem Statement: Challenges in Custom Aggregate Development Development Gaps Growing
demand for custom aggregate functions beyond ClickHouse's 100+ built-in options. Lack of consolidated design documentation and architectural patterns. No centralized reference for exemplary implementations. Steep learning curve for understanding internal APIs. Implementation Pitfalls Memory management complexity with Arena allocators. Alignment requirements for cache efficiency and SIMD. Serialization correctness for distributed aggregation. Performance optimization requiring iterative refinement. Thread safety and lock-free design patterns. The consequences of poorly designed aggregates include memory leaks, incorrect distributed results, significant performance degradation (10- 100x), and production incidents that are challenging to debug.

What You'll Learn 01 Architectural Patterns Understand the recommended design
patterns for aggregate functions, including the three- phase aggregation model, state management strategies, and the IAggregateFunctionDataHelper inheritance pattern. 02 Reference Implementations Examine exemplary implementations within the ClickHouse codebase, including sum(), avg(), uniqExact(), and quantiles(), with detailed analysis of design decisions and trade-offs. 03 Implementation Practices Learn specific coding techniques for the add(), merge(), serialize(), and deserialize() methods, including NULL handling, type templating, and result finalization patterns. 04 Memory Management Comprehend Arena-based allocation strategies, cache line alignment for performance, serialization requirements for distributed queries, and managing variable-length state data. 05 Performance Optimization Explore SIMD vectorization opportunities, multi-target code generation, hash table selection strategies, thread-safe patterns, and profiling techniques for identifying bottlenecks.

Presentation Roadmap Architecture & Design Patterns Slides 6-15: Aggregation pipeline,
three-phase model, memory layout, Arena management, state versus result types, and combinator patterns. Core Implementation Patterns Slides 16-25: Complete interface methods, additive and merge phases, serialization, Arena usage, template specialization, NULL handling, and registration. Gold Standard References Slides 26-35: Analysis of sum(), count(), avg(), min/max(), uniqExact(), quantiles(), groupArray(), source navigation, and adaptation strategies. Performance Optimizations Slides 36-45: Cache alignment, SIMD operations, multi-target code, hash table selection, vectorized blocks, thread safety, and profiling. Case Studies & Best Practices Slides 46-50: Real-world implementations, common pitfalls, testing workflows, development process, and key takeaways.

ARCHITECTURE ClickHouse Aggregation Pipeline Overview Query Processing Layer SQL parsing,
optimization, and execution planning distribute the aggregation workload across available cores and cluster nodes. Vectorized Execution Block-based processing operates on chunks of 8,192 rows (granules) for optimized cache utilization and SIMD capabilities. Storage Layer Column-oriented storage with efficient compression enables fast data scanning and minimal I/O for aggregation queries. Parallel Execution Multi-threaded aggregation across blocks and shards with per- thread intermediate states, followed by a final merge phase for results. Aggregate functions integrate seamlessly into this pipeline, operating on columnar data blocks with reduced overhead. The vectorized execution model processes thousands of rows per function call, amortizing the overhead of virtual function calls and enabling compiler optimizations.

The Three Phases of Aggregation Additive Phase Accumulate values from
input blocks into intermediate state structures. The add() method is called for each row, updating state with new data. This phase is highly parallelizable—each thread maintains its own state. Merge Phase Combine partial aggregation states from multiple threads, partitions, or distributed nodes. The merge() method takes two states and produces a unified result. This operation must be associative and commutative for correctness. Finalization Phase Convert the aggregate state to its final result value. The insertResultInto() method performs any necessary calculations (e.g., dividing sum by count for an average) and writes the result to the output column. State transitions through these phases: Raw data → Additive → Merge → Finalize → Result. Understanding this flow is critical for distributed aggregation, where merge operations occur across network boundaries.

Aggregate Function Data Structure Pattern State Structure Design The state
structure holds intermediate computation data throughout the aggregation lifecycle. Key design principles include: Compactness: Minimize size for cache efficiency. Serializability: Facilitate distributed aggregation. Alignment: Ensure compatibility with cache line boundaries. Completeness: Store all necessary data for the merge operation. struct SumData { UInt64 sum; // Primary accumulator UInt64 count; // For average calculation Arena* arena; // For variable-length data void add(UInt64 value) { sum += value; count++; } void merge(const SumData& other) { sum += other.sum; count += other.count; } }; Example Structure Memory layout is critical for performance. Memory efficiency is paramount, particularly when processing billions of aggregation groups. Optimal cache utilization can result in a two-to-three-fold throughput improvement.

The IAggregateFunctionDataHelper Base Class Primary Inheritance Pattern All aggregate functions
must inherit from IAggregateFunctionDataHelper<Data, Derived>, which provides type-safe access to state data and standardized memory management. Template Parameters The class is templated on two types: Data (the state structure) and Derived (the aggregate function class). This enables compile-time polymorphism, eliminating virtual call overhead. State Access Pattern State is accessed via this->data(place), where place is an opaque pointer to memory. The helper manages placement new construction and ensures proper alignment. Benefits of This Pattern This pattern guarantees consistency across all aggregates, provides safety through type checking, ensures correct alignment for SIMD operations, and simplifies memory management with automatic cleanup. template<typename Data, typename Derived> class MyAggregate : public IAggregateFunctionDataHelper<Data, Derived> { void add(AggregateDataPtr place, ...) { this->data(place).sum += value; // Type-safe access } };

Memory Layout & Alignment Strategy 64 Bytes per Cache Line
Modern CPUs fetch data in 64-byte cache lines. Aligning aggregate states to this boundary prevents false sharing and improves throughput. 2-3x Performance Improvement Proper alignment can deliver 2-3x throughput gains in high-concurrency aggregation workloads by eliminating cache line contention between threads. 8192 Typical Granule Size ClickHouse processes data in blocks of 8,192 rows, creating thousands of aggregate states that must fit efficiently within the CPU cache hierarchy. Aggregate states are packed into contiguous Arena memory with careful alignment. The allocation pattern utilizes alignedAlloc(total_size, align_aggregate_states), where align_aggregate_states is set to 64 bytes—the CPU cache line boundary. Memory layout visualization: Each state aligns to 64-byte boundaries, preventing multiple threads from modifying adjacent states within the same cache line (known as false sharing). This precise alignment strategy significantly improves parallel aggregation performance.

Arena-Based Memory Management Why Arena Over malloc/new? Centralized tracking: Enables
monitoring of query memory limits. Batch deallocation: Eliminates the need for individual deallocations. Spill-to-disk: Facilitates graceful degradation when memory limits are exceeded. Allocation performance: Reduces overhead for millions of states. Arena Lifecycle The Arena pool is initialized at the query's inception and is intrinsically linked to the aggregation operation. All state allocations are provisioned from this Arena. Upon query completion, the entire Arena is deallocated in a single operation, negating the requirement for individual state destructors. Performance Benefits 90% Reduction in allocation overhead compared to malloc 100% Memory fully accounted for within query limits 95% Achieves reduced memory fragmentation over time Arena allocation is crucial for managing variable-length aggregate states, such as those generated by groupArray(), which necessitate storing dynamic amounts of data per group. The Arena efficiently manages all allocations without requiring explicit programmer intervention.

Hash Table Structure for GROUP BY During GROUP BY processing,
ClickHouse maintains a hash table that maps group keys to aggregate state pointers. This intermediate storage accumulates partial results before the final merge phase. Hash Table Type Selection ClickHouse selects the hash table implementation based on factors such as the data types being aggregated, expected cardinality, data distribution characteristics, available memory, and thread count. State Pointer Storage Hash table values contain pointers to aggregate states allocated within the Arena. The hash table itself remains lightweight, storing only keys and pointers, while the more extensive state data resides in Arena memory. Rebuild Strategy During merge operations, hash tables may be rebuilt to combine states from different threads or nodes. This rebuild process involves iterating through one table, looking up corresponding keys in another, and merging their respective states. Hash table variants include: HashTable for low cardinality, TwoLevelHashTable for high cardinality with improved memory locality, and specialized versions designed for specific key types (e.g., integers, strings, or fixed-size keys).

State vs. Result Type Distinction // State for average struct
AvgState { UInt64 sum; UInt64 count; }; State Type: Intermediate Representation Stores raw computation data and algorithmic state. Must be serializable for network transmission. Can have a larger or different structure than the final result. Optimized for efficient merging, not for end-user consumption. // Result for average Float64 result = state.sum / state.count; Result Type: Final Computed Value The value returned to the user in query results. Often simpler than the state (e.g., a single value versus a complex struct). May require calculation from the state (e.g., sum / count). The type is visible in the SQL query result schema. This distinction enables distributed aggregation and materialized views. The state can be serialized, transmitted across a network, merged on different nodes, and finally converted to a result. Without this separation, distributed aggregation would be infeasible.

Combinator Pattern Overview State Combinator The modifier suffix -State returns
the aggregate state instead of the final result. For example, sumState() returns the intermediate sum state, not the computed sum. This is used for incremental aggregation in materialized views. Merge Combinator The modifier suffix -Merge takes aggregate states as input and produces the final result. For example, sumMerge(state_column) finalizes previously stored states, completing the two-phase aggregation workflow. MergeState Combinator The modifier suffix -MergeState merges multiple states into a single state without finalization. For instance, sumMergeState() combines states for further processing, enabling multi-stage aggregation pipelines. SimpleState Combinator This is a simplified variant offering improved performance for compatible functions. It employs a simpler serialization format and optimized merge logic. This combinator is applicable only when the state type matches the result type or when conversion is trivial. Combinators facilitate flexible aggregation workflows, enabling capabilities such as pre-computing states in materialized views, transmitting states across networks, combining states from different time periods, and constructing complex multi-stage analytical pipelines.

Architectural Decision Matrix Characteristic Simple Aggregates Complex Aggregates Memory Management
Performance Optimization State Structure Single value Multi-field structure Light (<64B) SIMD-ready Example Functions sum(), count() uniqExact(), quantiles() Medium (64B-1KB) Parallelizable Add Complexity Single operation Algorithm state update Heavy (>1KB) Sequential only Merge Logic Arithmetic combination Data structure merge Variable-length Vectorizable Serialization Binary copy Custom format Arena-dependent Scalar fallback Typical Use Case Standard metrics Statistical analysis Full dataset storage High-throughput scenarios This matrix assists in selecting the appropriate architectural approach for aggregate functions. Simple aggregates, such as sum and min, typically require minimal state and are well-suited for SIMD optimizations. Complex aggregates, like quantiles and uniqExact, necessitate sophisticated state management and custom serialization techniques. Aggregates with significant memory footprints, such as groupArray, demand careful management of Arena allocations. The potential for optimization largely depends on the algorithm's characteristics; arithmetic operations are often easily vectorized, while complex algorithmic logic may not be.

IMPLEMENTATION The Complete Aggregate Function Interface Every aggregate function must
implement six core methods that define behavior across the three aggregation phases. These methods establish the contract between your function and ClickHouse's execution engine. class IAggregateFunctionDataHelper<Data, Derived> { // Additive phase: Process a single input row void add(AggregateDataPtr place, const IColumn** columns, size_t row_num, Arena* arena); // Merge phase: Combine two aggregate states void merge(AggregateDataPtr place, ConstAggregateDataPtr rhs, Arena* arena); // Serialization: Convert aggregate state to bytes void serialize(ConstAggregateDataPtr place, WriteBuffer& buf); // Deserialization: Reconstruct aggregate state from bytes void deserialize(AggregateDataPtr place, ReadBuffer& buf, Arena* arena); // Finalization: Convert aggregate state to final result void insertResultInto(AggregateDataPtr place, IColumn& column); // Alternative finalization: Return aggregate state instead of final result void insertMergeResultInto(AggregateDataPtr place, IColumn& column); }; The order of implementation is crucial for comprehensive correctness testing: Begin by implementing add() and insertResultInto() for fundamental functionality. Next, implement merge() to enable multi-threaded support. Finally, implement serialize() and deserialize() for distributed query execution.

Implementing the add() Method Additive Phase Processing The add() method
is invoked for each input row, accumulating data into the aggregate state. This constitutes a performance-critical code path, executed with extremely high frequency in production queries, making optimal performance essential. Key Responsibilities Extract values from the input column at the specified row. Update the aggregate state structure with new data. Manage NULL values appropriately, typically by skipping them. Utilize the Arena for variable-length memory allocations. Maintain numerical stability, especially for floating-point operations. Access the state via this->data(place), which returns a typed reference to your state structure. The place pointer is managed by ClickHouse; direct dereferencing of it is not permitted. void add( AggregateDataPtr place, const IColumn** columns, size_t row_num, Arena* arena) { // Access state auto& state = this->data(place); // Extract value from column UInt64 value = columns[0]->getUInt(row_num); // Update state state.sum += value; state.count++; } Sum Aggregate Example Performance consideration: Minimize branches within the add() method. Modern CPUs achieve high instruction throughput but experience performance degradation with unpredictable branches. Employ branchless techniques where feasible.

Implementing the merge() Method The merge() method combines two partial
aggregate states into one. This operation must be associative and commutative to ensure correctness in distributed and multi-threaded scenarios where merge order is non-deterministic. void merge( AggregateDataPtr place, ConstAggregateDataPtr rhs, Arena* arena) { // Access destination state auto& dst = this->data(place); // Access source state const auto& src = this->data(rhs); // Combine states dst.sum += src.sum; dst.count += src.count; } When Merge Is Called Combining per-thread states in parallel aggregation Merging results from distributed query nodes Processing materialized view states Two-level aggregation hash table rebuilds Implementation Pattern // Test commutativity auto result1 = merge(stateA, stateB); auto result2 = merge(stateB, stateA); assert(result1 == result2); // Test associativity auto result3 = merge(merge(A,B), C); auto result4 = merge(A, merge(B,C)); assert(result3 == result4); Critical Properties Associativity: merge(merge(A,B),C) = merge(A,merge(B,C)) Commutativity: merge(A,B) = merge(B,A) These properties ensure identical results regardless of merge order. Thorough testing with different orderings is recommended: Complex aggregates, such as uniqExact(), merge hash sets using union operations. Variable-length data requires Arena allocation during the merge process to store combined results.

Serialization Pattern for Distribution & Storage `serialize()` Method Converts an
aggregate state into a byte stream suitable for network transmission or disk storage. The output must be deterministic; identical states must consistently serialize to identical byte sequences. Implementation should utilize `WriteBuffer` methods. `deserialize()` Method Reconstructs an aggregate state from a byte stream. This operation must be the inverse of `serialize()`, ensuring that round-trip conversion accurately preserves the original state. Reading should be performed using `ReadBuffer` methods, and `Arena` should be used for variable-length allocations. Critical Use Cases Distributed query processing (for inter- node transmission) Spill-to-disk operations when memory limits are exceeded Materialized views incorporating `AggregateFunction` columns Query result caching mechanisms void serialize( ConstAggregateDataPtr place, WriteBuffer& buf) const { const auto& state = this->data(place); writeBinary(state.sum, buf); writeBinary(state.count, buf); } Serialization Example void deserialize( AggregateDataPtr place, ReadBuffer& buf, Arena* arena) const { auto& state = this->data(place); readBinary(state.sum, buf); readBinary(state.count, buf); } Deserialization Example

Memory Arena Usage in Implementation The Arena parameter passed to
add(), merge(), and deserialize() plays a crucial role in managing variable-length aggregate state data. The Arena facilitates efficient batch allocation, thereby eliminating individual malloc overhead. 1 When to Utilize Arena The Arena is required whenever your aggregate state must store variable-length data, such as arrays, strings, hash sets, or any data structure whose size is not determined at compile time. Fixed-size states do not require Arena. 2 Allocation Procedure To allocate memory, invoke arena- >insert(data, size) or arena->alloc(size). The Arena retains ownership of this allocated memory, precluding the need for manual deallocation. Memory is automatically reclaimed upon destruction of the entire Arena at the conclusion of the query. void add(AggregateDataPtr place, const IColumn** columns, size_t row_num, Arena* arena) { auto& arr = this->data(place).array; StringRef value = columns[0]- >getDataAt(row_num); // Allocate space within the Arena for this value char* ptr = arena- >insert(value.data, value.size); arr.push_back(StringRef(ptr, value.size)); } 3 Example: groupArray Without the Arena, millions of individual malloc/free calls would lead to severe performance degradation and memory fragmentation. With the Arena, a single bulk allocation approach ensures zero fragmentation and automatic cleanup. This pattern is fundamental to ClickHouse's memory efficiency.

insertResultInto() vs insertMergeResultInto() void insertResultInto( AggregateDataPtr place, IColumn& to) const
{ const auto& state = this->data(place); Float64 result = static_cast<Float64>(state.sum) / state.count; to.insert(result); } insertResultInto() - Result Type Called during the finalization phase to convert aggregate state into the final result value visible to users. This function may perform calculations, such as dividing a sum by a count for an average. Characteristics Returns the result type (e.g., Float64 for average) Performs final computation from state Called once per group at the query's end Used in standard SQL aggregation queries void insertMergeResultInto( AggregateDataPtr place, IColumn& to) const { const auto& state = this->data(place); // Serialize state to column WriteBufferFromOwnString buf; serialize(place, buf); to.insertData(buf.str().data(), buf.str().size()); } insertMergeResultInto() - State Type Returns the aggregate state itself, rather than the computed result. This function is used with the State combinator for materialized views and incremental aggregation workflows. Characteristics Returns the state type (e.g., {sum, count}) No final computation performed Enables further merges or storage Used with -State and -MergeState combinators The key workflow difference is that insertResultInto() provides immediate finalization for end users, while insertMergeResultInto() enables delayed finalization for pre-aggregated materialized views and multi-stage pipelines.

Template Specialization for Data Types ClickHouse aggregates leverage C++ templates
to specialize implementation for various input data types. This approach facilitates compile-time optimization, eliminates virtual function overhead, and ensures type safety without the need for runtime checks. template<typename T> class SumAggregate : public IAggregateFunctionDataHelper<SumData<T>, SumAggregate<T>> { void add(AggregateDataPtr place, const IColumn** columns, size_t row_num, Arena*) { this->data(place).sum += static_cast<const ColumnVector<T>*>(columns[0])- >getData()[row_num]; } }; Template Pattern The type parameter T specializes the aggregation for types such as UInt32, UInt64, Float32, or Float64. Benefits of Templating Compile-time Optimization: The compiler generates specialized machine code for each type, optimizing performance. Type Safety: Type mismatches are detected at compile time, preventing runtime errors. Zero Overhead: There are no virtual function calls or runtime type checks, ensuring maximum efficiency. SIMD-Friendly: The compiler can effectively vectorize type-specific operations, further enhancing performance. void registerAggregateFunctionSum(AggregateFunctionFactory& factory) { factory.registerFunction("sum", [](const DataTypes& argument_types) { if (isInteger(argument_types[0])) return std::make_shared<SumAggregate<UInt64>> (argument_types); else if (isFloat(argument_types[0])) return std::make_shared<SumAggregate<Float64>> (argument_types); // ... other type specializations }); } Registration Pattern A factory method selects the appropriate template instantiation based on the input column's data type during function registration.

Handling NULL Values & Edge Cases void add(AggregateDataPtr place, const
IColumn** columns, size_t row_num, Arena* arena) { if (!columns[0]->isNullAt(row_num)) { auto& state = this->data(place); state.sum += columns[0]->getInt(row_num); state.count++; } } NULL Handling Pattern Most aggregate functions, by default, disregard NULL input values, meaning these values do not contribute to calculations such as sum, average, or count. Implementation typically involves checking isNullAt() prior to processing each row. Special Cases Some aggregate functions require explicit NULL handling. For instance, count() includes NULLs in its total, countIf() treats a NULL condition as false, and any() may return NULL if only NULL values are encountered. Explicit documentation of NULL behavior is essential. Empty State Handling Initialize the aggregate state to a valid "empty" value. For example, 0 for sum, NULL for min/max of an empty group, and an empty array for groupArray. The default-constructed state must represent the identity element for its respective operation. Numerical Stability For floating-point aggregates, it is crucial to employ numerically stable algorithms, such as Welford's algorithm for variance, Kahan summation for sums, or compensated arithmetic for averages. Naive implementations can suffer from catastrophic cancellation, leading to inaccurate results. Edge case testing checklist: Includes all NULL inputs, all non-NULL inputs, mixed NULL and non-NULL, empty groups, single-row groups, overflow conditions for integer types, underflow for floating-point types, and proper handling of NaN and infinity values.

Aggregate Function Registration & Factory Custom aggregates must be registered
with ClickHouse's AggregateFunctionFactory to be accessible in SQL queries. Registration defines the function name, parameter validation, type resolution, and instantiation logic. void registerMyAggregate( AggregateFunctionFactory& factory) { factory.registerFunction( "myAgg", [](const std::string& name, const DataTypes& argument_types, const Array& parameters) { // Validate parameters assertArity(name, argument_types, 1); assertUnary(name, argument_types); // Select implementation by type if (isInteger(argument_types[0])) return std::make_shared< MyAgg<UInt64>>( argument_types); else if (isFloat(argument_types[0])) return std::make_shared< MyAgg<Float64>>( argument_types); throw Exception( "Unsupported type"); }); } Registration Pattern void registerAggregateFunctions() { auto& factory = AggregateFunctionFactory::instance(); registerMyAggregate(factory); // Register other functions... } Factory Features Function name and aliases: Specifies the primary name and alternative spellings for the aggregate function. Parameter validation: Ensures the correct number and types of arguments are provided to the function. Type overloading: Allows different implementations to be used based on the input data types. Combinator support: Automatically generates -State and -Merge variants for complex aggregate functions. Properties: Enables marking functions with attributes such as determinism or parallelizability. Module Initialization This function is called during ClickHouse server startup to populate the global function registry.

Testing & Validation Patterns 01 Unit Tests for Methods Test
each interface method in isolation: `add()` with various input values, `merge()` with different state combinations, `serialize()` and `deserialize()` for round-trip correctness, and `insertResultInto()` for calculation accuracy. Mock input columns and validate state updates. 02 Integration Tests with SQL Execute SQL queries using your aggregate function, including simple `GROUP BY` with small datasets, complex queries with multiple aggregates, subqueries and nested aggregation, and joins with aggregation. Compare results against reference implementations or known- correct values. 03 Distributed Correctness Tests Verify that distributed aggregation produces identical results to single-node execution. This involves inserting data across multiple shards, running the aggregate query, and comparing the result to its single-node equivalent. Test with different merge orders to verify commutativity and associativity. 04 Performance Benchmarks Measure throughput and compare against built-in aggregates, assessing metrics such as rows per second processed, memory usage per state, the impact of parallelism (scaling with thread count), and serialization overhead. Utilize ClickHouse's built-in benchmarking tools or standalone microbenchmarks. 05 Edge Case Validation Test boundary conditions, including empty input groups, all `NULL` values, single-row groups, maximum cardinality stress tests, integer overflow scenarios, and floating-point special values (`NaN`, infinity). Ensure graceful handling without crashes or incorrect results.

GOLD STANDARDS Identifying Gold Standards in ClickHouse Aggregate implementations can
vary significantly in quality. Gold standard references demonstrate production-proven correctness, optimal performance, clear code structure, and comprehensive edge case handling, serving as templates for new aggregate development. Widely Used in Production Deployed in thousands of production queries across diverse workloads. Proven reliable under real-world conditions with billions of rows and extreme cardinalities. Community feedback has contributed to refined edge case handling. Highly Optimized Memory efficiency achieved through compact state structures. CPU optimization via SIMD vectorization. Multi-threaded scalability employing lock-free patterns. Allocation overhead is minimized through effective Arena usage. Well-Documented Code Clear comments explain design decisions. Documented edge cases and limitations. Example usage patterns are provided in tests. The structure is readable and promotes maintainability by other developers. Correct Edge Handling Robust handling of NULL values. Safe behavior regarding integer overflow. Numerically stable floating-point operations. Graceful management of empty groups. Comprehensive test coverage ensures reliability. Best Practice Examples Demonstrates proper serialization format. Showcases efficient Arena memory management. Implements clean merge logic. Utilizes appropriate hash table structures. Serves as Template Similar aggregates can be modeled after these examples. Design patterns are reusable across various domains. The implementation structure is clear and consistent, facilitating adaptation for custom requirements.

Gold Standard #1: sum() Aggregate Why It's the Reference The
sum() aggregate represents the most straightforward and accurate implementation. It demonstrates fundamental patterns without unnecessary complexity, serving as an ideal starting point for understanding aggregate development. Architecture Highlights Single numeric state field (UInt64 or Float64). No Arena allocation required. Fully SIMD-optimizable for modern CPUs. Trivial serialization (binary copy). Perfect commutativity and associativity. Key Lesson Prioritize simplicity and correctness in initial implementation. Optimization efforts should follow a validated basic implementation. The sum() aggregate illustrates that even the simplest aggregates demand meticulous attention to type handling and NULL values. // State structure template<typename T> struct SumData { T sum = 0; }; // Additive phase - scalar version void add(AggregateDataPtr place, const IColumn** columns, size_t row_num, Arena*) { this->data(place).sum += columns[0]->getInt(row_num); } // SIMD-optimized version void addMany( const Value* ptr, size_t start, size_t end) { // Process 4-8 values simultaneously // using AVX2/AVX512 vector instructions #ifdef __AVX2__ // Vectorized accumulation #endif } Implementation

Gold Standard #2: count() and countIf() Aggregates Why it is
a Reference Implementation This implementation explicitly demonstrates methodologies for handling NULL values and illustrates the distinction between counting all rows versus counting based on specific conditions. It showcases how to implement conditional aggregation efficiently, utilizing a highly efficient state structure. struct CountData { UInt64 count = 0; }; // Count all rows, including NULL values void add_count(...) { this->data(place).count++; } // Count only if condition is true void add_countIf(...) { if (condition_column- >getBool(row_num)) this->data(place).count++; } State Structure and Logic Key Characteristics Simple counter increment operation NULL values are implicitly included in the count Merging involves the summation of counters Direct UInt64 serialization Achieves optimal parallelization across multiple threads Variations include: count(), which increments unconditionally; count(column), which increments only for non-NULL values; and countIf(condition), which increments when a specified condition is true. These three related aggregates share the same fundamental implementation pattern.

Gold Standard #3: avg() Aggregate The avg() aggregate exemplifies the
crucial pattern of distinct state and result types. It maintains a sum and a count within its state, performing division only during finalization to compute the average, thereby preventing precision loss from intermediate divisions. template<typename T> struct AvgData { T sum = 0; UInt64 count = 0; }; // Additive phase void add(AggregateDataPtr place, const IColumn** columns, size_t row_num, Arena*) { auto& state = this->data(place); state.sum += columns[0]->get(row_num); state.count++; } // Merge phase void merge(AggregateDataPtr place, ConstAggregateDataPtr rhs, Arena*) { auto& dst = this->data(place); const auto& src = this->data(rhs); dst.sum += src.sum; dst.count += src.count; } State Structure // Convert state to result void insertResultInto( AggregateDataPtr place, IColumn& to) const { const auto& state = this->data(place); if (state.count == 0) { to.insertDefault(); return; } Float64 result = static_cast<Float64>(state.sum) / static_cast<Float64>(state.count); to.insert(result); } Finalization Logic Key Design Points State stores a {sum, count} pair The result type is Float64 (representing the average value) Division occurs solely during finalization Gracefully handles empty groups Ensures numerical stability for distributed merges

Gold Standard #4: min() and max() Aggregates struct MinData {
T value; bool initialized = false; }; void add(...) { if (!columns[0]->isNullAt(row_num)) { T val = columns[0]->get(row_num); if (!state.initialized || val < state.value) { state.value = val; state.initialized = true; } } } Minimum Aggregate This aggregate maintains the smallest value encountered. The first non-NULL value initializes the state. Subsequent values are compared, and the state is updated if a smaller value is found. Correct handling of NULL values is crucial. struct MaxData { T value; bool initialized = false; }; void add(...) { if (!columns[0]->isNullAt(row_num)) { T val = columns[0]->get(row_num); if (!state.initialized || val > state.value) { state.value = val; state.initialized = true; } } } Maximum Aggregate This aggregate maintains the largest value encountered. Its structure is identical to the minimum aggregate, but with an inverted comparison logic. Both minimum and maximum aggregates benefit from SIMD (Single Instruction, Multiple Data) min/max instructions available on modern CPUs for enhanced performance. Special considerations for these aggregates include: empty groups return NULL (not an uninitialized value), string comparisons utilize collation-aware ordering, and dates/timestamps are compared as numeric values. SIMD implementations leverage vector min/max operations, often yielding a 4-8x speedup in processing.

Gold Standard #5: uniqExact() Aggregate The uniqExact() aggregate exemplifies efficient
Arena memory management and advanced hash table patterns. It maintains a hash set of unique values observed, with its state stored entirely within Arena-allocated memory. This pattern is applicable to any aggregate requiring complex data structures. template<typename T> struct UniqExactData { using Set = HashSet<T>; Set* set = nullptr; void init(Arena* arena) { set = new (arena->alloc(sizeof(Set))) Set(); } }; State Structure The hash set is allocated within the Arena, rather than using `new` or `malloc`. This ensures memory is properly tracked and deallocated upon query completion. void add(AggregateDataPtr place, const IColumn** columns, size_t row_num, Arena* arena) { auto& state = this->data(place); if (!state.set) { state.init(arena); } T value = columns[0]->get(row_num); state.set->insert(value); } Add Operation void merge(AggregateDataPtr place, ConstAggregateDataPtr rhs, Arena* arena) { auto& dst = this->data(place); const auto& src = this->data(rhs); if (!dst.set) { dst.init(arena); } // Union two sets for (const auto& val : *src.set) { dst.set->insert(val); } } Merge Operation void serialize( ConstAggregateDataPtr place, WriteBuffer& buf) const { const auto& set = *this->data(place).set; writeVarUInt(set.size(), buf); for (const auto& val : set) { writeBinary(val, buf); } } Serialization This function serializes the hash set's contents into a compact binary format. Deserialization then reconstructs the set from the provided stream.

Gold Standard #6: quantiles() Aggregate Rationale for Stateful Algorithms The
quantiles() aggregate demonstrates that state can encapsulate complex algorithm objects, moving beyond simple data structures. It utilizes T-Digest or HDR Histogram for approximate quantile calculation with bounded memory consumption. Key Insights State as an Algorithm Object, not Plain Data Custom Serialization for Algorithm State Arena Allocator for Memory Management SIMD Not Applicable (Algorithm-Specific Logic) Accuracy-Memory Efficiency Trade-off struct QuantilesData { TDigest* digest = nullptr; void init(Arena* arena) { digest = new (arena->alloc( sizeof(TDigest))) TDigest(); } void add(Float64 value) { digest->add(value); } void merge(const QuantilesData& other) { digest->merge(*other.digest); } Float64 getQuantile(Float64 level) { return digest->getQuantile(level); } }; Implementation Pattern Memory-Accuracy Trade-off: Exact quantile calculations necessitate storing all input values (memory complexity of O(n)). T-Digest provides an approximation with <1% error, consuming approximately 1KB per state (memory complexity of O(1)). For datasets of one billion rows, exact methods require 8GB, whereas approximate methods need only 1KB—a memory reduction of approximately 8 million times.

Gold Standard #7: groupArray() Aggregate The groupArray() aggregate showcases variable-length
state with significant Arena usage. It accumulates all values for each group into an array, demonstrating the patterns required for aggregates that must store unbounded data per group. template<typename T> struct GroupArrayData { PODArray<T> array; void add(T value, Arena* arena) { array.push_back(value, arena); } }; State Structure PODArray manages memory through an Arena allocator, rather than malloc, and automatically resizes as needed. void add(AggregateDataPtr place, const IColumn** columns, size_t row_num, Arena* arena) { auto& arr = this->data(place).array; T value = columns[0]->get(row_num); // Store value in Arena arr.push_back(value, arena); } Add Implementation Memory Considerations Potentially unbounded growth (e.g., millions of values per group) Subject to defined memory limits (max_bytes_before_external_group_by) May spill to disk if memory thresholds are exceeded Merging can concatenate arrays, leading to significant memory consumption void serialize( ConstAggregateDataPtr place, WriteBuffer& buf) const { const auto& arr = this->data(place).array; writeVarUInt(arr.size(), buf); for (size_t i = 0; i < arr.size(); ++i) writeBinary(arr[i], buf); } Serialization Strategy

Source Code Navigation Guide 01 Primary Directory The src/AggregateFunctions/ directory
contains the implementations for all aggregate functions. Each aggregate typically features a dedicated .cpp/.h file pair that adheres to the naming pattern AggregateFunctionXxx.cpp. 02 Base Classes The primary header files include IAggregateFunction.h, which defines the core interface; IAggregateFunctionDataHelper.h, which establishes the templated base class pattern; and AggregateFunctionFactory.h, which manages the registration process. 03 Combinator System AggregateFunctionCombinatorFactory.cpp implements the `-State`, `-Merge`, `-MergeState`, and `-SimpleState` modifiers. A review of this file is recommended to understand how combinators transform base aggregate functions. 04 Reference Files for Study AggregateFunctionSum.cpp: Illustrates the simplest implementation. AggregateFunctionCount.cpp: Demonstrates patterns for NULL handling. AggregateFunctionAvg.cpp: Clarifies the distinction between state and result. AggregateFunctionUniq.cpp: Showcases hash set and Arena usage. AggregateFunctionQuantile.cpp: Presents a complex algorithm state. 05 Search Strategy To locate a specific aggregate function, search for the function name prefixed with "AggregateFunction." For example, searching for "AggregateFunctionSum" will identify the sum() implementation. Utilize IDE navigation features to traverse between base classes and their respective implementations.

Adapting Reference Implementations 1 Choose Closest Reference Select the reference
aggregate most similar to your requirements. For simple arithmetic, consider `sum()`. For unique values, ùniqExact()` is appropriate. For complex algorithms, `quantiles()` can be utilized. A closer match minimizes the adaptation required. 2 Copy Structure & Interface Duplicate the class structure, including inheritance from ÌAggregateFunctionDataHelper`. Retain the original method signatures (àdd`, `merge`, `serialize`, `deserialize`, ìnsertResultInto`). Maintain the template pattern to ensure type flexibility. 3 Modify State Structure Adapt the state structure to store the intermediate data for your computation. Add fields for accumulators, counters, or algorithm-specific objects. Consider the memory footprint; maintain compactness where feasible. Ensure all fields are initialized correctly. 4 Update Core Methods Implement àdd()` with your specific value processing logic. Adapt `merge()` to combine your state structures appropriately. Update `serialize()` and `deserialize()` to handle your state format. Modify ìnsertResultInto()` with your final calculation logic. 5 Test Against Reference For identical inputs, verify that your aggregate produces the same results as the reference. Test the correctness of serialization round-trips. Validate the commutativity and associativity of the merge operation. Compare performance characteristics, including memory usage and throughput. Adaptation checklist: Ensure the serialization format preserves all state data, NULL handling aligns with intended semantics, the merge operation is truly associative and commutative, empty group behavior is well-defined, and memory management utilizes the Arena appropriately.

PERFORMANCE Memory Alignment & Cache Efficiency CPU Cache Architecture Modern
CPUs fetch data in 64-byte cache lines. When multiple threads modify data within the same cache line, they compete for exclusive ownership—a phenomenon known as false sharing that significantly degrades parallel performance. Cache Line Contention Consider a false sharing scenario: Two threads aggregate data into adjacent memory locations within the same cache line. Each modification invalidates the other thread's cache line, forcing expensive memory bus traffic. This contention leads to serialized execution, negating the benefits of parallel processing. // Allocate with cache line alignment AggregateDataPtr place = arena->alignedAlloc( total_size, align_aggregate_states ); // align_aggregate_states = 64 // Matches CPU cache line size Alignment Solution 2-3x Throughput Gain Achieved through proper alignment in high-concurrency aggregation. 64 Bytes per Cache Line Standard on modern x86-64 CPUs. Performance impact: Misaligned aggregation with 16 threads can result in slower execution than single-threaded processing due to severe cache contention. Proper 64-byte alignment enables linear scaling, allowing 16 threads to deliver approximately 16x throughput.

SIMD Optimizations for Arithmetic Aggregates SIMD (Single Instruction Multiple Data)
enables processing multiple values in parallel using specialized CPU instructions. Modern processors can execute 4-8 operations simultaneously, delivering substantial performance gains for arithmetic aggregates. SIMD Capabilities AVX2: 256-bit registers process 4× UInt64 or 8× UInt32 simultaneously. AVX512: 512-bit registers process 8× UInt64 or 16× UInt32 simultaneously. Supported Operations: addition, subtraction, multiplication, minimum, maximum, and comparison. Throughput: Achieves 4-8x speedup over scalar code. void NO_INLINE addMany( const Value* __restrict ptr, size_t start, size_t end) { if (isArchSupported(TargetArch::AVX512BW)) { // Process 8 values per iteration __m512i sum_vec = _mm512_setzero_si512(); for (size_t i = start; i < end; i += 8) { __m512i data = _mm512_loadu_si512(&ptr[i]); sum_vec = _mm512_add_epi64(sum_vec, data); } // Horizontal sum of vector } else if (isArchSupported(TargetArch::AVX2)) { // Process 4 values per iteration // AVX2 implementation } else { // Scalar fallback for (size_t i = start; i < end; ++i) state.sum += ptr[i]; } } addMany() Pattern Runtime CPU detection ensures optimal code path selection. The binary contains multiple implementations, with CPU capabilities determining which one executes at runtime, eliminating the need for multiple binaries or user configuration.

Multi-Target Code and Architecture Specialization 1 USE_MULTITARGET_CODE Macro ClickHouse supports
compiling the same function multiple times with distinct optimization flags. The macro system generates a scalar baseline version, an AVX2-optimized version, and an AVX512- optimized version. A runtime dispatcher then selects the most optimal available implementation based on the CPU's capabilities. // Mark function for multi-target compilation DECLARE_MULTITARGET_CODE( void addManyImpl(const UInt64* data, size_t count, UInt64& sum) { #ifdef __AVX512F__ // AVX512 implementation #elif defined(__AVX2__) // AVX2 implementation #else // Scalar implementation #endif } ) // End multi-target code 2 Compilation Strategy 3 Benefits of Multi-Target Compilation Single binary compatible with all CPUs (both legacy and modern architectures) Automatic selection of the most efficient implementation Newer CPUs automatically leverage the latest instruction sets Eliminates the need for manual configuration or explicit CPU detection code The compiler independently optimizes each target implementation For instance, the sum() aggregate on a modern CPU with AVX512 achieves 8x throughput compared to the scalar baseline. The same binary on an older CPU without AVX512 will seamlessly fall back to the AVX2 (4x throughput) or scalar implementation, ensuring both compatibility and maximized performance across diverse hardware configurations.

Specialized Hash Tables for Different Data Sizes ClickHouse selects hash
table implementations based on cardinality and memory constraints. Choosing the optimal data structure is critical for efficient in-memory processing and avoiding expensive disk spills. 1 Small Cardinality This category involves a low group count that fits within the CPU cache. An optimized hash table with a cache-friendly layout and linear probing is used, ensuring all states reside in RAM for rapid access. A typical threshold for this scenario is fewer than 100,000 unique groups. 2 Medium Cardinality This covers a moderate group count that fits within available RAM. A standard hash table with robust collision handling is employed. Memory usage is monitored, and preparations for potential spillover are made. This typically applies to a range of 100,000 to 10 million unique groups. 3 Large Cardinality This addresses high group counts that exceed available RAM. A two-level hash table is utilized to provide enhanced memory locality and support for disk spill. External aggregation gracefully manages memory pressure. The threshold for this is configured by max_bytes_before_external_group_by (default: 100GB; recommended: 50% of max_memory_usage). Configuration: Setting max_bytes_before_external_group_by = 10GB triggers a disk spill when aggregation exceeds 10GB. The two-level hash table partitions data for efficient external processing. Performance impact: In-memory aggregation is 100-1000 times faster than disk-based processing, necessitating careful tuning of these thresholds.

Memory Locality & Data Structure Layout // BAD: Multiple pointer
indirections struct BadState { char* ptr1; // Separate allocation char* ptr2; // Another allocation uint64_t val; char padding[40]; // Wasted space }; // Each access: 3 cache misses // Size: 64 bytes (only 16 useful) Anti-Patterns to Avoid Scattered Allocations Multiple pointers necessitate multiple memory accesses. Each pointer dereference is a potential cache miss. Additionally, padding within data structures can lead to inefficient cache line utilization. // GOOD: Compact inline data struct GoodState { uint64_t sum; uint32_t count; uint32_t flags; }; // One access: 1 cache miss // Size: 16 bytes (all useful) Optimized Layout Compact Structure Grouping related fields together maximizes cache efficiency. Utilizing appropriate integer sizes (e.g., `UInt32` instead of `UInt64` where sufficient) minimizes memory footprint. Eliminating unnecessary padding further improves data density. 4x Cache Efficiency Gain Compact layouts enable fitting four states per cache line, significantly outperforming a single scattered state. 75% Memory Savings Achieved by eliminating padding and unused space within aggregate states, leading to substantial memory optimization.

Vectorized Block Processing ClickHouse processes data in blocks (segments of
8,192 to 65,536 rows) rather than individual rows. This block-level processing amortizes overhead and enables powerful optimizations that row-by-row processing cannot achieve. Block Structure A block constitutes the smallest unit of computation, comprising multiple columns with identical row counts. Blocks align with storage granules for efficient I/O. A typical size is 8,192 rows per block, which is tunable via the `max_block_size` setting. Amortized Overhead Virtual function calls are executed per-block, not per-row. Consequently, function call overhead, when distributed across 8,192 rows, becomes negligible. The compiler can optimize block-level functions more aggressively. This approach results in performance that is 10-100 times faster than row-by-row iteration. SIMD Opportunities Contiguous block data facilitates vectorized processing. This enables loading 4-8 values simultaneously from block memory and processing multiple rows per CPU instruction. The cache-friendly sequential memory access pattern maximizes throughput. // Process entire block at once void addMany( const IColumn& column, size_t start, size_t end) { const auto* data = column.getData().data(); // SIMD loop over [start, end) for (size_t i = start; i < end; i += 8) { // Process 8 rows simultaneously } } `addMany()` Implementation

Thread-Safe Aggregation & Lock-Free Patterns Single-Threaded Path Direct state updates
occur without synchronization overhead. All aggregation is processed sequentially within a single thread. While efficient for single-threaded operations, this approach does not inherently scale to multiple cores. It is typically employed when `max_threads=1` or when data fits entirely within a single partition. Multi-Threaded Pattern Each worker thread maintains its own independent aggregation context. The absence of shared state during the additive phase eliminates lock contention. Each thread processes distinct input blocks in parallel, and a final phase serially merges the per-thread states into the ultimate result. Lock-Free Design Locks are avoided in the critical execution path (e.g., `add`, `addMany` methods). The arena allocator itself employs internal thread-safety mechanisms. State pointers are never shared between threads during the accumulation phase. Only the final merge phase involves potential synchronization, which is typically fast relative to the total processing time. Performance Impact Lock-free aggregation demonstrates near-linear scalability with thread count. For example, 8 threads can deliver approximately 8x throughput, and 16 threads can achieve approximately 16x throughput. This contrasts sharply with locked approaches, where contention typically limits speedup to only 2-3x, regardless of the number of threads. A proper lock-free design choice can yield 5-10x better parallel efficiency. Arena Thread Safety: The Arena allocator utilizes atomic operations for thread-safe memory allocation. This minimal synchronization in the allocation path does not significantly impact performance, as allocations are infrequent compared to state updates.

Serialization Performance Considerations Why Serialization Matters In distributed aggregation, serialization
often represents an I/O bottleneck. States must be efficiently transmitted over the network between cluster nodes. An inefficient serialization format can significantly impact query time, even when computation is fast. Performance Factors Format Size: Compactness reduces network bandwidth usage. CPU Cost: Compression can improve performance, often justifying the CPU overhead. Frequency: Serialization typically occurs once per group in distributed queries. Determinism: Identical states must produce identical serialized outputs. Optimization Strategies Compact Binary Avoid JSON due to its verbosity. Employ binary formats with variable-length encoding for integers and tight data packing without alignment padding. Batch Operations Batch multiple states into a single buffer operation to amortize buffer flush overhead and enhance write buffer efficiency. Smart Compression For large states, compression can significantly reduce network bandwidth usage. The associated CPU cost is often justified by a 3- 10x reduction in size. Prioritize fast algorithms like LZ4 over GZIP. For example, in `uniqExact()` serialization optimization, a naive approach serializes the entire hash table structure, incurring significant overhead. An optimized approach serializes only the unique values, resulting in a 10x smaller serialized size and 10x faster network transmission.

Query Settings for Aggregate Optimization Setting Description Recommendation max_memory_usage Defines
the per-query memory limit. Aggregation operations will spill to disk if this limit is exceeded. 10-50GB max_bytes_before_external_group_by Sets the threshold for triggering a disk spill during a GROUP BY operation. It is recommended to set this to approximately 50% of max_memory_usage. 5-25GB max_threads Specifies the thread parallelism for query execution. This should be configured according to the number of available CPU cores. 8-32 aggregation_memory_efficient_mode Enables a memory-optimized aggregation algorithm, typically a two-level hash table. This setting trades increased CPU usage for reduced memory consumption. 1 (enabled) group_by_two_level_threshold Determines the cardinality threshold at which the system switches to a two-level hash table for GROUP BY operations. 100000 max_block_size Defines the number of rows processed per block. Larger block sizes can improve amortization but require more memory. 8192-65536 Tuning Strategy: Align these settings with the available hardware resources. For example, on a machine with 64GB of RAM: set max_memory_usage to 30GB (allowing headroom for the operating system), max_bytes_before_external_group_by to 15GB (50% of the memory limit), and max_threads to 16 (corresponding to physical CPU cores). Further optimization can be achieved by monitoring system.query_log for memory_usage metrics.

Performance Profiling & Measurement 01 Query Timing Analysis Utilize EXPLAIN
ANALYZE to obtain detailed execution timing. Identify query stages that consume the most time. Focus on: time spent in aggregate functions versus I/O or network operations. The system.query_log table stores historical query performance data. 02 Memory Tracking Monitor system.processes during query execution to observe real-time memory usage. After query completion, check system.query_log for peak_memory_usage. Compare this against established limits to assess available memory capacity. Track memory usage per worker thread for granular analysis. 03 CPU Profiling with perf Execute the Linux perf tool to capture CPU samples during aggregation: perf record -g -p <pid>, followed by perf report for hotspot analysis. Generate flame graphs to visualize time distribution across functions. This helps identify bottlenecks such as excessive cache misses, branch mispredictions, or inefficient functions. 04 Micro-Benchmarks Develop standalone C++ tests for specific aggregate functions. Measure add() throughput in rows per second. Conduct tests using various data patterns (uniform, skewed, sparse). Compare SIMD versus scalar implementations. Validate linear scaling with increasing thread counts. 05 Baseline Comparison Compare the performance of your aggregate against similar built-in functions. Test across various cardinalities (low, medium, high). Conduct measurements on the target hardware, as cache effects are architecture- specific. Indicators of suboptimal performance include: O(n²) behavior during merging, excessive memory allocations, lock contention, or poor SIMD utilization.

CASE STUDIES Case Study #1: Custom Approximate Aggregate Scenario and
Requirements Business Need: To calculate the 95th percentile (P95) latency metric across billions of request logs with an error tolerance of less than 1%, while maintaining a minimal memory footprint per group. Challenge: Exact percentile calculations necessitate storing all values (memory complexity of O(n) per group). For example, 1 billion values at 8 bytes each would require 8GB per group, an unfeasible amount for high- cardinality grouping scenarios. Constraints: A maximum of 100MB memory per group, sub-second query latency, and distributed aggregation across 100+ nodes are required. Solution Approach An approximate quantile function was implemented using the T-Digest algorithm. This algorithm performs streaming approximate percentile calculations, ensuring bounded memory usage irrespective of the input data size. Key Design Decisions State: Utilizes a T-Digest data structure (approximately 1KB fixed size). Accuracy: Guarantees less than 1% error for any percentile. Merge: Employs the T-Digest merge algorithm to combine states efficiently. Serialization: Custom binary format developed for compact data transmission. 400x Memory Reduction 1KB approximate state compared to 400MB required for exact storage of 50 million values per group. <1% Error Margin Achieved accuracy consistently meets business requirements for all percentiles. 100x Throughput Gain Significantly faster query execution due to reduced memory pressure and elimination of disk spill operations.

Case Study #2: Streaming ML Feature Aggregates Real-time machine learning
systems necessitate the computation of statistical features over streaming data, including mean, variance, covariance, and other moments. These features must update incrementally as new data arrives, eliminating the need for recomputation from scratch. Business Context An online recommendation system requires real-time user behavior features such as average session duration, variance in click rates, and covariance between various metrics. These features must update with sub-second latency and be capable of processing billions of events per day. Numerical Challenge A critical numerical challenge arises with the naive approach of computing the mean as sum/count and variance as sum(x²)/n - mean². Specifically, catastrophic cancellation can occur when the mean is large relative to the variance (e.g., mean=1000, variance=0.1), leading to highly inaccurate variance values. // Numerically stable variance computation struct WelfordState { UInt64 count = 0; Float64 mean = 0; Float64 M2 = 0; // Sum of squared deviations void add(Float64 value) { count++; Float64 delta = value - mean; mean += delta / count; M2 += delta * (value - mean); } Float64 variance() { return M2 / count; } }; Solution: Welford's Algorithm Results & Trade-offs Memory Efficiency: O(1) memory footprint, independent of input size. Numerical Stability: Robust across all value ranges. Parallelization: States can be efficiently merged. Trade-off: Results are not bit-exact with batch computation, which is acceptable for machine learning feature engineering.

Common Challenges and Solutions Non-Idempotent Serialization Problem: Serialization followed by
deserialization yields differing states, leading to data corruption and inaccurate distributed aggregation results. Solution: Implement comprehensive round-trip tests. For every state, verify that deserialize(serialize(state)) == state. Thoroughly test with edge cases, including empty states, maximum values, and special values. Memory Leaks in Arena Allocation Problem: Incorrect memory management when allocating complex objects (using new/malloc directly) within an arena, leading to unmanaged memory and potential leaks or double-free errors. Solution: Always utilize an arena for variable-length allocations. The arena is responsible for memory ownership; therefore, direct calls to delete are unnecessary. Employ placement new for complex objects: new (arena->alloc(sizeof(T))) T(). Non-Commutative Merge Operations Problem: Distributed aggregation results vary based on the order of merging states (i.e., Merge(A,B) ≠ Merge(B,A)), leading to inconsistent outcomes. Solution: Ensure that merge operations are strictly commutative. Validate this property by testing both orderings: assert(merge(stateA, stateB) == merge(stateB, stateA)). This often arises from order-dependent array concatenation or hash table iteration. SIMD Alignment Violations Problem: Single Instruction, Multiple Data (SIMD) operations, such as AVX2/AVX512 load instructions, often assume 32-byte aligned data. Unaligned column data can cause segmentation faults. Solution: Employ unaligned load instructions (e.g., _mm256_loadu_si256 instead of _mm256_load_si256). Alternatively, ensure arena allocations adhere to the required alignment. Test with intentionally misaligned data to verify robustness. Inconsistent NULL Handling Problem: Discrepancies in how NULL values are handled across add(), merge(), and serialize() methods lead to erroneous results when processing inputs containing NULL. Solution: Centralize NULL checking within the add() method. Clearly document the semantics for NULL values. Conduct thorough testing across all code paths, including cases with all NULL inputs, mixed NULL and non-NULL values, and NULL values at edge conditions.

Development Workflow & Testing Best Practices Start with Scalar Implementation
Begin with a straightforward, unoptimized scalar implementation. Implement all required methods: add(), merge(), serialize(), deserialize(), and insertResultInto(). Prioritize correctness over performance. Utilize reference aggregates as templates. Verify basic functionality using simple test cases. Add Comprehensive Unit Tests Test each method in isolation with a variety of inputs. Verify add() with edge values (zero, maximum, negative, and NULL). Confirm merge() commutativity and associativity. Validate serialization round-trip correctness. Check insertResultInto() calculation accuracy. Establish a robust test suite before initiating performance optimizations. Integration Testing with SQL Execute actual SQL queries utilizing your aggregate. Test with varying GROUP BY cardinalities. Verify distributed aggregation across multiple nodes. Compare results against reference implementations. Test with subqueries, joins, and complex expressions. Ensure identical results across all scenarios. Performance Testing & Profiling Micro-benchmark add() method throughput. Measure memory usage per state with diverse data patterns. Compare performance against similar built-in aggregates. Profile with `perf` to identify performance bottlenecks. Verify linear scaling with thread count, ensuring no lock contention. Document baseline performance characteristics thoroughly. Optional: SIMD Optimization Implement SIMD optimization only if add() is definitively identified as a bottleneck through profiling. Prioritize AVX2 implementation before AVX512 for broader compatibility. Employ multi-target compilation for automatic selection. Verify that SIMD results precisely match scalar results (bit-exact comparison). Test on multiple CPU architectures. Measure actual speedup, as SIMD implementations are not invariably faster.

Summary & Key Takeaways Design Patterns Inherit from IAggregateFunctionDataHelper for
type safety and memory management. Distinguish between state and result types for effective distributed aggregation. Employ a three-phase model: additive, merge, and finalize. Utilize data type templating for compile- time optimization. Architecture Mastery Implement arena-based allocation for variable-length state data. Ensure 64-byte alignment for optimal cache efficiency and SIMD performance. Select hash tables based on cardinality. Leverage the combinator pattern for flexible aggregation workflows. Guarantee serialization correctness for distributed queries. Reference Implementations Key functions for study include: sum() for simplicity, avg() illustrating state versus result type handling, uniqExact() showcasing hash sets and Arena allocation, quantiles() exemplifying complex algorithmic implementation, and groupArray() for handling variable-length data structures. These are located in the src/AggregateFunctions/ directory. Performance Optimization Apply SIMD vectorization for arithmetic operations. Utilize multi-target code for architecture-specific specialization. Employ lock-free patterns to achieve thread scalability. Implement vectorized block processing for enhanced throughput. Prioritize profiling before optimization— measurement, not conjecture, should guide efforts. Testing Strategy Conduct comprehensive unit tests for each method, integration tests with SQL queries, and verification of distributed correctness. Perform performance benchmarks against reference implementations and ensure extensive edge case coverage. Thoroughly test serialization round-trip functionality. Development Path Study existing implementations within the ClickHouse codebase. 1. Implement the scalar version first, focusing on correctness. 2. Develop a comprehensive test suite. 3. Profile the implementation to identify actual bottlenecks. 4. Optimize based on empirical measurements, not assumptions. 5. Integrate SIMD only if proven necessary by profiling results. 6. ChistaDATA ClickHouse Blog https://chistadata.com/blog/ This presentation provides the foundational principles for building production-quality aggregate functions. Success is achieved through a thorough understanding of design patterns, rigorous study of reference implementations, comprehensive testing, and optimization based on actual profiling data.

ClickHouse Aggregate Functions: Design & Implem...

ClickHouse Aggregate Functions: Design & Implementation Architecture

More Decks by Shiv Iyer

Other Decks in Technology

Featured

Transcript