Open
Description
Is your feature request related to a problem or challenge?
Aggregation is a key operation of Analytic engines. DataFusion has made great progress recently (e.g. #4973 and #6889)
This Epic gathers other potential ways we can improve the performance of aggregation
Core Hash Grouping Algorithm:
- Improve aggregate performance by special casing single group keys #6969
- Improve aggregate performance with specialized groups accumulator for single string group by #7064
- Improve performance for grouping by variable length columns (strings) #9403
- Improved performance for streaming group by #7023
- Evaluate vectorized hash table for group aggregation #7095
Specialized Aggregators:
- Implement fast min/max accumulator for binary / strings (now it uses the slower path) #6906
- Improve the performance of COUNT DISTINCT queries for high cardinality groups #5547
- Speed up
DistinctCountAccumulator
#5472 - [EPIC] Improve aggregate performance with adaptive sizing in accumulators / avoiding reallocations in accumulators #7065
- Improve grouping performance via better vectorization in accumulate functions #7066
New features:
- Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates #6937
- Generate GroupByHash output in multiple
RecordBatch
es rather than one large one #9562 - Better Grouping / aggregation pushdown #8699
- Change
Accumulator::evaluate
andAccumulator::state
to take&mut self
#8934
Improved partitioning:
- Lock free MPSC channel for RepartitionExec #6928
- Improve RepartitionExec for better query performance #7001
- Speed up hash partitioning #6822
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response