Skip to content

Navigation Menu

Appearance settings

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search syntax tips

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

apache / datafusion Public

Notifications You must be signed in to change notification settings
Fork 1.5k
Star 7.3k

Code
Issues 1.3k
Pull requests 127
Discussions
Actions
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Security
Insights

[EPIC] (Even More) Grouping / Group By / Aggregation Performance #7000

Copy link

Copy link

Open

Open

[EPIC] (Even More) Grouping / Group By / Aggregation Performance#7000

Copy link

Labels

enhancementNew feature or requestNew feature or request

Description

opened

on Jul 17, 2023

Issue body actions

Is your feature request related to a problem or challenge?

Aggregation is a key operation of Analytic engines. DataFusion has made great progress recently (e.g. #4973 and #6889)

This Epic gathers other potential ways we can improve the performance of aggregation

Core Hash Grouping Algorithm:

Improve aggregate performance by special casing single group keys #6969
Improve aggregate performance with specialized groups accumulator for single string group by #7064
Improve performance for grouping by variable length columns (strings) #9403
Improved performance for streaming group by #7023
Evaluate vectorized hash table for group aggregation #7095

Specialized Aggregators:

Implement fast min/max accumulator for binary / strings (now it uses the slower path) #6906
Improve the performance of COUNT DISTINCT queries for high cardinality groups #5547
Speed up DistinctCountAccumulator #5472
[EPIC] Improve aggregate performance with adaptive sizing in accumulators / avoiding reallocations in accumulators #7065
Improve grouping performance via better vectorization in accumulate functions #7066

New features:

Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates #6937
Generate GroupByHash output in multiple RecordBatches rather than one large one #9562
Better Grouping / aggregation pushdown #8699
Change Accumulator::evaluate and Accumulator::state to take &mut self #8934

Improved partitioning:

Lock free MPSC channel for RepartitionExec #6928
Improve RepartitionExec for better query performance #7001
Speed up hash partitioning #6822

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

Labels

enhancementNew feature or requestNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.