Skip to content

Run DataFusion benchmarks regularly and track performance history over time #5504

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As we make changes to DataFusion, some changes impact performance and some do not. Right now we mostly rely on reviewers to judge when a change could make an impact on performance, and if so run the appropriate benchmarks.

This means that

  1. We may miss some performance regressions (such as Optimize Accumulator size function performance (fix regression on clickbench) #5325)
  2. Since the benchmarks are not run regularly it is hard to know how to interpret results, and some seem to have bitrotted over time
  3. The wide variety of available benchmarks (e.g. Review existing datafusion benchmarks and clean them up #5502) makes it hard to know which ones to run and how to determine if performance has improved or regressed for particular changes

Describe the solution you'd like
I would like

  1. A system that runs DataFusion benchmarks regularly on main
  2. Some automated way to see if a particular PR has improved or regressed performance
  3. Bonus: a webpage that shows performance over time. Databend has a great example https://perf.databend.rs/

Suggestion

I believe conbench, https://conbench.ursa.dev/, which is partially integrated into the repo already, is intended for exactly this usecase. Using conbench would be nice as it appears to be actively maintained and has resources and is already hosted

The integration is https://github.com/apache/arrow-datafusion/tree/main/conbench and was added in #1791 by @dianaclarke

You can see its integration as it posts comments on PRs after merge such as #5476 (comment)

Describe alternatives you've considered
We could use existing timeseries databases and visualizations like grafana to visualize the information

Additional context

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requesthelp wantedExtra attention is neededperformanceMake DataFusion faster

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions