Description
Is your feature request related to a problem or challenge?
The interval arithmetic library in DataFusion works via two fundamental methods: evaluate_bounds
for bottom-up evaluation, and propagate_constraints
for top-down propagation. We apply each traversal once to update bounds on columns. However, when there are complex expressions where symbols (e.g. columns) appear more than once, this will give overly pessimistic bounds, resulting in missed opportunities in data pruning and optimizations.
Describe the solution you'd like
If we use Timothy Hickey's interval narrowing approach and utilize a single update_bounds
API (instead of the two APIs we have today), we can utilize a queue of nodes to only narrow intervals of nodes that keep shrinking. Simply stated, the update_bounds
function is a simultaneous computation of both evaluation and propagation logics -- it updates intervals of the parent and the children nodes at the same time.
This will allow us to arrive at more precise bounds without unnecessarily traversing the expression DAG too many times.
Describe alternatives you've considered
We can apply evaluate_bounds
and propagate_constraints
in a loop until column bounds converge to a fixed point (to some tolerance), but that would be inefficient (albeit easy).