-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: incorrect outer join parsing #6875
Comments
|
Also, while this may be less stressful for |
|
It gives wrong answers for |
|
The parser seems to do the right thing. The problem seems to be in |
|
Another issue in the name resolution logic: Postgres: Materialize: |
|
Fixing this is not trivial. The problem is that for the following join the join tree generated by the planner is: but according to the SQL semantics, the join tree generated should be this instead: ie. the following two queries should lead to the same join tree but they don't: My first attempt at fixing this was about treating all comma-join members as nested joins but that broke lateral joins, for the same reason lateral joins don't work within nested joins as they should: In order for the sibling context to be seen from within a nested join, we need to mark all intermediate joins as LATERAL, which is wrong. So, in order for the query above to work, we need to write it as: A few thoughts:
|
|
As an interim measure, can we throw an error "mixing of comma joins with XXX" is not allowed? |
ProposalIn order for lateral joins to be supported properly, which is needed for fixing this join planning issue, we need to support lateral joins in nested joins. The scope of a lateral join operand contains all siblings in the left hand side of the current join, including the left-hand siblings of the ancestors of the current node. For example, in the following join tree, the lateral scope of G is {B, D, F} , and not just {F} as in our current implementation. Given that we use Every time we descend a join node, then, the outer scope is then one level farther. Non-lateral join operands must then be aware that the outer scope is not at level 1, but that there are some phantom scopes in between that cannot be seen. This could be modeled by adding a We also need to change the semantics of this method: The
However, in order to support lateral joins in nested joins, as we have mentioned, all joins will create a new scope, but also that the right hand side might be correlated with a sibling node of some ancestor. In that case, we need to mark as lateral, not the current join, but the ancestor join the correlated symbol is coming from. We can detect this by visiting all columns in the right hand side in Lastly, we would need to modify @benesch, any comments on this proposal? Am I missing anything? Personally, I find this Hir representation a bit hard to deal with, since the code generating it and the code consuming it need to agree on what introduces a new scope, without that contract being explicit anywhere. I'd prefer using a Query Graph Model representation, which column references consist in a pointer to the referenced quantifier (join operand) and the column offset within the columns of that join operand (see example of a lateral join below). This Query Graph Model structure also contains all the information for name resolution purposes. |
|
A few examples of the raw plan after this proposal. In the following one, the LATERAL join operand is correlated with the previous table in the comma join, and hence, that is the join marked as LATERAL (correlated), since its right hand side is correlated with its left hand side. Double correlation. In this other example, the lateral join operand is correlated with both its sibling in the left join and the sibling of the left join in the comma join. Hence, both joins are marked as LATERAL (again, meaning correlated). Edit: Note that the join with the identity tuple can be eliminated in most cases leaving the plan as follows: |
|
Note that this HirRelationExpr that treats all joins as binary joins leads to non-optimal dataflow plans for queries like: select * from a, b, lateral(select * from c where a.a = c.a)...since it leads to a correlated inner join between C and the result of joining A and B. However, ideally we should only evaluate C for every value of A. In the query graph model, which is a representation closer to the SQL semantics, comma joins are represented as N-way cross joins: With such representation, planning join orders based on the dependencies of its operands is trivial. Even nested joins can be flattened to ease this dependency analysis needed for efficient join planning. The following example shows how a query with a nested join before and after being flattened: Another criticizable aspect HirRelationExpr representation is that it imports too many concepts from the syntactic representation, such as the visibility scopes or the concept of LEFT and RIGHT. I think there is a lot of reasoning and query simplification that should happen at that stage, before lowering, which would benefit from QGM-like representation. |
|
For context, the following query was the one being analyzed when this issue was spotted: The second version is of the query just wraps the left join with parenthesis, which makes the planner treat it as a derived join and hence, it doesn't push the comma join into the the left hand side of the left join. As shown above, that makes the query significantly faster (200x faster). Note that the only 'real' issue in the planner is the lack of proper support for lateral joins, since even if the join tree built for the first query doesn't correspond with the pure SQL interpretation of the query, it is semantically equivalent, and hence, the optimizer should be able to produce the same plan for both. So, even after fixing the join tree issue we should investigate why the optimizer wasn't able to reduce both queries to the same plan. This is a typical star-query that may come in different flavors and, although the most common one is the one below, it's not unlikely that they will come in all sort of flavors. |
|
This seems to be a predicate pushdown issue when the join predicates for the inner joins are put in the where clause. The following query builds the same join structure like the one above but, by just putting the predicates in the where clause, we get again poor performance. |
|
Example of wrong results due to this issue: As Frank mentioned above, non-left-commutative joins, like |
|
In the dataflow weekly we decided to error out instead of fixing the root cause. Fixing the root cause should become much easier after the planned QGM work progresses. |
We previously treated comma joins as cross joins, but comma joins should instead be generally handled as nested cross joins to support PG's join precedence semantics. However, our current scope and lateral join implementations preclude this approach from working for relations that rely on lateral joins– nesting prohibits the rhs from seeing the lhs' items. In this case, we continue to error as we did before. Improves MaterializeInc#6875
We previously treated comma joins as cross joins, but comma joins should instead be generally handled as nested cross joins to support PG's join precedence semantics. However, our current scope and lateral join implementations preclude this approach from working for relations that rely on lateral joins– nesting prohibits the rhs from seeing the lhs' items. In this case, we continue to error as we did before. Improves MaterializeInc#6875
We previously treated comma joins as cross joins, but comma joins should instead be generally handled as nested cross joins to support PG's join precedence semantics. However, our current scope and lateral join implementations preclude this approach from working for relations that rely on lateral joins– nesting prohibits the rhs from seeing the lhs' items. In this case, we continue to error as we did before. Improves MaterializeInc#6875
|
I'm going to reopen this because I'm finding it hard to reason about the correctness of our current |
We previously treated comma joins as cross joins, but comma joins should instead be generally handled as nested cross joins to support PG's join precedence semantics. However, our current scope and lateral join implementations preclude this approach from working for relations that rely on lateral joins– nesting prohibits the rhs from seeing the lhs' items. In this case, we continue to error as we did before. Improves MaterializeInc#6875
We previously treated comma joins as cross joins, but comma joins should instead be generally handled as nested cross joins to support PG's join precedence semantics. However, our current scope and lateral join implementations preclude this approach from working for relations that rely on lateral joins– nesting prohibits the rhs from seeing the lhs' items. In this case, we continue to error as we did before. Improves MaterializeInc#6875
Rework join planning so that we can handle lateral references that are mixed with right and full outer joins. The implementation here largely follows what @asenac laid out in @6875, except that the existing outer scope design was found to be sufficient; that is, there is no concept of a "sibling" scope introduced here. The gist is that we now always plan joins with the RHS as a child scope of the LHS. (Previously, we only created the new scope for joins if we could eagerly notice that the RHS had a lateral element.) If the join is a right or full join, we ban referencing any of the elements in the left scope, as required by SQL:2008. When we encounter a non-lateral subquery, we hide any lateral scope items entirely. Yes, it's weird that non-lateral subqueries pretend lateral scope items don't exit, while subqueries on the RHS of a right or full join *error* if they access a lateral scope item, but that's what SQL demands. The lowering code is amended to always treat inner and left joins as correlated, even if they are not. This is correct, but it may produce worse plans. I haven't investigated the movement of plans in much detail yet. Fix MaterializeInc#6875.
Rework join planning so that we can handle lateral references that are mixed with right and full outer joins. The implementation here largely follows what @asenac laid out in @6875, except that the existing outer scope design was found to be sufficient; that is, there is no concept of a "sibling" scope introduced here. The gist is that we now always plan joins with the RHS as a child scope of the LHS. (Previously, we only created the new scope for joins if we could eagerly notice that the RHS had a lateral element.) If the join is a right or full join, we ban referencing any of the elements in the left scope, as required by SQL:2008. When we encounter a non-lateral subquery, we hide any lateral scope items entirely. Yes, it's weird that non-lateral subqueries pretend lateral scope items don't exit, while subqueries on the RHS of a right or full join *error* if they access a lateral scope item, but that's what SQL demands. The lowering code is amended to always treat inner and left joins as correlated, even if they are not. This is correct, but it may produce worse plans. I haven't investigated the movement of plans in much detail yet. Fix MaterializeInc#6875.
Rework join planning so that we can handle lateral references that are mixed with right and full outer joins. The implementation here largely follows what @asenac laid out in @6875, except that the existing outer scope design was found to be sufficient; that is, there is no concept of a "sibling" scope introduced here. The gist is that we now always plan joins with the RHS as a child scope of the LHS. (Previously, we only created the new scope for joins if we could eagerly notice that the RHS had a lateral element.) If the join is a right or full join, we ban referencing any of the elements in the left scope, as required by SQL:2008. When we encounter a non-lateral subquery, we hide any lateral scope items entirely. Yes, it's weird that non-lateral subqueries pretend lateral scope items don't exit, while subqueries on the RHS of a right or full join *error* if they access a lateral scope item, but that's what SQL demands. The lowering code is amended to always treat inner and left joins as correlated, even if they are not. This is correct, but it may produce worse plans. I haven't investigated the movement of plans in much detail yet. Fix MaterializeInc#6875.
Rework join planning so that we can handle lateral references that are mixed with right and full outer joins. The implementation here largely follows what @asenac laid out in @6875, except that the existing outer scope design was found to be sufficient; that is, there is no concept of a "sibling" scope introduced here. The gist is that we now always plan joins with the RHS as a child scope of the LHS. (Previously, we only created the new scope for joins if we could eagerly notice that the RHS had a lateral element.) If the join is a right or full join, we ban referencing any of the elements in the left scope, as required by SQL:2008. When we encounter a non-lateral subquery, we hide any lateral scope items entirely. Yes, it's weird that non-lateral subqueries pretend lateral scope items don't exit, while subqueries on the RHS of a right or full join *error* if they access a lateral scope item, but that's what SQL demands. The lowering code is amended to always treat inner and left joins as correlated, even if they are not. This is correct, but it may produce worse plans. I haven't investigated the movement of plans in much detail yet. Fix MaterializeInc#6875.
Rework join planning so that we can handle lateral references that are mixed with right and full outer joins. The implementation here largely follows what @asenac laid out in @6875, except that the existing outer scope design was found to be sufficient; that is, there is no concept of a "sibling" scope introduced here. The gist is that we now always plan joins with the RHS as a child scope of the LHS. (Previously, we only created the new scope for joins if we could eagerly notice that the RHS had a lateral element.) If the join is a right or full join, we ban referencing any of the elements in the left scope, as required by SQL:2008. When we encounter a non-lateral subquery, we hide any lateral scope items entirely. Yes, it's weird that non-lateral subqueries pretend lateral scope items don't exit, while subqueries on the RHS of a right or full join *error* if they access a lateral scope item, but that's what SQL demands. The lowering code is amended to always treat inner and left joins as correlated, even if they are not. This is correct, but it may produce worse plans. I haven't investigated the movement of plans in much detail yet. Fix MaterializeInc#6875.
Rework join planning so that we can handle lateral references that are mixed with right and full outer joins. The implementation here largely follows what @asenac laid out in @6875, except that the existing outer scope design was found to be sufficient; that is, there is no concept of a "sibling" scope introduced here. The gist is that we now always plan joins with the RHS as a child scope of the LHS. (Previously, we only created the new scope for joins if we could eagerly notice that the RHS had a lateral element.) If the join is a right or full join, we ban referencing any of the elements in the left scope, as required by SQL:2008. When we encounter a non-lateral subquery, we hide any lateral scope items entirely. Yes, it's weird that non-lateral subqueries pretend lateral scope items don't exit, while subqueries on the RHS of a right or full join *error* if they access a lateral scope item, but that's what SQL demands. Fix MaterializeInc#6875.
Rework join planning so that we can handle lateral references that are mixed with right and full outer joins. The implementation here largely follows what @asenac laid out in @6875, except that the existing outer scope design was found to be sufficient; that is, there is no concept of a "sibling" scope introduced here. The gist is that we now always plan joins with the RHS as a child scope of the LHS. (Previously, we only created the new scope for joins if we could eagerly notice that the RHS had a lateral element.) If the join is a right or full join, we ban referencing any of the elements in the left scope, as required by SQL:2008. When we encounter a non-lateral subquery, we hide any lateral scope items entirely. Yes, it's weird that non-lateral subqueries pretend lateral scope items don't exit, while subqueries on the RHS of a right or full join *error* if they access a lateral scope item, but that's what SQL demands. Fix MaterializeInc#6875.











What version of Materialize are you using?
How did you install Materialize?
What was the issue?
The following query
should be syntactically equivalent to this other one:
However, they result in different raw plan, where the joins happen in different order. The right plan should be the one in the second query, since the comma operator has lower precedence.
This also leads to a name resolution issue whereby the sibling tables in the comma join are seen from the ON clause of the outer join, where they shouldn't be seen. Hence, the following query is accepted while it should fail:
... compared to postgres:
Is the issue reproducible? If so, please provide reproduction instructions.
Please attach any applicable log files.
The text was updated successfully, but these errors were encountered: