Skip to content

Conversation

@Pajaraja
Copy link
Contributor

What changes were proposed in this pull request?

Redefine the output of the UnionLoop and UnionLoopRef to be with new expression IDs.

Why are the changes needed?

Currently rCTEs don't behave properly in case when the anchor references the same column multiple times in the anchor, leading to wrong things being identified in the recursion.
For example this rCTE:

WITH RECURSIVE tmp(x) AS (
values (1), (2), (3), (4), (5)
), rcte(x, y) AS (
SELECT x, x FROM tmp WHERE x = 1
UNION ALL
SELECT x + 1, x FROM rcte WHERE x < 5
)
SELECT * FROM rcte;

Will return:

1 1
2 2
3 3
4 4
5 5

Instead of:

1 1
2 1
3 2
4 3
5 4

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New tests in golden file cte-recursion.sql.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label May 28, 2025
cteDef.id,
anchor,
rewriteRecursiveCTERefs(recursion, anchor, cteDef.id, None),
anchor.output.map(_.newInstance().exprId),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can add a new apply method in object UnionLoop, which takes the same parameters as before, and pass anchor.output.map(_.newInstance().exprId) additionally to construct UnionLoop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just using apply in object UnionLoop doesn't work because of default arguments.
in object UnionLoop, multiple overloaded alternatives of method apply define default arguments.
There are other possibilities to make this work, but I don't think any one change can work without modifying something in ResolveWithCTE.scala

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 5201a66 May 30, 2025
yhuang-db pushed a commit to yhuang-db/spark that referenced this pull request Jun 9, 2025
…the anchor output is duplicated

### What changes were proposed in this pull request?

Redefine the output of the UnionLoop and UnionLoopRef to be with new expression IDs.

### Why are the changes needed?

Currently rCTEs don't behave properly in case when the anchor references the same column multiple times in the anchor, leading to wrong things being identified in the recursion.
For example this rCTE:

```
WITH RECURSIVE tmp(x) AS (
values (1), (2), (3), (4), (5)
), rcte(x, y) AS (
SELECT x, x FROM tmp WHERE x = 1
UNION ALL
SELECT x + 1, x FROM rcte WHERE x < 5
)
SELECT * FROM rcte;
```

Will return:
```
1 1
2 2
3 3
4 4
5 5
```

Instead of:
```
1 1
2 1
3 2
4 3
5 4
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests in golden file cte-recursion.sql.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#51041 from Pajaraja/pavle-martinovic_data/UnionLoopOutput.

Lead-authored-by: pavle-martinovic_data <[email protected]>
Co-authored-by: Pavle Martinovic <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants