Skip to content

Conversation

@deepyaman
Copy link
Contributor

@deepyaman deepyaman commented Sep 16, 2025

Description of changes

Implement Backend.upsert() using sqlglot.expressions.merge() under the hood. Upsert support is very important, especially for data engineering use cases.

Starting with a most basic implementation, including only supporting one join column. I think this could be expanded to support a list without much effort.

MERGE INTO support is limited. DuckDB only added support for MERGE statements earlier today in 1.4.0, and many other backends don't support it. However, it seems like the more standard/correct approach for supporting upserts, and it doesn't require merge keys defined ahead of time on tables.

Backends that work:

  • DuckDB (from 1.4.0)
  • Flink
  • Oracle (currently using a hack to work around "AS" getting added to MERGE statement)
  • MS SQL (currently throwing a ; onto the end of every statement 😅)
  • Postgres

Should work, need help to test:

  • Databricks
  • Snowflake
  • BigQuery

Backends that don't work:

  • PySpark ("MERGE INTO TABLE is not supported temporarily.")
  • Clickhouse
  • DataFusion
  • SQLite (supports the nonstandard UPSERT statement)
  • Impala ("The MERGE statement is only supported for Iceberg tables.")
  • MySQL
  • Polars
  • RisingWave
  • Athena ("MERGE INTO is transactional and is supported only for Apache Iceberg tables in Athena engine version 3.")
  • Trino ("connector does not support modifying table rows")

Issues closed

@github-actions github-actions bot added tests Issues or PRs related to tests sql Backends that generate SQL labels Sep 16, 2025
@deepyaman deepyaman added the feature Features or general enhancements label Sep 16, 2025
@github-actions github-actions bot added the oracle The Oracle backend label Sep 17, 2025
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 5a27b34 to 21994eb Compare September 17, 2025 23:14
@deepyaman deepyaman requested a review from cpcloud September 18, 2025 00:52
@deepyaman
Copy link
Contributor Author

@cpcloud Requesting review for initial feedback while I try to improve backend support; I assume I could always mark a lot of them notyet if the approach is correct but some translations just need work.

@github-actions github-actions bot added the mssql The Microsoft SQL Server backend label Sep 18, 2025
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 46b0e4a to 8e47b79 Compare September 18, 2025 04:54
query = query.sql(self.dialect)

if "MERGE" in query:
query = f"{query};"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure how to better do this; I wasn't able to figure out how to add a semicolon to an expression in SQLGlot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't ideal but I think is fine with me. The "cleaner" way would be to override the _build_upsert_from_table method, but the amount of boilerplate for that feels not worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or actually, is it not possible to just always stick on a ; at the end of _build_upsert_from_table() for every backend? or does that break on some backends?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or actually, is it not possible to just always stick on a ; at the end of _build_upsert_from_table() for every backend?

I don't actually know how to stick a semicolon on the end of a SQLGlot expression. 😅 If _build_upsert_from_table() was handling the conversion to SQL, that would have been simple enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I understand.

huh, is this a bug in the upstream mssql engine? Can you confirm that only merge statements require trailing semicolons, but other sql queries/statements do not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, semicolons are only enforced for certain statements, including MERGE.

There's also a link in that answer to the official documentation stating that, at some point in the future, they will require terminators for every statement, but that clearly hasn't happened in the nearly 10 years since that answer was given.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say either:

  1. add a comment saying why only MERGE gets a semicolon, in addition linking to that stackoverflow.
  2. just always append a semicolor

Slight preference for 2 due to the less cognitive overhead for future developers

@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 2e1403d to b40103e Compare September 19, 2025 12:57
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 25061ad to d052b43 Compare September 20, 2025 16:45
@github-actions github-actions bot added the dependencies Issues or PRs related to dependencies label Sep 20, 2025
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch 4 times, most recently from d2b5922 to 8b0d6fe Compare September 21, 2025 02:57
@github-actions github-actions bot added the bigquery The BigQuery backend label Sep 21, 2025
@deepyaman
Copy link
Contributor Author

@cpcloud I'm temporarily cherry-picked the changes from #11636 and the updates to xfail/xpass two DuckDB tests in order to get DuckDB and Oracle tests working (need the newer DuckDB and SQLGlot releases). All of these changes are in the last 3 commits:

With that, all of the functionality to implement Backend.upsert() is working and ready for review. The remaining issues all also exist in #11637, so I won't duplicate solving them. Whenever you do merge #11637, I should be able to back out the last 3 commits and rebase this on top.

@deepyaman
Copy link
Contributor Author

@cpcloud I'm temporarily cherry-picked the changes from #11636 and the updates to xfail/xpass two DuckDB tests in order to get DuckDB and Oracle tests working (need the newer DuckDB and SQLGlot releases). All of these changes are in the last 3 commits:

With that, all of the functionality to implement Backend.upsert() is working and ready for review. The remaining issues all also exist in #11637, so I won't duplicate solving them. Whenever you do merge #11637, I should be able to back out the last 3 commits and rebase this on top.

@cpcloud FYI I've done this and everything is passing! Should be ready to go.

@deepyaman
Copy link
Contributor Author

@NickCrews Addressed all your comments (except didn't add an xfail test, but replied to your suggestion). Please feel free to take another look!

@NickCrews
Copy link
Contributor

Thanks @deepyaman, those tweaks all look great. I came up with several more test cases I'd love to see. Sorry for the continually moving target.

@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from 5f12ee5 to 10be51c Compare October 26, 2025 14:52
@deepyaman
Copy link
Contributor Author

Thanks @deepyaman, those tweaks all look great. I came up with several more test cases I'd love to see. Sorry for the continually moving target.

@NickCrews No worries, thanks for the revised review! I believe I've largely addressed your updated feedback.

@deepyaman deepyaman requested a review from NickCrews October 27, 2025 05:17
@deepyaman deepyaman force-pushed the feat/api/backend-upsert branch from a2458ce to 613dce6 Compare October 28, 2025 13:40

@pytest.fixture
def test_employee_data_3():
import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already imported at the top level

Copy link
Contributor

@NickCrews NickCrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two tiny nits:

and with those LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bigquery The BigQuery backend dependencies Issues or PRs related to dependencies feature Features or general enhancements mssql The Microsoft SQL Server backend oracle The Oracle backend sql Backends that generate SQL tests Issues or PRs related to tests

3 participants