-
Notifications
You must be signed in to change notification settings - Fork 4.9k
fix(destination-redshift): Support UTF-8 multibyte characters in column names #69112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix(destination-redshift): Support UTF-8 multibyte characters in column names #69112
Conversation
…mn names - Override convertStreamName in RedshiftSQLNameTransformer to properly handle Unicode characters - Implement Redshift-specific identifier rules per AWS documentation - Support UTF-8 multibyte characters (2-4 bytes) in column names - Replace spaces with underscores (standard identifiers don't allow spaces) - Enforce 127-byte UTF-8 length limit - Add comprehensive unit tests for various Unicode scripts (Japanese, Arabic, Hindi, Chinese, Korean, Greek, Cyrillic) - Fix bug where characters like タイプ, تاريخ الميلاد, अनाप्लान were being converted to underscores Fixes issue reported by Vai Ignatavicius where special character column headers were getting converted to underscores at destination. Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
Original prompt from Vai |
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. Helpful Resources
PR Slash CommandsAirbyte Maintainers (that's you!) can execute the following slash commands on your PR:
|
|
- Replace for loop with while loop using Character.charCount() to properly handle surrogate pairs (4-byte UTF-8 characters like emoji) - Use codePoint comparisons instead of char comparisons for consistency - Remove trailing whitespace from all lines to pass format check Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
|
/bump-version type=patch |
Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
|
/format-fix |
- Fix KDoc comment formatting (trailing * without space) - Wrap long lines in KDoc comments - Format when expression with assignment on separate line - Split long conditional statements across multiple lines - Format multi-line assertions in tests Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
|
Tested this with dev connector |
What
Fixes bug where UTF-8 multibyte characters in column names (e.g., タイプ, تاريخ الميلاد, अनाप्लान) are incorrectly converted to underscores in Redshift destination.
Reported by: Vai Ignatavicius (@vai-airbyte) in Slack #ask-devin-ai
Root Cause: The
StandardNameTransformerbase class usesNames.toAlphanumericAndUnderscore(), which applies the regex pattern[^\p{Alnum}_]that only matches ASCII alphanumeric characters. This causes all non-ASCII characters (including valid UTF-8 letters supported by Redshift) to be replaced with underscores.Redshift Support: Per AWS documentation, Redshift standard identifiers support UTF-8 multibyte characters (2-4 bytes) in both first and subsequent positions.
Link to Devin run: https://app.devin.ai/sessions/cf144af26da44705bc68817a0e2f9030
How
convertStreamName()inRedshiftSQLNameTransformerto implement Redshift-specific identifier rulesCharacter.isLetter()andCharacter.isLetterOrDigit()instead of ASCII-only regexLocale.ROOTfor lowercasing to avoid locale-specific issuesReview Guide
Critical Items (⚠️ High Risk)
RedshiftSQLNameTransformer.kt:92-95- Surrogate pair handling logiccodePointAt(i)already handles surrogate pairs, so checkingisHighSurrogate(char)after may be incorrectBreaking change assessment
_____→タイプbreakingChangesentry inmetadata.yaml?Edge cases in
toRedshiftIdentifier():Test Coverage
RedshiftSQLNameTransformerTest.kt- Comprehensive but untestedImplementation Details
Byte length calculation (
RedshiftSQLNameTransformer.kt:40-46)NFC normalization (line 28)
User Impact
Positive
Negative (Breaking Change)
_____,_____________,___タイプ,تاريخ_الميلاد,अनाप्लानMigration Required
Users affected by this change will need to:
Can this PR be safely reverted and rolled back?
Reason: Once deployed, Redshift tables will be created with UTF-8 column names. Reverting would cause:
Recommendation: Consider making this opt-in via feature flag before making it default behavior.