fix(destination-redshift): Support UTF-8 multibyte characters in column names #69112

devin-ai-integration · 2025-10-31T10:38:07Z

What

Fixes bug where UTF-8 multibyte characters in column names (e.g., タイプ, تاريخ الميلاد, अनाप्लान) are incorrectly converted to underscores in Redshift destination.

Reported by: Vai Ignatavicius (@vai-airbyte) in Slack #ask-devin-ai

Root Cause: The StandardNameTransformer base class uses Names.toAlphanumericAndUnderscore(), which applies the regex pattern [^\p{Alnum}_] that only matches ASCII alphanumeric characters. This causes all non-ASCII characters (including valid UTF-8 letters supported by Redshift) to be replaced with underscores.

Redshift Support: Per AWS documentation, Redshift standard identifiers support UTF-8 multibyte characters (2-4 bytes) in both first and subsequent positions.

Link to Devin run: https://app.devin.ai/sessions/cf144af26da44705bc68817a0e2f9030

How

Override convertStreamName() in RedshiftSQLNameTransformer to implement Redshift-specific identifier rules
Preserve Unicode characters using Character.isLetter() and Character.isLetterOrDigit() instead of ASCII-only regex
Implement Redshift rules:
- Replace spaces with underscores (standard identifiers don't allow spaces)
- Allow letters (ASCII or Unicode), digits, underscore, dollar sign
- Prefix underscore if name starts with digit
- Enforce 127-byte UTF-8 length limit by tracking byte count per character
- Apply NFC normalization for consistent representation
- Use Locale.ROOT for lowercasing to avoid locale-specific issues
Add comprehensive tests for Japanese, Arabic, Hindi, Chinese, Korean, Greek, Cyrillic characters

Review Guide

Critical Items (⚠️ High Risk)

RedshiftSQLNameTransformer.kt:92-95 - Surrogate pair handling logic
- This logic attempts to skip low surrogates after processing high surrogates
- Potential bug: codePointAt(i) already handles surrogate pairs, so checking isHighSurrogate(char) after may be incorrect
- Need to verify this correctly handles emoji and other 4-byte UTF-8 characters (code points > U+FFFF)
- Suggestion: Test with emoji like "😀" or "🎉" to verify
Breaking change assessment
- Existing users with UTF-8 column names will see names change from _____ → タイプ
- This could break downstream SQL queries, dbt models, BI dashboards
- Should this require:
  - Major version bump?
  - breakingChanges entry in metadata.yaml?
  - Migration guide?
  - Feature flag for gradual rollout?
Edge cases in toRedshiftIdentifier():
- Empty string input → returns empty string (is this valid?)
- Very long multibyte strings → byte counting logic correctness
- Mixed 1-byte, 2-byte, 3-byte, 4-byte characters → boundary cases

Test Coverage

RedshiftSQLNameTransformerTest.kt - Comprehensive but untested
- Tests cover: Japanese, Arabic, Hindi, Chinese, Korean, Greek, Cyrillic
- Tests for: spaces, leading digits, special chars, 127-byte limit, case conversion
- Cannot run locally (no Java/Kotlin environment) - may have compilation errors
- No integration test with actual Redshift database

Implementation Details

Byte length calculation (RedshiftSQLNameTransformer.kt:40-46)
- Verify UTF-8 byte length formula is correct for all code point ranges
- Edge case: What if normalized form changes byte length?
NFC normalization (line 28)
- Ensures consistent representation (e.g., "é" vs "e" + combining accent)
- Verify this doesn't inadvertently change characters in unexpected ways

User Impact

Positive

Users can now use UTF-8 column names (タイプ, تاريخ الميلاد, अनाप्लान, etc.) in their source data and have them preserved in Redshift instead of converted to underscores
Improves data quality and readability for international users

Negative (Breaking Change)

Existing users with UTF-8 column names in source data will see those column names change in Redshift:
- Before: _____, _____________, ___
- After: タイプ, تاريخ_الميلاد, अनाप्लान
This will break:
- SQL queries that reference old underscore-only column names
- dbt models with hardcoded column references
- BI dashboards and reports
- Any downstream data consumers expecting the old column names

Migration Required

Users affected by this change will need to:

Update all SQL queries referencing the old column names
Update dbt models
Update BI dashboard configurations
Potentially perform a full refresh to recreate tables with new column names

Can this PR be safely reverted and rolled back?

YES 💚
NO ❌

Reason: Once deployed, Redshift tables will be created with UTF-8 column names. Reverting would cause:

New syncs to fail (column name mismatch)
Require dropping and recreating tables
Data loss if users don't have backups

Recommendation: Consider making this opt-in via feature flag before making it default behavior.

…mn names - Override convertStreamName in RedshiftSQLNameTransformer to properly handle Unicode characters - Implement Redshift-specific identifier rules per AWS documentation - Support UTF-8 multibyte characters (2-4 bytes) in column names - Replace spaces with underscores (standard identifiers don't allow spaces) - Enforce 127-byte UTF-8 length limit - Add comprehensive unit tests for various Unicode scripts (Japanese, Arabic, Hindi, Chinese, Korean, Greek, Cyrillic) - Fix bug where characters like タイプ, تاريخ الميلاد, अनाप्लान were being converted to underscores Fixes issue reported by Vai Ignatavicius where special character column headers were getting converted to underscores at destination. Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>

devin-ai-integration · 2025-10-31T10:38:12Z

Original prompt from Vai

Received message in Slack channel #ask-devin-ai:

@Devin I have user with a Destination: Redshift (Connector version = v3.5.3). Issue is that special character column headers like “タイプ“, “تاريخ الميلاد“, “अनाप्लान” are getting converted as Underscores (“_____“) at destination.

From <https://docs.airbyte.com/integrations/destinations/redshift#notes-about-redshift-naming-conventions|our docs> thats taken from <https://docs.aws.amazon.com/redshift/latest/dg/r_names.html|redshift docs>:
Standard Identifiers
• Begin with an ASCII single-byte alphabetic character or underscore character, *or a UTF-8 multibyte character two to four bytes long.*
• Subsequent characters can be ASCII single-byte alphanumeric characters, underscores, or dollar signs, or UTF-8 multibyte characters two to four bytes long.
• Be between 1 and 127 bytes in length, not including quotation marks for delimited identifiers.
• Contain no quotation marks and no spaces.
`タイプ` - All 3-byte UTF-8 characters
`تاريخ الميلاد` - All 2-byte UTF-8 characters
`अनाप्लान` - All 3-byte UTF-8 characters

Is this a connector bug?
Thread URL: https://airbytehq-team.slack.com/archives/C08BHPUMEPJ/p1761906324553009?thread_ts=1761906324.553009

devin-ai-integration · 2025-10-31T10:38:14Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

github-actions · 2025-10-31T10:38:31Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Helpful Resources

Breaking Changes Guide - Breaking changes, migration guides, and upgrade deadlines
Developing Connectors Locally
Managing Connector Secrets
On-Demand Live Tests
On-Demand Regression Tests
#connector-ci-issues
#connector-publish-updates
#connector-build-statuses

PR Slash Commands

Airbyte Maintainers (that's you!) can execute the following slash commands on your PR:

/format-fix - Fixes most formatting issues.
/bump-version - Bumps connector versions.
- You can specify a custom changelog by passing changelog. Example: /bump-version changelog="My cool update"
- Leaving the changelog arg blank will auto-populate the changelog from the PR title.
/run-cat-tests - Runs legacy CAT tests (Connector Acceptance Tests)
/build-connector-images - Builds and publishes a pre-release docker image for the modified connector(s).
JVM connectors:
- /update-connector-cdk-version connector=<CONNECTOR_NAME> - Updates the specified connector to the latest CDK version.
  Example: /update-connector-cdk-version connector=destination-bigquery
- /bump-bulk-cdk-version type=patch changelog='foo' - Bump the Bulk CDK's version. type can be major/minor/patch.
Python connectors:
- /poe connector source-example lock - Run the Poe lock task on the source-example connector, committing the results back to the branch.
- /poe source example lock - Alias for /poe connector source-example lock.
- /poe source example use-cdk-branch my/branch - Pin the source-example CDK reference to the branch name specified.
- /poe source example use-cdk-latest - Update the source-example CDK dependency to the latest available version.

📝 Edit this welcome message.

github-actions · 2025-10-31T10:42:00Z

`destination-redshift` Connector Test Results

0 tests 0 ✅ 0s ⏱️
0 suites 0 💤
0 files 0 ❌

Results for commit fa270e0.

♻️ This comment has been updated with latest results.

- Replace for loop with while loop using Character.charCount() to properly handle surrogate pairs (4-byte UTF-8 characters like emoji) - Use codePoint comparisons instead of char comparisons for consistency - Remove trailing whitespace from all lines to pass format check Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>

devin-ai-integration · 2025-10-31T10:44:37Z

/bump-version type=patch

Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>

devin-ai-integration · 2025-10-31T10:48:21Z

/format-fix

- Fix KDoc comment formatting (trailing * without space) - Wrap long lines in KDoc comments - Format when expression with assignment on separate line - Split long conditional statements across multiple lines - Format multi-line assertions in tests Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>

vai-airbyte · 2025-10-31T16:02:06Z

Tested this with dev connector 3.5.3-dev.fa270e0383 it worked well:

id|タイプ   |regular_name |अन_प_ल_न|تاريخ_الميلاد|_airbyte_raw_id                     |_airbyte_extracted_at        |_airbyte_generation_id|_airbyte_meta                    |
--+------+-------------+--------+-------------+------------------------------------+-----------------------------+----------------------+---------------------------------+
 1|Type A|Normal Column|Value 1 |1990-01-15   |49cdac8f-3fd7-4e4d-b704-39dadd9078dc|2025-10-31 15:56:32.714 +0000|                     2|{"changes":[],"sync_id":55720724}|
 3|Type C|Normal Column|Value 3 |1992-11-30   |b13d68d1-5296-44ea-9b38-0dd92fe8f281|2025-10-31 15:56:32.714 +0000|                     2|{"changes":[],"sync_id":55720724}|
 2|Type B|Normal Column|Value 2 |1985-05-20   |616af9d1-6f29-4fb9-9cf1-e7c5d80cea96|2025-10-31 15:56:32.714 +0000|                     2|{"changes":[],"sync_id":55720724}|

devin-ai-integration bot requested a review from a team as a code owner October 31, 2025 10:38

devin-ai-integration bot assigned vai-airbyte Oct 31, 2025

devin-ai-integration bot requested a review from vai-airbyte October 31, 2025 10:38

octavia-squidington-iii added the connectors/destination/redshift label Oct 31, 2025

Replace wildcard import with specific Locale import

53bcd7e

Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>

vai-airbyte removed their request for review October 31, 2025 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(destination-redshift): Support UTF-8 multibyte characters in column names #69112

fix(destination-redshift): Support UTF-8 multibyte characters in column names #69112

devin-ai-integration bot commented Oct 31, 2025

devin-ai-integration bot commented Oct 31, 2025

devin-ai-integration bot commented Oct 31, 2025

github-actions bot commented Oct 31, 2025

github-actions bot commented Oct 31, 2025 •

edited

Loading

devin-ai-integration bot commented Oct 31, 2025

devin-ai-integration bot commented Oct 31, 2025

vai-airbyte commented Oct 31, 2025

Labels

3 participants

fix(destination-redshift): Support UTF-8 multibyte characters in column names #69112

Are you sure you want to change the base?

fix(destination-redshift): Support UTF-8 multibyte characters in column names #69112

Conversation

devin-ai-integration bot commented Oct 31, 2025

What

How

Review Guide

Critical Items (⚠️ High Risk)

Test Coverage

Implementation Details

User Impact

Positive

Negative (Breaking Change)

Migration Required

Can this PR be safely reverted and rolled back?

devin-ai-integration bot commented Oct 31, 2025

devin-ai-integration bot commented Oct 31, 2025

🤖 Devin AI Engineer

github-actions bot commented Oct 31, 2025

👋 Greetings, Airbyte Team Member!

Helpful Resources

PR Slash Commands

github-actions bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

destination-redshift Connector Test Results

devin-ai-integration bot commented Oct 31, 2025

devin-ai-integration bot commented Oct 31, 2025

vai-airbyte commented Oct 31, 2025

Labels

3 participants

github-actions bot commented Oct 31, 2025 •

edited

Loading

`destination-redshift` Connector Test Results