Skip to content

Conversation

@devin-ai-integration
Copy link
Contributor

What

Fixes bug where UTF-8 multibyte characters in column names (e.g., タイプ, تاريخ الميلاد, अनाप्लान) are incorrectly converted to underscores in Redshift destination.

Reported by: Vai Ignatavicius (@vai-airbyte) in Slack #ask-devin-ai

Root Cause: The StandardNameTransformer base class uses Names.toAlphanumericAndUnderscore(), which applies the regex pattern [^\p{Alnum}_] that only matches ASCII alphanumeric characters. This causes all non-ASCII characters (including valid UTF-8 letters supported by Redshift) to be replaced with underscores.

Redshift Support: Per AWS documentation, Redshift standard identifiers support UTF-8 multibyte characters (2-4 bytes) in both first and subsequent positions.

Link to Devin run: https://app.devin.ai/sessions/cf144af26da44705bc68817a0e2f9030

How

  1. Override convertStreamName() in RedshiftSQLNameTransformer to implement Redshift-specific identifier rules
  2. Preserve Unicode characters using Character.isLetter() and Character.isLetterOrDigit() instead of ASCII-only regex
  3. Implement Redshift rules:
    • Replace spaces with underscores (standard identifiers don't allow spaces)
    • Allow letters (ASCII or Unicode), digits, underscore, dollar sign
    • Prefix underscore if name starts with digit
    • Enforce 127-byte UTF-8 length limit by tracking byte count per character
    • Apply NFC normalization for consistent representation
    • Use Locale.ROOT for lowercasing to avoid locale-specific issues
  4. Add comprehensive tests for Japanese, Arabic, Hindi, Chinese, Korean, Greek, Cyrillic characters

Review Guide

Critical Items (⚠️ High Risk)

  1. RedshiftSQLNameTransformer.kt:92-95 - Surrogate pair handling logic

    • This logic attempts to skip low surrogates after processing high surrogates
    • Potential bug: codePointAt(i) already handles surrogate pairs, so checking isHighSurrogate(char) after may be incorrect
    • Need to verify this correctly handles emoji and other 4-byte UTF-8 characters (code points > U+FFFF)
    • Suggestion: Test with emoji like "😀" or "🎉" to verify
  2. Breaking change assessment

    • Existing users with UTF-8 column names will see names change from _____タイプ
    • This could break downstream SQL queries, dbt models, BI dashboards
    • Should this require:
      • Major version bump?
      • breakingChanges entry in metadata.yaml?
      • Migration guide?
      • Feature flag for gradual rollout?
  3. Edge cases in toRedshiftIdentifier():

    • Empty string input → returns empty string (is this valid?)
    • Very long multibyte strings → byte counting logic correctness
    • Mixed 1-byte, 2-byte, 3-byte, 4-byte characters → boundary cases

Test Coverage

  1. RedshiftSQLNameTransformerTest.kt - Comprehensive but untested
    • Tests cover: Japanese, Arabic, Hindi, Chinese, Korean, Greek, Cyrillic
    • Tests for: spaces, leading digits, special chars, 127-byte limit, case conversion
    • Cannot run locally (no Java/Kotlin environment) - may have compilation errors
    • No integration test with actual Redshift database

Implementation Details

  1. Byte length calculation (RedshiftSQLNameTransformer.kt:40-46)

    • Verify UTF-8 byte length formula is correct for all code point ranges
    • Edge case: What if normalized form changes byte length?
  2. NFC normalization (line 28)

    • Ensures consistent representation (e.g., "é" vs "e" + combining accent)
    • Verify this doesn't inadvertently change characters in unexpected ways

User Impact

Positive

  • Users can now use UTF-8 column names (タイプ, تاريخ الميلاد, अनाप्लान, etc.) in their source data and have them preserved in Redshift instead of converted to underscores
  • Improves data quality and readability for international users

Negative (Breaking Change)

  • Existing users with UTF-8 column names in source data will see those column names change in Redshift:
    • Before: _____, _____________, ___
    • After: タイプ, تاريخ_الميلاد, अनाप्लान
  • This will break:
    • SQL queries that reference old underscore-only column names
    • dbt models with hardcoded column references
    • BI dashboards and reports
    • Any downstream data consumers expecting the old column names

Migration Required

Users affected by this change will need to:

  1. Update all SQL queries referencing the old column names
  2. Update dbt models
  3. Update BI dashboard configurations
  4. Potentially perform a full refresh to recreate tables with new column names

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Reason: Once deployed, Redshift tables will be created with UTF-8 column names. Reverting would cause:

  • New syncs to fail (column name mismatch)
  • Require dropping and recreating tables
  • Data loss if users don't have backups

Recommendation: Consider making this opt-in via feature flag before making it default behavior.

…mn names

- Override convertStreamName in RedshiftSQLNameTransformer to properly handle Unicode characters
- Implement Redshift-specific identifier rules per AWS documentation
- Support UTF-8 multibyte characters (2-4 bytes) in column names
- Replace spaces with underscores (standard identifiers don't allow spaces)
- Enforce 127-byte UTF-8 length limit
- Add comprehensive unit tests for various Unicode scripts (Japanese, Arabic, Hindi, Chinese, Korean, Greek, Cyrillic)
- Fix bug where characters like タイプ, تاريخ الميلاد, अनाप्लान were being converted to underscores

Fixes issue reported by Vai Ignatavicius where special character column headers were getting converted to underscores at destination.

Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
@devin-ai-integration
Copy link
Contributor Author

Original prompt from Vai
Received message in Slack channel #ask-devin-ai:

@Devin I have user with a Destination: Redshift (Connector version = v3.5.3). Issue is that special character column headers like “タイプ“, “تاريخ الميلاد“, “अनाप्लान” are getting converted as Underscores (“_____“) at destination.

From <https://docs.airbyte.com/integrations/destinations/redshift#notes-about-redshift-naming-conventions|our docs> thats taken from <https://docs.aws.amazon.com/redshift/latest/dg/r_names.html|redshift docs>:
Standard Identifiers
• Begin with an ASCII single-byte alphabetic character or underscore character, *or a UTF-8 multibyte character two to four bytes long.*
• Subsequent characters can be ASCII single-byte alphanumeric characters, underscores, or dollar signs, or UTF-8 multibyte characters two to four bytes long.
• Be between 1 and 127 bytes in length, not including quotation marks for delimited identifiers.
• Contain no quotation marks and no spaces.
`タイプ` - All 3-byte UTF-8 characters
`تاريخ الميلاد` - All 2-byte UTF-8 characters
`अनाप्लान` - All 3-byte UTF-8 characters

Is this a connector bug?
Thread URL: https://airbytehq-team.slack.com/archives/C08BHPUMEPJ/p1761906324553009?thread_ts=1761906324.553009
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring
@github-actions
Copy link
Contributor

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

Helpful Resources

PR Slash Commands

Airbyte Maintainers (that's you!) can execute the following slash commands on your PR:

  • /format-fix - Fixes most formatting issues.
  • /bump-version - Bumps connector versions.
    • You can specify a custom changelog by passing changelog. Example: /bump-version changelog="My cool update"
    • Leaving the changelog arg blank will auto-populate the changelog from the PR title.
  • /run-cat-tests - Runs legacy CAT tests (Connector Acceptance Tests)
  • /build-connector-images - Builds and publishes a pre-release docker image for the modified connector(s).
  • JVM connectors:
    • /update-connector-cdk-version connector=<CONNECTOR_NAME> - Updates the specified connector to the latest CDK version.
      Example: /update-connector-cdk-version connector=destination-bigquery
    • /bump-bulk-cdk-version type=patch changelog='foo' - Bump the Bulk CDK's version. type can be major/minor/patch.
  • Python connectors:
    • /poe connector source-example lock - Run the Poe lock task on the source-example connector, committing the results back to the branch.
    • /poe source example lock - Alias for /poe connector source-example lock.
    • /poe source example use-cdk-branch my/branch - Pin the source-example CDK reference to the branch name specified.
    • /poe source example use-cdk-latest - Update the source-example CDK dependency to the latest available version.

📝 Edit this welcome message.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 31, 2025

destination-redshift Connector Test Results

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit fa270e0.

♻️ This comment has been updated with latest results.

- Replace for loop with while loop using Character.charCount() to properly handle surrogate pairs (4-byte UTF-8 characters like emoji)
- Use codePoint comparisons instead of char comparisons for consistency
- Remove trailing whitespace from all lines to pass format check

Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
@devin-ai-integration
Copy link
Contributor Author

/bump-version type=patch

Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
@devin-ai-integration
Copy link
Contributor Author

/format-fix

- Fix KDoc comment formatting (trailing * without space)
- Wrap long lines in KDoc comments
- Format when expression with assignment on separate line
- Split long conditional statements across multiple lines
- Format multi-line assertions in tests

Co-Authored-By: Vai Ignatavicius <vaidotas.ignatavicius@airbyte.io>
@vai-airbyte
Copy link
Contributor

Tested this with dev connector 3.5.3-dev.fa270e0383 it worked well:

id|タイプ   |regular_name |अन_प_ल_न|تاريخ_الميلاد|_airbyte_raw_id                     |_airbyte_extracted_at        |_airbyte_generation_id|_airbyte_meta                    |
--+------+-------------+--------+-------------+------------------------------------+-----------------------------+----------------------+---------------------------------+
 1|Type A|Normal Column|Value 1 |1990-01-15   |49cdac8f-3fd7-4e4d-b704-39dadd9078dc|2025-10-31 15:56:32.714 +0000|                     2|{"changes":[],"sync_id":55720724}|
 3|Type C|Normal Column|Value 3 |1992-11-30   |b13d68d1-5296-44ea-9b38-0dd92fe8f281|2025-10-31 15:56:32.714 +0000|                     2|{"changes":[],"sync_id":55720724}|
 2|Type B|Normal Column|Value 2 |1985-05-20   |616af9d1-6f29-4fb9-9cf1-e7c5d80cea96|2025-10-31 15:56:32.714 +0000|                     2|{"changes":[],"sync_id":55720724}|
@vai-airbyte vai-airbyte removed their request for review October 31, 2025 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment