Skip to content

Improve snippet case transforms suport for non-Latin scripts (fix: #286165)#287150

Merged
jrieken merged 10 commits intomicrosoft:mainfrom
lucas-gomes-santana:fix/snippet-unicode-support
Jan 26, 2026
Merged

Improve snippet case transforms suport for non-Latin scripts (fix: #286165)#287150
jrieken merged 10 commits intomicrosoft:mainfrom
lucas-gomes-santana:fix/snippet-unicode-support

Conversation

@lucas-gomes-santana
Copy link
Contributor

@lucas-gomes-santana lucas-gomes-santana commented Jan 12, 2026

Description

This PR was made to solve a problem reported on Issue #286165, and the objective is improves snippet case transforms by replacing ASCII-only regular expressions with Unicode-aware patterns and locale-aware case mapping.

Previously, snippet transforms such as upcase, downcase, camelcase, pascalcase, kebabcase, and snakecase relied on [a-zA-Z]-based matching. As a result, non-Latin input (e.g. Cyrillic or Greek) was not recognized correctly and transforms were silently skipped, producing no output changes at all.


The changes in this PR:

  • Use Unicode property escapes (\p{L}, \p{Lu}, \p{Ll}, \p{Nd}) to properly detect letters and numbers across modern scripts.

  • Use locale-aware casing (toLocaleLowerCase / toLocaleUpperCase) instead of ASCII-only case conversion.

  • Preserve existing behavior for Latin input while improving support for scripts that have uppercase/lowercase distinctions (e.g. Cyrillic, Greek).


Limitations

This change does not aim to provide a fully language-aware or linguistically perfect solution for all scripts.
Word-based transforms (camelCase, PascalCase, kebab-case, snake_case) inherently rely on uppercase/lowercase transitions and therefore cannot be meaningfully applied to scripts without case (e.g. Chinese, Japanese, Arabic, Hebrew).

For such scripts, transforms effectively become no-ops, which is consistent with current behavior and preferable to producing arbitrary or destructive output.

Summary

  • Fixes silent failures for non-Latin input in snippet transforms

  • Improves Unicode correctness without breaking existing behavior

  • Clearly scoped as an incremental improvement, not a universal linguistic solution


Final inputs:

одинДва -> ОДИНДВА одиндва одинДва ОдинДва один-два один_два (Russian)
一个测试 -> 一个测试 一个测试 一个测试 一个测试 一个测试 一个测试 (Simplefied Chinese)
έναςΔύο -> ΈΝΑΣΔΎΟ έναςδύο έναςΔύο ΈναςΔύο ένας-δύο ένας_δύο (Greek)
ひらがなカタカナ -> ひらがなカタカナ ひらがなカタカナ ひらがなカタカナ ひらがなカタカナ ひらがなカタカナ ひらがなカタカナ(Japonese Hiragana + Katakana)
하나둘 -> 하나둘 하나둘 하나둘 하나둘 하나둘 하나둘
одинДва3 -> ОДИНДВА3 одиндва3 одинДва3 ОдинДва3 один-два3 один_два3 (Russian with number)
ένας_δύο -> ΈΝΑΣ_ΔΎΟ ένας_δύο έναςΔύο ΈναςΔύο ένας-δύο ένας_δύο (Greek with underline)
こんにちはWorld -> こんにちはWORLD こんにちはworld こんにちはWorld こんにちはWorld world こんにちはworld (Japonese with english word)
واحدإثنين -> واحدإثنين واحدإثنين واحدإثنين واحدإثنين واحدإثنين واحدإثنين (Arabic)

Russian input before the regexs changes:

одинДва -> ОДИНДВА одиндва одинДва одинДва одинДва одиндва (wrong formatting)
@vs-code-engineering
Copy link

vs-code-engineering bot commented Jan 12, 2026

📬 CODENOTIFY

The following users are being notified based on files changed in this PR:

@jrieken

Matched files:

  • src/vs/editor/contrib/snippet/browser/snippetParser.ts
  • src/vs/editor/contrib/snippet/test/browser/snippetParser.test.ts
@lucas-gomes-santana
Copy link
Contributor Author

@microsoft-github-policy-service agree

Copy link
Contributor

@dmitrivMS dmitrivMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned adding some tests - I think it would be really good in this case.

@lucas-gomes-santana
Copy link
Contributor Author

lucas-gomes-santana commented Jan 14, 2026

@dmitrivMS Now I added tests for the modified regexs, including a test with the turkish language. Waiting for review and feedback.

@dmitrivMS
Copy link
Contributor

Apologies for the delay, I'll review and respond in about 24h.

Copy link
Contributor Author

@lucas-gomes-santana lucas-gomes-santana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now these changes should work. I have mentionated the tests logs on past comments.

@lucas-gomes-santana
Copy link
Contributor Author

Apologies for the delay, I'll review and respond in about 24h.

Did you review my changes now? I think everything is working now

Copy link
Member

@jrieken jrieken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @lucas-gomes-santana

@vs-code-engineering vs-code-engineering bot added this to the January 2026 milestone Jan 26, 2026
@jrieken jrieken enabled auto-merge January 26, 2026 16:05
@jrieken jrieken merged commit 283d8d0 into microsoft:main Jan 26, 2026
17 checks passed
@lucas-gomes-santana lucas-gomes-santana deleted the fix/snippet-unicode-support branch January 28, 2026 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants