fix(memory): extend tokenizer + slug regex to Thai/Arabic/Hebrew/Cyrillic#104
Merged
warren618 merged 1 commit intoMay 14, 2026
Merged
Conversation
…llic The previous CJK tokenizer ranges (#87, #95) only matched ``一-鿿`` and ``㐀-䶿``, so memory entries with Thai, Arabic, Hebrew, or Cyrillic titles: - Tokenized to the empty set, making recall always miss (e.g. ``find_relevant("ถัวเฉลี่ย")`` returned nothing even when the body contained the word). - Had their slug characters stripped to ``_``, so two distinct Thai titles of equal length silently overwrote each other on disk. The new ``_NON_LATIN_SCRIPT_RANGES`` constant covers CJK + Thai + Arabic + Hebrew + Cyrillic and is reused by: - ``_TOKEN_RE`` — single alternation pattern, one ``re.findall`` per ``_tokenize`` call (one text scan instead of two; precompiled at module level so it doesn't go through ``re.compile`` cache lookup on each invocation). - ``_SLUG_DISALLOWED_RE`` — negation pattern used by ``add()``. Arabic and Hebrew are deliberately narrowed to the basic letter blocks (U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints like U+061C ARABIC LETTER MARK and combining marks out of on-disk slugs, where they would render as invisible-but-distinct filenames. Tests cover tokenization for each script, slug preservation (parametrized across the four new scripts), and a Thai collision- distinction regression. Out of scope: ``agent/src/session/search.py`` has the same CJK-only range in its FTS sanitizer; worth a follow-up PR to consume the same constant.
This was referenced May 14, 2026
Closed
11 tasks
ykykj
pushed a commit
to ykykj/Vibe-Trading
that referenced
this pull request
May 23, 2026
…llic (HKUDS#104) The previous CJK tokenizer ranges (HKUDS#87, HKUDS#95) only matched ``一-鿿`` and ``㐀-䶿``, so memory entries with Thai, Arabic, Hebrew, or Cyrillic titles: - Tokenized to the empty set, making recall always miss (e.g. ``find_relevant("ถัวเฉลี่ย")`` returned nothing even when the body contained the word). - Had their slug characters stripped to ``_``, so two distinct Thai titles of equal length silently overwrote each other on disk. The new ``_NON_LATIN_SCRIPT_RANGES`` constant covers CJK + Thai + Arabic + Hebrew + Cyrillic and is reused by: - ``_TOKEN_RE`` — single alternation pattern, one ``re.findall`` per ``_tokenize`` call (one text scan instead of two; precompiled at module level so it doesn't go through ``re.compile`` cache lookup on each invocation). - ``_SLUG_DISALLOWED_RE`` — negation pattern used by ``add()``. Arabic and Hebrew are deliberately narrowed to the basic letter blocks (U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints like U+061C ARABIC LETTER MARK and combining marks out of on-disk slugs, where they would render as invisible-but-distinct filenames. Tests cover tokenization for each script, slug preservation (parametrized across the four new scripts), and a Thai collision- distinction regression. Out of scope: ``agent/src/session/search.py`` has the same CJK-only range in its FTS sanitizer; worth a follow-up PR to consume the same constant.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends
_tokenizeandadd()'s slug regex insrc/memory/persistent.pyto cover Thai, Arabic, Hebrew, and Cyrillic in addition to the existing
CJK ranges. Follow-up of #87, #95, and the limitation flagged in #102.
Why
Heavy E2E testing during #102 surfaced that memory entries with
non-Latin / non-CJK titles silently misbehaved:
_tokenize("ถัวเฉลี่ย")returnedset()— recall on Thai queriesnever matched, even when the body contained the term verbatim.
add("นโยบาย", ...)slugged the entire Thai name to_, producingfilenames like
feedback______________.md. Two distinct same-lengthThai titles silently overwrote each other on disk.
The bug applied symmetrically to Arabic, Hebrew, and Cyrillic.
Changes
agent/src/memory/persistent.py_NON_LATIN_SCRIPT_RANGESconstant covering 6 ranges: CJK Unified,CJK Extension A, Thai, Arabic letters, Hebrew letters, Cyrillic.
(
U+0620-U+064A,U+05D0-U+05EA) to keep bidi-control codepoints(e.g.
U+061CARABIC LETTER MARK) and combining marks out of on-diskslugs, where they would render as invisible-but-distinct filenames.
_TOKEN_RE = re.compile(rf"[a-zA-Z0-9]{{3,}}|[{_NON_LATIN_SCRIPT_RANGES}]")alternation pattern replaces the old two-pass regex. Precompiled at
module level to avoid per-call f-string concat + cache lookup.
_SLUG_DISALLOWED_REso adding a newscript in the future is a one-line edit.
agent/tests/test_persistent_memory.pythe "collide to underscores" bug).
Out of scope
agent/src/session/search.py(FTS sanitizer) has the same CJK-onlyrange and would benefit from consuming
_NON_LATIN_SCRIPT_RANGES.Worth a separate small PR; keeping this one focused on the memory
layer flagged in #102.
Test Plan
pytest agent/tests/test_persistent_memory.py agent/tests/test_cli_memory.py agent/tests/test_remember_tool.py agent/tests/test_cli_init.py→ 90 passruff check agent/src/memory/persistent.py agent/tests/test_persistent_memory.py— net -1 error vs upstream (the new@pytest.mark.parametrizemakes the previously-unusedpytestimport live)test_remember_tool16/16 still greenvibe-trading runsaving a Thai+ASCII mixed title viarememberChecklist
src/agent/,src/session/,src/providers/)CONTRIBUTING.md