fix(memory): extend tokenizer + slug regex to Thai/Arabic/Hebrew/Cyrillic by Teerapat-Vatpitak · Pull Request #104 · HKUDS/Vibe-Trading

Teerapat-Vatpitak · 2026-05-13T19:26:46Z

Summary

Extends _tokenize and add()'s slug regex in src/memory/persistent.py
to cover Thai, Arabic, Hebrew, and Cyrillic in addition to the existing
CJK ranges. Follow-up of #87, #95, and the limitation flagged in #102.

Why

Heavy E2E testing during #102 surfaced that memory entries with
non-Latin / non-CJK titles silently misbehaved:

_tokenize("ถัวเฉลี่ย") returned set() — recall on Thai queries
never matched, even when the body contained the term verbatim.
add("นโยบาย", ...) slugged the entire Thai name to _, producing
filenames like feedback______________.md. Two distinct same-length
Thai titles silently overwrote each other on disk.

The bug applied symmetrically to Arabic, Hebrew, and Cyrillic.

Changes

agent/src/memory/persistent.py

New _NON_LATIN_SCRIPT_RANGES constant covering 6 ranges: CJK Unified,
CJK Extension A, Thai, Arabic letters, Hebrew letters, Cyrillic.
Arabic and Hebrew deliberately narrowed to the basic letter blocks
(U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints
(e.g. U+061C ARABIC LETTER MARK) and combining marks out of on-disk
slugs, where they would render as invisible-but-distinct filenames.
Single _TOKEN_RE = re.compile(rf"[a-zA-Z0-9]{{3,}}|[{_NON_LATIN_SCRIPT_RANGES}]")
alternation pattern replaces the old two-pass regex. Precompiled at
module level to avoid per-call f-string concat + cache lookup.
Same constant feeds the slug _SLUG_DISALLOWED_RE so adding a new
script in the future is a one-line edit.

agent/tests/test_persistent_memory.py

Tokenization tests for Thai, Arabic, Hebrew, Cyrillic.
Parametrized slug-preservation test for all four scripts.
Regression: two distinct Thai titles produce distinct files (catches
the "collide to underscores" bug).

Out of scope

agent/src/session/search.py (FTS sanitizer) has the same CJK-only
range and would benefit from consuming _NON_LATIN_SCRIPT_RANGES.
Worth a separate small PR; keeping this one focused on the memory
layer flagged in #102.

Test Plan

pytest agent/tests/test_persistent_memory.py agent/tests/test_cli_memory.py agent/tests/test_remember_tool.py agent/tests/test_cli_init.py → 90 pass
ruff check agent/src/memory/persistent.py agent/tests/test_persistent_memory.py — net -1 error vs upstream (the new @pytest.mark.parametrize makes the previously-unused pytest import live)
Backward compat: CJK tokenize/slug unchanged; test_remember_tool 16/16 still green
Edge cases: empty input, pure ASCII, pure CJK, 4-script mix, single non-Latin char
Heavy E2E: file-level recall round-trip for Thai/Arabic/Hebrew/Cyrillic and a real LLM vibe-trading run saving a Thai+ASCII mixed title via remember

Checklist

No changes to protected modules (src/agent/, src/session/, src/providers/)
No hardcoded values
Follows CONTRIBUTING.md
Tests added
On-disk format unchanged for existing CJK/ASCII entries

…llic The previous CJK tokenizer ranges (#87, #95) only matched ``一-鿿`` and ``㐀-䶿``, so memory entries with Thai, Arabic, Hebrew, or Cyrillic titles: - Tokenized to the empty set, making recall always miss (e.g. ``find_relevant("ถัวเฉลี่ย")`` returned nothing even when the body contained the word). - Had their slug characters stripped to ``_``, so two distinct Thai titles of equal length silently overwrote each other on disk. The new ``_NON_LATIN_SCRIPT_RANGES`` constant covers CJK + Thai + Arabic + Hebrew + Cyrillic and is reused by: - ``_TOKEN_RE`` — single alternation pattern, one ``re.findall`` per ``_tokenize`` call (one text scan instead of two; precompiled at module level so it doesn't go through ``re.compile`` cache lookup on each invocation). - ``_SLUG_DISALLOWED_RE`` — negation pattern used by ``add()``. Arabic and Hebrew are deliberately narrowed to the basic letter blocks (U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints like U+061C ARABIC LETTER MARK and combining marks out of on-disk slugs, where they would render as invisible-but-distinct filenames. Tests cover tokenization for each script, slug preservation (parametrized across the four new scripts), and a Thai collision- distinction regression. Out of scope: ``agent/src/session/search.py`` has the same CJK-only range in its FTS sanitizer; worth a follow-up PR to consume the same constant.

…llic (HKUDS#104) The previous CJK tokenizer ranges (HKUDS#87, HKUDS#95) only matched ``一-鿿`` and ``㐀-䶿``, so memory entries with Thai, Arabic, Hebrew, or Cyrillic titles: - Tokenized to the empty set, making recall always miss (e.g. ``find_relevant("ถัวเฉลี่ย")`` returned nothing even when the body contained the word). - Had their slug characters stripped to ``_``, so two distinct Thai titles of equal length silently overwrote each other on disk. The new ``_NON_LATIN_SCRIPT_RANGES`` constant covers CJK + Thai + Arabic + Hebrew + Cyrillic and is reused by: - ``_TOKEN_RE`` — single alternation pattern, one ``re.findall`` per ``_tokenize`` call (one text scan instead of two; precompiled at module level so it doesn't go through ``re.compile`` cache lookup on each invocation). - ``_SLUG_DISALLOWED_RE`` — negation pattern used by ``add()``. Arabic and Hebrew are deliberately narrowed to the basic letter blocks (U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints like U+061C ARABIC LETTER MARK and combining marks out of on-disk slugs, where they would render as invisible-but-distinct filenames. Tests cover tokenization for each script, slug preservation (parametrized across the four new scripts), and a Thai collision- distinction regression. Out of scope: ``agent/src/session/search.py`` has the same CJK-only range in its FTS sanitizer; worth a follow-up PR to consume the same constant.

warren618 merged commit f85f3b8 into HKUDS:main May 14, 2026
1 check passed

Teerapat-Vatpitak deleted the fix/memory-tokenizer-non-cjk branch May 14, 2026 11:33

Teerapat-Vatpitak mentioned this pull request May 14, 2026

feat(memory): harden PersistentMemory.add() for control bytes, length, empty names #112

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(memory): extend tokenizer + slug regex to Thai/Arabic/Hebrew/Cyrillic#104

fix(memory): extend tokenizer + slug regex to Thai/Arabic/Hebrew/Cyrillic#104
warren618 merged 1 commit into
HKUDS:mainfrom
Teerapat-Vatpitak:fix/memory-tokenizer-non-cjk

Teerapat-Vatpitak commented May 13, 2026

Uh oh!

Labels

2 participants

Uh oh!

Conversation

Teerapat-Vatpitak commented May 13, 2026

Summary

Why

Changes

Out of scope

Test Plan

Checklist

Uh oh!

Labels

2 participants