Skip to content

fix(memory): extend tokenizer + slug regex to Thai/Arabic/Hebrew/Cyrillic#104

Merged
warren618 merged 1 commit into
HKUDS:mainfrom
Teerapat-Vatpitak:fix/memory-tokenizer-non-cjk
May 14, 2026
Merged

fix(memory): extend tokenizer + slug regex to Thai/Arabic/Hebrew/Cyrillic#104
warren618 merged 1 commit into
HKUDS:mainfrom
Teerapat-Vatpitak:fix/memory-tokenizer-non-cjk

Conversation

@Teerapat-Vatpitak

Copy link
Copy Markdown
Contributor

Summary

Extends _tokenize and add()'s slug regex in src/memory/persistent.py
to cover Thai, Arabic, Hebrew, and Cyrillic in addition to the existing
CJK ranges. Follow-up of #87, #95, and the limitation flagged in #102.

Why

Heavy E2E testing during #102 surfaced that memory entries with
non-Latin / non-CJK titles silently misbehaved:

  • _tokenize("ถัวเฉลี่ย") returned set() — recall on Thai queries
    never matched, even when the body contained the term verbatim.
  • add("นโยบาย", ...) slugged the entire Thai name to _, producing
    filenames like feedback______________.md. Two distinct same-length
    Thai titles silently overwrote each other on disk.

The bug applied symmetrically to Arabic, Hebrew, and Cyrillic.

Changes

agent/src/memory/persistent.py

  • New _NON_LATIN_SCRIPT_RANGES constant covering 6 ranges: CJK Unified,
    CJK Extension A, Thai, Arabic letters, Hebrew letters, Cyrillic.
  • Arabic and Hebrew deliberately narrowed to the basic letter blocks
    (U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints
    (e.g. U+061C ARABIC LETTER MARK) and combining marks out of on-disk
    slugs, where they would render as invisible-but-distinct filenames.
  • Single _TOKEN_RE = re.compile(rf"[a-zA-Z0-9]{{3,}}|[{_NON_LATIN_SCRIPT_RANGES}]")
    alternation pattern replaces the old two-pass regex. Precompiled at
    module level to avoid per-call f-string concat + cache lookup.
  • Same constant feeds the slug _SLUG_DISALLOWED_RE so adding a new
    script in the future is a one-line edit.

agent/tests/test_persistent_memory.py

  • Tokenization tests for Thai, Arabic, Hebrew, Cyrillic.
  • Parametrized slug-preservation test for all four scripts.
  • Regression: two distinct Thai titles produce distinct files (catches
    the "collide to underscores" bug).

Out of scope

agent/src/session/search.py (FTS sanitizer) has the same CJK-only
range and would benefit from consuming _NON_LATIN_SCRIPT_RANGES.
Worth a separate small PR; keeping this one focused on the memory
layer flagged in #102.

Test Plan

  • pytest agent/tests/test_persistent_memory.py agent/tests/test_cli_memory.py agent/tests/test_remember_tool.py agent/tests/test_cli_init.py → 90 pass
  • ruff check agent/src/memory/persistent.py agent/tests/test_persistent_memory.py — net -1 error vs upstream (the new @pytest.mark.parametrize makes the previously-unused pytest import live)
  • Backward compat: CJK tokenize/slug unchanged; test_remember_tool 16/16 still green
  • Edge cases: empty input, pure ASCII, pure CJK, 4-script mix, single non-Latin char
  • Heavy E2E: file-level recall round-trip for Thai/Arabic/Hebrew/Cyrillic and a real LLM vibe-trading run saving a Thai+ASCII mixed title via remember

Checklist

  • No changes to protected modules (src/agent/, src/session/, src/providers/)
  • No hardcoded values
  • Follows CONTRIBUTING.md
  • Tests added
  • On-disk format unchanged for existing CJK/ASCII entries
…llic

The previous CJK tokenizer ranges (#87, #95) only matched ``一-鿿``
and ``㐀-䶿``, so memory entries with Thai, Arabic, Hebrew, or
Cyrillic titles:

- Tokenized to the empty set, making recall always miss (e.g.
  ``find_relevant("ถัวเฉลี่ย")`` returned nothing even when the body
  contained the word).
- Had their slug characters stripped to ``_``, so two distinct Thai
  titles of equal length silently overwrote each other on disk.

The new ``_NON_LATIN_SCRIPT_RANGES`` constant covers CJK + Thai +
Arabic + Hebrew + Cyrillic and is reused by:

- ``_TOKEN_RE`` — single alternation pattern, one ``re.findall`` per
  ``_tokenize`` call (one text scan instead of two; precompiled at
  module level so it doesn't go through ``re.compile`` cache lookup on
  each invocation).
- ``_SLUG_DISALLOWED_RE`` — negation pattern used by ``add()``.

Arabic and Hebrew are deliberately narrowed to the basic letter blocks
(U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints like
U+061C ARABIC LETTER MARK and combining marks out of on-disk slugs,
where they would render as invisible-but-distinct filenames.

Tests cover tokenization for each script, slug preservation
(parametrized across the four new scripts), and a Thai collision-
distinction regression.

Out of scope: ``agent/src/session/search.py`` has the same CJK-only
range in its FTS sanitizer; worth a follow-up PR to consume the same
constant.
@warren618 warren618 merged commit f85f3b8 into HKUDS:main May 14, 2026
1 check passed
@Teerapat-Vatpitak Teerapat-Vatpitak deleted the fix/memory-tokenizer-non-cjk branch May 14, 2026 11:33
ykykj pushed a commit to ykykj/Vibe-Trading that referenced this pull request May 23, 2026
…llic (HKUDS#104)

The previous CJK tokenizer ranges (HKUDS#87, HKUDS#95) only matched ``一-鿿``
and ``㐀-䶿``, so memory entries with Thai, Arabic, Hebrew, or
Cyrillic titles:

- Tokenized to the empty set, making recall always miss (e.g.
  ``find_relevant("ถัวเฉลี่ย")`` returned nothing even when the body
  contained the word).
- Had their slug characters stripped to ``_``, so two distinct Thai
  titles of equal length silently overwrote each other on disk.

The new ``_NON_LATIN_SCRIPT_RANGES`` constant covers CJK + Thai +
Arabic + Hebrew + Cyrillic and is reused by:

- ``_TOKEN_RE`` — single alternation pattern, one ``re.findall`` per
  ``_tokenize`` call (one text scan instead of two; precompiled at
  module level so it doesn't go through ``re.compile`` cache lookup on
  each invocation).
- ``_SLUG_DISALLOWED_RE`` — negation pattern used by ``add()``.

Arabic and Hebrew are deliberately narrowed to the basic letter blocks
(U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints like
U+061C ARABIC LETTER MARK and combining marks out of on-disk slugs,
where they would render as invisible-but-distinct filenames.

Tests cover tokenization for each script, slug preservation
(parametrized across the four new scripts), and a Thai collision-
distinction regression.

Out of scope: ``agent/src/session/search.py`` has the same CJK-only
range in its FTS sanitizer; worth a follow-up PR to consume the same
constant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants