fix(memory): treat underscores as word boundaries in _tokenize by hp083625 · Pull Request #87 · HKUDS/Vibe-Trading

hp083625 · 2026-05-10T07:21:22Z

Problem

_tokenize in agent/src/memory/persistent.py uses the regex
[a-zA-Z0-9_]{3,}, which treats _ as a word character. Memory titles
saved through RememberTool are slugified to snake_case (e.g.
mcp_wiring_test), so they collapse into a single token. A natural-
language recall query like "mcp wiring" produces tokens
{"mcp", "wiring"} and never intersects {"mcp_wiring_test"} —
recall returns 0 results despite the memory existing on disk.

Repro

remember(action="save", title="mcp_wiring_test", content="…")
remember(action="recall", query="mcp wiring") → count: 0
remember(action="forget", title="mcp_wiring_test") → succeeds (proves the memory was on disk)

Fix

Drop _ from the character class so underscores act as token boundaries.
mcp_wiring_test now tokenizes to {"mcp", "wiring", "test"}. Verbatim
queries ("mcp_wiring_test" as a single string) still work because they
split into the same three tokens on both sides of the comparison.

Tests

All five existing TestTokenize cases still pass.
New test_underscores_split covers the regression scenario directly.

Memory titles saved through RememberTool are slugified to snake_case (e.g. "mcp_wiring_test"), so the previous regex `[a-zA-Z0-9_]{3,}` collapsed them into a single token. A natural-language recall query like "mcp wiring" produces tokens {"mcp", "wiring"} and never intersects {"mcp_wiring_test"} — recall returned 0 results despite the memory existing on disk. Drop `_` from the character class so underscores act as token boundaries. Verbatim queries still work because the same string splits into the same tokens on both sides of the comparison. Adds a regression test for the snake_case case. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

warren618 · 2026-05-10T12:00:32Z

Thanks for the focused fix. Treating underscores as token boundaries makes sense for snake_case memory titles, and the regression test covers the failure mode clearly.

Merged and verified.

Add `vibe-trading memory list/show/search/forget` so users can inspect and manage persistent memory without opening the file store directly. Two recent memory bugs (#87 underscore tokenization, #95 CJK slug collisions) were hard to spot without observability into what is actually saved and recalled. Also expand the PersistentMemory public surface to keep CLI lookup logic out of storage internals: - MEMORY_TYPES tuple as the single source of truth for type names, shared by argparse choices and the CLI style table - list_entries() / find() / remove_entry() back the new commands; find() resolves by title first, then by on-disk filename stem so users can paste either form from the index - remove_entry(entry) skips the re-scan that remove(title) does Existing remove(title) is unchanged; remember_tool stays green.

…llic (#104) The previous CJK tokenizer ranges (#87, #95) only matched ``一-鿿`` and ``㐀-䶿``, so memory entries with Thai, Arabic, Hebrew, or Cyrillic titles: - Tokenized to the empty set, making recall always miss (e.g. ``find_relevant("ถัวเฉลี่ย")`` returned nothing even when the body contained the word). - Had their slug characters stripped to ``_``, so two distinct Thai titles of equal length silently overwrote each other on disk. The new ``_NON_LATIN_SCRIPT_RANGES`` constant covers CJK + Thai + Arabic + Hebrew + Cyrillic and is reused by: - ``_TOKEN_RE`` — single alternation pattern, one ``re.findall`` per ``_tokenize`` call (one text scan instead of two; precompiled at module level so it doesn't go through ``re.compile`` cache lookup on each invocation). - ``_SLUG_DISALLOWED_RE`` — negation pattern used by ``add()``. Arabic and Hebrew are deliberately narrowed to the basic letter blocks (U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints like U+061C ARABIC LETTER MARK and combining marks out of on-disk slugs, where they would render as invisible-but-distinct filenames. Tests cover tokenization for each script, slug preservation (parametrized across the four new scripts), and a Thai collision- distinction regression. Out of scope: ``agent/src/session/search.py`` has the same CJK-only range in its FTS sanitizer; worth a follow-up PR to consume the same constant.

Add `vibe-trading memory list/show/search/forget` so users can inspect and manage persistent memory without opening the file store directly. Two recent memory bugs (HKUDS#87 underscore tokenization, HKUDS#95 CJK slug collisions) were hard to spot without observability into what is actually saved and recalled. Also expand the PersistentMemory public surface to keep CLI lookup logic out of storage internals: - MEMORY_TYPES tuple as the single source of truth for type names, shared by argparse choices and the CLI style table - list_entries() / find() / remove_entry() back the new commands; find() resolves by title first, then by on-disk filename stem so users can paste either form from the index - remove_entry(entry) skips the re-scan that remove(title) does Existing remove(title) is unchanged; remember_tool stays green.

…llic (HKUDS#104) The previous CJK tokenizer ranges (HKUDS#87, HKUDS#95) only matched ``一-鿿`` and ``㐀-䶿``, so memory entries with Thai, Arabic, Hebrew, or Cyrillic titles: - Tokenized to the empty set, making recall always miss (e.g. ``find_relevant("ถัวเฉลี่ย")`` returned nothing even when the body contained the word). - Had their slug characters stripped to ``_``, so two distinct Thai titles of equal length silently overwrote each other on disk. The new ``_NON_LATIN_SCRIPT_RANGES`` constant covers CJK + Thai + Arabic + Hebrew + Cyrillic and is reused by: - ``_TOKEN_RE`` — single alternation pattern, one ``re.findall`` per ``_tokenize`` call (one text scan instead of two; precompiled at module level so it doesn't go through ``re.compile`` cache lookup on each invocation). - ``_SLUG_DISALLOWED_RE`` — negation pattern used by ``add()``. Arabic and Hebrew are deliberately narrowed to the basic letter blocks (U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints like U+061C ARABIC LETTER MARK and combining marks out of on-disk slugs, where they would render as invisible-but-distinct filenames. Tests cover tokenization for each script, slug preservation (parametrized across the four new scripts), and a Thai collision- distinction regression. Out of scope: ``agent/src/session/search.py`` has the same CJK-only range in its FTS sanitizer; worth a follow-up PR to consume the same constant.

hp083625 force-pushed the fix-memory-tokenize-underscore branch from bbeb076 to 5863ed6 Compare May 10, 2026 07:26

ykykj merged commit 9b6e322 into HKUDS:main May 10, 2026
1 check passed

Teerapat-Vatpitak mentioned this pull request May 13, 2026

feat(cli): add memory introspection commands #102

Merged

10 tasks

Teerapat-Vatpitak mentioned this pull request May 13, 2026

fix(memory): extend tokenizer + slug regex to Thai/Arabic/Hebrew/Cyrillic #104

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(memory): treat underscores as word boundaries in _tokenize#87

fix(memory): treat underscores as word boundaries in _tokenize#87
ykykj merged 1 commit into
HKUDS:mainfrom
hp083625:fix-memory-tokenize-underscore

hp083625 commented May 10, 2026

Uh oh!

warren618 commented May 10, 2026

Labels

3 participants

Uh oh!

Conversation

hp083625 commented May 10, 2026

Problem

Repro

Fix

Tests

Uh oh!

warren618 commented May 10, 2026

Labels

3 participants