Skip to content

fix(memory): treat underscores as word boundaries in _tokenize#87

Merged
ykykj merged 1 commit into
HKUDS:mainfrom
hp083625:fix-memory-tokenize-underscore
May 10, 2026
Merged

fix(memory): treat underscores as word boundaries in _tokenize#87
ykykj merged 1 commit into
HKUDS:mainfrom
hp083625:fix-memory-tokenize-underscore

Conversation

@hp083625

Copy link
Copy Markdown
Contributor

Problem

_tokenize in agent/src/memory/persistent.py uses the regex
[a-zA-Z0-9_]{3,}, which treats _ as a word character. Memory titles
saved through RememberTool are slugified to snake_case (e.g.
mcp_wiring_test), so they collapse into a single token. A natural-
language recall query like "mcp wiring" produces tokens
{"mcp", "wiring"} and never intersects {"mcp_wiring_test"}
recall returns 0 results despite the memory existing on disk.

Repro

  1. remember(action="save", title="mcp_wiring_test", content="…")
  2. remember(action="recall", query="mcp wiring")count: 0
  3. remember(action="forget", title="mcp_wiring_test") → succeeds (proves the memory was on disk)

Fix

Drop _ from the character class so underscores act as token boundaries.
mcp_wiring_test now tokenizes to {"mcp", "wiring", "test"}. Verbatim
queries ("mcp_wiring_test" as a single string) still work because they
split into the same three tokens on both sides of the comparison.

Tests

  • All five existing TestTokenize cases still pass.
  • New test_underscores_split covers the regression scenario directly.
Memory titles saved through RememberTool are slugified to snake_case
(e.g. "mcp_wiring_test"), so the previous regex `[a-zA-Z0-9_]{3,}`
collapsed them into a single token. A natural-language recall query
like "mcp wiring" produces tokens {"mcp", "wiring"} and never
intersects {"mcp_wiring_test"} — recall returned 0 results despite
the memory existing on disk.

Drop `_` from the character class so underscores act as token
boundaries. Verbatim queries still work because the same string
splits into the same tokens on both sides of the comparison.

Adds a regression test for the snake_case case.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@hp083625 hp083625 force-pushed the fix-memory-tokenize-underscore branch from bbeb076 to 5863ed6 Compare May 10, 2026 07:26
@ykykj ykykj merged commit 9b6e322 into HKUDS:main May 10, 2026
1 check passed
@warren618

Copy link
Copy Markdown
Collaborator

Thanks for the focused fix. Treating underscores as token boundaries makes sense for snake_case memory titles, and the regression test covers the failure mode clearly.

Merged and verified.

warren618 pushed a commit that referenced this pull request May 13, 2026
Add `vibe-trading memory list/show/search/forget` so users can inspect
and manage persistent memory without opening the file store directly.
Two recent memory bugs (#87 underscore tokenization, #95 CJK slug
collisions) were hard to spot without observability into what is
actually saved and recalled.

Also expand the PersistentMemory public surface to keep CLI lookup
logic out of storage internals:

- MEMORY_TYPES tuple as the single source of truth for type names,
  shared by argparse choices and the CLI style table
- list_entries() / find() / remove_entry() back the new commands;
  find() resolves by title first, then by on-disk filename stem so
  users can paste either form from the index
- remove_entry(entry) skips the re-scan that remove(title) does

Existing remove(title) is unchanged; remember_tool stays green.
warren618 pushed a commit that referenced this pull request May 14, 2026
…llic (#104)

The previous CJK tokenizer ranges (#87, #95) only matched ``一-鿿``
and ``㐀-䶿``, so memory entries with Thai, Arabic, Hebrew, or
Cyrillic titles:

- Tokenized to the empty set, making recall always miss (e.g.
  ``find_relevant("ถัวเฉลี่ย")`` returned nothing even when the body
  contained the word).
- Had their slug characters stripped to ``_``, so two distinct Thai
  titles of equal length silently overwrote each other on disk.

The new ``_NON_LATIN_SCRIPT_RANGES`` constant covers CJK + Thai +
Arabic + Hebrew + Cyrillic and is reused by:

- ``_TOKEN_RE`` — single alternation pattern, one ``re.findall`` per
  ``_tokenize`` call (one text scan instead of two; precompiled at
  module level so it doesn't go through ``re.compile`` cache lookup on
  each invocation).
- ``_SLUG_DISALLOWED_RE`` — negation pattern used by ``add()``.

Arabic and Hebrew are deliberately narrowed to the basic letter blocks
(U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints like
U+061C ARABIC LETTER MARK and combining marks out of on-disk slugs,
where they would render as invisible-but-distinct filenames.

Tests cover tokenization for each script, slug preservation
(parametrized across the four new scripts), and a Thai collision-
distinction regression.

Out of scope: ``agent/src/session/search.py`` has the same CJK-only
range in its FTS sanitizer; worth a follow-up PR to consume the same
constant.
ykykj pushed a commit to ykykj/Vibe-Trading that referenced this pull request May 23, 2026
Add `vibe-trading memory list/show/search/forget` so users can inspect
and manage persistent memory without opening the file store directly.
Two recent memory bugs (HKUDS#87 underscore tokenization, HKUDS#95 CJK slug
collisions) were hard to spot without observability into what is
actually saved and recalled.

Also expand the PersistentMemory public surface to keep CLI lookup
logic out of storage internals:

- MEMORY_TYPES tuple as the single source of truth for type names,
  shared by argparse choices and the CLI style table
- list_entries() / find() / remove_entry() back the new commands;
  find() resolves by title first, then by on-disk filename stem so
  users can paste either form from the index
- remove_entry(entry) skips the re-scan that remove(title) does

Existing remove(title) is unchanged; remember_tool stays green.
ykykj pushed a commit to ykykj/Vibe-Trading that referenced this pull request May 23, 2026
…llic (HKUDS#104)

The previous CJK tokenizer ranges (HKUDS#87, HKUDS#95) only matched ``一-鿿``
and ``㐀-䶿``, so memory entries with Thai, Arabic, Hebrew, or
Cyrillic titles:

- Tokenized to the empty set, making recall always miss (e.g.
  ``find_relevant("ถัวเฉลี่ย")`` returned nothing even when the body
  contained the word).
- Had their slug characters stripped to ``_``, so two distinct Thai
  titles of equal length silently overwrote each other on disk.

The new ``_NON_LATIN_SCRIPT_RANGES`` constant covers CJK + Thai +
Arabic + Hebrew + Cyrillic and is reused by:

- ``_TOKEN_RE`` — single alternation pattern, one ``re.findall`` per
  ``_tokenize`` call (one text scan instead of two; precompiled at
  module level so it doesn't go through ``re.compile`` cache lookup on
  each invocation).
- ``_SLUG_DISALLOWED_RE`` — negation pattern used by ``add()``.

Arabic and Hebrew are deliberately narrowed to the basic letter blocks
(U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints like
U+061C ARABIC LETTER MARK and combining marks out of on-disk slugs,
where they would render as invisible-but-distinct filenames.

Tests cover tokenization for each script, slug preservation
(parametrized across the four new scripts), and a Thai collision-
distinction regression.

Out of scope: ``agent/src/session/search.py`` has the same CJK-only
range in its FTS sanitizer; worth a follow-up PR to consume the same
constant.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants