fix(memory): treat underscores as word boundaries in _tokenize#87
Merged
Conversation
Memory titles saved through RememberTool are slugified to snake_case
(e.g. "mcp_wiring_test"), so the previous regex `[a-zA-Z0-9_]{3,}`
collapsed them into a single token. A natural-language recall query
like "mcp wiring" produces tokens {"mcp", "wiring"} and never
intersects {"mcp_wiring_test"} — recall returned 0 results despite
the memory existing on disk.
Drop `_` from the character class so underscores act as token
boundaries. Verbatim queries still work because the same string
splits into the same tokens on both sides of the comparison.
Adds a regression test for the snake_case case.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
bbeb076 to
5863ed6
Compare
Collaborator
|
Thanks for the focused fix. Treating underscores as token boundaries makes sense for snake_case memory titles, and the regression test covers the failure mode clearly. Merged and verified. |
10 tasks
warren618
pushed a commit
that referenced
this pull request
May 13, 2026
Add `vibe-trading memory list/show/search/forget` so users can inspect and manage persistent memory without opening the file store directly. Two recent memory bugs (#87 underscore tokenization, #95 CJK slug collisions) were hard to spot without observability into what is actually saved and recalled. Also expand the PersistentMemory public surface to keep CLI lookup logic out of storage internals: - MEMORY_TYPES tuple as the single source of truth for type names, shared by argparse choices and the CLI style table - list_entries() / find() / remove_entry() back the new commands; find() resolves by title first, then by on-disk filename stem so users can paste either form from the index - remove_entry(entry) skips the re-scan that remove(title) does Existing remove(title) is unchanged; remember_tool stays green.
10 tasks
warren618
pushed a commit
that referenced
this pull request
May 14, 2026
…llic (#104) The previous CJK tokenizer ranges (#87, #95) only matched ``一-鿿`` and ``㐀-䶿``, so memory entries with Thai, Arabic, Hebrew, or Cyrillic titles: - Tokenized to the empty set, making recall always miss (e.g. ``find_relevant("ถัวเฉลี่ย")`` returned nothing even when the body contained the word). - Had their slug characters stripped to ``_``, so two distinct Thai titles of equal length silently overwrote each other on disk. The new ``_NON_LATIN_SCRIPT_RANGES`` constant covers CJK + Thai + Arabic + Hebrew + Cyrillic and is reused by: - ``_TOKEN_RE`` — single alternation pattern, one ``re.findall`` per ``_tokenize`` call (one text scan instead of two; precompiled at module level so it doesn't go through ``re.compile`` cache lookup on each invocation). - ``_SLUG_DISALLOWED_RE`` — negation pattern used by ``add()``. Arabic and Hebrew are deliberately narrowed to the basic letter blocks (U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints like U+061C ARABIC LETTER MARK and combining marks out of on-disk slugs, where they would render as invisible-but-distinct filenames. Tests cover tokenization for each script, slug preservation (parametrized across the four new scripts), and a Thai collision- distinction regression. Out of scope: ``agent/src/session/search.py`` has the same CJK-only range in its FTS sanitizer; worth a follow-up PR to consume the same constant.
ykykj
pushed a commit
to ykykj/Vibe-Trading
that referenced
this pull request
May 23, 2026
Add `vibe-trading memory list/show/search/forget` so users can inspect and manage persistent memory without opening the file store directly. Two recent memory bugs (HKUDS#87 underscore tokenization, HKUDS#95 CJK slug collisions) were hard to spot without observability into what is actually saved and recalled. Also expand the PersistentMemory public surface to keep CLI lookup logic out of storage internals: - MEMORY_TYPES tuple as the single source of truth for type names, shared by argparse choices and the CLI style table - list_entries() / find() / remove_entry() back the new commands; find() resolves by title first, then by on-disk filename stem so users can paste either form from the index - remove_entry(entry) skips the re-scan that remove(title) does Existing remove(title) is unchanged; remember_tool stays green.
ykykj
pushed a commit
to ykykj/Vibe-Trading
that referenced
this pull request
May 23, 2026
…llic (HKUDS#104) The previous CJK tokenizer ranges (HKUDS#87, HKUDS#95) only matched ``一-鿿`` and ``㐀-䶿``, so memory entries with Thai, Arabic, Hebrew, or Cyrillic titles: - Tokenized to the empty set, making recall always miss (e.g. ``find_relevant("ถัวเฉลี่ย")`` returned nothing even when the body contained the word). - Had their slug characters stripped to ``_``, so two distinct Thai titles of equal length silently overwrote each other on disk. The new ``_NON_LATIN_SCRIPT_RANGES`` constant covers CJK + Thai + Arabic + Hebrew + Cyrillic and is reused by: - ``_TOKEN_RE`` — single alternation pattern, one ``re.findall`` per ``_tokenize`` call (one text scan instead of two; precompiled at module level so it doesn't go through ``re.compile`` cache lookup on each invocation). - ``_SLUG_DISALLOWED_RE`` — negation pattern used by ``add()``. Arabic and Hebrew are deliberately narrowed to the basic letter blocks (U+0620-U+064A, U+05D0-U+05EA) to keep bidi-control codepoints like U+061C ARABIC LETTER MARK and combining marks out of on-disk slugs, where they would render as invisible-but-distinct filenames. Tests cover tokenization for each script, slug preservation (parametrized across the four new scripts), and a Thai collision- distinction regression. Out of scope: ``agent/src/session/search.py`` has the same CJK-only range in its FTS sanitizer; worth a follow-up PR to consume the same constant.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
_tokenizeinagent/src/memory/persistent.pyuses the regex[a-zA-Z0-9_]{3,}, which treats_as a word character. Memory titlessaved through
RememberToolare slugified to snake_case (e.g.mcp_wiring_test), so they collapse into a single token. A natural-language
recallquery like"mcp wiring"produces tokens{"mcp", "wiring"}and never intersects{"mcp_wiring_test"}—recall returns 0 results despite the memory existing on disk.
Repro
remember(action="save", title="mcp_wiring_test", content="…")remember(action="recall", query="mcp wiring")→count: 0remember(action="forget", title="mcp_wiring_test")→ succeeds (proves the memory was on disk)Fix
Drop
_from the character class so underscores act as token boundaries.mcp_wiring_testnow tokenizes to{"mcp", "wiring", "test"}. Verbatimqueries (
"mcp_wiring_test"as a single string) still work because theysplit into the same three tokens on both sides of the comparison.
Tests
TestTokenizecases still pass.test_underscores_splitcovers the regression scenario directly.