Problem
compress.js has zero CJK-awareness. All regex patterns (ARTICLES, FILLERS, PLEASANTRIES, HEDGES, LEADERS) are English-only, but the ARTICLES regex /\b(?:a|an|the)\s+(?=[a-z])/gi incorrectly strips English articles from CJK-mixed text where they carry semantic weight.
In Chinese technical conversations, English articles before technical terms are meaningful — they signal specificity or quantity:
| Original |
After compress |
Problem |
| "这是一个 a LLM 模型" |
"这是一个 LLM 模型" |
Lost "a" = loses "one/some" nuance |
| "这个 the MCP server" |
"这个 MCP server" |
Lost "the" = loses "this specific" nuance |
| "the Agent 的状态" |
"Agent 的状态" |
Same — specific reference lost |
Even worse, LEADERS regex ^(?:i'll|i will|...)\s+ can accidentally delete English verb phrases that are embedded in Chinese sentences:
| Original |
After compress |
| "i will 检查这个 bug" |
"检查这个 bug" — lost intent statement |
Root Cause
ARTICLES regex uses \b word boundary which matches at CJK↔English transitions, so "a LLM" gets caught
- In pure English, removing "a/an/the" is fine — they're filler. In CJK-mixed text, English articles are not filler — they're semantic modifiers for technical terms embedded in Chinese
- Chinese has no article concept (中文没有冠词), so the English article is the only way to express specificity for the embedded English term
compressProse() runs all regex regardless of whether the text contains CJK characters
Suggested Fix
In compress.js, add CJK-awareness to compressProse():
// Detect CJK-mixed text — Chinese, Japanese, Korean characters
const CJK_RE = /[一-鿿 -〿ぁ-ヶ가-힣]/;
function compressProse(text) {
const hasCJK = CJK_RE.test(text);
let s = text;
s = s.replace(LEADERS, );
s = s.replace(PLEASANTRIES, );
s = s.replace(HEDGES, );
s = s.replace(FILLERS, );
// In CJK-mixed text, English articles are semantic — skip removal
if (!hasCJK) {
s = s.replace(ARTICLES, );
}
// ... rest unchanged
}
This is a minimal, safe change — it only skips ARTICLES removal when CJK characters are detected. All other compression (fillers, pleasantries, hedges) still applies in CJK-mixed text, which is correct — those are genuinely filler in any language context.
Related
Environment
- caveman v18e45320 (latest)
- Claude Code, Chinese + English mixed conversations
- macOS, Node.js v22
Problem
compress.jshas zero CJK-awareness. All regex patterns (ARTICLES,FILLERS,PLEASANTRIES,HEDGES,LEADERS) are English-only, but theARTICLESregex/\b(?:a|an|the)\s+(?=[a-z])/giincorrectly strips English articles from CJK-mixed text where they carry semantic weight.In Chinese technical conversations, English articles before technical terms are meaningful — they signal specificity or quantity:
Even worse,
LEADERSregex^(?:i'll|i will|...)\s+can accidentally delete English verb phrases that are embedded in Chinese sentences:Root Cause
ARTICLESregex uses\bword boundary which matches at CJK↔English transitions, so "a LLM" gets caughtcompressProse()runs all regex regardless of whether the text contains CJK charactersSuggested Fix
In
compress.js, add CJK-awareness tocompressProse():This is a minimal, safe change — it only skips ARTICLES removal when CJK characters are detected. All other compression (fillers, pleasantries, hedges) still applies in CJK-mixed text, which is correct — those are genuinely filler in any language context.
Related
Environment