Skip to content

[Bug] CJK/Chinese text — ARTICLES regex deletes English articles that carry semantic weight in mixed-language context #575

Description

@lg320531124

Problem

compress.js has zero CJK-awareness. All regex patterns (ARTICLES, FILLERS, PLEASANTRIES, HEDGES, LEADERS) are English-only, but the ARTICLES regex /\b(?:a|an|the)\s+(?=[a-z])/gi incorrectly strips English articles from CJK-mixed text where they carry semantic weight.

In Chinese technical conversations, English articles before technical terms are meaningful — they signal specificity or quantity:

Original After compress Problem
"这是一个 a LLM 模型" "这是一个 LLM 模型" Lost "a" = loses "one/some" nuance
"这个 the MCP server" "这个 MCP server" Lost "the" = loses "this specific" nuance
"the Agent 的状态" "Agent 的状态" Same — specific reference lost

Even worse, LEADERS regex ^(?:i'll|i will|...)\s+ can accidentally delete English verb phrases that are embedded in Chinese sentences:

Original After compress
"i will 检查这个 bug" "检查这个 bug" — lost intent statement

Root Cause

  1. ARTICLES regex uses \b word boundary which matches at CJK↔English transitions, so "a LLM" gets caught
  2. In pure English, removing "a/an/the" is fine — they're filler. In CJK-mixed text, English articles are not filler — they're semantic modifiers for technical terms embedded in Chinese
  3. Chinese has no article concept (中文没有冠词), so the English article is the only way to express specificity for the embedded English term
  4. compressProse() runs all regex regardless of whether the text contains CJK characters

Suggested Fix

In compress.js, add CJK-awareness to compressProse():

// Detect CJK-mixed text — Chinese, Japanese, Korean characters
const CJK_RE = /[-鿿 ---]/;

function compressProse(text) {
  const hasCJK = CJK_RE.test(text);

  let s = text;
  s = s.replace(LEADERS, );
  s = s.replace(PLEASANTRIES, );
  s = s.replace(HEDGES, );
  s = s.replace(FILLERS, );

  // In CJK-mixed text, English articles are semantic — skip removal
  if (!hasCJK) {
    s = s.replace(ARTICLES, );
  }

  // ... rest unchanged
}

This is a minimal, safe change — it only skips ARTICLES removal when CJK characters are detected. All other compression (fillers, pleasantries, hedges) still applies in CJK-mixed text, which is correct — those are genuinely filler in any language context.

Related

Environment

  • caveman v18e45320 (latest)
  • Claude Code, Chinese + English mixed conversations
  • macOS, Node.js v22

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions