[Bug] CJK/Chinese text — ARTICLES regex deletes English articles that carry semantic weight in mixed-language context

Problem

compress.js has zero CJK-awareness. All regex patterns (ARTICLES, FILLERS, PLEASANTRIES, HEDGES, LEADERS) are English-only, but the ARTICLES regex /\b(?:a|an|the)\s+(?=[a-z])/gi incorrectly strips English articles from CJK-mixed text where they carry semantic weight.

In Chinese technical conversations, English articles before technical terms are meaningful — they signal specificity or quantity:

Original	After compress	Problem
"这是一个 a LLM 模型"	"这是一个 LLM 模型"	Lost "a" = loses "one/some" nuance
"这个 the MCP server"	"这个 MCP server"	Lost "the" = loses "this specific" nuance
"the Agent 的状态"	"Agent 的状态"	Same — specific reference lost

Even worse, LEADERS regex ^(?:i'll|i will|...)\s+ can accidentally delete English verb phrases that are embedded in Chinese sentences:

Original	After compress
"i will 检查这个 bug"	"检查这个 bug" — lost intent statement

Root Cause

ARTICLES regex uses \b word boundary which matches at CJK↔English transitions, so "a LLM" gets caught
In pure English, removing "a/an/the" is fine — they're filler. In CJK-mixed text, English articles are not filler — they're semantic modifiers for technical terms embedded in Chinese
Chinese has no article concept (中文没有冠词), so the English article is the only way to express specificity for the embedded English term
compressProse() runs all regex regardless of whether the text contains CJK characters

Suggested Fix

In compress.js, add CJK-awareness to compressProse():

// Detect CJK-mixed text — Chinese, Japanese, Korean characters
const CJK_RE = /[一-鿿　-〿ぁ-ヶ가-힣]/;

function compressProse(text) {
  const hasCJK = CJK_RE.test(text);

  let s = text;
  s = s.replace(LEADERS, );
  s = s.replace(PLEASANTRIES, );
  s = s.replace(HEDGES, );
  s = s.replace(FILLERS, );

  // In CJK-mixed text, English articles are semantic — skip removal
  if (!hasCJK) {
    s = s.replace(ARTICLES, );
  }

  // ... rest unchanged
}

This is a minimal, safe change — it only skips ARTICLES removal when CJK characters are detected. All other compression (fillers, pleasantries, hedges) still applies in CJK-mixed text, which is correct — those are genuinely filler in any language context.

Environment

caveman v18e45320 (latest)
Claude Code, Chinese + English mixed conversations
macOS, Node.js v22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] CJK/Chinese text — ARTICLES regex deletes English articles that carry semantic weight in mixed-language context #575

Problem

Root Cause

Suggested Fix

Related

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] CJK/Chinese text — ARTICLES regex deletes English articles that carry semantic weight in mixed-language context #575

Description

Problem

Root Cause

Suggested Fix

Related

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions