Skip to content

Conversation

@xming521
Copy link
Owner

@xming521 xming521 commented Nov 1, 2025

Sourcery 总结

实现基于 JSON 引导解码的 LLM 并发和批量在线推理,集成到数据清洗工作流中,通过聊天成员关系丰富 QA 生成,并将版本升级到 0.3.03。

新功能:

  • 为 OnlineLLM 添加支持线程池的异步和批量聊天方法 (chat_async, chat_batch)
  • 引入引导解码 JSON 解析工具 (parse_guided_decoding_results),用于 pydantic 验证
  • 扩展 QA 生成器,使其能够从 users.json 加载聊天成员关系并将其包含在系统提示中

改进:

  • 重构数据清洗管道,使用 chat_batch 进行评分,跳过基于图像的样本,并采用统一的 JSON 解析
  • 将在线和离线推理的 JSON 解析逻辑集中到一个共享函数中
  • load_csv 重命名为 load_file,并调整 QA 生成器中的文件加载逻辑

构建:

  • 将项目版本和配置版本升级到 0.3.03

测试:

  • 更新 test_full_pipe 以复制测试数据源中的所有文件,移除 CSV 扩展名过滤器
Original summary in English

Summary by Sourcery

Implement concurrent and batch online inference for LLMs with JSON-guided decoding, integrate into the data cleaning workflow, enrich QA generation with chat member relationships, and bump version to 0.3.03.

New Features:

  • Add asynchronous and batch chat methods (chat_async, chat_batch) to OnlineLLM with thread pool support
  • Introduce guided decoding JSON parsing utility (parse_guided_decoding_results) for pydantic validation
  • Extend QA generator to load chat member relationships from users.json and include them in system prompts

Enhancements:

  • Refactor data cleaning pipeline to use chat_batch for scoring, skip image-based samples, and adopt unified JSON parsing
  • Centralize JSON parsing logic for both online and offline inference into a shared function
  • Rename load_csv to load_file and adjust file loading logic in QA generator

Build:

  • Bump project version and config version to 0.3.03

Tests:

  • Update test_full_pipe to copy all files in test data source, removing CSV extension filter

Note

Add concurrent/batch online LLM inference with unified JSON-guided parsing, integrate into cleaning, enrich QA with chat-member relations, and bump version/config to 0.3.03.

  • Core Inference:
    • OnlineLLM: Introduce thread-pooled chat_async/chat_batch with JSON responses; optional guided decoding via Pydantic; shared parser parse_guided_decoding_results (also used by offline vLLM).
    • Offline: Refactor to use shared JSON parser; support parsing OpenAI ChatCompletion.
  • Data Cleaning:
    • Switch to single CLEAN_PROMPT; refactor online cleaning to batch-score via chat_batch, skip image samples, and unify JSON parsing.
  • Dataset/QA Generation:
    • Load users.json per chat folder to capture relation; when add_relation is enabled, append relation and formatted time to system prompt; track talker through QA matching.
    • Rename load_csv to load_file; pipeline adjusted accordingly.
  • Models/Data Filters:
    • Extend skip types (用户上传的GIF表情, sticker2).
    • PII batch analyzer perf: n_process 24, batch_size 32.
  • Config & Templates:
    • Add add_relation flag; increase clean_batch_size default to 50; update example/test configs; version/config_version → 0.3.03 with changelog.
  • Tests:
    • test_full_pipe: copy all files from test data source (no CSV-only filter); update test fixtures.
  • Misc:
    • Update .gitignore and submodule pointer.

Written by Cursor Bugbot for commit 8ee9d05. This will update automatically on new commits. Configure here.

Introduce a ThreadPoolExecutor in the OnlineLLM class to enable
concurrent API calls, significantly boosting throughput for LLM operations.
Refactor the OlineLLMCleaningStrategy to leverage this new batching
capability, allowing multiple data points to be processed in parallel.

Increase the default `clean_batch_size` in settings and enhance
`n_process`/`batch_size` for PII detection to optimize for the new
concurrency. Simplify prompt management by removing a dedicated prompt
for online LLM cleaning. Add context manager support to OnlineLLM for
reliable resource cleanup. Ensure LLM chat responses explicitly request
JSON format.
Expands the list of message types to be skipped or ignored by the application.
This ensures proper handling for user-uploaded GIF stickers and a new sticker
type identified as 'sticker2', preventing potential processing errors.
Introduces a unified utility to parse and validate JSON-structured outputs
from both vLLM and OpenAI API inference results using Pydantic models.
This enables guided decoding for the OnlineLLM.chat method.
The existing guided decoding logic in vllm_infer is refactored to use
this new shared utility, improving consistency and error handling across
inference modes.
Enhance error messages when JSON parsing of LLM outputs fails in
offline inference. The system now dynamically extracts relevant
text snippets based on the response object type (e.g., RequestOutput,
ChatCompletion), making error logs more informative.

Additionally, suppress verbose debug/info logs from the OpenAI client
and httpx to reduce console noise during online inference.

Update the 'WC-exp' submodule and add 'models_final/*' to .gitignore
for repository hygiene.
- Updated the `qa_generator.py` to include a new mechanism for managing chat member relationships, allowing the addition of contextual information about the relationship between users in conversations.
- Refactored the CSV loading function to support loading user relationship data from a `users.json` file, improving the context provided during QA generation.
- Added a new configuration option `add_relation` to the dataset settings, enabling users to toggle this feature.
- Updated the `.gitignore` to exclude additional data directories and cache files for better repository hygiene.
- Bumped version to 0.3.03 in `pyproject.toml` to reflect these changes.
@sourcery-ai
Copy link

sourcery-ai bot commented Nov 1, 2025

审阅者指南

引入了多线程和支持批处理的在线LLM接口,统一了引导式解码的JSON解析,重构了数据清理以利用批处理聊天,通过用户关系丰富了QA生成,更新了版本/配置,并调整了PII检测中的并行度。

使用 OnlineLLM.chat_batch 进行批量数据清理的序列图

sequenceDiagram
    participant "OlineLLMCleaningStrategy"
    participant "OnlineLLM"
    participant "OpenAI API"
    participant "Logger"
    "OlineLLMCleaningStrategy"->>"OnlineLLM": chat_batch(batch_prompts)
    loop for each prompt in batch
        "OnlineLLM"->>"OpenAI API": chat(prompt)
        "OpenAI API"-->>"OnlineLLM": response
    end
    "OnlineLLM"-->>"OlineLLMCleaningStrategy": responses
    alt response is Exception
        "OlineLLMCleaningStrategy"->>"Logger": log error
    else response is valid
        "OlineLLMCleaningStrategy"->>"Logger": log success
    end
Loading

带有关系字段的 MakeDatasetArgs 和 QaPair 的 ER 图

erDiagram
    MAKE_DATASET_ARGS {
        string add_relation
    }
    QA_PAIR {
        string id
        string messages
        string images
        string talker
    }
    MAKE_DATASET_ARGS ||--o{ QA_PAIR : "generates"
Loading

更新后的 OnlineLLM 类图,支持批量和异步聊天

classDiagram
    class OnlineLLM {
        -api_key: str
        -base_url: str
        -model_name: str
        -default_system: Optional[str]
        -max_workers: int
        -client: OpenAI
        -executor: ThreadPoolExecutor
        +chat(prompt_text, temperature, max_tokens, top_p, stream): ChatCompletion
        +chat_async(prompt_text, temperature, max_tokens, top_p, stream): Future
        +chat_batch(prompts, temperature, max_tokens, top_p, stream, callback, guided_decoding_class): List/tuple
        +close()
        +__enter__()
        +__exit__(exc_type, exc_val, exc_tb)
    }
    OnlineLLM --> OpenAI
    OnlineLLM --> ThreadPoolExecutor
Loading

支持用户关系的 QaGenerator 类图

classDiagram
    class QaGenerator {
        -c: Config
        -relations: dict
        +main()
        +match_qa(messages): List[QaPair]
        +load_file(file_path): List[ChatMessage]
    }
    QaGenerator --> QaPair
    QaGenerator --> ChatMessage
Loading

parse_guided_decoding_results 工具类图

classDiagram
    class parse_guided_decoding_results {
        +parse_guided_decoding_results(results, guided_decoding_class): tuple[List[Optional[BaseModel]], List[int]]
    }
Loading

带有 add_relation 字段的 MakeDatasetArgs 类图

classDiagram
    class MakeDatasetArgs {
        +add_relation: bool
        ...
    }
Loading

文件级别变更

变更 详情 文件
扩展 OnlineLLM 以支持多线程、异步和批量聊天API
  • 使用 max_workers 参数配置 ThreadPoolExecutor
  • 添加了支持回调和引导式解码的 chat_async 和 chat_batch 方法
  • 实现了上下文管理器(enter/exit)和关闭逻辑
  • 抑制了冗余的 OpenAI/httpx 日志
weclone/core/inference/online_infer.py
重构清理策略,使用 chat_batch 批量处理提示
  • 根据消息和用户/图片逻辑构建输入列表
  • 用 chat_batch 响应替换了顺序提示/JSON解析
  • 处理了每个批处理项的异常并附加了 QaPairScoreWithId
weclone/data/clean/strategies.py
统一引导式解码结果的 JSON 解析
  • 添加了 parse_guided_decoding_results 以从 ChatCompletion 或 RequestOutput 中提取并验证 JSON
  • 替换了离线和 vllm 推理函数中重复的解析逻辑
weclone/core/inference/offline_infer.py
增强 QA 生成器以支持关系感知和上下文丰富
  • 从每个文件夹的 users.json 加载关系
  • 在 match_qa 过程中跟踪 conversation_talker
  • 配置后将关系信息附加到系统提示中
weclone/data/qa_generator.py
weclone/utils/config_models.py
提升项目和配置版本并公开新的 add_relation 标志
  • 将 pyproject.toml 中的版本更新为 0.3.03
  • 将 tool.weclone config_version 增加到 0.3.03
  • 为 MakeDatasetArgs 添加了 add_relation 字段
pyproject.toml
weclone/utils/config_models.py
增加 PII 检测中的并行阈值
  • 将 n_process 从 12 增加到 24
  • 将 batch_size 从 16 增加到 32
weclone/core/PII/pii_detector.py
调整测试设置以适应加载逻辑的变更
  • 放宽 test_full_pipe 设置中的文件扩展名检查以复制所有文件
tests/test_full_pipe.py

提示和命令

与 Sourcery 互动

  • 触发新审查: 在拉取请求上评论 @sourcery-ai review
  • 继续讨论: 直接回复 Sourcery 的审查评论。
  • 从审查评论生成 GitHub issue: 通过回复审查评论,要求 Sourcery 从中创建一个 issue。您也可以回复审查评论并加上 @sourcery-ai issue 来创建 issue。
  • 生成拉取请求标题: 在拉取请求标题的任意位置写入 @sourcery-ai 即可随时生成标题。您也可以在拉取请求上评论 @sourcery-ai title 来随时(重新)生成标题。
  • 生成拉取请求摘要: 在拉取请求正文的任意位置写入 @sourcery-ai summary 即可随时在您想要的位置生成 PR 摘要。您也可以在拉取请求上评论 @sourcery-ai summary 来随时(重新)生成摘要。
  • 生成审阅者指南: 在拉取请求上评论 @sourcery-ai guide 即可随时(重新)生成审阅者指南。
  • 解决所有 Sourcery 评论: 在拉取请求上评论 @sourcery-ai resolve 以解决所有 Sourcery 评论。如果您已经处理了所有评论并且不想再看到它们,这会很有用。
  • 驳回所有 Sourcery 审查: 在拉取请求上评论 @sourcery-ai dismiss 以驳回所有现有的 Sourcery 审查。如果您想从头开始进行新的审查,这特别有用——别忘了评论 @sourcery-ai review 以触发新的审查!

自定义您的体验

访问您的 仪表板 以:

  • 启用或禁用审查功能,例如 Sourcery 生成的拉取请求摘要、审阅者指南等。
  • 更改审查语言。
  • 添加、删除或编辑自定义审查说明。
  • 调整其他审查设置。

获取帮助

Original review guide in English

Reviewer's Guide

Introduces multithreaded and batch-capable online LLM interfaces, unifies JSON parsing for guided decoding, refactors data cleaning to leverage batch chat, enriches QA generation with user relations, updates version/config, and tunes parallelism in PII detection.

Sequence diagram for batch data cleaning with OnlineLLM.chat_batch

sequenceDiagram
    participant "OlineLLMCleaningStrategy"
    participant "OnlineLLM"
    participant "OpenAI API"
    participant "Logger"
    "OlineLLMCleaningStrategy"->>"OnlineLLM": chat_batch(batch_prompts)
    loop for each prompt in batch
        "OnlineLLM"->>"OpenAI API": chat(prompt)
        "OpenAI API"-->>"OnlineLLM": response
    end
    "OnlineLLM"-->>"OlineLLMCleaningStrategy": responses
    alt response is Exception
        "OlineLLMCleaningStrategy"->>"Logger": log error
    else response is valid
        "OlineLLMCleaningStrategy"->>"Logger": log success
    end
Loading

ER diagram for MakeDatasetArgs and QaPair with relation field

erDiagram
    MAKE_DATASET_ARGS {
        string add_relation
    }
    QA_PAIR {
        string id
        string messages
        string images
        string talker
    }
    MAKE_DATASET_ARGS ||--o{ QA_PAIR : "generates"
Loading

Class diagram for updated OnlineLLM with batch and async chat

classDiagram
    class OnlineLLM {
        -api_key: str
        -base_url: str
        -model_name: str
        -default_system: Optional[str]
        -max_workers: int
        -client: OpenAI
        -executor: ThreadPoolExecutor
        +chat(prompt_text, temperature, max_tokens, top_p, stream): ChatCompletion
        +chat_async(prompt_text, temperature, max_tokens, top_p, stream): Future
        +chat_batch(prompts, temperature, max_tokens, top_p, stream, callback, guided_decoding_class): List/tuple
        +close()
        +__enter__()
        +__exit__(exc_type, exc_val, exc_tb)
    }
    OnlineLLM --> OpenAI
    OnlineLLM --> ThreadPoolExecutor
Loading

Class diagram for QaGenerator with user relations support

classDiagram
    class QaGenerator {
        -c: Config
        -relations: dict
        +main()
        +match_qa(messages): List[QaPair]
        +load_file(file_path): List[ChatMessage]
    }
    QaGenerator --> QaPair
    QaGenerator --> ChatMessage
Loading

Class diagram for parse_guided_decoding_results utility

classDiagram
    class parse_guided_decoding_results {
        +parse_guided_decoding_results(results, guided_decoding_class): tuple[List[Optional[BaseModel]], List[int]]
    }
Loading

Class diagram for MakeDatasetArgs with add_relation field

classDiagram
    class MakeDatasetArgs {
        +add_relation: bool
        ...
    }
Loading

File-Level Changes

Change Details Files
Extend OnlineLLM with multithreading, async and batch chat APIs
  • Configured ThreadPoolExecutor with max_workers param
  • Added chat_async and chat_batch methods with callback and guided decoding support
  • Implemented context manager (enter/exit) and shutdown logic
  • Suppressed verbose OpenAI/httpx logs
weclone/core/inference/online_infer.py
Refactor cleaning strategy to process prompts in batches using chat_batch
  • Built inputs list from messages and user/images logic
  • Replaced sequential prompts/JSON parsing with chat_batch responses
  • Handled exceptions per batch item and appended QaPairScoreWithId
weclone/data/clean/strategies.py
Unify JSON parsing for guided decoding results
  • Added parse_guided_decoding_results to extract and validate JSON from ChatCompletion or RequestOutput
  • Replaced duplicate parsing logic in offline and vllm inference functions
weclone/core/inference/offline_infer.py
Enhance QA generator with relation awareness and context enrichment
  • Loaded relations from users.json per folder
  • Tracked conversation_talker throughout match_qa
  • Appended relation info into system prompt when configured
weclone/data/qa_generator.py
weclone/utils/config_models.py
Bump project and config versions and expose new add_relation flag
  • Updated version in pyproject.toml to 0.3.03
  • Incremented tool.weclone config_version to 0.3.03
  • Added add_relation field to MakeDatasetArgs
pyproject.toml
weclone/utils/config_models.py
Increase parallelism thresholds in PII detection
  • Doubled n_process from 12 to 24
  • Doubled batch_size from 16 to 32
weclone/core/PII/pii_detector.py
Adjust test setup to align with loading logic changes
  • Relaxed file extension check in test_full_pipe setup to copy all files
tests/test_full_pipe.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

大家好 - 我已经审阅了你的更改,它们看起来很棒!

AI 代理的提示
请处理此代码审查中的评论:

## 个人评论

### 评论 1
<location> `weclone/data/clean/strategies.py:174-175` </location>
<code_context>
+        inputs = []
+        prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)
+        for qa in data:
+            if qa.images:
+                qa.score = 6
+            else:
+                messages_str = ""
</code_context>

<issue_to_address>
**issue (bug_risk):** 为包含图像的项目分配 score=6 可能会导致混淆。

如果分数限制在 1-5 之间,则为包含图像的项目使用 6 可能会引入歧义。请考虑为这些情况使用专用标志或更清晰的逻辑。
</issue_to_address>

### 评论 2
<location> `weclone/core/PII/pii_detector.py:179` </location>
<code_context>
             language=self.language,
             entities=self.filtered_entities,
             score_threshold=self.threshold,
-            n_process=12,
-            batch_size=16,
+            n_process=24,
+            batch_size=32,
         )
</code_context>

<issue_to_address>
**suggestion (performance):** 增加 n_process 和 batch_size 可能会影响资源使用。

考虑将 n_process 和 batch_size 设置为可配置项,以便用户可以根据需要调整资源使用。

建议的实现:

```python
            language=self.language,
            entities=self.filtered_entities,
            score_threshold=self.threshold,
            n_process=n_process,
            batch_size=batch_size,
        )

```

您还需要:
1.`n_process=24``batch_size=32` 作为参数添加到包含此代码的函数或方法中。
2. 如果需要自定义值,请更新所有对此函数/方法的调用以传递这些参数。
</issue_to_address>

### 评论 3
<location> `weclone/core/inference/offline_infer.py:62-66` </location>
<code_context>
def parse_guided_decoding_results(
    results: List[RequestOutput] | List[ChatCompletion] | List, guided_decoding_class: type[BaseModel]
) -> tuple[List[Optional[BaseModel]], List[int]]:
    """Parse guided decoding results and return parsed results with failed indices.

    Args:
        results: Raw vLLM generation results
        guided_decoding_class: Pydantic model class for validation

    Returns:
        tuple: (parsed_results, failed_indices) where failed_indices contains
               indices of failed JSON parsing
    """
    parsed_results = []
    failed_indexs = []

    for idx, result in enumerate(results):
        try:
            if isinstance(result, RequestOutput):
                json_text = extract_json_from_text(result.outputs[0].text)
            elif isinstance(result, ChatCompletion):
                json_text = extract_json_from_text(result.choices[0].message.content)
            else:
                raise ValueError(f"Unsupported result type: {type(result)}")
            parsed_result = guided_decoding_class.model_validate_json(json_text)
            parsed_results.append(parsed_result)
        except Exception as e:
            if isinstance(result, RequestOutput):
                log_text = result.outputs[0].text[:100] + "..."
            elif isinstance(result, ChatCompletion):
                log_text = result.choices[0].message.content[:100] + "..."
            else:
                log_text = str(result)[:100] + "..."
            logger.warning(
                f"Failed to parse JSON from result at sequence index {idx}: {log_text}, error: {e}"
            )
            failed_indexs.append(idx)
            parsed_results.append(None)

    return parsed_results, failed_indexs

</code_context>

<issue_to_address>
**issue (code-quality):** 使用 f-string 而不是字符串连接 [×3] ([`use-fstring-for-concatenation`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-fstring-for-concatenation/))
</issue_to_address>

### 评论 4
<location> `weclone/core/inference/offline_infer.py:76` </location>
<code_context>
def vllm_infer(
    inputs: List[str],
    model_name_or_path: str,
    adapter_name_or_path: Optional[str] = None,
    dataset: str = "alpaca_en_demo",
    dataset_dir: str = "data",
    template: str = "default",
    cutoff_len: int = 2048,
    max_samples: Optional[int] = None,
    vllm_config: str = "{}",
    save_name: str = "generated_predictions.jsonl",
    default_system: Optional[str] = None,
    enable_thinking: bool = False,
    temperature: float = 0.95,
    top_p: float = 0.7,
    top_k: int = 50,
    guided_decoding_class: Optional[type[BaseModel]] = None,
    bad_words: Optional[List[str]] = None,
    logprobs: Optional[int] = None,
    max_new_tokens: int = 1024,
    repetition_penalty: float = 1.0,
    skip_special_tokens: bool = True,
    seed: Optional[int] = None,
    pipeline_parallel_size: int = 1,
    image_max_pixels: int = 768 * 768,
    image_min_pixels: int = 32 * 32,
) -> tuple[List[RequestOutput] | List[Optional[BaseModel]], List[int]]:
    r"""Perform batch generation using vLLM engine, which supports tensor parallelism.

    Returns:
        tuple: (results, failed_indices) where failed_indices contains indices of failed JSON parsing
    """
    if pipeline_parallel_size > get_device_count():
        raise ValueError("Pipeline parallel size should be smaller than the number of gpus.")

    wc_vllm_args = cast(VllmArgs, load_config("vllm"))
    model_args, data_args, _, generating_args = get_infer_args(
        {
            "model_name_or_path": model_name_or_path,
            "adapter_name_or_path": adapter_name_or_path,
            "dataset": dataset,
            "dataset_dir": dataset_dir,
            "template": template,
            "cutoff_len": cutoff_len,
            "max_samples": max_samples,
            "preprocessing_num_workers": 16,
            "vllm_config": vllm_config,
            "temperature": temperature,
            "top_p": top_p,
            "top_k": top_k,
            "max_new_tokens": max_new_tokens,
            "repetition_penalty": repetition_penalty,
            "enable_thinking": enable_thinking,
        }
    )

    tokenizer_module = load_tokenizer(model_args)
    tokenizer = tokenizer_module["tokenizer"]
    template_obj = get_template_and_fix_tokenizer(tokenizer, data_args)
    template_obj.mm_plugin.expand_mm_tokens = False  # for vllm generate

    if guided_decoding_class:
        json_schema = guided_decoding_class.model_json_schema()
        guided_decoding_params = GuidedDecodingParams(json=json_schema, disable_any_whitespace=True)

    sampling_params = SamplingParams(
        repetition_penalty=generating_args.repetition_penalty or 1.0,
        temperature=generating_args.temperature,
        top_p=generating_args.top_p or 1.0,
        top_k=generating_args.top_k or -1,
        stop_token_ids=template_obj.get_stop_token_ids(tokenizer),
        max_tokens=generating_args.max_new_tokens,
        skip_special_tokens=skip_special_tokens,
        seed=seed,
        bad_words=bad_words,
        guided_decoding=guided_decoding_params if guided_decoding_class else None,
    )
    if model_args.adapter_name_or_path is not None:
        lora_request = LoRARequest("default", 1, model_args.adapter_name_or_path[0])
    else:
        lora_request = None

    engine_args = {
        "model": model_args.model_name_or_path,
        "trust_remote_code": True,
        "dtype": model_args.infer_dtype,
        "max_model_len": cutoff_len + max_new_tokens,
        "disable_log_stats": True,
        "enable_lora": model_args.adapter_name_or_path is not None,
        "enable_prefix_caching": True,
        "guided_decoding_backend": "guidance",
        "guided_decoding_disable_any_whitespace": True,
    }

    if template_obj.mm_plugin.__class__.__name__ != "BasePlugin":
        engine_args["limit_mm_per_prompt"] = {"image": 4, "video": 2, "audio": 2}

    wc_vllm_dict = {k: v for k, v in wc_vllm_args.model_dump().items() if v is not None}
    engine_args.update(wc_vllm_dict)

    if isinstance(model_args.vllm_config, dict):
        engine_args.update(model_args.vllm_config)

    messages_list = [[{"role": "user", "content": text}] for text in inputs]

    llm = LLM(**engine_args)

    results = llm.chat(
        messages_list,
        sampling_params,
        lora_request=lora_request,
        chat_template_kwargs={"enable_thinking": enable_thinking},
    )  # type: ignore

    del llm
    torch.cuda.empty_cache()

    if guided_decoding_class:
        # TODO better json decode  https://github.com/vllm-project/vllm/commit/1d0ae26c8544fd5a62e171e30c2dcc2973a23bc8#diff-3b27790a2ce97bc50cdd5476f7b0057da682ed0d1ec8426a7b76c5e21454e57d
        parsed_results, failed_indexs = parse_guided_decoding_results(results, guided_decoding_class)
        return parsed_results, failed_indexs
    else:
        return results, []

</code_context>

<issue_to_address>
**issue (code-quality):** 我们发现了以下问题:

- 通过联合运算符合并字典更新 [×2] ([`dict-assign-update-to-union`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/dict-assign-update-to-union/))
- vllm\_infer 中发现代码质量低下 - 23% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>解释</summary>

此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。

您如何解决这个问题?

重构此函数以使其更短、更易读可能值得。

- 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下,一个函数应该少于 10 行。
- 减少嵌套,例如通过引入守卫子句来提前返回。
- 确保变量的作用域紧密,以便使用相关概念的代码在函数内紧密地放在一起,而不是分散开来。</details>
</issue_to_address>

### 评论 5
<location> `weclone/data/clean/strategies.py:159` </location>
<code_context>
    def judge(self, data: List[QaPair]) -> None:
        config = self.make_dataset_config
        logger.info("Starting online model scoring of data")
        logger.info(f"Using model {config.model_name}")

        client = OnlineLLM(
            api_key=config.llm_api_key,
            base_url=config.base_url,
            model_name=config.model_name,
            max_workers=config.clean_batch_size + 5,
        )

        inputs = []
        prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)
        for qa in data:
            if qa.images:
                qa.score = 6
            else:
                messages_str = ""
                for msg in qa.messages:
                    if msg.role == "user":
                        messages_str += f"Q: {msg.content}\n"
                    elif msg.role == "assistant":
                        messages_str += f"A: {msg.content}\n"
                prompt_value = prompt_template.invoke({"id": qa.id, "messages": messages_str.strip()})
                inputs.append(prompt_value.to_string())

        prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)

        parsed_scores = []
        clean_batch_size = config.clean_batch_size

        for i in tqdm(range(0, len(inputs), clean_batch_size), desc="Online model scoring progress"):
            batch = inputs[i : i + clean_batch_size]

            try:
                responses = client.chat_batch(batch, temperature=0)

                for j, response in enumerate(responses):
                    if isinstance(response, Exception):
                        logger.error(f"Failed to get response for batch item {i + j}: {str(response)}")
                        continue
                    result_text = response.choices[0].message.content
                    if "</think>" in result_text:
                        result_text = result_text.split("</think>", 1)[1]
                    result_text = re.sub(r"^```json\s*|```$", "", result_text.strip(), flags=re.MULTILINE)
                    try:
                        item_res = json.loads(result_text)
                    except json.JSONDecodeError as e:
                        logger.error(
                            f"JSON parsing failed for batch item {i + j}: {e}\nContent: {result_text}"
                        )
                        continue
                    parsed_scores.append(QaPairScoreWithId(**item_res))
            except Exception as e:
                logger.error(
                    f"Failed to call online model or parse result for batch starting at index {i}, error: {str(e)}"
                )

        score_map = {score.id: score.score for score in parsed_scores}
        for qa in data:
            if qa.id in score_map:
                qa.score = score_map[qa.id]
            else:
                logger.warning(f"No score obtained for QA ID {qa.id}, default assigned 0")
                qa.score = 0

        scores = [qa.score for qa in data if qa.score is not None]
        score_series = pd.Series(scores)
        score_counts = score_series.value_counts().sort_index()
        score_percentages = score_series.value_counts(normalize=True).sort_index() * 100
        pd.set_option("display.unicode.east_asian_width", True)
        distribution_df = pd.DataFrame(
            {
                "Count": score_counts,
                "Percentage(%)": score_percentages.round(2),
            }
        )
        distribution_df.index.name = "Score"
        printable_df_str = distribution_df.reset_index().to_string(index=False)
        logger.success(f"Online model scoring distribution:\n{printable_df_str}")

</code_context>

<issue_to_address>
**issue (code-quality):** OlineLLMCleaningStrategy.judge 中发现代码质量低下 - 18% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>解释</summary>此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。

您如何解决这个问题?

重构此函数以使其更短、更易读可能值得。

- 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下,一个函数应该少于 10 行。
- 减少嵌套,例如通过引入守卫子句来提前返回。
- 确保变量的作用域紧密,以便使用相关概念的代码在函数内紧密地放在一起,而不是分散开来。</details>
</issue_to_address>

### 评论 6
<location> `weclone/data/qa_generator.py:560-561` </location>
<code_context>
    def load_file(self, file_path) -> List[ChatMessage]:
        """
        Perform overall first preprocessing, filter rows that don't meet conditions, check if images exist and change type to cut if not, add DataModality field
        """
        folder_path = os.path.dirname(file_path)
        folder_name = os.path.basename(folder_path)

        if folder_name not in self.relations:
            users_json_path = os.path.join(folder_path, "users.json")
            if os.path.exists(users_json_path):
                try:
                    with open(users_json_path, encoding="utf-8") as f:
                        users_data = json.load(f)
                        relation = users_data.get("relation", "")
                        if relation:
                            self.relations[folder_name] = relation
                            logger.debug(f"Loaded relation for {folder_name}: {relation}")
                except (FileNotFoundError, json.JSONDecodeError) as e:
                    logger.warning(f"Failed to load users.json from {folder_path}: {e}")

        df = pd.read_csv(
            file_path,
            encoding="utf-8",
            dtype={"msg": str, "src": str},
            escapechar=None,
            keep_default_na=False,
        )

        df = df[~df["type_name"].isin(values=self.skip_type_list)]

        if "is_forward" in df.columns:
            df = df[~((df["is_sender"] == 1) & (df["is_forward"]))]

        # Batch process text messages for PII detection and blocked words
        text_indices = []
        text_messages = []

        for i in df.index:
            if df.loc[i, "type_name"].lower() in ["文本", "text"]:  # type: ignore
                msg_str = str(df.loc[i, "msg"])
                msg_str = msg_str.replace("\n", "")
                text_indices.append(i)
                text_messages.append(msg_str)

        # TODO Deleting directly by batch_has_pii returning true/false.
        indices_to_drop = []
        if text_messages:
            pii_results = self.pii_detector.batch_has_pii(text_messages)

            for idx, (df_index, msg_str, has_pii) in enumerate(zip(text_indices, text_messages, pii_results)):
                if has_pii:
                    indices_to_drop.append(df_index)
                    continue

                # Check blocked words
                for blocked_word in self.blocked_words:
                    if blocked_word in msg_str:
                        indices_to_drop.append(df_index)
                        break

        df = df.drop(index=indices_to_drop)

        # Process other message types
        for i in df.index:
            if df.loc[i, "type_name"].lower() in ["文本", "text"]:
                continue
            if df.loc[i, "src"].lower().endswith(".gif"):
                df.loc[i, "src"] = ""
                df.loc[i, "type_name"] = "动画表情" if self.c.platform == PlatformType.CHAT else "sticker"
                continue
            if df.loc[i, "type_name"].lower() in ["图片", "image"]:  # type: ignore
                if self.c.platform in [PlatformType.CHAT, PlatformType.TELEGRAM]:
                    result = check_image_file_exists(str(df.loc[i, "src"]))
                    if isinstance(result, str) and df.loc[i, "is_sender"] == 0:
                        df.loc[i, "src"] = result
                        df.loc[i, "msg"] = "<image>"
                        df.loc[i, "modality"] = DataModality.IMAGE
                    else:
                        df.loc[i, "type_name"] = "Cut"
            elif df.loc[i, "type_name"] in ["sticker", "动画表情"]:
                if self.c.platform in [PlatformType.CHAT, PlatformType.TELEGRAM]:
                    df.loc[i, "src"] = ""
                    continue
            else:
                df.loc[i, "msg"] = ""

        df = df.dropna(how="all")
        # Time format: 2021-07-07 10:27:23
        df["CreateTime"] = pd.to_datetime(df["CreateTime"])

        return [ChatMessage(**row) for row in df.to_dict("records")]  # type: ignore

</code_context>

<issue_to_address>
**suggestion (code-quality):** 我们发现了以下问题:

- 使用命名表达式简化赋值和条件 ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- DataProcessor.load\_file 中发现代码质量低下 - 15% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

```suggestion
                        if relation := users_data.get("relation", ""):
```

<br/><details><summary>解释</summary>
此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。

您如何解决这个问题?

重构此函数以使其更短、更易读可能值得。

- 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下,一个函数应该少于 10 行。
- 减少嵌套,例如通过引入守卫子句来提前返回。
- 确保变量的作用域紧密,以便使用相关概念的代码在函数内紧密地放在一起,而不是分散开来。</details>
</issue_to_address>

Sourcery 对开源免费 - 如果您喜欢我们的评论,请考虑分享它们 ✨
请帮助我更有用!请在每条评论上点击 👍 或 👎 ,我将使用这些反馈来改进您的评论。
Original comment in English

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `weclone/data/clean/strategies.py:174-175` </location>
<code_context>
+        inputs = []
+        prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)
+        for qa in data:
+            if qa.images:
+                qa.score = 6
+            else:
+                messages_str = ""
</code_context>

<issue_to_address>
**issue (bug_risk):** Assigning score=6 for items with images may cause confusion.

If scores are limited to 1-5, using 6 for items with images may introduce ambiguity. Consider a dedicated flag or clearer logic for these cases.
</issue_to_address>

### Comment 2
<location> `weclone/core/PII/pii_detector.py:179` </location>
<code_context>
             language=self.language,
             entities=self.filtered_entities,
             score_threshold=self.threshold,
-            n_process=12,
-            batch_size=16,
+            n_process=24,
+            batch_size=32,
         )
</code_context>

<issue_to_address>
**suggestion (performance):** Increasing n_process and batch_size may impact resource usage.

Consider making n_process and batch_size configurable so users can adjust resource usage as needed.

Suggested implementation:

```python
            language=self.language,
            entities=self.filtered_entities,
            score_threshold=self.threshold,
            n_process=n_process,
            batch_size=batch_size,
        )

```

You will also need to:
1. Add `n_process=24` and `batch_size=32` as parameters to the function or method containing this code.
2. Update any calls to this function/method to pass these parameters if custom values are desired.
</issue_to_address>

### Comment 3
<location> `weclone/core/inference/offline_infer.py:62-66` </location>
<code_context>
def parse_guided_decoding_results(
    results: List[RequestOutput] | List[ChatCompletion] | List, guided_decoding_class: type[BaseModel]
) -> tuple[List[Optional[BaseModel]], List[int]]:
    """Parse guided decoding results and return parsed results with failed indices.

    Args:
        results: Raw vLLM generation results
        guided_decoding_class: Pydantic model class for validation

    Returns:
        tuple: (parsed_results, failed_indices) where failed_indices contains
               indices of failed JSON parsing
    """
    parsed_results = []
    failed_indexs = []

    for idx, result in enumerate(results):
        try:
            if isinstance(result, RequestOutput):
                json_text = extract_json_from_text(result.outputs[0].text)
            elif isinstance(result, ChatCompletion):
                json_text = extract_json_from_text(result.choices[0].message.content)
            else:
                raise ValueError(f"Unsupported result type: {type(result)}")
            parsed_result = guided_decoding_class.model_validate_json(json_text)
            parsed_results.append(parsed_result)
        except Exception as e:
            if isinstance(result, RequestOutput):
                log_text = result.outputs[0].text[:100] + "..."
            elif isinstance(result, ChatCompletion):
                log_text = result.choices[0].message.content[:100] + "..."
            else:
                log_text = str(result)[:100] + "..."
            logger.warning(
                f"Failed to parse JSON from result at sequence index {idx}: {log_text}, error: {e}"
            )
            failed_indexs.append(idx)
            parsed_results.append(None)

    return parsed_results, failed_indexs

</code_context>

<issue_to_address>
**issue (code-quality):** Use f-string instead of string concatenation [×3] ([`use-fstring-for-concatenation`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-fstring-for-concatenation/))
</issue_to_address>

### Comment 4
<location> `weclone/core/inference/offline_infer.py:76` </location>
<code_context>
def vllm_infer(
    inputs: List[str],
    model_name_or_path: str,
    adapter_name_or_path: Optional[str] = None,
    dataset: str = "alpaca_en_demo",
    dataset_dir: str = "data",
    template: str = "default",
    cutoff_len: int = 2048,
    max_samples: Optional[int] = None,
    vllm_config: str = "{}",
    save_name: str = "generated_predictions.jsonl",
    default_system: Optional[str] = None,
    enable_thinking: bool = False,
    temperature: float = 0.95,
    top_p: float = 0.7,
    top_k: int = 50,
    guided_decoding_class: Optional[type[BaseModel]] = None,
    bad_words: Optional[List[str]] = None,
    logprobs: Optional[int] = None,
    max_new_tokens: int = 1024,
    repetition_penalty: float = 1.0,
    skip_special_tokens: bool = True,
    seed: Optional[int] = None,
    pipeline_parallel_size: int = 1,
    image_max_pixels: int = 768 * 768,
    image_min_pixels: int = 32 * 32,
) -> tuple[List[RequestOutput] | List[Optional[BaseModel]], List[int]]:
    r"""Perform batch generation using vLLM engine, which supports tensor parallelism.

    Returns:
        tuple: (results, failed_indices) where failed_indices contains indices of failed JSON parsing
    """
    if pipeline_parallel_size > get_device_count():
        raise ValueError("Pipeline parallel size should be smaller than the number of gpus.")

    wc_vllm_args = cast(VllmArgs, load_config("vllm"))
    model_args, data_args, _, generating_args = get_infer_args(
        {
            "model_name_or_path": model_name_or_path,
            "adapter_name_or_path": adapter_name_or_path,
            "dataset": dataset,
            "dataset_dir": dataset_dir,
            "template": template,
            "cutoff_len": cutoff_len,
            "max_samples": max_samples,
            "preprocessing_num_workers": 16,
            "vllm_config": vllm_config,
            "temperature": temperature,
            "top_p": top_p,
            "top_k": top_k,
            "max_new_tokens": max_new_tokens,
            "repetition_penalty": repetition_penalty,
            "enable_thinking": enable_thinking,
        }
    )

    tokenizer_module = load_tokenizer(model_args)
    tokenizer = tokenizer_module["tokenizer"]
    template_obj = get_template_and_fix_tokenizer(tokenizer, data_args)
    template_obj.mm_plugin.expand_mm_tokens = False  # for vllm generate

    if guided_decoding_class:
        json_schema = guided_decoding_class.model_json_schema()
        guided_decoding_params = GuidedDecodingParams(json=json_schema, disable_any_whitespace=True)

    sampling_params = SamplingParams(
        repetition_penalty=generating_args.repetition_penalty or 1.0,
        temperature=generating_args.temperature,
        top_p=generating_args.top_p or 1.0,
        top_k=generating_args.top_k or -1,
        stop_token_ids=template_obj.get_stop_token_ids(tokenizer),
        max_tokens=generating_args.max_new_tokens,
        skip_special_tokens=skip_special_tokens,
        seed=seed,
        bad_words=bad_words,
        guided_decoding=guided_decoding_params if guided_decoding_class else None,
    )
    if model_args.adapter_name_or_path is not None:
        lora_request = LoRARequest("default", 1, model_args.adapter_name_or_path[0])
    else:
        lora_request = None

    engine_args = {
        "model": model_args.model_name_or_path,
        "trust_remote_code": True,
        "dtype": model_args.infer_dtype,
        "max_model_len": cutoff_len + max_new_tokens,
        "disable_log_stats": True,
        "enable_lora": model_args.adapter_name_or_path is not None,
        "enable_prefix_caching": True,
        "guided_decoding_backend": "guidance",
        "guided_decoding_disable_any_whitespace": True,
    }

    if template_obj.mm_plugin.__class__.__name__ != "BasePlugin":
        engine_args["limit_mm_per_prompt"] = {"image": 4, "video": 2, "audio": 2}

    wc_vllm_dict = {k: v for k, v in wc_vllm_args.model_dump().items() if v is not None}
    engine_args.update(wc_vllm_dict)

    if isinstance(model_args.vllm_config, dict):
        engine_args.update(model_args.vllm_config)

    messages_list = [[{"role": "user", "content": text}] for text in inputs]

    llm = LLM(**engine_args)

    results = llm.chat(
        messages_list,
        sampling_params,
        lora_request=lora_request,
        chat_template_kwargs={"enable_thinking": enable_thinking},
    )  # type: ignore

    del llm
    torch.cuda.empty_cache()

    if guided_decoding_class:
        # TODO better json decode  https://github.com/vllm-project/vllm/commit/1d0ae26c8544fd5a62e171e30c2dcc2973a23bc8#diff-3b27790a2ce97bc50cdd5476f7b0057da682ed0d1ec8426a7b76c5e21454e57d
        parsed_results, failed_indexs = parse_guided_decoding_results(results, guided_decoding_class)
        return parsed_results, failed_indexs
    else:
        return results, []

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Merge dictionary updates via the union operator [×2] ([`dict-assign-update-to-union`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/dict-assign-update-to-union/))
- Low code quality found in vllm\_infer - 23% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>Explanation</summary>

The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

### Comment 5
<location> `weclone/data/clean/strategies.py:159` </location>
<code_context>
    def judge(self, data: List[QaPair]) -> None:
        config = self.make_dataset_config
        logger.info("Starting online model scoring of data")
        logger.info(f"Using model {config.model_name}")

        client = OnlineLLM(
            api_key=config.llm_api_key,
            base_url=config.base_url,
            model_name=config.model_name,
            max_workers=config.clean_batch_size + 5,
        )

        inputs = []
        prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)
        for qa in data:
            if qa.images:
                qa.score = 6
            else:
                messages_str = ""
                for msg in qa.messages:
                    if msg.role == "user":
                        messages_str += f"Q: {msg.content}\n"
                    elif msg.role == "assistant":
                        messages_str += f"A: {msg.content}\n"
                prompt_value = prompt_template.invoke({"id": qa.id, "messages": messages_str.strip()})
                inputs.append(prompt_value.to_string())

        prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)

        parsed_scores = []
        clean_batch_size = config.clean_batch_size

        for i in tqdm(range(0, len(inputs), clean_batch_size), desc="Online model scoring progress"):
            batch = inputs[i : i + clean_batch_size]

            try:
                responses = client.chat_batch(batch, temperature=0)

                for j, response in enumerate(responses):
                    if isinstance(response, Exception):
                        logger.error(f"Failed to get response for batch item {i + j}: {str(response)}")
                        continue
                    result_text = response.choices[0].message.content
                    if "</think>" in result_text:
                        result_text = result_text.split("</think>", 1)[1]
                    result_text = re.sub(r"^```json\s*|```$", "", result_text.strip(), flags=re.MULTILINE)
                    try:
                        item_res = json.loads(result_text)
                    except json.JSONDecodeError as e:
                        logger.error(
                            f"JSON parsing failed for batch item {i + j}: {e}\nContent: {result_text}"
                        )
                        continue
                    parsed_scores.append(QaPairScoreWithId(**item_res))
            except Exception as e:
                logger.error(
                    f"Failed to call online model or parse result for batch starting at index {i}, error: {str(e)}"
                )

        score_map = {score.id: score.score for score in parsed_scores}
        for qa in data:
            if qa.id in score_map:
                qa.score = score_map[qa.id]
            else:
                logger.warning(f"No score obtained for QA ID {qa.id}, default assigned 0")
                qa.score = 0

        scores = [qa.score for qa in data if qa.score is not None]
        score_series = pd.Series(scores)
        score_counts = score_series.value_counts().sort_index()
        score_percentages = score_series.value_counts(normalize=True).sort_index() * 100
        pd.set_option("display.unicode.east_asian_width", True)
        distribution_df = pd.DataFrame(
            {
                "Count": score_counts,
                "Percentage(%)": score_percentages.round(2),
            }
        )
        distribution_df.index.name = "Score"
        printable_df_str = distribution_df.reset_index().to_string(index=False)
        logger.success(f"Online model scoring distribution:\n{printable_df_str}")

</code_context>

<issue_to_address>
**issue (code-quality):** Low code quality found in OlineLLMCleaningStrategy.judge - 18% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>Explanation</summary>The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

### Comment 6
<location> `weclone/data/qa_generator.py:560-561` </location>
<code_context>
    def load_file(self, file_path) -> List[ChatMessage]:
        """
        Perform overall first preprocessing, filter rows that don't meet conditions, check if images exist and change type to cut if not, add DataModality field
        """
        folder_path = os.path.dirname(file_path)
        folder_name = os.path.basename(folder_path)

        if folder_name not in self.relations:
            users_json_path = os.path.join(folder_path, "users.json")
            if os.path.exists(users_json_path):
                try:
                    with open(users_json_path, encoding="utf-8") as f:
                        users_data = json.load(f)
                        relation = users_data.get("relation", "")
                        if relation:
                            self.relations[folder_name] = relation
                            logger.debug(f"Loaded relation for {folder_name}: {relation}")
                except (FileNotFoundError, json.JSONDecodeError) as e:
                    logger.warning(f"Failed to load users.json from {folder_path}: {e}")

        df = pd.read_csv(
            file_path,
            encoding="utf-8",
            dtype={"msg": str, "src": str},
            escapechar=None,
            keep_default_na=False,
        )

        df = df[~df["type_name"].isin(values=self.skip_type_list)]

        if "is_forward" in df.columns:
            df = df[~((df["is_sender"] == 1) & (df["is_forward"]))]

        # Batch process text messages for PII detection and blocked words
        text_indices = []
        text_messages = []

        for i in df.index:
            if df.loc[i, "type_name"].lower() in ["文本", "text"]:  # type: ignore
                msg_str = str(df.loc[i, "msg"])
                msg_str = msg_str.replace("\n", "")
                text_indices.append(i)
                text_messages.append(msg_str)

        # TODO Deleting directly by batch_has_pii returning true/false.
        indices_to_drop = []
        if text_messages:
            pii_results = self.pii_detector.batch_has_pii(text_messages)

            for idx, (df_index, msg_str, has_pii) in enumerate(zip(text_indices, text_messages, pii_results)):
                if has_pii:
                    indices_to_drop.append(df_index)
                    continue

                # Check blocked words
                for blocked_word in self.blocked_words:
                    if blocked_word in msg_str:
                        indices_to_drop.append(df_index)
                        break

        df = df.drop(index=indices_to_drop)

        # Process other message types
        for i in df.index:
            if df.loc[i, "type_name"].lower() in ["文本", "text"]:
                continue
            if df.loc[i, "src"].lower().endswith(".gif"):
                df.loc[i, "src"] = ""
                df.loc[i, "type_name"] = "动画表情" if self.c.platform == PlatformType.CHAT else "sticker"
                continue
            if df.loc[i, "type_name"].lower() in ["图片", "image"]:  # type: ignore
                if self.c.platform in [PlatformType.CHAT, PlatformType.TELEGRAM]:
                    result = check_image_file_exists(str(df.loc[i, "src"]))
                    if isinstance(result, str) and df.loc[i, "is_sender"] == 0:
                        df.loc[i, "src"] = result
                        df.loc[i, "msg"] = "<image>"
                        df.loc[i, "modality"] = DataModality.IMAGE
                    else:
                        df.loc[i, "type_name"] = "Cut"
            elif df.loc[i, "type_name"] in ["sticker", "动画表情"]:
                if self.c.platform in [PlatformType.CHAT, PlatformType.TELEGRAM]:
                    df.loc[i, "src"] = ""
                    continue
            else:
                df.loc[i, "msg"] = ""

        df = df.dropna(how="all")
        # Time format: 2021-07-07 10:27:23
        df["CreateTime"] = pd.to_datetime(df["CreateTime"])

        return [ChatMessage(**row) for row in df.to_dict("records")]  # type: ignore

</code_context>

<issue_to_address>
**suggestion (code-quality):** We've found these issues:

- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Low code quality found in DataProcessor.load\_file - 15% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

```suggestion
                        if relation := users_data.get("relation", ""):
```

<br/><details><summary>Explanation</summary>
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Comment on lines +174 to +175
if qa.images:
qa.score = 6
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): 为包含图像的项目分配 score=6 可能会导致混淆。

如果分数限制在 1-5 之间,则为包含图像的项目使用 6 可能会引入歧义。请考虑为这些情况使用专用标志或更清晰的逻辑。

Original comment in English

issue (bug_risk): Assigning score=6 for items with images may cause confusion.

If scores are limited to 1-5, using 6 for items with images may introduce ambiguity. Consider a dedicated flag or clearer logic for these cases.

language=self.language,
entities=self.filtered_entities,
score_threshold=self.threshold,
n_process=12,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (performance): 增加 n_process 和 batch_size 可能会影响资源使用。

考虑将 n_process 和 batch_size 设置为可配置项,以便用户可以根据需要调整资源使用。

建议的实现:

            language=self.language,
            entities=self.filtered_entities,
            score_threshold=self.threshold,
            n_process=n_process,
            batch_size=batch_size,
        )

您还需要:

  1. n_process=24batch_size=32 作为参数添加到包含此代码的函数或方法中。
  2. 如果需要自定义值,请更新所有对此函数/方法的调用以传递这些参数。
Original comment in English

suggestion (performance): Increasing n_process and batch_size may impact resource usage.

Consider making n_process and batch_size configurable so users can adjust resource usage as needed.

Suggested implementation:

            language=self.language,
            entities=self.filtered_entities,
            score_threshold=self.threshold,
            n_process=n_process,
            batch_size=batch_size,
        )

You will also need to:

  1. Add n_process=24 and batch_size=32 as parameters to the function or method containing this code.
  2. Update any calls to this function/method to pass these parameters if custom values are desired.
Comment on lines +62 to +66
log_text = result.outputs[0].text[:100] + "..."
elif isinstance(result, ChatCompletion):
log_text = result.choices[0].message.content[:100] + "..."
else:
log_text = str(result)[:100] + "..."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): 使用 f-string 而不是字符串连接 [×3] (use-fstring-for-concatenation)

Original comment in English

issue (code-quality): Use f-string instead of string concatenation [×3] (use-fstring-for-concatenation)

return parsed_results, failed_indexs


def vllm_infer(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): 我们发现了以下问题:


解释

此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。

您如何解决这个问题?

重构此函数以使其更短、更易读可能值得。

  • 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下,一个函数应该少于 10 行。
  • 减少嵌套,例如通过引入守卫子句来提前返回。
  • 确保变量的作用域紧密,以便使用相关概念的代码在函数内紧密地放在一起,而不是分散开来。
Original comment in English

issue (code-quality): We've found these issues:


Explanation

The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.
"""Strategy for data cleaning using large language models"""

# TODO: images clean support
def judge(self, data: List[QaPair]) -> None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): OlineLLMCleaningStrategy.judge 中发现代码质量低下 - 18% (low-code-quality)


解释此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。

您如何解决这个问题?

重构此函数以使其更短、更易读可能值得。

  • 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下,一个函数应该少于 10 行。
  • 减少嵌套,例如通过引入守卫子句来提前返回。
  • 确保变量的作用域紧密,以便使用相关概念的代码在函数内紧密地放在一起,而不是分散开来。
Original comment in English

issue (code-quality): Low code quality found in OlineLLMCleaningStrategy.judge - 18% (low-code-quality)


ExplanationThe quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.
Comment on lines +560 to +561
relation = users_data.get("relation", "")
if relation:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): 我们发现了以下问题:

Suggested change
relation = users_data.get("relation", "")
if relation:
if relation := users_data.get("relation", ""):


解释
此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。

您如何解决这个问题?

重构此函数以使其更短、更易读可能值得。

  • 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下,一个函数应该少于 10 行。
  • 减少嵌套,例如通过引入守卫子句来提前返回。
  • 确保变量的作用域紧密,以便使用相关概念的代码在函数内紧密地放在一起,而不是分散开来。
Original comment in English

suggestion (code-quality): We've found these issues:

Suggested change
relation = users_data.get("relation", "")
if relation:
if relation := users_data.get("relation", ""):


Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.
@xming521 xming521 requested a review from Copilot November 1, 2025 09:21
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

ids_in_batch = [qa["id"] for qa in qa_list]
logger.error(
f"Failed to call online model or parse result, current batch QA ID list: {ids_in_batch}, error: {str(e)}"
f"Failed to call online model or parse result for batch starting at index {i}, error: {str(e)}"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Incorrect score override for image QA pairs

In OlineLLMCleaningStrategy.judge(), QA pairs with images are assigned score=6 at line 175, but then at lines 214-219, the code iterates through ALL QA pairs and assigns score=0 if the ID is not in score_map. Since QA pairs with images are not added to the inputs list, they won't be in score_map, causing their score to be overwritten from 6 to 0. The fix should check if the QA pair already has a non-zero score before assigning 0, or skip QA pairs with images in the final loop.

Fix in Cursor Fix in Web

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces concurrent batch processing for online LLM inference with JSON schema validation, enriches QA pairs with optional chat member relationships from users.json, and updates the project to version 0.3.03.

  • Adds threaded async/batch chat methods to OnlineLLM for parallel processing
  • Introduces unified JSON parsing via parse_guided_decoding_results supporting both vLLM and OpenAI responses
  • Extends QA generation to load and inject chat member relationships into system prompts

Reviewed Changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
weclone/utils/config_models.py Adds add_relation flag to control relationship metadata injection
weclone/prompts/clean_data.py Comments out batch prompt template, retains per-item scoring format
weclone/data/qa_generator.py Loads users.json per folder, injects relation/time into system prompts, tracks talker through QA matching
weclone/data/models.py Adds "用户上传的GIF表情" and "sticker2" to skip types
weclone/data/clean/strategies.py Refactors to use chat_batch with per-QA prompts, skips image-based samples
weclone/core/inference/online_infer.py Adds chat_async, chat_batch, thread pool executor, and context manager support
weclone/core/inference/offline_infer.py Extracts shared JSON parsing into parse_guided_decoding_results function
weclone/core/PII/pii_detector.py Increases batch analyzer parallelism (24 processes, batch size 32)
tests/tests_data/test_person/test_0_730.csv Normalizes talker/room fields to test_person
tests/test_full_pipe.py Removes CSV extension filter to copy all test files
tests/configs/qwen3.jsonc Updates model to Qwen2.5-0.5B, enables add_relation/add_time
tests/configs/Qwen2.5-VL.jsonc Enables add_relation and add_time flags
settings.template.jsonc Adds add_relation config, increases clean_batch_size to 50
pyproject.toml Bumps version to 0.3.03, updates changelog
WC-exp Updates submodule commit
.gitignore Adds exclusions for models_final, /data, /llamaboard_cache

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

system_content = self.system_prompt
if self.c.add_time:
system_content += f" Current datetime: {time_stamp.strftime('%m-%d %H:%M:%S')}"
system_content += f"\n 现在是{time_stamp.strftime('%m-%d %H:%M')}"
Copy link

Copilot AI Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Remove leading whitespace after '\n' in formatted string for consistent formatting.

Suggested change
system_content += f"\n 现在是{time_stamp.strftime('%m-%d %H:%M')}"
system_content += f"\n现在是{time_stamp.strftime('%m-%d %H:%M')}"
Copilot uses AI. Check for mistakes.
if self.c.add_relation and talker:
relation = self.relations.get(talker, "")
if relation:
system_content += f"\n 对方是你的{relation},你们正在聊天"
Copy link

Copilot AI Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Remove leading whitespace after '\n' in formatted string for consistent formatting.

Suggested change
system_content += f"\n 对方是你的{relation},你们正在聊天"
system_content += f"\n对方是你的{relation},你们正在聊天"
Copilot uses AI. Check for mistakes.
for item_name in os.listdir(test_data_source_dir):
source_item_path = os.path.join(test_data_source_dir, item_name)
if os.path.isfile(source_item_path) and item_name.lower().endswith('.csv'):
if os.path.isfile(source_item_path) :
Copy link

Copilot AI Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove trailing whitespace before colon.

Suggested change
if os.path.isfile(source_item_path) :
if os.path.isfile(source_item_path):
Copilot uses AI. Check for mistakes.
Comment on lines +174 to +175
if qa.images:
qa.score = 6
Copy link

Copilot AI Nov 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 6 for image-based sample scores should be documented or defined as a named constant to clarify its meaning (e.g., SKIP_IMAGE_SCORE = 6).

Copilot uses AI. Check for mistakes.
@xming521 xming521 closed this Nov 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants