Dev #205

xming521 · 2025-11-01T09:12:02Z

Sourcery 总结

实现基于 JSON 引导解码的 LLM 并发和批量在线推理，集成到数据清洗工作流中，通过聊天成员关系丰富 QA 生成，并将版本升级到 0.3.03。

新功能：

为 OnlineLLM 添加支持线程池的异步和批量聊天方法 (chat_async, chat_batch)
引入引导解码 JSON 解析工具 (parse_guided_decoding_results)，用于 pydantic 验证
扩展 QA 生成器，使其能够从 users.json 加载聊天成员关系并将其包含在系统提示中

改进：

重构数据清洗管道，使用 chat_batch 进行评分，跳过基于图像的样本，并采用统一的 JSON 解析
将在线和离线推理的 JSON 解析逻辑集中到一个共享函数中
将 load_csv 重命名为 load_file，并调整 QA 生成器中的文件加载逻辑

构建：

将项目版本和配置版本升级到 0.3.03

测试：

更新 test_full_pipe 以复制测试数据源中的所有文件，移除 CSV 扩展名过滤器

Original summary in English

Summary by Sourcery

Implement concurrent and batch online inference for LLMs with JSON-guided decoding, integrate into the data cleaning workflow, enrich QA generation with chat member relationships, and bump version to 0.3.03.

New Features:

Add asynchronous and batch chat methods (chat_async, chat_batch) to OnlineLLM with thread pool support
Introduce guided decoding JSON parsing utility (parse_guided_decoding_results) for pydantic validation
Extend QA generator to load chat member relationships from users.json and include them in system prompts

Enhancements:

Refactor data cleaning pipeline to use chat_batch for scoring, skip image-based samples, and adopt unified JSON parsing
Centralize JSON parsing logic for both online and offline inference into a shared function
Rename load_csv to load_file and adjust file loading logic in QA generator

Build:

Bump project version and config version to 0.3.03

Tests:

Update test_full_pipe to copy all files in test data source, removing CSV extension filter

Note

Add concurrent/batch online LLM inference with unified JSON-guided parsing, integrate into cleaning, enrich QA with chat-member relations, and bump version/config to 0.3.03.

Core Inference:
- OnlineLLM: Introduce thread-pooled chat_async/chat_batch with JSON responses; optional guided decoding via Pydantic; shared parser parse_guided_decoding_results (also used by offline vLLM).
- Offline: Refactor to use shared JSON parser; support parsing OpenAI ChatCompletion.
Data Cleaning:
- Switch to single CLEAN_PROMPT; refactor online cleaning to batch-score via chat_batch, skip image samples, and unify JSON parsing.
Dataset/QA Generation:
- Load users.json per chat folder to capture relation; when add_relation is enabled, append relation and formatted time to system prompt; track talker through QA matching.
- Rename load_csv to load_file; pipeline adjusted accordingly.
Models/Data Filters:
- Extend skip types (用户上传的GIF表情, sticker2).
- PII batch analyzer perf: n_process 24, batch_size 32.
Config & Templates:
- Add add_relation flag; increase clean_batch_size default to 50; update example/test configs; version/config_version → 0.3.03 with changelog.
Tests:
- test_full_pipe: copy all files from test data source (no CSV-only filter); update test fixtures.
Misc:
- Update .gitignore and submodule pointer.

^{Written by Cursor Bugbot for commit 8ee9d05. This will update automatically on new commits. Configure here.}

Introduce a ThreadPoolExecutor in the OnlineLLM class to enable concurrent API calls, significantly boosting throughput for LLM operations. Refactor the OlineLLMCleaningStrategy to leverage this new batching capability, allowing multiple data points to be processed in parallel. Increase the default `clean_batch_size` in settings and enhance `n_process`/`batch_size` for PII detection to optimize for the new concurrency. Simplify prompt management by removing a dedicated prompt for online LLM cleaning. Add context manager support to OnlineLLM for reliable resource cleanup. Ensure LLM chat responses explicitly request JSON format.

Expands the list of message types to be skipped or ignored by the application. This ensures proper handling for user-uploaded GIF stickers and a new sticker type identified as 'sticker2', preventing potential processing errors.

Introduces a unified utility to parse and validate JSON-structured outputs from both vLLM and OpenAI API inference results using Pydantic models. This enables guided decoding for the OnlineLLM.chat method. The existing guided decoding logic in vllm_infer is refactored to use this new shared utility, improving consistency and error handling across inference modes.

Enhance error messages when JSON parsing of LLM outputs fails in offline inference. The system now dynamically extracts relevant text snippets based on the response object type (e.g., RequestOutput, ChatCompletion), making error logs more informative. Additionally, suppress verbose debug/info logs from the OpenAI client and httpx to reduce console noise during online inference. Update the 'WC-exp' submodule and add 'models_final/*' to .gitignore for repository hygiene.

- Updated the `qa_generator.py` to include a new mechanism for managing chat member relationships, allowing the addition of contextual information about the relationship between users in conversations. - Refactored the CSV loading function to support loading user relationship data from a `users.json` file, improving the context provided during QA generation. - Added a new configuration option `add_relation` to the dataset settings, enabling users to toggle this feature. - Updated the `.gitignore` to exclude additional data directories and cache files for better repository hygiene. - Bumped version to 0.3.03 in `pyproject.toml` to reflect these changes.

sourcery-ai · 2025-11-01T09:12:20Z

审阅者指南

引入了多线程和支持批处理的在线LLM接口，统一了引导式解码的JSON解析，重构了数据清理以利用批处理聊天，通过用户关系丰富了QA生成，更新了版本/配置，并调整了PII检测中的并行度。

使用 OnlineLLM.chat_batch 进行批量数据清理的序列图

sequenceDiagram
    participant "OlineLLMCleaningStrategy"
    participant "OnlineLLM"
    participant "OpenAI API"
    participant "Logger"
    "OlineLLMCleaningStrategy"->>"OnlineLLM": chat_batch(batch_prompts)
    loop for each prompt in batch
        "OnlineLLM"->>"OpenAI API": chat(prompt)
        "OpenAI API"-->>"OnlineLLM": response
    end
    "OnlineLLM"-->>"OlineLLMCleaningStrategy": responses
    alt response is Exception
        "OlineLLMCleaningStrategy"->>"Logger": log error
    else response is valid
        "OlineLLMCleaningStrategy"->>"Logger": log success
    end

带有关系字段的 MakeDatasetArgs 和 QaPair 的 ER 图

erDiagram
    MAKE_DATASET_ARGS {
        string add_relation
    }
    QA_PAIR {
        string id
        string messages
        string images
        string talker
    }
    MAKE_DATASET_ARGS ||--o{ QA_PAIR : "generates"

更新后的 OnlineLLM 类图，支持批量和异步聊天

classDiagram
    class OnlineLLM {
        -api_key: str
        -base_url: str
        -model_name: str
        -default_system: Optional[str]
        -max_workers: int
        -client: OpenAI
        -executor: ThreadPoolExecutor
        +chat(prompt_text, temperature, max_tokens, top_p, stream): ChatCompletion
        +chat_async(prompt_text, temperature, max_tokens, top_p, stream): Future
        +chat_batch(prompts, temperature, max_tokens, top_p, stream, callback, guided_decoding_class): List/tuple
        +close()
        +__enter__()
        +__exit__(exc_type, exc_val, exc_tb)
    }
    OnlineLLM --> OpenAI
    OnlineLLM --> ThreadPoolExecutor

支持用户关系的 QaGenerator 类图

classDiagram
    class QaGenerator {
        -c: Config
        -relations: dict
        +main()
        +match_qa(messages): List[QaPair]
        +load_file(file_path): List[ChatMessage]
    }
    QaGenerator --> QaPair
    QaGenerator --> ChatMessage

parse_guided_decoding_results 工具类图

classDiagram
    class parse_guided_decoding_results {
        +parse_guided_decoding_results(results, guided_decoding_class): tuple[List[Optional[BaseModel]], List[int]]
    }

带有 add_relation 字段的 MakeDatasetArgs 类图

classDiagram
    class MakeDatasetArgs {
        +add_relation: bool
        ...
    }

文件级别变更

变更	详情	文件
扩展 OnlineLLM 以支持多线程、异步和批量聊天API	使用 max_workers 参数配置 ThreadPoolExecutor 添加了支持回调和引导式解码的 chat_async 和 chat_batch 方法实现了上下文管理器（enter/exit）和关闭逻辑抑制了冗余的 OpenAI/httpx 日志	`weclone/core/inference/online_infer.py`
重构清理策略，使用 chat_batch 批量处理提示	根据消息和用户/图片逻辑构建输入列表用 chat_batch 响应替换了顺序提示/JSON解析处理了每个批处理项的异常并附加了 QaPairScoreWithId	`weclone/data/clean/strategies.py`
统一引导式解码结果的 JSON 解析	添加了 parse_guided_decoding_results 以从 ChatCompletion 或 RequestOutput 中提取并验证 JSON 替换了离线和 vllm 推理函数中重复的解析逻辑	`weclone/core/inference/offline_infer.py`
增强 QA 生成器以支持关系感知和上下文丰富	从每个文件夹的 users.json 加载关系在 match_qa 过程中跟踪 conversation_talker 配置后将关系信息附加到系统提示中	`weclone/data/qa_generator.py` `weclone/utils/config_models.py`
提升项目和配置版本并公开新的 add_relation 标志	将 pyproject.toml 中的版本更新为 0.3.03 将 tool.weclone config_version 增加到 0.3.03 为 MakeDatasetArgs 添加了 add_relation 字段	`pyproject.toml` `weclone/utils/config_models.py`
增加 PII 检测中的并行阈值	将 n_process 从 12 增加到 24 将 batch_size 从 16 增加到 32	`weclone/core/PII/pii_detector.py`
调整测试设置以适应加载逻辑的变更	放宽 test_full_pipe 设置中的文件扩展名检查以复制所有文件	`tests/test_full_pipe.py`

提示和命令

与 Sourcery 互动

触发新审查： 在拉取请求上评论 @sourcery-ai review。
继续讨论： 直接回复 Sourcery 的审查评论。
从审查评论生成 GitHub issue： 通过回复审查评论，要求 Sourcery 从中创建一个 issue。您也可以回复审查评论并加上 @sourcery-ai issue 来创建 issue。
生成拉取请求标题： 在拉取请求标题的任意位置写入 @sourcery-ai 即可随时生成标题。您也可以在拉取请求上评论 @sourcery-ai title 来随时（重新）生成标题。
生成拉取请求摘要： 在拉取请求正文的任意位置写入 @sourcery-ai summary 即可随时在您想要的位置生成 PR 摘要。您也可以在拉取请求上评论 @sourcery-ai summary 来随时（重新）生成摘要。
生成审阅者指南： 在拉取请求上评论 @sourcery-ai guide 即可随时（重新）生成审阅者指南。
解决所有 Sourcery 评论： 在拉取请求上评论 @sourcery-ai resolve 以解决所有 Sourcery 评论。如果您已经处理了所有评论并且不想再看到它们，这会很有用。
驳回所有 Sourcery 审查： 在拉取请求上评论 @sourcery-ai dismiss 以驳回所有现有的 Sourcery 审查。如果您想从头开始进行新的审查，这特别有用——别忘了评论 @sourcery-ai review 以触发新的审查！

自定义您的体验

访问您的仪表板以：

启用或禁用审查功能，例如 Sourcery 生成的拉取请求摘要、审阅者指南等。
更改审查语言。
添加、删除或编辑自定义审查说明。
调整其他审查设置。

获取帮助

如有疑问或反馈，请联系我们的支持团队。
访问我们的文档获取详细指南和信息。
通过在 X/Twitter、LinkedIn 或 GitHub 上关注我们，与 Sourcery 团队保持联系。

Original review guide in English

Reviewer's Guide

Introduces multithreaded and batch-capable online LLM interfaces, unifies JSON parsing for guided decoding, refactors data cleaning to leverage batch chat, enriches QA generation with user relations, updates version/config, and tunes parallelism in PII detection.

Sequence diagram for batch data cleaning with OnlineLLM.chat_batch

sequenceDiagram
    participant "OlineLLMCleaningStrategy"
    participant "OnlineLLM"
    participant "OpenAI API"
    participant "Logger"
    "OlineLLMCleaningStrategy"->>"OnlineLLM": chat_batch(batch_prompts)
    loop for each prompt in batch
        "OnlineLLM"->>"OpenAI API": chat(prompt)
        "OpenAI API"-->>"OnlineLLM": response
    end
    "OnlineLLM"-->>"OlineLLMCleaningStrategy": responses
    alt response is Exception
        "OlineLLMCleaningStrategy"->>"Logger": log error
    else response is valid
        "OlineLLMCleaningStrategy"->>"Logger": log success
    end

ER diagram for MakeDatasetArgs and QaPair with relation field

erDiagram
    MAKE_DATASET_ARGS {
        string add_relation
    }
    QA_PAIR {
        string id
        string messages
        string images
        string talker
    }
    MAKE_DATASET_ARGS ||--o{ QA_PAIR : "generates"

Class diagram for updated OnlineLLM with batch and async chat

classDiagram
    class OnlineLLM {
        -api_key: str
        -base_url: str
        -model_name: str
        -default_system: Optional[str]
        -max_workers: int
        -client: OpenAI
        -executor: ThreadPoolExecutor
        +chat(prompt_text, temperature, max_tokens, top_p, stream): ChatCompletion
        +chat_async(prompt_text, temperature, max_tokens, top_p, stream): Future
        +chat_batch(prompts, temperature, max_tokens, top_p, stream, callback, guided_decoding_class): List/tuple
        +close()
        +__enter__()
        +__exit__(exc_type, exc_val, exc_tb)
    }
    OnlineLLM --> OpenAI
    OnlineLLM --> ThreadPoolExecutor

Class diagram for QaGenerator with user relations support

classDiagram
    class QaGenerator {
        -c: Config
        -relations: dict
        +main()
        +match_qa(messages): List[QaPair]
        +load_file(file_path): List[ChatMessage]
    }
    QaGenerator --> QaPair
    QaGenerator --> ChatMessage

Class diagram for parse_guided_decoding_results utility

classDiagram
    class parse_guided_decoding_results {
        +parse_guided_decoding_results(results, guided_decoding_class): tuple[List[Optional[BaseModel]], List[int]]
    }

Class diagram for MakeDatasetArgs with add_relation field

classDiagram
    class MakeDatasetArgs {
        +add_relation: bool
        ...
    }

File-Level Changes

Change	Details	Files
Extend OnlineLLM with multithreading, async and batch chat APIs	Configured ThreadPoolExecutor with max_workers param Added chat_async and chat_batch methods with callback and guided decoding support Implemented context manager (enter/exit) and shutdown logic Suppressed verbose OpenAI/httpx logs	`weclone/core/inference/online_infer.py`
Refactor cleaning strategy to process prompts in batches using chat_batch	Built inputs list from messages and user/images logic Replaced sequential prompts/JSON parsing with chat_batch responses Handled exceptions per batch item and appended QaPairScoreWithId	`weclone/data/clean/strategies.py`
Unify JSON parsing for guided decoding results	Added parse_guided_decoding_results to extract and validate JSON from ChatCompletion or RequestOutput Replaced duplicate parsing logic in offline and vllm inference functions	`weclone/core/inference/offline_infer.py`
Enhance QA generator with relation awareness and context enrichment	Loaded relations from users.json per folder Tracked conversation_talker throughout match_qa Appended relation info into system prompt when configured	`weclone/data/qa_generator.py` `weclone/utils/config_models.py`
Bump project and config versions and expose new add_relation flag	Updated version in pyproject.toml to 0.3.03 Incremented tool.weclone config_version to 0.3.03 Added add_relation field to MakeDatasetArgs	`pyproject.toml` `weclone/utils/config_models.py`
Increase parallelism thresholds in PII detection	Doubled n_process from 12 to 24 Doubled batch_size from 16 to 32	`weclone/core/PII/pii_detector.py`
Adjust test setup to align with loading logic changes	Relaxed file extension check in test_full_pipe setup to copy all files	`tests/test_full_pipe.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

大家好 - 我已经审阅了你的更改，它们看起来很棒！

AI 代理的提示

请处理此代码审查中的评论：

## 个人评论

### 评论 1
<location> `weclone/data/clean/strategies.py:174-175` </location>
<code_context>
+        inputs = []
+        prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)
+        for qa in data:
+            if qa.images:
+                qa.score = 6
+            else:
+                messages_str = ""
</code_context>

<issue_to_address>
**issue (bug_risk):** 为包含图像的项目分配 score=6 可能会导致混淆。

如果分数限制在 1-5 之间，则为包含图像的项目使用 6 可能会引入歧义。请考虑为这些情况使用专用标志或更清晰的逻辑。
</issue_to_address>

### 评论 2
<location> `weclone/core/PII/pii_detector.py:179` </location>
<code_context>
             language=self.language,
             entities=self.filtered_entities,
             score_threshold=self.threshold,
-            n_process=12,
-            batch_size=16,
+            n_process=24,
+            batch_size=32,
         )
</code_context>

<issue_to_address>
**suggestion (performance):** 增加 n_process 和 batch_size 可能会影响资源使用。

考虑将 n_process 和 batch_size 设置为可配置项，以便用户可以根据需要调整资源使用。

建议的实现：

```python
            language=self.language,
            entities=self.filtered_entities,
            score_threshold=self.threshold,
            n_process=n_process,
            batch_size=batch_size,
        )

```

您还需要：
1. 将 `n_process=24` 和 `batch_size=32` 作为参数添加到包含此代码的函数或方法中。
2. 如果需要自定义值，请更新所有对此函数/方法的调用以传递这些参数。
</issue_to_address>

### 评论 3
<location> `weclone/core/inference/offline_infer.py:62-66` </location>
<code_context>
def parse_guided_decoding_results(
    results: List[RequestOutput] | List[ChatCompletion] | List, guided_decoding_class: type[BaseModel]
) -> tuple[List[Optional[BaseModel]], List[int]]:
    """Parse guided decoding results and return parsed results with failed indices.

    Args:
        results: Raw vLLM generation results
        guided_decoding_class: Pydantic model class for validation

    Returns:
        tuple: (parsed_results, failed_indices) where failed_indices contains
               indices of failed JSON parsing
    """
    parsed_results = []
    failed_indexs = []

    for idx, result in enumerate(results):
        try:
            if isinstance(result, RequestOutput):
                json_text = extract_json_from_text(result.outputs[0].text)
            elif isinstance(result, ChatCompletion):
                json_text = extract_json_from_text(result.choices[0].message.content)
            else:
                raise ValueError(f"Unsupported result type: {type(result)}")
            parsed_result = guided_decoding_class.model_validate_json(json_text)
            parsed_results.append(parsed_result)
        except Exception as e:
            if isinstance(result, RequestOutput):
                log_text = result.outputs[0].text[:100] + "..."
            elif isinstance(result, ChatCompletion):
                log_text = result.choices[0].message.content[:100] + "..."
            else:
                log_text = str(result)[:100] + "..."
            logger.warning(
                f"Failed to parse JSON from result at sequence index {idx}: {log_text}, error: {e}"
            )
            failed_indexs.append(idx)
            parsed_results.append(None)

    return parsed_results, failed_indexs

</code_context>

<issue_to_address>
**issue (code-quality):** 使用 f-string 而不是字符串连接 [×3] ([`use-fstring-for-concatenation`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-fstring-for-concatenation/))
</issue_to_address>

### 评论 4
<location> `weclone/core/inference/offline_infer.py:76` </location>
<code_context>
def vllm_infer(
    inputs: List[str],
    model_name_or_path: str,
    adapter_name_or_path: Optional[str] = None,
    dataset: str = "alpaca_en_demo",
    dataset_dir: str = "data",
    template: str = "default",
    cutoff_len: int = 2048,
    max_samples: Optional[int] = None,
    vllm_config: str = "{}",
    save_name: str = "generated_predictions.jsonl",
    default_system: Optional[str] = None,
    enable_thinking: bool = False,
    temperature: float = 0.95,
    top_p: float = 0.7,
    top_k: int = 50,
    guided_decoding_class: Optional[type[BaseModel]] = None,
    bad_words: Optional[List[str]] = None,
    logprobs: Optional[int] = None,
    max_new_tokens: int = 1024,
    repetition_penalty: float = 1.0,
    skip_special_tokens: bool = True,
    seed: Optional[int] = None,
    pipeline_parallel_size: int = 1,
    image_max_pixels: int = 768 * 768,
    image_min_pixels: int = 32 * 32,
) -> tuple[List[RequestOutput] | List[Optional[BaseModel]], List[int]]:
    r"""Perform batch generation using vLLM engine, which supports tensor parallelism.

    Returns:
        tuple: (results, failed_indices) where failed_indices contains indices of failed JSON parsing
    """
    if pipeline_parallel_size > get_device_count():
        raise ValueError("Pipeline parallel size should be smaller than the number of gpus.")

    wc_vllm_args = cast(VllmArgs, load_config("vllm"))
    model_args, data_args, _, generating_args = get_infer_args(
        {
            "model_name_or_path": model_name_or_path,
            "adapter_name_or_path": adapter_name_or_path,
            "dataset": dataset,
            "dataset_dir": dataset_dir,
            "template": template,
            "cutoff_len": cutoff_len,
            "max_samples": max_samples,
            "preprocessing_num_workers": 16,
            "vllm_config": vllm_config,
            "temperature": temperature,
            "top_p": top_p,
            "top_k": top_k,
            "max_new_tokens": max_new_tokens,
            "repetition_penalty": repetition_penalty,
            "enable_thinking": enable_thinking,
        }
    )

    tokenizer_module = load_tokenizer(model_args)
    tokenizer = tokenizer_module["tokenizer"]
    template_obj = get_template_and_fix_tokenizer(tokenizer, data_args)
    template_obj.mm_plugin.expand_mm_tokens = False  # for vllm generate

    if guided_decoding_class:
        json_schema = guided_decoding_class.model_json_schema()
        guided_decoding_params = GuidedDecodingParams(json=json_schema, disable_any_whitespace=True)

    sampling_params = SamplingParams(
        repetition_penalty=generating_args.repetition_penalty or 1.0,
        temperature=generating_args.temperature,
        top_p=generating_args.top_p or 1.0,
        top_k=generating_args.top_k or -1,
        stop_token_ids=template_obj.get_stop_token_ids(tokenizer),
        max_tokens=generating_args.max_new_tokens,
        skip_special_tokens=skip_special_tokens,
        seed=seed,
        bad_words=bad_words,
        guided_decoding=guided_decoding_params if guided_decoding_class else None,
    )
    if model_args.adapter_name_or_path is not None:
        lora_request = LoRARequest("default", 1, model_args.adapter_name_or_path[0])
    else:
        lora_request = None

    engine_args = {
        "model": model_args.model_name_or_path,
        "trust_remote_code": True,
        "dtype": model_args.infer_dtype,
        "max_model_len": cutoff_len + max_new_tokens,
        "disable_log_stats": True,
        "enable_lora": model_args.adapter_name_or_path is not None,
        "enable_prefix_caching": True,
        "guided_decoding_backend": "guidance",
        "guided_decoding_disable_any_whitespace": True,
    }

    if template_obj.mm_plugin.__class__.__name__ != "BasePlugin":
        engine_args["limit_mm_per_prompt"] = {"image": 4, "video": 2, "audio": 2}

    wc_vllm_dict = {k: v for k, v in wc_vllm_args.model_dump().items() if v is not None}
    engine_args.update(wc_vllm_dict)

    if isinstance(model_args.vllm_config, dict):
        engine_args.update(model_args.vllm_config)

    messages_list = [[{"role": "user", "content": text}] for text in inputs]

    llm = LLM(**engine_args)

    results = llm.chat(
        messages_list,
        sampling_params,
        lora_request=lora_request,
        chat_template_kwargs={"enable_thinking": enable_thinking},
    )  # type: ignore

    del llm
    torch.cuda.empty_cache()

    if guided_decoding_class:
        # TODO better json decode  https://github.com/vllm-project/vllm/commit/1d0ae26c8544fd5a62e171e30c2dcc2973a23bc8#diff-3b27790a2ce97bc50cdd5476f7b0057da682ed0d1ec8426a7b76c5e21454e57d
        parsed_results, failed_indexs = parse_guided_decoding_results(results, guided_decoding_class)
        return parsed_results, failed_indexs
    else:
        return results, []

</code_context>

<issue_to_address>
**issue (code-quality):** 我们发现了以下问题：

- 通过联合运算符合并字典更新 [×2] ([`dict-assign-update-to-union`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/dict-assign-update-to-union/))
- vllm\_infer 中发现代码质量低下 - 23% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>解释</summary>

此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。

您如何解决这个问题？

重构此函数以使其更短、更易读可能值得。

- 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下，一个函数应该少于 10 行。
- 减少嵌套，例如通过引入守卫子句来提前返回。
- 确保变量的作用域紧密，以便使用相关概念的代码在函数内紧密地放在一起，而不是分散开来。</details>
</issue_to_address>

### 评论 5
<location> `weclone/data/clean/strategies.py:159` </location>
<code_context>
    def judge(self, data: List[QaPair]) -> None:
        config = self.make_dataset_config
        logger.info("Starting online model scoring of data")
        logger.info(f"Using model {config.model_name}")

        client = OnlineLLM(
            api_key=config.llm_api_key,
            base_url=config.base_url,
            model_name=config.model_name,
            max_workers=config.clean_batch_size + 5,
        )

        inputs = []
        prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)
        for qa in data:
            if qa.images:
                qa.score = 6
            else:
                messages_str = ""
                for msg in qa.messages:
                    if msg.role == "user":
                        messages_str += f"Q: {msg.content}\n"
                    elif msg.role == "assistant":
                        messages_str += f"A: {msg.content}\n"
                prompt_value = prompt_template.invoke({"id": qa.id, "messages": messages_str.strip()})
                inputs.append(prompt_value.to_string())

        prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)

        parsed_scores = []
        clean_batch_size = config.clean_batch_size

        for i in tqdm(range(0, len(inputs), clean_batch_size), desc="Online model scoring progress"):
            batch = inputs[i : i + clean_batch_size]

            try:
                responses = client.chat_batch(batch, temperature=0)

                for j, response in enumerate(responses):
                    if isinstance(response, Exception):
                        logger.error(f"Failed to get response for batch item {i + j}: {str(response)}")
                        continue
                    result_text = response.choices[0].message.content
                    if "</think>" in result_text:
                        result_text = result_text.split("</think>", 1)[1]
                    result_text = re.sub(r"^```json\s*|```$", "", result_text.strip(), flags=re.MULTILINE)
                    try:
                        item_res = json.loads(result_text)
                    except json.JSONDecodeError as e:
                        logger.error(
                            f"JSON parsing failed for batch item {i + j}: {e}\nContent: {result_text}"
                        )
                        continue
                    parsed_scores.append(QaPairScoreWithId(**item_res))
            except Exception as e:
                logger.error(
                    f"Failed to call online model or parse result for batch starting at index {i}, error: {str(e)}"
                )

        score_map = {score.id: score.score for score in parsed_scores}
        for qa in data:
            if qa.id in score_map:
                qa.score = score_map[qa.id]
            else:
                logger.warning(f"No score obtained for QA ID {qa.id}, default assigned 0")
                qa.score = 0

        scores = [qa.score for qa in data if qa.score is not None]
        score_series = pd.Series(scores)
        score_counts = score_series.value_counts().sort_index()
        score_percentages = score_series.value_counts(normalize=True).sort_index() * 100
        pd.set_option("display.unicode.east_asian_width", True)
        distribution_df = pd.DataFrame(
            {
                "Count": score_counts,
                "Percentage(%)": score_percentages.round(2),
            }
        )
        distribution_df.index.name = "Score"
        printable_df_str = distribution_df.reset_index().to_string(index=False)
        logger.success(f"Online model scoring distribution:\n{printable_df_str}")

</code_context>

<issue_to_address>
**issue (code-quality):** OlineLLMCleaningStrategy.judge 中发现代码质量低下 - 18% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>解释</summary>此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。

您如何解决这个问题？

重构此函数以使其更短、更易读可能值得。

- 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下，一个函数应该少于 10 行。
- 减少嵌套，例如通过引入守卫子句来提前返回。
- 确保变量的作用域紧密，以便使用相关概念的代码在函数内紧密地放在一起，而不是分散开来。</details>
</issue_to_address>

### 评论 6
<location> `weclone/data/qa_generator.py:560-561` </location>
<code_context>
    def load_file(self, file_path) -> List[ChatMessage]:
        """
        Perform overall first preprocessing, filter rows that don't meet conditions, check if images exist and change type to cut if not, add DataModality field
        """
        folder_path = os.path.dirname(file_path)
        folder_name = os.path.basename(folder_path)

        if folder_name not in self.relations:
            users_json_path = os.path.join(folder_path, "users.json")
            if os.path.exists(users_json_path):
                try:
                    with open(users_json_path, encoding="utf-8") as f:
                        users_data = json.load(f)
                        relation = users_data.get("relation", "")
                        if relation:
                            self.relations[folder_name] = relation
                            logger.debug(f"Loaded relation for {folder_name}: {relation}")
                except (FileNotFoundError, json.JSONDecodeError) as e:
                    logger.warning(f"Failed to load users.json from {folder_path}: {e}")

        df = pd.read_csv(
            file_path,
            encoding="utf-8",
            dtype={"msg": str, "src": str},
            escapechar=None,
            keep_default_na=False,
        )

        df = df[~df["type_name"].isin(values=self.skip_type_list)]

        if "is_forward" in df.columns:
            df = df[~((df["is_sender"] == 1) & (df["is_forward"]))]

        # Batch process text messages for PII detection and blocked words
        text_indices = []
        text_messages = []

        for i in df.index:
            if df.loc[i, "type_name"].lower() in ["文本", "text"]:  # type: ignore
                msg_str = str(df.loc[i, "msg"])
                msg_str = msg_str.replace("\n", "")
                text_indices.append(i)
                text_messages.append(msg_str)

        # TODO Deleting directly by batch_has_pii returning true/false.
        indices_to_drop = []
        if text_messages:
            pii_results = self.pii_detector.batch_has_pii(text_messages)

            for idx, (df_index, msg_str, has_pii) in enumerate(zip(text_indices, text_messages, pii_results)):
                if has_pii:
                    indices_to_drop.append(df_index)
                    continue

                # Check blocked words
                for blocked_word in self.blocked_words:
                    if blocked_word in msg_str:
                        indices_to_drop.append(df_index)
                        break

        df = df.drop(index=indices_to_drop)

        # Process other message types
        for i in df.index:
            if df.loc[i, "type_name"].lower() in ["文本", "text"]:
                continue
            if df.loc[i, "src"].lower().endswith(".gif"):
                df.loc[i, "src"] = ""
                df.loc[i, "type_name"] = "动画表情" if self.c.platform == PlatformType.CHAT else "sticker"
                continue
            if df.loc[i, "type_name"].lower() in ["图片", "image"]:  # type: ignore
                if self.c.platform in [PlatformType.CHAT, PlatformType.TELEGRAM]:
                    result = check_image_file_exists(str(df.loc[i, "src"]))
                    if isinstance(result, str) and df.loc[i, "is_sender"] == 0:
                        df.loc[i, "src"] = result
                        df.loc[i, "msg"] = "<image>"
                        df.loc[i, "modality"] = DataModality.IMAGE
                    else:
                        df.loc[i, "type_name"] = "Cut"
            elif df.loc[i, "type_name"] in ["sticker", "动画表情"]:
                if self.c.platform in [PlatformType.CHAT, PlatformType.TELEGRAM]:
                    df.loc[i, "src"] = ""
                    continue
            else:
                df.loc[i, "msg"] = ""

        df = df.dropna(how="all")
        # Time format: 2021-07-07 10:27:23
        df["CreateTime"] = pd.to_datetime(df["CreateTime"])

        return [ChatMessage(**row) for row in df.to_dict("records")]  # type: ignore

</code_context>

<issue_to_address>
**suggestion (code-quality):** 我们发现了以下问题：

- 使用命名表达式简化赋值和条件 ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- DataProcessor.load\_file 中发现代码质量低下 - 15% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

```suggestion
                        if relation := users_data.get("relation", ""):
```

<br/><details><summary>解释</summary>
此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。

您如何解决这个问题？

重构此函数以使其更短、更易读可能值得。

- 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下，一个函数应该少于 10 行。
- 减少嵌套，例如通过引入守卫子句来提前返回。
- 确保变量的作用域紧密，以便使用相关概念的代码在函数内紧密地放在一起，而不是分散开来。</details>
</issue_to_address>

Sourcery 对开源免费 - 如果您喜欢我们的评论，请考虑分享它们 ✨

_{请帮助我更有用！请在每条评论上点击 👍 或 👎 ，我将使用这些反馈来改进您的评论。}

Original comment in English

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `weclone/data/clean/strategies.py:174-175` </location>
<code_context>
+        inputs = []
+        prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)
+        for qa in data:
+            if qa.images:
+                qa.score = 6
+            else:
+                messages_str = ""
</code_context>

<issue_to_address>
**issue (bug_risk):** Assigning score=6 for items with images may cause confusion.

If scores are limited to 1-5, using 6 for items with images may introduce ambiguity. Consider a dedicated flag or clearer logic for these cases.
</issue_to_address>

### Comment 2
<location> `weclone/core/PII/pii_detector.py:179` </location>
<code_context>
             language=self.language,
             entities=self.filtered_entities,
             score_threshold=self.threshold,
-            n_process=12,
-            batch_size=16,
+            n_process=24,
+            batch_size=32,
         )
</code_context>

<issue_to_address>
**suggestion (performance):** Increasing n_process and batch_size may impact resource usage.

Consider making n_process and batch_size configurable so users can adjust resource usage as needed.

Suggested implementation:

```python
            language=self.language,
            entities=self.filtered_entities,
            score_threshold=self.threshold,
            n_process=n_process,
            batch_size=batch_size,
        )

```

You will also need to:
1. Add `n_process=24` and `batch_size=32` as parameters to the function or method containing this code.
2. Update any calls to this function/method to pass these parameters if custom values are desired.
</issue_to_address>

### Comment 3
<location> `weclone/core/inference/offline_infer.py:62-66` </location>
<code_context>
def parse_guided_decoding_results(
    results: List[RequestOutput] | List[ChatCompletion] | List, guided_decoding_class: type[BaseModel]
) -> tuple[List[Optional[BaseModel]], List[int]]:
    """Parse guided decoding results and return parsed results with failed indices.

    Args:
        results: Raw vLLM generation results
        guided_decoding_class: Pydantic model class for validation

    Returns:
        tuple: (parsed_results, failed_indices) where failed_indices contains
               indices of failed JSON parsing
    """
    parsed_results = []
    failed_indexs = []

    for idx, result in enumerate(results):
        try:
            if isinstance(result, RequestOutput):
                json_text = extract_json_from_text(result.outputs[0].text)
            elif isinstance(result, ChatCompletion):
                json_text = extract_json_from_text(result.choices[0].message.content)
            else:
                raise ValueError(f"Unsupported result type: {type(result)}")
            parsed_result = guided_decoding_class.model_validate_json(json_text)
            parsed_results.append(parsed_result)
        except Exception as e:
            if isinstance(result, RequestOutput):
                log_text = result.outputs[0].text[:100] + "..."
            elif isinstance(result, ChatCompletion):
                log_text = result.choices[0].message.content[:100] + "..."
            else:
                log_text = str(result)[:100] + "..."
            logger.warning(
                f"Failed to parse JSON from result at sequence index {idx}: {log_text}, error: {e}"
            )
            failed_indexs.append(idx)
            parsed_results.append(None)

    return parsed_results, failed_indexs

</code_context>

<issue_to_address>
**issue (code-quality):** Use f-string instead of string concatenation [×3] ([`use-fstring-for-concatenation`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-fstring-for-concatenation/))
</issue_to_address>

### Comment 4
<location> `weclone/core/inference/offline_infer.py:76` </location>
<code_context>
def vllm_infer(
    inputs: List[str],
    model_name_or_path: str,
    adapter_name_or_path: Optional[str] = None,
    dataset: str = "alpaca_en_demo",
    dataset_dir: str = "data",
    template: str = "default",
    cutoff_len: int = 2048,
    max_samples: Optional[int] = None,
    vllm_config: str = "{}",
    save_name: str = "generated_predictions.jsonl",
    default_system: Optional[str] = None,
    enable_thinking: bool = False,
    temperature: float = 0.95,
    top_p: float = 0.7,
    top_k: int = 50,
    guided_decoding_class: Optional[type[BaseModel]] = None,
    bad_words: Optional[List[str]] = None,
    logprobs: Optional[int] = None,
    max_new_tokens: int = 1024,
    repetition_penalty: float = 1.0,
    skip_special_tokens: bool = True,
    seed: Optional[int] = None,
    pipeline_parallel_size: int = 1,
    image_max_pixels: int = 768 * 768,
    image_min_pixels: int = 32 * 32,
) -> tuple[List[RequestOutput] | List[Optional[BaseModel]], List[int]]:
    r"""Perform batch generation using vLLM engine, which supports tensor parallelism.

    Returns:
        tuple: (results, failed_indices) where failed_indices contains indices of failed JSON parsing
    """
    if pipeline_parallel_size > get_device_count():
        raise ValueError("Pipeline parallel size should be smaller than the number of gpus.")

    wc_vllm_args = cast(VllmArgs, load_config("vllm"))
    model_args, data_args, _, generating_args = get_infer_args(
        {
            "model_name_or_path": model_name_or_path,
            "adapter_name_or_path": adapter_name_or_path,
            "dataset": dataset,
            "dataset_dir": dataset_dir,
            "template": template,
            "cutoff_len": cutoff_len,
            "max_samples": max_samples,
            "preprocessing_num_workers": 16,
            "vllm_config": vllm_config,
            "temperature": temperature,
            "top_p": top_p,
            "top_k": top_k,
            "max_new_tokens": max_new_tokens,
            "repetition_penalty": repetition_penalty,
            "enable_thinking": enable_thinking,
        }
    )

    tokenizer_module = load_tokenizer(model_args)
    tokenizer = tokenizer_module["tokenizer"]
    template_obj = get_template_and_fix_tokenizer(tokenizer, data_args)
    template_obj.mm_plugin.expand_mm_tokens = False  # for vllm generate

    if guided_decoding_class:
        json_schema = guided_decoding_class.model_json_schema()
        guided_decoding_params = GuidedDecodingParams(json=json_schema, disable_any_whitespace=True)

    sampling_params = SamplingParams(
        repetition_penalty=generating_args.repetition_penalty or 1.0,
        temperature=generating_args.temperature,
        top_p=generating_args.top_p or 1.0,
        top_k=generating_args.top_k or -1,
        stop_token_ids=template_obj.get_stop_token_ids(tokenizer),
        max_tokens=generating_args.max_new_tokens,
        skip_special_tokens=skip_special_tokens,
        seed=seed,
        bad_words=bad_words,
        guided_decoding=guided_decoding_params if guided_decoding_class else None,
    )
    if model_args.adapter_name_or_path is not None:
        lora_request = LoRARequest("default", 1, model_args.adapter_name_or_path[0])
    else:
        lora_request = None

    engine_args = {
        "model": model_args.model_name_or_path,
        "trust_remote_code": True,
        "dtype": model_args.infer_dtype,
        "max_model_len": cutoff_len + max_new_tokens,
        "disable_log_stats": True,
        "enable_lora": model_args.adapter_name_or_path is not None,
        "enable_prefix_caching": True,
        "guided_decoding_backend": "guidance",
        "guided_decoding_disable_any_whitespace": True,
    }

    if template_obj.mm_plugin.__class__.__name__ != "BasePlugin":
        engine_args["limit_mm_per_prompt"] = {"image": 4, "video": 2, "audio": 2}

    wc_vllm_dict = {k: v for k, v in wc_vllm_args.model_dump().items() if v is not None}
    engine_args.update(wc_vllm_dict)

    if isinstance(model_args.vllm_config, dict):
        engine_args.update(model_args.vllm_config)

    messages_list = [[{"role": "user", "content": text}] for text in inputs]

    llm = LLM(**engine_args)

    results = llm.chat(
        messages_list,
        sampling_params,
        lora_request=lora_request,
        chat_template_kwargs={"enable_thinking": enable_thinking},
    )  # type: ignore

    del llm
    torch.cuda.empty_cache()

    if guided_decoding_class:
        # TODO better json decode  https://github.com/vllm-project/vllm/commit/1d0ae26c8544fd5a62e171e30c2dcc2973a23bc8#diff-3b27790a2ce97bc50cdd5476f7b0057da682ed0d1ec8426a7b76c5e21454e57d
        parsed_results, failed_indexs = parse_guided_decoding_results(results, guided_decoding_class)
        return parsed_results, failed_indexs
    else:
        return results, []

</code_context>

<issue_to_address>
**issue (code-quality):** We've found these issues:

- Merge dictionary updates via the union operator [×2] ([`dict-assign-update-to-union`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/dict-assign-update-to-union/))
- Low code quality found in vllm\_infer - 23% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>Explanation</summary>

The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

### Comment 5
<location> `weclone/data/clean/strategies.py:159` </location>
<code_context>
    def judge(self, data: List[QaPair]) -> None:
        config = self.make_dataset_config
        logger.info("Starting online model scoring of data")
        logger.info(f"Using model {config.model_name}")

        client = OnlineLLM(
            api_key=config.llm_api_key,
            base_url=config.base_url,
            model_name=config.model_name,
            max_workers=config.clean_batch_size + 5,
        )

        inputs = []
        prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)
        for qa in data:
            if qa.images:
                qa.score = 6
            else:
                messages_str = ""
                for msg in qa.messages:
                    if msg.role == "user":
                        messages_str += f"Q: {msg.content}\n"
                    elif msg.role == "assistant":
                        messages_str += f"A: {msg.content}\n"
                prompt_value = prompt_template.invoke({"id": qa.id, "messages": messages_str.strip()})
                inputs.append(prompt_value.to_string())

        prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)

        parsed_scores = []
        clean_batch_size = config.clean_batch_size

        for i in tqdm(range(0, len(inputs), clean_batch_size), desc="Online model scoring progress"):
            batch = inputs[i : i + clean_batch_size]

            try:
                responses = client.chat_batch(batch, temperature=0)

                for j, response in enumerate(responses):
                    if isinstance(response, Exception):
                        logger.error(f"Failed to get response for batch item {i + j}: {str(response)}")
                        continue
                    result_text = response.choices[0].message.content
                    if "</think>" in result_text:
                        result_text = result_text.split("</think>", 1)[1]
                    result_text = re.sub(r"^```json\s*|```$", "", result_text.strip(), flags=re.MULTILINE)
                    try:
                        item_res = json.loads(result_text)
                    except json.JSONDecodeError as e:
                        logger.error(
                            f"JSON parsing failed for batch item {i + j}: {e}\nContent: {result_text}"
                        )
                        continue
                    parsed_scores.append(QaPairScoreWithId(**item_res))
            except Exception as e:
                logger.error(
                    f"Failed to call online model or parse result for batch starting at index {i}, error: {str(e)}"
                )

        score_map = {score.id: score.score for score in parsed_scores}
        for qa in data:
            if qa.id in score_map:
                qa.score = score_map[qa.id]
            else:
                logger.warning(f"No score obtained for QA ID {qa.id}, default assigned 0")
                qa.score = 0

        scores = [qa.score for qa in data if qa.score is not None]
        score_series = pd.Series(scores)
        score_counts = score_series.value_counts().sort_index()
        score_percentages = score_series.value_counts(normalize=True).sort_index() * 100
        pd.set_option("display.unicode.east_asian_width", True)
        distribution_df = pd.DataFrame(
            {
                "Count": score_counts,
                "Percentage(%)": score_percentages.round(2),
            }
        )
        distribution_df.index.name = "Score"
        printable_df_str = distribution_df.reset_index().to_string(index=False)
        logger.success(f"Online model scoring distribution:\n{printable_df_str}")

</code_context>

<issue_to_address>
**issue (code-quality):** Low code quality found in OlineLLMCleaningStrategy.judge - 18% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

<br/><details><summary>Explanation</summary>The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

### Comment 6
<location> `weclone/data/qa_generator.py:560-561` </location>
<code_context>
    def load_file(self, file_path) -> List[ChatMessage]:
        """
        Perform overall first preprocessing, filter rows that don't meet conditions, check if images exist and change type to cut if not, add DataModality field
        """
        folder_path = os.path.dirname(file_path)
        folder_name = os.path.basename(folder_path)

        if folder_name not in self.relations:
            users_json_path = os.path.join(folder_path, "users.json")
            if os.path.exists(users_json_path):
                try:
                    with open(users_json_path, encoding="utf-8") as f:
                        users_data = json.load(f)
                        relation = users_data.get("relation", "")
                        if relation:
                            self.relations[folder_name] = relation
                            logger.debug(f"Loaded relation for {folder_name}: {relation}")
                except (FileNotFoundError, json.JSONDecodeError) as e:
                    logger.warning(f"Failed to load users.json from {folder_path}: {e}")

        df = pd.read_csv(
            file_path,
            encoding="utf-8",
            dtype={"msg": str, "src": str},
            escapechar=None,
            keep_default_na=False,
        )

        df = df[~df["type_name"].isin(values=self.skip_type_list)]

        if "is_forward" in df.columns:
            df = df[~((df["is_sender"] == 1) & (df["is_forward"]))]

        # Batch process text messages for PII detection and blocked words
        text_indices = []
        text_messages = []

        for i in df.index:
            if df.loc[i, "type_name"].lower() in ["文本", "text"]:  # type: ignore
                msg_str = str(df.loc[i, "msg"])
                msg_str = msg_str.replace("\n", "")
                text_indices.append(i)
                text_messages.append(msg_str)

        # TODO Deleting directly by batch_has_pii returning true/false.
        indices_to_drop = []
        if text_messages:
            pii_results = self.pii_detector.batch_has_pii(text_messages)

            for idx, (df_index, msg_str, has_pii) in enumerate(zip(text_indices, text_messages, pii_results)):
                if has_pii:
                    indices_to_drop.append(df_index)
                    continue

                # Check blocked words
                for blocked_word in self.blocked_words:
                    if blocked_word in msg_str:
                        indices_to_drop.append(df_index)
                        break

        df = df.drop(index=indices_to_drop)

        # Process other message types
        for i in df.index:
            if df.loc[i, "type_name"].lower() in ["文本", "text"]:
                continue
            if df.loc[i, "src"].lower().endswith(".gif"):
                df.loc[i, "src"] = ""
                df.loc[i, "type_name"] = "动画表情" if self.c.platform == PlatformType.CHAT else "sticker"
                continue
            if df.loc[i, "type_name"].lower() in ["图片", "image"]:  # type: ignore
                if self.c.platform in [PlatformType.CHAT, PlatformType.TELEGRAM]:
                    result = check_image_file_exists(str(df.loc[i, "src"]))
                    if isinstance(result, str) and df.loc[i, "is_sender"] == 0:
                        df.loc[i, "src"] = result
                        df.loc[i, "msg"] = "<image>"
                        df.loc[i, "modality"] = DataModality.IMAGE
                    else:
                        df.loc[i, "type_name"] = "Cut"
            elif df.loc[i, "type_name"] in ["sticker", "动画表情"]:
                if self.c.platform in [PlatformType.CHAT, PlatformType.TELEGRAM]:
                    df.loc[i, "src"] = ""
                    continue
            else:
                df.loc[i, "msg"] = ""

        df = df.dropna(how="all")
        # Time format: 2021-07-07 10:27:23
        df["CreateTime"] = pd.to_datetime(df["CreateTime"])

        return [ChatMessage(**row) for row in df.to_dict("records")]  # type: ignore

</code_context>

<issue_to_address>
**suggestion (code-quality):** We've found these issues:

- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Low code quality found in DataProcessor.load\_file - 15% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))

```suggestion
                        if relation := users_data.get("relation", ""):
```

<br/><details><summary>Explanation</summary>
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

- Reduce the function length by extracting pieces of functionality out into
  their own functions. This is the most important thing you can do - ideally a
  function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
  sits together within the function rather than being scattered.</details>
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-11-01T09:14:55Z

weclone/data/clean/strategies.py

+            if qa.images:
+                qa.score = 6


issue (bug_risk): 为包含图像的项目分配 score=6 可能会导致混淆。

如果分数限制在 1-5 之间，则为包含图像的项目使用 6 可能会引入歧义。请考虑为这些情况使用专用标志或更清晰的逻辑。

Original comment in English

issue (bug_risk): Assigning score=6 for items with images may cause confusion.

If scores are limited to 1-5, using 6 for items with images may introduce ambiguity. Consider a dedicated flag or clearer logic for these cases.

sourcery-ai · 2025-11-01T09:14:55Z

weclone/core/PII/pii_detector.py

            language=self.language,
            entities=self.filtered_entities,
            score_threshold=self.threshold,
-            n_process=12,


suggestion (performance): 增加 n_process 和 batch_size 可能会影响资源使用。

考虑将 n_process 和 batch_size 设置为可配置项，以便用户可以根据需要调整资源使用。

建议的实现：

language=self.language, entities=self.filtered_entities, score_threshold=self.threshold, n_process=n_process, batch_size=batch_size, )

您还需要：

将 n_process=24 和 batch_size=32 作为参数添加到包含此代码的函数或方法中。

如果需要自定义值，请更新所有对此函数/方法的调用以传递这些参数。

Original comment in English

suggestion (performance): Increasing n_process and batch_size may impact resource usage.

Consider making n_process and batch_size configurable so users can adjust resource usage as needed.

Suggested implementation:

language=self.language, entities=self.filtered_entities, score_threshold=self.threshold, n_process=n_process, batch_size=batch_size, )

You will also need to:

Add n_process=24 and batch_size=32 as parameters to the function or method containing this code.

Update any calls to this function/method to pass these parameters if custom values are desired.

sourcery-ai · 2025-11-01T09:14:55Z

weclone/core/inference/offline_infer.py

+                log_text = result.outputs[0].text[:100] + "..."
+            elif isinstance(result, ChatCompletion):
+                log_text = result.choices[0].message.content[:100] + "..."
+            else:
+                log_text = str(result)[:100] + "..."


issue (code-quality): 使用 f-string 而不是字符串连接 [×3] (use-fstring-for-concatenation)

Original comment in English

issue (code-quality): Use f-string instead of string concatenation [×3] (use-fstring-for-concatenation)

sourcery-ai · 2025-11-01T09:14:55Z

weclone/core/inference/offline_infer.py

+    return parsed_results, failed_indexs
+
+
 def vllm_infer(


issue (code-quality): 我们发现了以下问题：

通过联合运算符合并字典更新 [×2] (dict-assign-update-to-union)

vllm_infer 中发现代码质量低下 - 23% (low-code-quality)

解释

此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。

您如何解决这个问题？

重构此函数以使其更短、更易读可能值得。

通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下，一个函数应该少于 10 行。

减少嵌套，例如通过引入守卫子句来提前返回。

确保变量的作用域紧密，以便使用相关概念的代码在函数内紧密地放在一起，而不是分散开来。

Original comment in English

issue (code-quality): We've found these issues:

Merge dictionary updates via the union operator [×2] (dict-assign-update-to-union)

Low code quality found in vllm_infer - 23% (low-code-quality)

Explanation

The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.

Reduce nesting, perhaps by introducing guard clauses to return early.

Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.

sourcery-ai · 2025-11-01T09:14:55Z

weclone/data/clean/strategies.py

    """Strategy for data cleaning using large language models"""

+    # TODO: images clean support
    def judge(self, data: List[QaPair]) -> None:


issue (code-quality): OlineLLMCleaningStrategy.judge 中发现代码质量低下 - 18% (low-code-quality)

解释
此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。

您如何解决这个问题？

重构此函数以使其更短、更易读可能值得。

通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下，一个函数应该少于 10 行。

减少嵌套，例如通过引入守卫子句来提前返回。

确保变量的作用域紧密，以便使用相关概念的代码在函数内紧密地放在一起，而不是分散开来。

Original comment in English

issue (code-quality): Low code quality found in OlineLLMCleaningStrategy.judge - 18% (low-code-quality)

Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.

Reduce nesting, perhaps by introducing guard clauses to return early.

Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.

sourcery-ai · 2025-11-01T09:14:56Z

weclone/data/qa_generator.py

+                        relation = users_data.get("relation", "")
+                        if relation:


suggestion (code-quality): 我们发现了以下问题：

使用命名表达式简化赋值和条件 (use-named-expression)

DataProcessor.load_file 中发现代码质量低下 - 15% (low-code-quality)

Suggested change

relation = users_data.get("relation", "")

if relation:

if relation := users_data.get("relation", ""):

解释

此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。

您如何解决这个问题？

重构此函数以使其更短、更易读可能值得。

通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下，一个函数应该少于 10 行。

减少嵌套，例如通过引入守卫子句来提前返回。

确保变量的作用域紧密，以便使用相关概念的代码在函数内紧密地放在一起，而不是分散开来。

Original comment in English

suggestion (code-quality): We've found these issues:

Use named expression to simplify assignment and conditional (use-named-expression)

Low code quality found in DataProcessor.load_file - 15% (low-code-quality)

Suggested change

relation = users_data.get("relation", "")

if relation:

if relation := users_data.get("relation", ""):

Explanation

The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.

Reduce nesting, perhaps by introducing guard clauses to return early.

Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.

cursor

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

cursor · 2025-11-01T09:22:36Z

weclone/data/clean/strategies.py

-                ids_in_batch = [qa["id"] for qa in qa_list]
                logger.error(
-                    f"Failed to call online model or parse result, current batch QA ID list: {ids_in_batch}, error: {str(e)}"
+                    f"Failed to call online model or parse result for batch starting at index {i}, error: {str(e)}"


Bug: Incorrect score override for image QA pairs

In OlineLLMCleaningStrategy.judge(), QA pairs with images are assigned score=6 at line 175, but then at lines 214-219, the code iterates through ALL QA pairs and assigns score=0 if the ID is not in score_map. Since QA pairs with images are not added to the inputs list, they won't be in score_map, causing their score to be overwritten from 6 to 0. The fix should check if the QA pair already has a non-zero score before assigning 0, or skip QA pairs with images in the final loop.

Copilot

Pull Request Overview

This PR introduces concurrent batch processing for online LLM inference with JSON schema validation, enriches QA pairs with optional chat member relationships from users.json, and updates the project to version 0.3.03.

Adds threaded async/batch chat methods to OnlineLLM for parallel processing
Introduces unified JSON parsing via parse_guided_decoding_results supporting both vLLM and OpenAI responses
Extends QA generation to load and inject chat member relationships into system prompts

Reviewed Changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
weclone/utils/config_models.py	Adds `add_relation` flag to control relationship metadata injection
weclone/prompts/clean_data.py	Comments out batch prompt template, retains per-item scoring format
weclone/data/qa_generator.py	Loads `users.json` per folder, injects relation/time into system prompts, tracks talker through QA matching
weclone/data/models.py	Adds "用户上传的GIF表情" and "sticker2" to skip types
weclone/data/clean/strategies.py	Refactors to use `chat_batch` with per-QA prompts, skips image-based samples
weclone/core/inference/online_infer.py	Adds `chat_async`, `chat_batch`, thread pool executor, and context manager support
weclone/core/inference/offline_infer.py	Extracts shared JSON parsing into `parse_guided_decoding_results` function
weclone/core/PII/pii_detector.py	Increases batch analyzer parallelism (24 processes, batch size 32)
tests/tests_data/test_person/test_0_730.csv	Normalizes talker/room fields to `test_person`
tests/test_full_pipe.py	Removes CSV extension filter to copy all test files
tests/configs/qwen3.jsonc	Updates model to Qwen2.5-0.5B, enables `add_relation`/`add_time`
tests/configs/Qwen2.5-VL.jsonc	Enables `add_relation` and `add_time` flags
settings.template.jsonc	Adds `add_relation` config, increases `clean_batch_size` to 50
pyproject.toml	Bumps version to 0.3.03, updates changelog
WC-exp	Updates submodule commit
.gitignore	Adds exclusions for models_final, /data, /llamaboard_cache

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-01T09:23:56Z

weclone/data/qa_generator.py

                system_content = self.system_prompt
                if self.c.add_time:
-                    system_content += f" Current datetime: {time_stamp.strftime('%m-%d %H:%M:%S')}"
+                    system_content += f"\n 现在是{time_stamp.strftime('%m-%d %H:%M')}"


[nitpick] Remove leading whitespace after '\n' in formatted string for consistent formatting.

Suggested change

system_content += f"\n 现在是{time_stamp.strftime('%m-%d %H:%M')}"

system_content += f"\n现在是{time_stamp.strftime('%m-%d %H:%M')}"

Copilot · 2025-11-01T09:23:57Z

weclone/data/qa_generator.py

+                if self.c.add_relation and talker:
+                    relation = self.relations.get(talker, "")
+                    if relation:
+                        system_content += f"\n 对方是你的{relation}，你们正在聊天"


[nitpick] Remove leading whitespace after '\n' in formatted string for consistent formatting.

Suggested change

system_content += f"\n 对方是你的{relation}，你们正在聊天"

system_content += f"\n对方是你的{relation}，你们正在聊天"

Copilot · 2025-11-01T09:23:57Z

tests/test_full_pipe.py

    for item_name in os.listdir(test_data_source_dir):
        source_item_path = os.path.join(test_data_source_dir, item_name)
-        if os.path.isfile(source_item_path) and item_name.lower().endswith('.csv'):
+        if os.path.isfile(source_item_path) :


Remove trailing whitespace before colon.

Suggested change

if os.path.isfile(source_item_path) :

if os.path.isfile(source_item_path):

Copilot · 2025-11-01T09:23:57Z

weclone/data/clean/strategies.py

+            if qa.images:
+                qa.score = 6


The magic number 6 for image-based sample scores should be documented or defined as a named constant to clarify its meaning (e.g., SKIP_IMAGE_SCORE = 6).

xming521 added 6 commits August 28, 2025 21:30

Merge remote-tracking branch 'origin/master' into dev

ac5afb8

feat(data): include new sticker types in skip list

a887d12

Expands the list of message types to be skipped or ignored by the application. This ensures proper handling for user-uploaded GIF stickers and a new sticker type identified as 'sticker2', preventing potential processing errors.

sourcery-ai bot reviewed Nov 1, 2025

View reviewed changes

Merge branch 'master' into dev

8ee9d05

xming521 requested a review from Copilot November 1, 2025 09:21

cursor bot reviewed Nov 1, 2025

View reviewed changes

Copilot AI reviewed Nov 1, 2025

View reviewed changes

xming521 closed this Nov 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dev #205

Dev #205

xming521 commented Nov 1, 2025 •

edited by cursor bot

Loading

sourcery-ai bot commented Nov 1, 2025 •

edited

Loading

与 Sourcery 互动

自定义您的体验

获取帮助

Reviewer's Guide

Sequence diagram for batch data cleaning with OnlineLLM.chat_batch

ER diagram for MakeDatasetArgs and QaPair with relation field

Class diagram for updated OnlineLLM with batch and async chat

Class diagram for QaGenerator with user relations support

Class diagram for parse_guided_decoding_results utility

Class diagram for MakeDatasetArgs with add_relation field

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

sourcery-ai bot Nov 1, 2025

sourcery-ai bot Nov 1, 2025

sourcery-ai bot Nov 1, 2025

sourcery-ai bot Nov 1, 2025

sourcery-ai bot Nov 1, 2025

sourcery-ai bot Nov 1, 2025

cursor bot left a comment

cursor bot Nov 1, 2025

Copilot AI left a comment

Copilot AI Nov 1, 2025

Copilot AI Nov 1, 2025

Copilot AI Nov 1, 2025

Copilot AI Nov 1, 2025

Labels

2 participants

	relation = users_data.get("relation", "")
	if relation:
	if relation := users_data.get("relation", ""):

	system_content += f"\n 现在是{time_stamp.strftime('%m-%d %H:%M')}"
	system_content += f"\n现在是{time_stamp.strftime('%m-%d %H:%M')}"

	system_content += f"\n 对方是你的{relation}，你们正在聊天"
	system_content += f"\n对方是你的{relation}，你们正在聊天"

	if os.path.isfile(source_item_path) :
	if os.path.isfile(source_item_path):

Dev #205

Dev #205

Conversation

xming521 commented Nov 1, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sourcery 总结

Summary by Sourcery

sourcery-ai bot commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

审阅者指南

使用 OnlineLLM.chat_batch 进行批量数据清理的序列图

带有关系字段的 MakeDatasetArgs 和 QaPair 的 ER 图

更新后的 OnlineLLM 类图，支持批量和异步聊天

支持用户关系的 QaGenerator 类图

parse_guided_decoding_results 工具类图

带有 add_relation 字段的 MakeDatasetArgs 类图

文件级别变更

与 Sourcery 互动

自定义您的体验

获取帮助

Reviewer's Guide

Sequence diagram for batch data cleaning with OnlineLLM.chat_batch

ER diagram for MakeDatasetArgs and QaPair with relation field

Class diagram for updated OnlineLLM with batch and async chat

Class diagram for QaGenerator with user relations support

Class diagram for parse_guided_decoding_results utility

Class diagram for MakeDatasetArgs with add_relation field

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Nov 1, 2025

Choose a reason for hiding this comment

sourcery-ai bot Nov 1, 2025

Choose a reason for hiding this comment

sourcery-ai bot Nov 1, 2025

Choose a reason for hiding this comment

sourcery-ai bot Nov 1, 2025

Choose a reason for hiding this comment

sourcery-ai bot Nov 1, 2025

Choose a reason for hiding this comment

sourcery-ai bot Nov 1, 2025

Choose a reason for hiding this comment

cursor bot left a comment

Choose a reason for hiding this comment

This PR is being reviewed by Cursor Bugbot

cursor bot Nov 1, 2025

Choose a reason for hiding this comment

Bug: Incorrect score override for image QA pairs

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Copilot AI Nov 1, 2025

Choose a reason for hiding this comment

Copilot AI Nov 1, 2025

Choose a reason for hiding this comment

Copilot AI Nov 1, 2025

Choose a reason for hiding this comment

Copilot AI Nov 1, 2025

Choose a reason for hiding this comment

Labels

2 participants

xming521 commented Nov 1, 2025 •

edited by cursor bot

Loading

sourcery-ai bot commented Nov 1, 2025 •

edited

Loading