-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Dev #205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Introduce a ThreadPoolExecutor in the OnlineLLM class to enable concurrent API calls, significantly boosting throughput for LLM operations. Refactor the OlineLLMCleaningStrategy to leverage this new batching capability, allowing multiple data points to be processed in parallel. Increase the default `clean_batch_size` in settings and enhance `n_process`/`batch_size` for PII detection to optimize for the new concurrency. Simplify prompt management by removing a dedicated prompt for online LLM cleaning. Add context manager support to OnlineLLM for reliable resource cleanup. Ensure LLM chat responses explicitly request JSON format.
Expands the list of message types to be skipped or ignored by the application. This ensures proper handling for user-uploaded GIF stickers and a new sticker type identified as 'sticker2', preventing potential processing errors.
Introduces a unified utility to parse and validate JSON-structured outputs from both vLLM and OpenAI API inference results using Pydantic models. This enables guided decoding for the OnlineLLM.chat method. The existing guided decoding logic in vllm_infer is refactored to use this new shared utility, improving consistency and error handling across inference modes.
Enhance error messages when JSON parsing of LLM outputs fails in offline inference. The system now dynamically extracts relevant text snippets based on the response object type (e.g., RequestOutput, ChatCompletion), making error logs more informative. Additionally, suppress verbose debug/info logs from the OpenAI client and httpx to reduce console noise during online inference. Update the 'WC-exp' submodule and add 'models_final/*' to .gitignore for repository hygiene.
- Updated the `qa_generator.py` to include a new mechanism for managing chat member relationships, allowing the addition of contextual information about the relationship between users in conversations. - Refactored the CSV loading function to support loading user relationship data from a `users.json` file, improving the context provided during QA generation. - Added a new configuration option `add_relation` to the dataset settings, enabling users to toggle this feature. - Updated the `.gitignore` to exclude additional data directories and cache files for better repository hygiene. - Bumped version to 0.3.03 in `pyproject.toml` to reflect these changes.
审阅者指南引入了多线程和支持批处理的在线LLM接口,统一了引导式解码的JSON解析,重构了数据清理以利用批处理聊天,通过用户关系丰富了QA生成,更新了版本/配置,并调整了PII检测中的并行度。 使用 OnlineLLM.chat_batch 进行批量数据清理的序列图sequenceDiagram
participant "OlineLLMCleaningStrategy"
participant "OnlineLLM"
participant "OpenAI API"
participant "Logger"
"OlineLLMCleaningStrategy"->>"OnlineLLM": chat_batch(batch_prompts)
loop for each prompt in batch
"OnlineLLM"->>"OpenAI API": chat(prompt)
"OpenAI API"-->>"OnlineLLM": response
end
"OnlineLLM"-->>"OlineLLMCleaningStrategy": responses
alt response is Exception
"OlineLLMCleaningStrategy"->>"Logger": log error
else response is valid
"OlineLLMCleaningStrategy"->>"Logger": log success
end
带有关系字段的 MakeDatasetArgs 和 QaPair 的 ER 图erDiagram
MAKE_DATASET_ARGS {
string add_relation
}
QA_PAIR {
string id
string messages
string images
string talker
}
MAKE_DATASET_ARGS ||--o{ QA_PAIR : "generates"
更新后的 OnlineLLM 类图,支持批量和异步聊天classDiagram
class OnlineLLM {
-api_key: str
-base_url: str
-model_name: str
-default_system: Optional[str]
-max_workers: int
-client: OpenAI
-executor: ThreadPoolExecutor
+chat(prompt_text, temperature, max_tokens, top_p, stream): ChatCompletion
+chat_async(prompt_text, temperature, max_tokens, top_p, stream): Future
+chat_batch(prompts, temperature, max_tokens, top_p, stream, callback, guided_decoding_class): List/tuple
+close()
+__enter__()
+__exit__(exc_type, exc_val, exc_tb)
}
OnlineLLM --> OpenAI
OnlineLLM --> ThreadPoolExecutor
支持用户关系的 QaGenerator 类图classDiagram
class QaGenerator {
-c: Config
-relations: dict
+main()
+match_qa(messages): List[QaPair]
+load_file(file_path): List[ChatMessage]
}
QaGenerator --> QaPair
QaGenerator --> ChatMessage
parse_guided_decoding_results 工具类图classDiagram
class parse_guided_decoding_results {
+parse_guided_decoding_results(results, guided_decoding_class): tuple[List[Optional[BaseModel]], List[int]]
}
带有 add_relation 字段的 MakeDatasetArgs 类图classDiagram
class MakeDatasetArgs {
+add_relation: bool
...
}
文件级别变更
提示和命令与 Sourcery 互动
自定义您的体验访问您的 仪表板 以:
获取帮助Original review guide in EnglishReviewer's GuideIntroduces multithreaded and batch-capable online LLM interfaces, unifies JSON parsing for guided decoding, refactors data cleaning to leverage batch chat, enriches QA generation with user relations, updates version/config, and tunes parallelism in PII detection. Sequence diagram for batch data cleaning with OnlineLLM.chat_batchsequenceDiagram
participant "OlineLLMCleaningStrategy"
participant "OnlineLLM"
participant "OpenAI API"
participant "Logger"
"OlineLLMCleaningStrategy"->>"OnlineLLM": chat_batch(batch_prompts)
loop for each prompt in batch
"OnlineLLM"->>"OpenAI API": chat(prompt)
"OpenAI API"-->>"OnlineLLM": response
end
"OnlineLLM"-->>"OlineLLMCleaningStrategy": responses
alt response is Exception
"OlineLLMCleaningStrategy"->>"Logger": log error
else response is valid
"OlineLLMCleaningStrategy"->>"Logger": log success
end
ER diagram for MakeDatasetArgs and QaPair with relation fielderDiagram
MAKE_DATASET_ARGS {
string add_relation
}
QA_PAIR {
string id
string messages
string images
string talker
}
MAKE_DATASET_ARGS ||--o{ QA_PAIR : "generates"
Class diagram for updated OnlineLLM with batch and async chatclassDiagram
class OnlineLLM {
-api_key: str
-base_url: str
-model_name: str
-default_system: Optional[str]
-max_workers: int
-client: OpenAI
-executor: ThreadPoolExecutor
+chat(prompt_text, temperature, max_tokens, top_p, stream): ChatCompletion
+chat_async(prompt_text, temperature, max_tokens, top_p, stream): Future
+chat_batch(prompts, temperature, max_tokens, top_p, stream, callback, guided_decoding_class): List/tuple
+close()
+__enter__()
+__exit__(exc_type, exc_val, exc_tb)
}
OnlineLLM --> OpenAI
OnlineLLM --> ThreadPoolExecutor
Class diagram for QaGenerator with user relations supportclassDiagram
class QaGenerator {
-c: Config
-relations: dict
+main()
+match_qa(messages): List[QaPair]
+load_file(file_path): List[ChatMessage]
}
QaGenerator --> QaPair
QaGenerator --> ChatMessage
Class diagram for parse_guided_decoding_results utilityclassDiagram
class parse_guided_decoding_results {
+parse_guided_decoding_results(results, guided_decoding_class): tuple[List[Optional[BaseModel]], List[int]]
}
Class diagram for MakeDatasetArgs with add_relation fieldclassDiagram
class MakeDatasetArgs {
+add_relation: bool
...
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
大家好 - 我已经审阅了你的更改,它们看起来很棒!
AI 代理的提示
请处理此代码审查中的评论:
## 个人评论
### 评论 1
<location> `weclone/data/clean/strategies.py:174-175` </location>
<code_context>
+ inputs = []
+ prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)
+ for qa in data:
+ if qa.images:
+ qa.score = 6
+ else:
+ messages_str = ""
</code_context>
<issue_to_address>
**issue (bug_risk):** 为包含图像的项目分配 score=6 可能会导致混淆。
如果分数限制在 1-5 之间,则为包含图像的项目使用 6 可能会引入歧义。请考虑为这些情况使用专用标志或更清晰的逻辑。
</issue_to_address>
### 评论 2
<location> `weclone/core/PII/pii_detector.py:179` </location>
<code_context>
language=self.language,
entities=self.filtered_entities,
score_threshold=self.threshold,
- n_process=12,
- batch_size=16,
+ n_process=24,
+ batch_size=32,
)
</code_context>
<issue_to_address>
**suggestion (performance):** 增加 n_process 和 batch_size 可能会影响资源使用。
考虑将 n_process 和 batch_size 设置为可配置项,以便用户可以根据需要调整资源使用。
建议的实现:
```python
language=self.language,
entities=self.filtered_entities,
score_threshold=self.threshold,
n_process=n_process,
batch_size=batch_size,
)
```
您还需要:
1. 将 `n_process=24` 和 `batch_size=32` 作为参数添加到包含此代码的函数或方法中。
2. 如果需要自定义值,请更新所有对此函数/方法的调用以传递这些参数。
</issue_to_address>
### 评论 3
<location> `weclone/core/inference/offline_infer.py:62-66` </location>
<code_context>
def parse_guided_decoding_results(
results: List[RequestOutput] | List[ChatCompletion] | List, guided_decoding_class: type[BaseModel]
) -> tuple[List[Optional[BaseModel]], List[int]]:
"""Parse guided decoding results and return parsed results with failed indices.
Args:
results: Raw vLLM generation results
guided_decoding_class: Pydantic model class for validation
Returns:
tuple: (parsed_results, failed_indices) where failed_indices contains
indices of failed JSON parsing
"""
parsed_results = []
failed_indexs = []
for idx, result in enumerate(results):
try:
if isinstance(result, RequestOutput):
json_text = extract_json_from_text(result.outputs[0].text)
elif isinstance(result, ChatCompletion):
json_text = extract_json_from_text(result.choices[0].message.content)
else:
raise ValueError(f"Unsupported result type: {type(result)}")
parsed_result = guided_decoding_class.model_validate_json(json_text)
parsed_results.append(parsed_result)
except Exception as e:
if isinstance(result, RequestOutput):
log_text = result.outputs[0].text[:100] + "..."
elif isinstance(result, ChatCompletion):
log_text = result.choices[0].message.content[:100] + "..."
else:
log_text = str(result)[:100] + "..."
logger.warning(
f"Failed to parse JSON from result at sequence index {idx}: {log_text}, error: {e}"
)
failed_indexs.append(idx)
parsed_results.append(None)
return parsed_results, failed_indexs
</code_context>
<issue_to_address>
**issue (code-quality):** 使用 f-string 而不是字符串连接 [×3] ([`use-fstring-for-concatenation`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-fstring-for-concatenation/))
</issue_to_address>
### 评论 4
<location> `weclone/core/inference/offline_infer.py:76` </location>
<code_context>
def vllm_infer(
inputs: List[str],
model_name_or_path: str,
adapter_name_or_path: Optional[str] = None,
dataset: str = "alpaca_en_demo",
dataset_dir: str = "data",
template: str = "default",
cutoff_len: int = 2048,
max_samples: Optional[int] = None,
vllm_config: str = "{}",
save_name: str = "generated_predictions.jsonl",
default_system: Optional[str] = None,
enable_thinking: bool = False,
temperature: float = 0.95,
top_p: float = 0.7,
top_k: int = 50,
guided_decoding_class: Optional[type[BaseModel]] = None,
bad_words: Optional[List[str]] = None,
logprobs: Optional[int] = None,
max_new_tokens: int = 1024,
repetition_penalty: float = 1.0,
skip_special_tokens: bool = True,
seed: Optional[int] = None,
pipeline_parallel_size: int = 1,
image_max_pixels: int = 768 * 768,
image_min_pixels: int = 32 * 32,
) -> tuple[List[RequestOutput] | List[Optional[BaseModel]], List[int]]:
r"""Perform batch generation using vLLM engine, which supports tensor parallelism.
Returns:
tuple: (results, failed_indices) where failed_indices contains indices of failed JSON parsing
"""
if pipeline_parallel_size > get_device_count():
raise ValueError("Pipeline parallel size should be smaller than the number of gpus.")
wc_vllm_args = cast(VllmArgs, load_config("vllm"))
model_args, data_args, _, generating_args = get_infer_args(
{
"model_name_or_path": model_name_or_path,
"adapter_name_or_path": adapter_name_or_path,
"dataset": dataset,
"dataset_dir": dataset_dir,
"template": template,
"cutoff_len": cutoff_len,
"max_samples": max_samples,
"preprocessing_num_workers": 16,
"vllm_config": vllm_config,
"temperature": temperature,
"top_p": top_p,
"top_k": top_k,
"max_new_tokens": max_new_tokens,
"repetition_penalty": repetition_penalty,
"enable_thinking": enable_thinking,
}
)
tokenizer_module = load_tokenizer(model_args)
tokenizer = tokenizer_module["tokenizer"]
template_obj = get_template_and_fix_tokenizer(tokenizer, data_args)
template_obj.mm_plugin.expand_mm_tokens = False # for vllm generate
if guided_decoding_class:
json_schema = guided_decoding_class.model_json_schema()
guided_decoding_params = GuidedDecodingParams(json=json_schema, disable_any_whitespace=True)
sampling_params = SamplingParams(
repetition_penalty=generating_args.repetition_penalty or 1.0,
temperature=generating_args.temperature,
top_p=generating_args.top_p or 1.0,
top_k=generating_args.top_k or -1,
stop_token_ids=template_obj.get_stop_token_ids(tokenizer),
max_tokens=generating_args.max_new_tokens,
skip_special_tokens=skip_special_tokens,
seed=seed,
bad_words=bad_words,
guided_decoding=guided_decoding_params if guided_decoding_class else None,
)
if model_args.adapter_name_or_path is not None:
lora_request = LoRARequest("default", 1, model_args.adapter_name_or_path[0])
else:
lora_request = None
engine_args = {
"model": model_args.model_name_or_path,
"trust_remote_code": True,
"dtype": model_args.infer_dtype,
"max_model_len": cutoff_len + max_new_tokens,
"disable_log_stats": True,
"enable_lora": model_args.adapter_name_or_path is not None,
"enable_prefix_caching": True,
"guided_decoding_backend": "guidance",
"guided_decoding_disable_any_whitespace": True,
}
if template_obj.mm_plugin.__class__.__name__ != "BasePlugin":
engine_args["limit_mm_per_prompt"] = {"image": 4, "video": 2, "audio": 2}
wc_vllm_dict = {k: v for k, v in wc_vllm_args.model_dump().items() if v is not None}
engine_args.update(wc_vllm_dict)
if isinstance(model_args.vllm_config, dict):
engine_args.update(model_args.vllm_config)
messages_list = [[{"role": "user", "content": text}] for text in inputs]
llm = LLM(**engine_args)
results = llm.chat(
messages_list,
sampling_params,
lora_request=lora_request,
chat_template_kwargs={"enable_thinking": enable_thinking},
) # type: ignore
del llm
torch.cuda.empty_cache()
if guided_decoding_class:
# TODO better json decode https://github.com/vllm-project/vllm/commit/1d0ae26c8544fd5a62e171e30c2dcc2973a23bc8#diff-3b27790a2ce97bc50cdd5476f7b0057da682ed0d1ec8426a7b76c5e21454e57d
parsed_results, failed_indexs = parse_guided_decoding_results(results, guided_decoding_class)
return parsed_results, failed_indexs
else:
return results, []
</code_context>
<issue_to_address>
**issue (code-quality):** 我们发现了以下问题:
- 通过联合运算符合并字典更新 [×2] ([`dict-assign-update-to-union`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/dict-assign-update-to-union/))
- vllm\_infer 中发现代码质量低下 - 23% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))
<br/><details><summary>解释</summary>
此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。
您如何解决这个问题?
重构此函数以使其更短、更易读可能值得。
- 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下,一个函数应该少于 10 行。
- 减少嵌套,例如通过引入守卫子句来提前返回。
- 确保变量的作用域紧密,以便使用相关概念的代码在函数内紧密地放在一起,而不是分散开来。</details>
</issue_to_address>
### 评论 5
<location> `weclone/data/clean/strategies.py:159` </location>
<code_context>
def judge(self, data: List[QaPair]) -> None:
config = self.make_dataset_config
logger.info("Starting online model scoring of data")
logger.info(f"Using model {config.model_name}")
client = OnlineLLM(
api_key=config.llm_api_key,
base_url=config.base_url,
model_name=config.model_name,
max_workers=config.clean_batch_size + 5,
)
inputs = []
prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)
for qa in data:
if qa.images:
qa.score = 6
else:
messages_str = ""
for msg in qa.messages:
if msg.role == "user":
messages_str += f"Q: {msg.content}\n"
elif msg.role == "assistant":
messages_str += f"A: {msg.content}\n"
prompt_value = prompt_template.invoke({"id": qa.id, "messages": messages_str.strip()})
inputs.append(prompt_value.to_string())
prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)
parsed_scores = []
clean_batch_size = config.clean_batch_size
for i in tqdm(range(0, len(inputs), clean_batch_size), desc="Online model scoring progress"):
batch = inputs[i : i + clean_batch_size]
try:
responses = client.chat_batch(batch, temperature=0)
for j, response in enumerate(responses):
if isinstance(response, Exception):
logger.error(f"Failed to get response for batch item {i + j}: {str(response)}")
continue
result_text = response.choices[0].message.content
if "</think>" in result_text:
result_text = result_text.split("</think>", 1)[1]
result_text = re.sub(r"^```json\s*|```$", "", result_text.strip(), flags=re.MULTILINE)
try:
item_res = json.loads(result_text)
except json.JSONDecodeError as e:
logger.error(
f"JSON parsing failed for batch item {i + j}: {e}\nContent: {result_text}"
)
continue
parsed_scores.append(QaPairScoreWithId(**item_res))
except Exception as e:
logger.error(
f"Failed to call online model or parse result for batch starting at index {i}, error: {str(e)}"
)
score_map = {score.id: score.score for score in parsed_scores}
for qa in data:
if qa.id in score_map:
qa.score = score_map[qa.id]
else:
logger.warning(f"No score obtained for QA ID {qa.id}, default assigned 0")
qa.score = 0
scores = [qa.score for qa in data if qa.score is not None]
score_series = pd.Series(scores)
score_counts = score_series.value_counts().sort_index()
score_percentages = score_series.value_counts(normalize=True).sort_index() * 100
pd.set_option("display.unicode.east_asian_width", True)
distribution_df = pd.DataFrame(
{
"Count": score_counts,
"Percentage(%)": score_percentages.round(2),
}
)
distribution_df.index.name = "Score"
printable_df_str = distribution_df.reset_index().to_string(index=False)
logger.success(f"Online model scoring distribution:\n{printable_df_str}")
</code_context>
<issue_to_address>
**issue (code-quality):** OlineLLMCleaningStrategy.judge 中发现代码质量低下 - 18% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))
<br/><details><summary>解释</summary>此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。
您如何解决这个问题?
重构此函数以使其更短、更易读可能值得。
- 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下,一个函数应该少于 10 行。
- 减少嵌套,例如通过引入守卫子句来提前返回。
- 确保变量的作用域紧密,以便使用相关概念的代码在函数内紧密地放在一起,而不是分散开来。</details>
</issue_to_address>
### 评论 6
<location> `weclone/data/qa_generator.py:560-561` </location>
<code_context>
def load_file(self, file_path) -> List[ChatMessage]:
"""
Perform overall first preprocessing, filter rows that don't meet conditions, check if images exist and change type to cut if not, add DataModality field
"""
folder_path = os.path.dirname(file_path)
folder_name = os.path.basename(folder_path)
if folder_name not in self.relations:
users_json_path = os.path.join(folder_path, "users.json")
if os.path.exists(users_json_path):
try:
with open(users_json_path, encoding="utf-8") as f:
users_data = json.load(f)
relation = users_data.get("relation", "")
if relation:
self.relations[folder_name] = relation
logger.debug(f"Loaded relation for {folder_name}: {relation}")
except (FileNotFoundError, json.JSONDecodeError) as e:
logger.warning(f"Failed to load users.json from {folder_path}: {e}")
df = pd.read_csv(
file_path,
encoding="utf-8",
dtype={"msg": str, "src": str},
escapechar=None,
keep_default_na=False,
)
df = df[~df["type_name"].isin(values=self.skip_type_list)]
if "is_forward" in df.columns:
df = df[~((df["is_sender"] == 1) & (df["is_forward"]))]
# Batch process text messages for PII detection and blocked words
text_indices = []
text_messages = []
for i in df.index:
if df.loc[i, "type_name"].lower() in ["文本", "text"]: # type: ignore
msg_str = str(df.loc[i, "msg"])
msg_str = msg_str.replace("\n", "")
text_indices.append(i)
text_messages.append(msg_str)
# TODO Deleting directly by batch_has_pii returning true/false.
indices_to_drop = []
if text_messages:
pii_results = self.pii_detector.batch_has_pii(text_messages)
for idx, (df_index, msg_str, has_pii) in enumerate(zip(text_indices, text_messages, pii_results)):
if has_pii:
indices_to_drop.append(df_index)
continue
# Check blocked words
for blocked_word in self.blocked_words:
if blocked_word in msg_str:
indices_to_drop.append(df_index)
break
df = df.drop(index=indices_to_drop)
# Process other message types
for i in df.index:
if df.loc[i, "type_name"].lower() in ["文本", "text"]:
continue
if df.loc[i, "src"].lower().endswith(".gif"):
df.loc[i, "src"] = ""
df.loc[i, "type_name"] = "动画表情" if self.c.platform == PlatformType.CHAT else "sticker"
continue
if df.loc[i, "type_name"].lower() in ["图片", "image"]: # type: ignore
if self.c.platform in [PlatformType.CHAT, PlatformType.TELEGRAM]:
result = check_image_file_exists(str(df.loc[i, "src"]))
if isinstance(result, str) and df.loc[i, "is_sender"] == 0:
df.loc[i, "src"] = result
df.loc[i, "msg"] = "<image>"
df.loc[i, "modality"] = DataModality.IMAGE
else:
df.loc[i, "type_name"] = "Cut"
elif df.loc[i, "type_name"] in ["sticker", "动画表情"]:
if self.c.platform in [PlatformType.CHAT, PlatformType.TELEGRAM]:
df.loc[i, "src"] = ""
continue
else:
df.loc[i, "msg"] = ""
df = df.dropna(how="all")
# Time format: 2021-07-07 10:27:23
df["CreateTime"] = pd.to_datetime(df["CreateTime"])
return [ChatMessage(**row) for row in df.to_dict("records")] # type: ignore
</code_context>
<issue_to_address>
**suggestion (code-quality):** 我们发现了以下问题:
- 使用命名表达式简化赋值和条件 ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- DataProcessor.load\_file 中发现代码质量低下 - 15% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))
```suggestion
if relation := users_data.get("relation", ""):
```
<br/><details><summary>解释</summary>
此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。
您如何解决这个问题?
重构此函数以使其更短、更易读可能值得。
- 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下,一个函数应该少于 10 行。
- 减少嵌套,例如通过引入守卫子句来提前返回。
- 确保变量的作用域紧密,以便使用相关概念的代码在函数内紧密地放在一起,而不是分散开来。</details>
</issue_to_address>请帮助我更有用!请在每条评论上点击 👍 或 👎 ,我将使用这些反馈来改进您的评论。
Original comment in English
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `weclone/data/clean/strategies.py:174-175` </location>
<code_context>
+ inputs = []
+ prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)
+ for qa in data:
+ if qa.images:
+ qa.score = 6
+ else:
+ messages_str = ""
</code_context>
<issue_to_address>
**issue (bug_risk):** Assigning score=6 for items with images may cause confusion.
If scores are limited to 1-5, using 6 for items with images may introduce ambiguity. Consider a dedicated flag or clearer logic for these cases.
</issue_to_address>
### Comment 2
<location> `weclone/core/PII/pii_detector.py:179` </location>
<code_context>
language=self.language,
entities=self.filtered_entities,
score_threshold=self.threshold,
- n_process=12,
- batch_size=16,
+ n_process=24,
+ batch_size=32,
)
</code_context>
<issue_to_address>
**suggestion (performance):** Increasing n_process and batch_size may impact resource usage.
Consider making n_process and batch_size configurable so users can adjust resource usage as needed.
Suggested implementation:
```python
language=self.language,
entities=self.filtered_entities,
score_threshold=self.threshold,
n_process=n_process,
batch_size=batch_size,
)
```
You will also need to:
1. Add `n_process=24` and `batch_size=32` as parameters to the function or method containing this code.
2. Update any calls to this function/method to pass these parameters if custom values are desired.
</issue_to_address>
### Comment 3
<location> `weclone/core/inference/offline_infer.py:62-66` </location>
<code_context>
def parse_guided_decoding_results(
results: List[RequestOutput] | List[ChatCompletion] | List, guided_decoding_class: type[BaseModel]
) -> tuple[List[Optional[BaseModel]], List[int]]:
"""Parse guided decoding results and return parsed results with failed indices.
Args:
results: Raw vLLM generation results
guided_decoding_class: Pydantic model class for validation
Returns:
tuple: (parsed_results, failed_indices) where failed_indices contains
indices of failed JSON parsing
"""
parsed_results = []
failed_indexs = []
for idx, result in enumerate(results):
try:
if isinstance(result, RequestOutput):
json_text = extract_json_from_text(result.outputs[0].text)
elif isinstance(result, ChatCompletion):
json_text = extract_json_from_text(result.choices[0].message.content)
else:
raise ValueError(f"Unsupported result type: {type(result)}")
parsed_result = guided_decoding_class.model_validate_json(json_text)
parsed_results.append(parsed_result)
except Exception as e:
if isinstance(result, RequestOutput):
log_text = result.outputs[0].text[:100] + "..."
elif isinstance(result, ChatCompletion):
log_text = result.choices[0].message.content[:100] + "..."
else:
log_text = str(result)[:100] + "..."
logger.warning(
f"Failed to parse JSON from result at sequence index {idx}: {log_text}, error: {e}"
)
failed_indexs.append(idx)
parsed_results.append(None)
return parsed_results, failed_indexs
</code_context>
<issue_to_address>
**issue (code-quality):** Use f-string instead of string concatenation [×3] ([`use-fstring-for-concatenation`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-fstring-for-concatenation/))
</issue_to_address>
### Comment 4
<location> `weclone/core/inference/offline_infer.py:76` </location>
<code_context>
def vllm_infer(
inputs: List[str],
model_name_or_path: str,
adapter_name_or_path: Optional[str] = None,
dataset: str = "alpaca_en_demo",
dataset_dir: str = "data",
template: str = "default",
cutoff_len: int = 2048,
max_samples: Optional[int] = None,
vllm_config: str = "{}",
save_name: str = "generated_predictions.jsonl",
default_system: Optional[str] = None,
enable_thinking: bool = False,
temperature: float = 0.95,
top_p: float = 0.7,
top_k: int = 50,
guided_decoding_class: Optional[type[BaseModel]] = None,
bad_words: Optional[List[str]] = None,
logprobs: Optional[int] = None,
max_new_tokens: int = 1024,
repetition_penalty: float = 1.0,
skip_special_tokens: bool = True,
seed: Optional[int] = None,
pipeline_parallel_size: int = 1,
image_max_pixels: int = 768 * 768,
image_min_pixels: int = 32 * 32,
) -> tuple[List[RequestOutput] | List[Optional[BaseModel]], List[int]]:
r"""Perform batch generation using vLLM engine, which supports tensor parallelism.
Returns:
tuple: (results, failed_indices) where failed_indices contains indices of failed JSON parsing
"""
if pipeline_parallel_size > get_device_count():
raise ValueError("Pipeline parallel size should be smaller than the number of gpus.")
wc_vllm_args = cast(VllmArgs, load_config("vllm"))
model_args, data_args, _, generating_args = get_infer_args(
{
"model_name_or_path": model_name_or_path,
"adapter_name_or_path": adapter_name_or_path,
"dataset": dataset,
"dataset_dir": dataset_dir,
"template": template,
"cutoff_len": cutoff_len,
"max_samples": max_samples,
"preprocessing_num_workers": 16,
"vllm_config": vllm_config,
"temperature": temperature,
"top_p": top_p,
"top_k": top_k,
"max_new_tokens": max_new_tokens,
"repetition_penalty": repetition_penalty,
"enable_thinking": enable_thinking,
}
)
tokenizer_module = load_tokenizer(model_args)
tokenizer = tokenizer_module["tokenizer"]
template_obj = get_template_and_fix_tokenizer(tokenizer, data_args)
template_obj.mm_plugin.expand_mm_tokens = False # for vllm generate
if guided_decoding_class:
json_schema = guided_decoding_class.model_json_schema()
guided_decoding_params = GuidedDecodingParams(json=json_schema, disable_any_whitespace=True)
sampling_params = SamplingParams(
repetition_penalty=generating_args.repetition_penalty or 1.0,
temperature=generating_args.temperature,
top_p=generating_args.top_p or 1.0,
top_k=generating_args.top_k or -1,
stop_token_ids=template_obj.get_stop_token_ids(tokenizer),
max_tokens=generating_args.max_new_tokens,
skip_special_tokens=skip_special_tokens,
seed=seed,
bad_words=bad_words,
guided_decoding=guided_decoding_params if guided_decoding_class else None,
)
if model_args.adapter_name_or_path is not None:
lora_request = LoRARequest("default", 1, model_args.adapter_name_or_path[0])
else:
lora_request = None
engine_args = {
"model": model_args.model_name_or_path,
"trust_remote_code": True,
"dtype": model_args.infer_dtype,
"max_model_len": cutoff_len + max_new_tokens,
"disable_log_stats": True,
"enable_lora": model_args.adapter_name_or_path is not None,
"enable_prefix_caching": True,
"guided_decoding_backend": "guidance",
"guided_decoding_disable_any_whitespace": True,
}
if template_obj.mm_plugin.__class__.__name__ != "BasePlugin":
engine_args["limit_mm_per_prompt"] = {"image": 4, "video": 2, "audio": 2}
wc_vllm_dict = {k: v for k, v in wc_vllm_args.model_dump().items() if v is not None}
engine_args.update(wc_vllm_dict)
if isinstance(model_args.vllm_config, dict):
engine_args.update(model_args.vllm_config)
messages_list = [[{"role": "user", "content": text}] for text in inputs]
llm = LLM(**engine_args)
results = llm.chat(
messages_list,
sampling_params,
lora_request=lora_request,
chat_template_kwargs={"enable_thinking": enable_thinking},
) # type: ignore
del llm
torch.cuda.empty_cache()
if guided_decoding_class:
# TODO better json decode https://github.com/vllm-project/vllm/commit/1d0ae26c8544fd5a62e171e30c2dcc2973a23bc8#diff-3b27790a2ce97bc50cdd5476f7b0057da682ed0d1ec8426a7b76c5e21454e57d
parsed_results, failed_indexs = parse_guided_decoding_results(results, guided_decoding_class)
return parsed_results, failed_indexs
else:
return results, []
</code_context>
<issue_to_address>
**issue (code-quality):** We've found these issues:
- Merge dictionary updates via the union operator [×2] ([`dict-assign-update-to-union`](https://docs.sourcery.ai/Reference/Default-Rules/suggestions/dict-assign-update-to-union/))
- Low code quality found in vllm\_infer - 23% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))
<br/><details><summary>Explanation</summary>
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.</details>
</issue_to_address>
### Comment 5
<location> `weclone/data/clean/strategies.py:159` </location>
<code_context>
def judge(self, data: List[QaPair]) -> None:
config = self.make_dataset_config
logger.info("Starting online model scoring of data")
logger.info(f"Using model {config.model_name}")
client = OnlineLLM(
api_key=config.llm_api_key,
base_url=config.base_url,
model_name=config.model_name,
max_workers=config.clean_batch_size + 5,
)
inputs = []
prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)
for qa in data:
if qa.images:
qa.score = 6
else:
messages_str = ""
for msg in qa.messages:
if msg.role == "user":
messages_str += f"Q: {msg.content}\n"
elif msg.role == "assistant":
messages_str += f"A: {msg.content}\n"
prompt_value = prompt_template.invoke({"id": qa.id, "messages": messages_str.strip()})
inputs.append(prompt_value.to_string())
prompt_template = PromptTemplate.from_template(CLEAN_PROMPT)
parsed_scores = []
clean_batch_size = config.clean_batch_size
for i in tqdm(range(0, len(inputs), clean_batch_size), desc="Online model scoring progress"):
batch = inputs[i : i + clean_batch_size]
try:
responses = client.chat_batch(batch, temperature=0)
for j, response in enumerate(responses):
if isinstance(response, Exception):
logger.error(f"Failed to get response for batch item {i + j}: {str(response)}")
continue
result_text = response.choices[0].message.content
if "</think>" in result_text:
result_text = result_text.split("</think>", 1)[1]
result_text = re.sub(r"^```json\s*|```$", "", result_text.strip(), flags=re.MULTILINE)
try:
item_res = json.loads(result_text)
except json.JSONDecodeError as e:
logger.error(
f"JSON parsing failed for batch item {i + j}: {e}\nContent: {result_text}"
)
continue
parsed_scores.append(QaPairScoreWithId(**item_res))
except Exception as e:
logger.error(
f"Failed to call online model or parse result for batch starting at index {i}, error: {str(e)}"
)
score_map = {score.id: score.score for score in parsed_scores}
for qa in data:
if qa.id in score_map:
qa.score = score_map[qa.id]
else:
logger.warning(f"No score obtained for QA ID {qa.id}, default assigned 0")
qa.score = 0
scores = [qa.score for qa in data if qa.score is not None]
score_series = pd.Series(scores)
score_counts = score_series.value_counts().sort_index()
score_percentages = score_series.value_counts(normalize=True).sort_index() * 100
pd.set_option("display.unicode.east_asian_width", True)
distribution_df = pd.DataFrame(
{
"Count": score_counts,
"Percentage(%)": score_percentages.round(2),
}
)
distribution_df.index.name = "Score"
printable_df_str = distribution_df.reset_index().to_string(index=False)
logger.success(f"Online model scoring distribution:\n{printable_df_str}")
</code_context>
<issue_to_address>
**issue (code-quality):** Low code quality found in OlineLLMCleaningStrategy.judge - 18% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))
<br/><details><summary>Explanation</summary>The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.</details>
</issue_to_address>
### Comment 6
<location> `weclone/data/qa_generator.py:560-561` </location>
<code_context>
def load_file(self, file_path) -> List[ChatMessage]:
"""
Perform overall first preprocessing, filter rows that don't meet conditions, check if images exist and change type to cut if not, add DataModality field
"""
folder_path = os.path.dirname(file_path)
folder_name = os.path.basename(folder_path)
if folder_name not in self.relations:
users_json_path = os.path.join(folder_path, "users.json")
if os.path.exists(users_json_path):
try:
with open(users_json_path, encoding="utf-8") as f:
users_data = json.load(f)
relation = users_data.get("relation", "")
if relation:
self.relations[folder_name] = relation
logger.debug(f"Loaded relation for {folder_name}: {relation}")
except (FileNotFoundError, json.JSONDecodeError) as e:
logger.warning(f"Failed to load users.json from {folder_path}: {e}")
df = pd.read_csv(
file_path,
encoding="utf-8",
dtype={"msg": str, "src": str},
escapechar=None,
keep_default_na=False,
)
df = df[~df["type_name"].isin(values=self.skip_type_list)]
if "is_forward" in df.columns:
df = df[~((df["is_sender"] == 1) & (df["is_forward"]))]
# Batch process text messages for PII detection and blocked words
text_indices = []
text_messages = []
for i in df.index:
if df.loc[i, "type_name"].lower() in ["文本", "text"]: # type: ignore
msg_str = str(df.loc[i, "msg"])
msg_str = msg_str.replace("\n", "")
text_indices.append(i)
text_messages.append(msg_str)
# TODO Deleting directly by batch_has_pii returning true/false.
indices_to_drop = []
if text_messages:
pii_results = self.pii_detector.batch_has_pii(text_messages)
for idx, (df_index, msg_str, has_pii) in enumerate(zip(text_indices, text_messages, pii_results)):
if has_pii:
indices_to_drop.append(df_index)
continue
# Check blocked words
for blocked_word in self.blocked_words:
if blocked_word in msg_str:
indices_to_drop.append(df_index)
break
df = df.drop(index=indices_to_drop)
# Process other message types
for i in df.index:
if df.loc[i, "type_name"].lower() in ["文本", "text"]:
continue
if df.loc[i, "src"].lower().endswith(".gif"):
df.loc[i, "src"] = ""
df.loc[i, "type_name"] = "动画表情" if self.c.platform == PlatformType.CHAT else "sticker"
continue
if df.loc[i, "type_name"].lower() in ["图片", "image"]: # type: ignore
if self.c.platform in [PlatformType.CHAT, PlatformType.TELEGRAM]:
result = check_image_file_exists(str(df.loc[i, "src"]))
if isinstance(result, str) and df.loc[i, "is_sender"] == 0:
df.loc[i, "src"] = result
df.loc[i, "msg"] = "<image>"
df.loc[i, "modality"] = DataModality.IMAGE
else:
df.loc[i, "type_name"] = "Cut"
elif df.loc[i, "type_name"] in ["sticker", "动画表情"]:
if self.c.platform in [PlatformType.CHAT, PlatformType.TELEGRAM]:
df.loc[i, "src"] = ""
continue
else:
df.loc[i, "msg"] = ""
df = df.dropna(how="all")
# Time format: 2021-07-07 10:27:23
df["CreateTime"] = pd.to_datetime(df["CreateTime"])
return [ChatMessage(**row) for row in df.to_dict("records")] # type: ignore
</code_context>
<issue_to_address>
**suggestion (code-quality):** We've found these issues:
- Use named expression to simplify assignment and conditional ([`use-named-expression`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-named-expression/))
- Low code quality found in DataProcessor.load\_file - 15% ([`low-code-quality`](https://docs.sourcery.ai/Reference/Default-Rules/comments/low-code-quality/))
```suggestion
if relation := users_data.get("relation", ""):
```
<br/><details><summary>Explanation</summary>
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines.
- Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.</details>
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| if qa.images: | ||
| qa.score = 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): 为包含图像的项目分配 score=6 可能会导致混淆。
如果分数限制在 1-5 之间,则为包含图像的项目使用 6 可能会引入歧义。请考虑为这些情况使用专用标志或更清晰的逻辑。
Original comment in English
issue (bug_risk): Assigning score=6 for items with images may cause confusion.
If scores are limited to 1-5, using 6 for items with images may introduce ambiguity. Consider a dedicated flag or clearer logic for these cases.
| language=self.language, | ||
| entities=self.filtered_entities, | ||
| score_threshold=self.threshold, | ||
| n_process=12, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (performance): 增加 n_process 和 batch_size 可能会影响资源使用。
考虑将 n_process 和 batch_size 设置为可配置项,以便用户可以根据需要调整资源使用。
建议的实现:
language=self.language,
entities=self.filtered_entities,
score_threshold=self.threshold,
n_process=n_process,
batch_size=batch_size,
)您还需要:
- 将
n_process=24和batch_size=32作为参数添加到包含此代码的函数或方法中。 - 如果需要自定义值,请更新所有对此函数/方法的调用以传递这些参数。
Original comment in English
suggestion (performance): Increasing n_process and batch_size may impact resource usage.
Consider making n_process and batch_size configurable so users can adjust resource usage as needed.
Suggested implementation:
language=self.language,
entities=self.filtered_entities,
score_threshold=self.threshold,
n_process=n_process,
batch_size=batch_size,
)You will also need to:
- Add
n_process=24andbatch_size=32as parameters to the function or method containing this code. - Update any calls to this function/method to pass these parameters if custom values are desired.
| log_text = result.outputs[0].text[:100] + "..." | ||
| elif isinstance(result, ChatCompletion): | ||
| log_text = result.choices[0].message.content[:100] + "..." | ||
| else: | ||
| log_text = str(result)[:100] + "..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): 使用 f-string 而不是字符串连接 [×3] (use-fstring-for-concatenation)
Original comment in English
issue (code-quality): Use f-string instead of string concatenation [×3] (use-fstring-for-concatenation)
| return parsed_results, failed_indexs | ||
|
|
||
|
|
||
| def vllm_infer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): 我们发现了以下问题:
- 通过联合运算符合并字典更新 [×2] (
dict-assign-update-to-union) - vllm_infer 中发现代码质量低下 - 23% (
low-code-quality)
解释
此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。
您如何解决这个问题?
重构此函数以使其更短、更易读可能值得。
- 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下,一个函数应该少于 10 行。
- 减少嵌套,例如通过引入守卫子句来提前返回。
- 确保变量的作用域紧密,以便使用相关概念的代码在函数内紧密地放在一起,而不是分散开来。
Original comment in English
issue (code-quality): We've found these issues:
- Merge dictionary updates via the union operator [×2] (
dict-assign-update-to-union) - Low code quality found in vllm_infer - 23% (
low-code-quality)
Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
| """Strategy for data cleaning using large language models""" | ||
|
|
||
| # TODO: images clean support | ||
| def judge(self, data: List[QaPair]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): OlineLLMCleaningStrategy.judge 中发现代码质量低下 - 18% (low-code-quality)
解释
此函数的质量得分低于 25% 的质量阈值。此得分是方法长度、认知复杂度和工作内存的组合。
您如何解决这个问题?
重构此函数以使其更短、更易读可能值得。
- 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下,一个函数应该少于 10 行。
- 减少嵌套,例如通过引入守卫子句来提前返回。
- 确保变量的作用域紧密,以便使用相关概念的代码在函数内紧密地放在一起,而不是分散开来。
Original comment in English
issue (code-quality): Low code quality found in OlineLLMCleaningStrategy.judge - 18% (low-code-quality)
Explanation
The quality score for this function is below the quality threshold of 25%.This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
| relation = users_data.get("relation", "") | ||
| if relation: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): 我们发现了以下问题:
- 使用命名表达式简化赋值和条件 (
use-named-expression) - DataProcessor.load_file 中发现代码质量低下 - 15% (
low-code-quality)
| relation = users_data.get("relation", "") | |
| if relation: | |
| if relation := users_data.get("relation", ""): |
解释
此函数的质量得分低于 25% 的质量阈值。
此得分是方法长度、认知复杂度和工作内存的组合。
您如何解决这个问题?
重构此函数以使其更短、更易读可能值得。
- 通过将部分功能提取到它们自己的函数中来减少函数长度。这是您可以做的最重要的事情——理想情况下,一个函数应该少于 10 行。
- 减少嵌套,例如通过引入守卫子句来提前返回。
- 确保变量的作用域紧密,以便使用相关概念的代码在函数内紧密地放在一起,而不是分散开来。
Original comment in English
suggestion (code-quality): We've found these issues:
- Use named expression to simplify assignment and conditional (
use-named-expression) - Low code quality found in DataProcessor.load_file - 15% (
low-code-quality)
| relation = users_data.get("relation", "") | |
| if relation: | |
| if relation := users_data.get("relation", ""): |
Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is being reviewed by Cursor Bugbot
Details
You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
| ids_in_batch = [qa["id"] for qa in qa_list] | ||
| logger.error( | ||
| f"Failed to call online model or parse result, current batch QA ID list: {ids_in_batch}, error: {str(e)}" | ||
| f"Failed to call online model or parse result for batch starting at index {i}, error: {str(e)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Incorrect score override for image QA pairs
In OlineLLMCleaningStrategy.judge(), QA pairs with images are assigned score=6 at line 175, but then at lines 214-219, the code iterates through ALL QA pairs and assigns score=0 if the ID is not in score_map. Since QA pairs with images are not added to the inputs list, they won't be in score_map, causing their score to be overwritten from 6 to 0. The fix should check if the QA pair already has a non-zero score before assigning 0, or skip QA pairs with images in the final loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces concurrent batch processing for online LLM inference with JSON schema validation, enriches QA pairs with optional chat member relationships from users.json, and updates the project to version 0.3.03.
- Adds threaded async/batch chat methods to
OnlineLLMfor parallel processing - Introduces unified JSON parsing via
parse_guided_decoding_resultssupporting both vLLM and OpenAI responses - Extends QA generation to load and inject chat member relationships into system prompts
Reviewed Changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| weclone/utils/config_models.py | Adds add_relation flag to control relationship metadata injection |
| weclone/prompts/clean_data.py | Comments out batch prompt template, retains per-item scoring format |
| weclone/data/qa_generator.py | Loads users.json per folder, injects relation/time into system prompts, tracks talker through QA matching |
| weclone/data/models.py | Adds "用户上传的GIF表情" and "sticker2" to skip types |
| weclone/data/clean/strategies.py | Refactors to use chat_batch with per-QA prompts, skips image-based samples |
| weclone/core/inference/online_infer.py | Adds chat_async, chat_batch, thread pool executor, and context manager support |
| weclone/core/inference/offline_infer.py | Extracts shared JSON parsing into parse_guided_decoding_results function |
| weclone/core/PII/pii_detector.py | Increases batch analyzer parallelism (24 processes, batch size 32) |
| tests/tests_data/test_person/test_0_730.csv | Normalizes talker/room fields to test_person |
| tests/test_full_pipe.py | Removes CSV extension filter to copy all test files |
| tests/configs/qwen3.jsonc | Updates model to Qwen2.5-0.5B, enables add_relation/add_time |
| tests/configs/Qwen2.5-VL.jsonc | Enables add_relation and add_time flags |
| settings.template.jsonc | Adds add_relation config, increases clean_batch_size to 50 |
| pyproject.toml | Bumps version to 0.3.03, updates changelog |
| WC-exp | Updates submodule commit |
| .gitignore | Adds exclusions for models_final, /data, /llamaboard_cache |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| system_content = self.system_prompt | ||
| if self.c.add_time: | ||
| system_content += f" Current datetime: {time_stamp.strftime('%m-%d %H:%M:%S')}" | ||
| system_content += f"\n 现在是{time_stamp.strftime('%m-%d %H:%M')}" |
Copilot
AI
Nov 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Remove leading whitespace after '\n' in formatted string for consistent formatting.
| system_content += f"\n 现在是{time_stamp.strftime('%m-%d %H:%M')}" | |
| system_content += f"\n现在是{time_stamp.strftime('%m-%d %H:%M')}" |
| if self.c.add_relation and talker: | ||
| relation = self.relations.get(talker, "") | ||
| if relation: | ||
| system_content += f"\n 对方是你的{relation},你们正在聊天" |
Copilot
AI
Nov 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Remove leading whitespace after '\n' in formatted string for consistent formatting.
| system_content += f"\n 对方是你的{relation},你们正在聊天" | |
| system_content += f"\n对方是你的{relation},你们正在聊天" |
| for item_name in os.listdir(test_data_source_dir): | ||
| source_item_path = os.path.join(test_data_source_dir, item_name) | ||
| if os.path.isfile(source_item_path) and item_name.lower().endswith('.csv'): | ||
| if os.path.isfile(source_item_path) : |
Copilot
AI
Nov 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove trailing whitespace before colon.
| if os.path.isfile(source_item_path) : | |
| if os.path.isfile(source_item_path): |
| if qa.images: | ||
| qa.score = 6 |
Copilot
AI
Nov 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 6 for image-based sample scores should be documented or defined as a named constant to clarify its meaning (e.g., SKIP_IMAGE_SCORE = 6).
Sourcery 总结
实现基于 JSON 引导解码的 LLM 并发和批量在线推理,集成到数据清洗工作流中,通过聊天成员关系丰富 QA 生成,并将版本升级到 0.3.03。
新功能:
chat_async,chat_batch)parse_guided_decoding_results),用于 pydantic 验证users.json加载聊天成员关系并将其包含在系统提示中改进:
chat_batch进行评分,跳过基于图像的样本,并采用统一的 JSON 解析load_csv重命名为load_file,并调整 QA 生成器中的文件加载逻辑构建:
测试:
test_full_pipe以复制测试数据源中的所有文件,移除 CSV 扩展名过滤器Original summary in English
Summary by Sourcery
Implement concurrent and batch online inference for LLMs with JSON-guided decoding, integrate into the data cleaning workflow, enrich QA generation with chat member relationships, and bump version to 0.3.03.
New Features:
Enhancements:
Build:
Tests:
Note
Add concurrent/batch online LLM inference with unified JSON-guided parsing, integrate into cleaning, enrich QA with chat-member relations, and bump version/config to 0.3.03.
chat_async/chat_batchwith JSON responses; optional guided decoding via Pydantic; shared parserparse_guided_decoding_results(also used by offline vLLM).ChatCompletion.CLEAN_PROMPT; refactor online cleaning to batch-score viachat_batch, skip image samples, and unify JSON parsing.users.jsonper chat folder to capturerelation; whenadd_relationis enabled, append relation and formatted time tosystemprompt; track talker through QA matching.load_csvtoload_file; pipeline adjusted accordingly.用户上传的GIF表情,sticker2).n_process24,batch_size32.add_relationflag; increaseclean_batch_sizedefault to 50; update example/test configs; version/config_version →0.3.03with changelog.test_full_pipe: copy all files from test data source (no CSV-only filter); update test fixtures..gitignoreand submodule pointer.Written by Cursor Bugbot for commit 8ee9d05. This will update automatically on new commits. Configure here.