Summary
A critical performance vulnerability has been identified in the input preprocessing logic of the multimodal tokenizer. The code dynamically replaces placeholder tokens (e.g., <|audio_|>, <|image_|>) with repeated tokens based on precomputed lengths. Due to inefficient list concatenation operations, the algorithm exhibits quadratic time complexity (O(n²)), allowing malicious actors to trigger resource exhaustion via specially crafted inputs.
Details
Affected Component: input_processor_for_phi4mm function.
|
while i < len(input_ids): |
|
token_id = input_ids[i] |
|
if token_id == _AUDIO_PLACEHOLDER_TOKEN_ID: |
|
token_count = next(audio_embed_size_iter) |
|
audio_cnt += 1 |
|
elif token_id == _IMAGE_PLACEHOLDER_TOKEN_ID: |
|
token_count = next(image_token_count_iter) |
|
img_cnt += 1 |
|
else: |
|
user_text_input_cnt += 1 if token_id not in \ |
|
NON_USER_INPUT_TOKENS else 0 |
|
i += 1 |
|
continue |
|
tokens = [token_id] * token_count |
|
input_ids = input_ids[:i] + tokens + input_ids[i + 1:] |
|
i += token_count |
The code modifies the input_ids list in-place using input_ids = input_ids[:i] + tokens + input_ids[i+1:]. Each concatenation operation copies the entire list, leading to O(n) operations per replacement. For k placeholders expanding to m tokens, total time becomes O(kmn), approximating O(n²) in worst-case scenarios.
PoC
Test data demonstrates exponential time growth:
test_cases = [100, 200, 400, 800, 1600, 3200, 6400]
run_times = [0.002, 0.007, 0.028, 0.136, 0.616, 2.707, 11.854] # seconds
Doubling input size increases runtime by ~4x (consistent with O(n²)).
Impact
Denial-of-Service (DoS): An attacker could submit inputs with many placeholders (e.g., 10,000 <|audio_1|> tokens), causing CPU/memory exhaustion.
Example: 10,000 placeholders → ~100 million operations.
Remediation Recommendations
Precompute all placeholder positions and expansion lengths upfront.
Replace dynamic list concatenation with a single preallocated array.
# Pseudocode for O(n) solution
new_input_ids = []
for token in input_ids:
if token is placeholder:
new_input_ids.extend([token] * precomputed_length)
else:
new_input_ids.append(token)
Summary
A critical performance vulnerability has been identified in the input preprocessing logic of the multimodal tokenizer. The code dynamically replaces placeholder tokens (e.g., <|audio_|>, <|image_|>) with repeated tokens based on precomputed lengths. Due to inefficient list concatenation operations, the algorithm exhibits quadratic time complexity (O(n²)), allowing malicious actors to trigger resource exhaustion via specially crafted inputs.
Details
Affected Component: input_processor_for_phi4mm function.
vllm/vllm/model_executor/models/phi4mm.py
Lines 1182 to 1197 in 8cac35b
The code modifies the input_ids list in-place using input_ids = input_ids[:i] + tokens + input_ids[i+1:]. Each concatenation operation copies the entire list, leading to O(n) operations per replacement. For k placeholders expanding to m tokens, total time becomes O(kmn), approximating O(n²) in worst-case scenarios.
PoC
Test data demonstrates exponential time growth:
Doubling input size increases runtime by ~4x (consistent with O(n²)).
Impact
Denial-of-Service (DoS): An attacker could submit inputs with many placeholders (e.g., 10,000 <|audio_1|> tokens), causing CPU/memory exhaustion.
Example: 10,000 placeholders → ~100 million operations.
Remediation Recommendations
Precompute all placeholder positions and expansion lengths upfront.
Replace dynamic list concatenation with a single preallocated array.