-
Notifications
You must be signed in to change notification settings - Fork 31k
Description
System Info
transformers version: 4.57.1
python version: 3.11
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
I'm opening this issue to request that config.json and preprocessor_config.json be added to the kyutai/moshiko-pytorch-bf16 model repository.
Problem:
Currently, AutoFeatureExtractor.from_pretrained("kyutai/moshiko-pytorch-bf16") taken from model doc page at huggingface.co/docs/transformers/en/model_doc/moshi under the heading 1. Model generation fails with an OSError because preprocessor_config.json is missing. This is inconsistent with other repos in the collection, like kyutai/moshiko-pytorch-q8 and kmhf/hf-moshiko, which do contain these necessary configuration files.
from datasets import load_dataset, Audio
import torch, math
from transformers import MoshiForConditionalGeneration, AutoFeatureExtractor, AutoTokenizer
librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
feature_extractor = AutoFeatureExtractor.from_pretrained("kyutai/moshiko-pytorch-bf16")
tokenizer = AutoTokenizer.from_pretrained("kyutai/moshiko-pytorch-bf16")
device = "cuda"
dtype = torch.bfloat16
# prepare user input audio
librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
audio_sample = librispeech_dummy[-1]["audio"]["array"]
user_input_values = feature_extractor(raw_audio=audio_sample, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(device=device, dtype=dtype)
# prepare moshi input values - we suppose moshi didn't say anything while the user spoke
moshi_input_values = torch.zeros_like(user_input_values.input_values)
# prepare moshi input ids - we suppose moshi didn't say anything while the user spoke
num_tokens = math.ceil(moshi_input_values.shape[-1] * waveform_to_token_ratio)
input_ids = torch.ones((1, num_tokens), device=device, dtype=torch.int64) * tokenizer.encode("<pad>")[0]
# generate 25 new tokens (around 2s of audio)
output = model.generate(input_ids=input_ids, user_input_values=user_input_values.input_values, moshi_input_values=moshi_input_values, max_new_tokens=25)
text_tokens = output.sequences
audio_waveforms = output.audio_sequences
error:
OSError: kyutai/moshiko-pytorch-bf16 does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/kyutai/moshiko-pytorch-bf16/tree/main' for available files.
Confirmation from Source Repository:
This has been confirmed by the model's authors as an issue for the Transformers port to handle (see: kyutai-labs/moshi#234 )
Expected behavior
Proposed Solution:
Adding the missing configuration files will resolve this. The content can be derived from the existing q8 variant.
Proposed preprocessor_config.json:
(Copied from kmhf/hf-moshiko)
{
"feature_extractor_type": "EncodecFeatureExtractor",
"sampling_rate": 24000,
"feature_size": 1,
"padding_side": "right",
"padding_value": 0.0,
"return_attention_mask": true,
"chunk_length_s": null,
"overlap": null
}Proposed config.json :
(Based on kyutai/moshiko-pytorch-q8 and kyutai/moshiko-pytorch-bf16/
{
"moshi_name": "model.safetensors",
"mimi_name": "tokenizer-e351c8d8-checkpoint125.safetensors",
"tokenizer_name": "tokenizer_spm_32k_3.model",
"quantize": false,
"dim": 4096,
"text_card": 32000,
"existing_text_padding_id": 3,
"n_q": 16,
"dep_q": 8,
"card": 2048,
"num_heads": 32,
"num_layers": 32,
"hidden_scale": 4.125,
"causal": true,
"layer_scale": null,
"context": 3000,
"max_period": 10000,
"gating": "silu",
"norm": "rms_norm_f32",
"positional_embedding": "rope",
"depformer_dim": 1024,
"depformer_dim_feedforward": 4224,
"depformer_num_heads": 16,
"depformer_num_layers": 6,
"depformer_causal": true,
"depformer_layer_scale": null,
"depformer_multi_linear": true,
"depformer_context": 8,
"depformer_max_period": 10000,
"depformer_gating": "silu",
"depformer_pos_emb": "none",
"depformer_weights_per_step": true,
"delays": [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1]
}