Skip to content

Missing config.json and preprocessor_config.json in kyutai/moshiko-pytorch-bf16 model repo #41935

@akshatvishu

Description

@akshatvishu

System Info

transformers version: 4.57.1
python version: 3.11

Who can help?

@Cyrilvallez @eustlb

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I'm opening this issue to request that config.json and preprocessor_config.json be added to the kyutai/moshiko-pytorch-bf16 model repository.

Problem:
Currently, AutoFeatureExtractor.from_pretrained("kyutai/moshiko-pytorch-bf16") taken from model doc page at huggingface.co/docs/transformers/en/model_doc/moshi under the heading 1. Model generation fails with an OSError because preprocessor_config.json is missing. This is inconsistent with other repos in the collection, like kyutai/moshiko-pytorch-q8 and kmhf/hf-moshiko, which do contain these necessary configuration files.

from datasets import load_dataset, Audio
import torch, math
from transformers import MoshiForConditionalGeneration, AutoFeatureExtractor, AutoTokenizer


librispeech_dummy = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
feature_extractor = AutoFeatureExtractor.from_pretrained("kyutai/moshiko-pytorch-bf16")
tokenizer = AutoTokenizer.from_pretrained("kyutai/moshiko-pytorch-bf16")
device = "cuda"
dtype = torch.bfloat16

# prepare user input audio 
librispeech_dummy = librispeech_dummy.cast_column("audio", Audio(sampling_rate=feature_extractor.sampling_rate))
audio_sample = librispeech_dummy[-1]["audio"]["array"]
user_input_values = feature_extractor(raw_audio=audio_sample, sampling_rate=feature_extractor.sampling_rate, return_tensors="pt").to(device=device, dtype=dtype)

# prepare moshi input values - we suppose moshi didn't say anything while the user spoke
moshi_input_values = torch.zeros_like(user_input_values.input_values)

# prepare moshi input ids - we suppose moshi didn't say anything while the user spoke
num_tokens = math.ceil(moshi_input_values.shape[-1] * waveform_to_token_ratio)
input_ids = torch.ones((1, num_tokens), device=device, dtype=torch.int64) * tokenizer.encode("<pad>")[0]

# generate 25 new tokens (around 2s of audio)
output = model.generate(input_ids=input_ids, user_input_values=user_input_values.input_values, moshi_input_values=moshi_input_values, max_new_tokens=25)

text_tokens = output.sequences
audio_waveforms = output.audio_sequences

error:

OSError: kyutai/moshiko-pytorch-bf16 does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/kyutai/moshiko-pytorch-bf16/tree/main' for available files.

Confirmation from Source Repository:
This has been confirmed by the model's authors as an issue for the Transformers port to handle (see: kyutai-labs/moshi#234 )

Expected behavior

Proposed Solution:
Adding the missing configuration files will resolve this. The content can be derived from the existing q8 variant.

Proposed preprocessor_config.json:
(Copied from kmhf/hf-moshiko)

{
  "feature_extractor_type": "EncodecFeatureExtractor",
  "sampling_rate": 24000,
  "feature_size": 1,
  "padding_side": "right",
  "padding_value": 0.0,
  "return_attention_mask": true,
  "chunk_length_s": null,
  "overlap": null
}

Proposed config.json :
(Based on kyutai/moshiko-pytorch-q8 and kyutai/moshiko-pytorch-bf16/

{
    "moshi_name": "model.safetensors",
    "mimi_name": "tokenizer-e351c8d8-checkpoint125.safetensors",
    "tokenizer_name": "tokenizer_spm_32k_3.model",
    "quantize": false,
    "dim": 4096,
    "text_card": 32000,
    "existing_text_padding_id": 3,
    "n_q": 16,
    "dep_q": 8,
    "card": 2048,
    "num_heads": 32,
    "num_layers": 32,
    "hidden_scale": 4.125,
    "causal": true,
    "layer_scale": null,
    "context": 3000,
    "max_period": 10000,
    "gating": "silu",
    "norm": "rms_norm_f32",
    "positional_embedding": "rope",
    "depformer_dim": 1024,
    "depformer_dim_feedforward": 4224,
    "depformer_num_heads": 16,
    "depformer_num_layers": 6,
    "depformer_causal": true,
    "depformer_layer_scale": null,
    "depformer_multi_linear": true,
    "depformer_context": 8,
    "depformer_max_period": 10000,
    "depformer_gating": "silu",
    "depformer_pos_emb": "none",
    "depformer_weights_per_step": true,
    "delays": [0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1]
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions