This is the code for the multimedia mining search and retrieval (MMSR) WS25 course at the Johannes Kepler University Linz.
The code aims at developing and training a neural network (NN) architecture that, given two tracks, classifies them as "sharing at least one genre" (1, positive output) or "not sharing any music genre" (0, negative output). The task is therefore formulated as binary classification.
You will use this code to train the algorithm on a set
The dataset creation is shown in the notebook /data/create_dataset.ipynb. As you can see, we started from an initial set of tracks
We then shuffled the resulting data, and selected 80% of pairs as training set, 10% as validation set, and 10% as test set.
The data for training, evaluating, and testing the NN is shared with you in the folder binary_classification available here.
Notice that the files to be used for the actual retrieval task (i.e., the set retrieval available here.
The code for the actual NN models is in ssnet_fop/binary_classification_model.py. All models rely on the SingleBranchGeneral class,
which is a subclass of PyTorch nn.Module. All models have a forward method that, given the feature vectors of
track
- Processes each of them with an embedding module (see below for description)
- Computes the cosine similarity of the resulting embeddings
- Returns the cosine similarity as logits to compute the probability that the two tracks share at least one genre.
The feature embedding modules are multi-layer perceptrons (MLP) with fully connected linear layers, followed by batch normalization,
ReLU activation function and dropout regularization. The all embedding modules are children classes of EmbedBranchGeneral.
In all embedding modules, the same final layer self.fc_shared is shared between the two input modalities.
We currently support three options:
EmbedBranchDownproject: Before being passed as input toself.fc_shared, feature 1 and feature 2 are passed to a modality-specific layer,self.fc_iandself.fc_j. This means that the first layer is not shared between the two modalities.EmbedBranchPadding: Before being passed as input toself.fc_shared, feature 1 and feature 2 are passed to a same layer, shared between modalities. Since the modalities might be of different dimensionality, the algorithm pads with zeros the modality of lower dimension, to ensure That both modalities are compatible with the input dimension ofself.fc_shared. In this case, one betweenself.fc_iandself.fc_jis the identity operator, while the other is the padding operator.
Models are trained with binary cross entropy loss. The code for training the NN is in ssnet_fop/main.py. You can run it as follows:
cd ssnet_fop
python main.pyAll arguments of this script are optional, but be aware of the default values.
usage: main.py [-h] [--seed S] [--cuda] [--save_dir SAVE_DIR] [--lr LR] [--batch_size BATCH_SIZE] [--max_num_epoch MAX_NUM_EPOCH] [--intermediate_emb INTERMEDIATE_EMB] [--dim_embed DIM_EMBED] [--feature_i FEATURE_I] [--feature_j FEATURE_J]
[--merging_technique MERGING_TECHNIQUE]
options:
--help, --help show this help message and exit
--seed S Random Seed. Default 1
--device Device to be used, either cuda (if a GPU is available) or cpu. Default cuda
--save_dir SAVE_DIR Directory for saving checkpoints. Default model
--lr LR learning rate. Default: 1e-5
--batch_size BATCH_SIZE
Batch size for training. Default 128
--max_num_epoch MAX_NUM_EPOCH
Max number of epochs to train, number. Default 500
--intermediate_emb INTERMEDIATE_EMB
Intermediate Layer. Default 256
--dim_embed DIM_EMBED
Embedding Size. Default 128
--feature_i FEATURE_I
feature i (first modality). Default mfcc_bow
--feature_j FEATURE_J
feature j (second modality). Default mfcc_bow
--merging_technique MERGING_TECHNIQUE
whether to downproject or pad if there is a dimension mismatch. Default downproject
.
├── README.md
├── data
│ ├── todo folder
│ ├── todo folder
│ ├── create_dataset.ipynb
│ └── inspect.ipynb
├── ssnet_fop
│ ├── best_fc2_mfcc_bow_mfcc_bow_model
│ │ └── checkpoint.pth.tar
│ ├── fc2_mfcc_bow_mfcc_bow_model
│ ├── main.py
│ ├── online_evaluation.py
│ ├── output
│ ├── retrieval_model.pyFor training, evaluating, and testing the model with our scripts, the dataset should be stored in the data folder.
The data for training, evaluating, and testing the NN is shared with you in the folder binary_classification available here.
The pre-trained model should be stored in the ssnet_fop/best_fc2<modality_1>_<modality_2>_model folder. Such a model is
also stored any time you re-run the code to train a new model. Pre-trained model instances are available (not yet ready) here.
The name shares the same convention of the model folder, and allows you to identify what features were used during training.
We recommend using a conda environment to run the code. Once you installed conda, you can run the following commands to reproduce our setup
conda env -n sbnet python=3.12
conda activate sbnet
pip install pandas==2.3.3 scikit-learn==1.7.2 scipy==1.16.2 tqdm==4.67.1 torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0