Skip to content

ECNU-Cross-Innovation-Lab/ShiftSER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Temporal Shift for Speech Emotion Recognition [arXiv]

Code for ICASSP 2023 paper "Mingling or Misalignment? Temporal Shift for Speech Emotion Recognition with Pre-trained Representations".

Shift

Libraries and Dependencies

Data Preparation

Download our preprocessed wavfeature_7.5.tar.gz directly and unzip it to /dataset/IEMOCAP directory.

Or obtain IEMOCAP from USC and run

cd dataset/IEMOCAP python Preprocess.py --path $path to directory of IEMOCAP$ cd ../.. 

Then you have wavfeature_7.5.pkl and each processed audio is clipped to 7.5s and samped at 16kHz.

Training

We train the model specified in our paper with the same placement/proportion of shift. It should be noted that the placement/proportion of shift and other hyperparameters (see config.py) can be adjusted flexibly.

Key arguments for easy config modification in main.py are as follows,

  • --model: the model chosen for training.
    • cnn: convnext-like 1D CNN with 2 blocks.
    • rnn: 1 layer bidirectioanl LSTM.
    • transformer: 2-block Transformer with relative positional embedding by default.
  • --shift: whether to use temporal shift. The placement of shift are hard-coding as described in our paper: residual shift for CNN, replacement of MHSA for Transformer and in-place shift for lstm. We provide optional placement in annotation.
  • --ndiv: proportion of shift, namely 1/ndiv of channels will be shifted while others reamain unchanged.
  • --stride: the step to be shifted, kept as 1 by default.
  • --bidirectional: whether to use bidirectional temporal shift.
  • --finetune: whether to finetune the pretrained model or take the pretrained model as feature extractor. By default, we use wav2vec2 for finetuning and HuBERT for feature extraction.

For CNN and ShiftCNN

# Feture extraction for basic convnext python main.py --model cnn # Finetuning for basic convnext python main.py --model cnn --finetune # Feture extraction for ShiftCNN python main.py --model cnn --shift --ndiv 16 # Finetuning for ShiftCNN python main.py --model cnn --shift --ndiv 16 --finetune

For Transformer and Shiftformer

# Feture extraction for transformer python main.py --model transformer # Finetuning for transformer python main.py --model transformer --finetune # Feture extraction for Shiftformer python main.py --model transformer --shift --ndiv 4 --bidirectional # Finetuning for Shiftformer python main.py --model transformer --shift --ndiv 4 --bidirectional --finetune

For LSTM and ShiftLSTM

# Feture extraction for lstm python main.py --model rnn # Finetuning for lstm python main.py --model rnn --finetune # Feture extraction for Shiftlstm python main.py --model rnn --shift --ndiv 4 # Finetuning for Shiftlstm python main.py --model rnn --shift --ndiv 4 --finetune

The fianl results of 5 fold is expected to be in /log.

Citation

@inproceedings{shen2023mingling, title={Mingling or misalignment? temporal shift for speech emotion recognition with pre-trained representations}, author={Shen, Siyuan and Liu, Feng and Zhou, Aimin}, booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={1--5}, year={2023}, organization={IEEE} }

About

[ICASSP 2023] Mingling or Misalignment? Temporal Shift for Speech Emotion Recognition with Pre-trained Representations

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages