I have been trying audio classification on the UrbanSound8k dataset and MPSSC snore classification dataset. I am using the approach of transfer learning by extracting features from AlexNet and VGG19 pre-trained on ImageNet. I am then feeding these features to an SVM. Weirdly, I obtain better performance for both the datasets when using the viridis colormap as opposed to giving the same 2D grayscale spectrogram array in each of the 3 channels. One thing I don't understand is how does a colormap add any information which wasn't present in the original spectrogram?
I went through answers such as Do I need 3 RGB channels for a spectrogram CNN? which say that training a CNN has similar performance when using different colormaps. Is the same true for pre-trained networks too?