I am working on an audio classification problem statement to classify between two audio classes. I have collected samples from jotform, they are providing audio widget to collect .wav audio but it turned out that widget is storing data in .mp3 format :
In my problem statement, Classification classes are from different formats :
class A : all the 100 samples are in .mp3 format ( jot form collection ) class B : all the samples are in .wav format I am adding both types of classes' sample here :
Class A sample audio : it's in .wav format
Details :
General Complete name : count_class_1.wav Format : Wave File size : 1.41 MiB Duration : 15 s 445 ms Overall bit rate mode : Constant Overall bit rate : 768 kb/s Audio Format : PCM Format settings : Little / Signed Codec ID : 1 Duration : 15 s 445 ms Bit rate mode : Constant Bit rate : 768 kb/s Channel(s) : 1 channel Sampling rate : 48.0 kHz Bit depth : 16 bits Stream size : 1.41 MiB (100%) Class B sample audio Jotform says it's .wav format but only extension is .wav, file is .mp3 format.
Details :
General Complete name : count.wav Format : MPEG Audio File size : 183 KiB Duration : 9 s 360 ms Overall bit rate mode : Constant Overall bit rate : 160 kb/s Writing library : LAME3.99.5 FileExtension_Invalid : m1a mpa mpa1 mp1 m2a mpa2 mp2 mp3 Audio Format : MPEG Audio Format version : Version 1 Format profile : Layer 3 Format settings : Joint stereo / MS Stereo Duration : 9 s 360 ms Bit rate mode : Constant Bit rate : 160 kb/s Channel(s) : 2 channels Sampling rate : 48.0 kHz Frame rate : 41.667 FPS (1152 SPF) Compression mode : Lossy Stream size : 183 KiB (100%) Writing library : LAME3.99.5 What i am doing before feeding it to neural network :
- Downsampled to 16kHz, the level of the signal was normalized
- Segmented in audio segments, by removing the silences in the signal
- High filtered (pre-emphasis filter). Audio segments were then divided in non-overlapping Hamming-windowed frames of 25ms.
Now after this extracting various features from each frames including MFCCs, Zero-crossing rate (ZCR), Formants (the first 4) etc and at last feeding all these features to simple dense layer neural network or CNN (spectrogram format).
But the problem is both classes' audio files are in a different format class A audio samples are in .wav and class B audio samples in .mp3 and there are high chances that network can be biased towards format or audio encoding.
Solutions I have thought :
- Downgrade all files to 16kHz frequency ( But format issue is still there)
- or convert all files into one universal format, for example I am converting all .mp3 files to .wav files then all files will be having same format, I could convert one into another, but I am afraid I will lose quality on the converted files.
My doubt is if I downsampled both classes audio samples (.wav and mp3 both) to 16kHz will my neural network still be format biased?
What would be a good strategy for me for Audio classification when audio files are in different formats?