How to deal with different audio formats for audio classification?

Question

I am working on an audio classification problem statement to classify between two audio classes. I have collected samples from jotform, they are providing audio widget to collect .wav audio but it turned out that widget is storing data in .mp3 format :

In my problem statement, Classification classes are from different formats :

class A : all the 100 samples are in .mp3 format ( jot form collection ) class B : all the samples are in .wav format

I am adding both types of classes' sample here :

Class A sample audio : it's in .wav format

Details :

General Complete name : count_class_1.wav Format : Wave File size : 1.41 MiB Duration : 15 s 445 ms Overall bit rate mode : Constant Overall bit rate : 768 kb/s Audio Format : PCM Format settings : Little / Signed Codec ID : 1 Duration : 15 s 445 ms Bit rate mode : Constant Bit rate : 768 kb/s Channel(s) : 1 channel Sampling rate : 48.0 kHz Bit depth : 16 bits Stream size : 1.41 MiB (100%)

Class B sample audio Jotform says it's .wav format but only extension is .wav, file is .mp3 format.

Details :

General Complete name : count.wav Format : MPEG Audio File size : 183 KiB Duration : 9 s 360 ms Overall bit rate mode : Constant Overall bit rate : 160 kb/s Writing library : LAME3.99.5 FileExtension_Invalid : m1a mpa mpa1 mp1 m2a mpa2 mp2 mp3 Audio Format : MPEG Audio Format version : Version 1 Format profile : Layer 3 Format settings : Joint stereo / MS Stereo Duration : 9 s 360 ms Bit rate mode : Constant Bit rate : 160 kb/s Channel(s) : 2 channels Sampling rate : 48.0 kHz Frame rate : 41.667 FPS (1152 SPF) Compression mode : Lossy Stream size : 183 KiB (100%) Writing library : LAME3.99.5

What i am doing before feeding it to neural network :

Downsampled to 16kHz, the level of the signal was normalized
Segmented in audio segments, by removing the silences in the signal
High filtered (pre-emphasis filter). Audio segments were then divided in non-overlapping Hamming-windowed frames of 25ms.

Now after this extracting various features from each frames including MFCCs, Zero-crossing rate (ZCR), Formants (the first 4) etc and at last feeding all these features to simple dense layer neural network or CNN (spectrogram format).

But the problem is both classes' audio files are in a different format class A audio samples are in .wav and class B audio samples in .mp3 and there are high chances that network can be biased towards format or audio encoding.

Solutions I have thought :

Downgrade all files to 16kHz frequency ( But format issue is still there)
or convert all files into one universal format, for example I am converting all .mp3 files to .wav files then all files will be having same format, I could convert one into another, but I am afraid I will lose quality on the converted files.

My doubt is if I downsampled both classes audio samples (.wav and mp3 both) to 16kHz will my neural network still be format biased?

What would be a good strategy for me for Audio classification when audio files are in different formats?

Can you collect class B also via the same process as class A? That would be the best approach, since it will fix the format issue - but also ensures that other things are similar in the collection process for the two classes in dataset — Jon Nordby
– Jon Nordby, Commented Jul 16, 2020 at 19:43
@jonnor No, we can't collect class A as class B now. they are cough samples. How we can scale them in one format? — Aaditya ura
– Aaditya ura, Commented Jul 17, 2020 at 9:28

Brian Spiering · Accepted Answer · 2021-01-01 13:43:14Z

If the goal of the project is to classify the groups, then just use the file format information. It is common in machine learning to use meta-data information as a feature.

If you want to contrive the problem by not using metadata, then convert them to format that audio quality matters less such as spectrogram or Fourier transform.

Stack Exchange Network

How to deal with different audio formats for audio classification?

1 Answer 1

Hot Network Questions

How to deal with different audio formats for audio classification?

1 Answer 1

Related

Hot Network Questions