3
$\begingroup$

Question

Please help understand the cause and solution of the problem of unable to remove the audio noise by using the signal power as filter threshold. If the approach is not correct, please advise the better or correct ways.

Background

Try to remove the audio noises in the JFK speech. For instance, ... my fellow Americans <noise> ask not what your country can do for you <noise> ..., which are the red rectangles in the waveform.

enter image description here

Ideally aiming to get similar to the melspectgram below where the noises are removed in between phrases.

enter image description here

Data

# https://github.com/openai/whisper/blob/main/tests/jfk.flac wget -P ./data https://github.com/openai/whisper/blob/main/tests/jfk.flac 

Idea

Using the power of the signal to remove noise as Professor Steve Brunton showed in Denoising Data with FFT. The code exceprt from CH02_SEC02_2_Denoise.ipynb in CODE.zip + DATA.zip (PYTHON CODE by Daniel Dylewsky, unzip into same folder):

fhat = np.fft.fft(f,n) # Compute the FFT PSD = fhat * np.conj(fhat) / n # Power spectrum (power per freq) indices = PSD > 100 # Find all freqs with large power fhat = indices * fhat # Zero out small Fourier coeffs. in Y ffilt = np.fft.ifft(fhat) # Inverse FFT for filtered time signal 

enter image description here

Problem

Implemented the code, but it could not remove the noises in between phrases as the white cloudy parts seen in the mel spectgraum. Not sure why the weak power (white cloudy parts) cannot not been removed.

enter image description here

Code

First, find out the power range where the human voice 300 - 3000 Hz exists. From the power/frequency diagram, I should extract above -100 dB, hence this is the power threshold to filter.

import librosa import matplotlib.pyplot as plt import numpy as np import scipy as sp data, original_sampling_rate = librosa.load("./data/jfk.flac", sr=None) sampling_rate = original_sampling_rate N = num_total_samples = data.shape[0] # -------------------------------------------------------------------------------- # Power per Frequency # -------------------------------------------------------------------------------- dft = np.fft.rfft(data, norm="forward") amplitude = 2 * np.abs(dft) db = 20 * np.log10(amplitude) frequency = np.fft.rfftfreq(n=len(data), d=1/sampling_rate) plt.figure().set_figwidth(8) plt.plot(frequency, db) plt.xlabel("Frequency (Hz)") plt.ylabel("Power (dB)") plt.grid(visible=True, which='major') plt.grid(visible=True, which='minor', linestyle='--', alpha=0.5) plt.xscale("log") # -------------------------------------------------------------------------------- # Mel Spectrogram # -------------------------------------------------------------------------------- S = librosa.feature.melspectrogram( y=data, sr=sampling_rate, n_mels=128, fmax=sampling_rate/2, n_fft=1024, hop_length=512 ) S_dB = librosa.power_to_db(S, ref=1, top_db=None, amin=1e-10) plt.figure().set_figwidth(8) librosa.display.specshow(S_dB, x_axis="time", y_axis="mel", sr=sampling_rate, fmax=sampling_rate/2) plt.colorbar() 

enter image description here

Extract the area surrounded by the threshold -100 dB and the 300-3000 band.

# -------------------------------------------------------------------------------- # Cut low/high frequencies with scipy.signal.butterworth # -------------------------------------------------------------------------------- HIGHCUT = 3000 LOWCUT = 300 ORDER = 10 bandpassed = butter_bandpass_filter(signal=data, lowcut=LOWCUT, highcut=HIGHCUT, sampling_rate=sampling_rate, order=ORDER) # -------------------------------------------------------------------------------- # Cut low power < -100 dB # -------------------------------------------------------------------------------- THRESHOLD = -100 fhat = np.fft.rfft(a=bandpassed, norm="forward", axis=-1) fhat_amplitude = 2 * np.abs(fhat) fhat_power = 20 * np.log10(fhat_amplitude) # zero out < -100 dB fhat_filter_indices = fhat_power > THRESHOLD fhat_power_filtered = fhat * fhat_filter_indices plt.figure().set_figwidth(8) plt.plot(frequency, 20 * np.log10(2 * np.abs(fhat_power_filtered))) plt.grid(visible=True, which='major') plt.grid(visible=True, which='minor', linestyle='--', alpha=0.5) plt.xlabel("Frequency (Hz)") plt.ylabel("Power (dB)") plt.grid() plt.xscale("log") 

enter image description here

However, the noises in-between were not removed.

data_filtered = np.fft.irfft(fhat_power_filtered, norm="forward") plt.figure().set_figwidth(8) plt.grid() librosa.display.waveshow(y=data_filtered, sr=sampling_rate) S_filtered = librosa.feature.melspectrogram(y=data_filtered, sr=sampling_rate, n_mels=128, fmax=sampling_rate/2) S_filtered_dB = librosa.power_to_db(S_filtered) plt.figure().set_figwidth(8) librosa.display.specshow(S_filtered_dB, x_axis="time", y_axis="mel", sr=sampling_rate, fmax=sampling_rate/2) plt.colorbar() 

enter image description here


Workaround

If I cut the sound data with its amplitude, the result mel spectgram looks what needed.

data, original_sampling_rate = librosa.load("./data/jfk.flac", sr=None) sampling_rate = original_sampling_rate data = data * (data > 0.1) # <--- Cut the small amplitude data # Same code as above from here. 

enter image description here

However, the sound has high note noise and the voice is like that of someone with stuffed nose. Also, if power is form amplitude as in power = (amplitude ** 2) / 2, then why cut by amplitude remove the noises in-between, but power cannot?

from IPython.display import ( Audio, display ) display(Audio(data=data_filtered, rate=sampling_rate)) 

Butterworth filter

def butter_bandpass( lowcut, highcut, sampling_rate: int, order: int = 5 ): """ Args: signal: signal data to filter lowcut: low cut-off frequency highcut: high cut-off frequency sampling_rate: sampling rate used to sample the signal order: filter order Returns: filtered output """ nyquist = 0.5 * sampling_rate low = lowcut / nyquist high = highcut / nyquist sos = sp.signal.butter(N=order, Wn=[low, high], btype='band', analog=False, output='sos') return sos def butter_bandpass_filter( signal: np.ndarray, lowcut, highcut, sampling_rate: int, order: int = 5 ): """Butterworth bandpath filter Args: signal: signal data to filter lowcut: low cut-off frequency highcut: high cut-off frequency sampling_rate: sampling rate used to sample the signal order: filter order """ sos = butter_bandpass( lowcut=lowcut, highcut=highcut, sampling_rate=sampling_rate, order=order) y = sp.signal.sosfilt(sos=sos, x=signal) return y 

Update

It looks zero-ing in the frequency domain is not a good idea according to Why is it a bad idea to filter by zeroing out FFT bins?, but I have no idea what the explanation mean.

Zeroing bins in the frequency domain is the same as multiplying by a rectangular window in the frequency domain. Multiplying by a window in the frequency domain is the same as circular convolution by the transform of that window in the time domain. The transform of a rectangular window is the Sinc function ($\sin(\omega t)/\omega t$). Note that the Sinc function has lots of large ripples and ripples that extend the full width of time domain aperture. If a time-domain filter that can output all those ripples (ringing) is a "bad idea", then so is zeroing bins.

$\endgroup$
2
  • 2
    $\begingroup$ I don’t have time to go through your lengthy question at the moment, but you might have better luck with a Weiner filter $\endgroup$ Commented Jul 26, 2024 at 4:31
  • 2
    $\begingroup$ The thresholding approaches are almost never the way to go. Consider the fact that you will get Gib's phenomena if you go with the "brickwall" frequency filtering approach. In addition, filtering the waveform will also remove the data that is not considered "noise" but still under the threshold that you pick. In this case, the waveform that is going from e.g., 1000 to -1000 will jump from +threshold to 0 and then from 0 to -threshold. There are many denoisers out there, check this one for example: pypi.org/project/noisereduce $\endgroup$ Commented Jul 26, 2024 at 7:52

1 Answer 1

1
$\begingroup$

I still do not completely understand the exact mechanism, but using a threshold and zero-out in the frequency domain was a wrong way as explained in Why is it a bad idea to filter by zeroing out FFT bins?, which further broken down into the explanation in Explanation to layman on the effect of zero-ing in frequency domain

Zeroing out a bin is the same as subtracting out a perfect sine wave of a specific bin centered frequency.

Most (or all) real world signals are imperfect. If the signal you are trying to filter out is slightly different from an exactly perfect sinewave of that bin frequency, the subtraction will leave behind a lot of "junk".

Signals that are "between bins" are different from any (or any small subset of) basis vectors of an FFT, thus can't be removed by subtracting a basis vector, which are all perfect sine waves of bin centered frequencies, e.g. frequencies whose period is an exact integer sub-multiple of the FFT's length. The subtraction will leave behind, often nasty, "junk".


Weiner Filter

As suggested by Jdip, experimented it and got a better result with background noise reduced in the speech.

wienered = sp.signal.wiener(im=data, mysize=28) # <--- 28 is found out by tweakig mysize paramter. bandpassed = butter_bandpass_filter(signal=wienered, lowcut=LOWCUT, highcut=HIGHCUT, sampling_rate=sampling_rate, order=ORDER) S = librosa.feature.melspectrogram(y=bandpassed, sr=sampling_rate, n_mels=128, fmax=sampling_rate/2) S_dB = librosa.power_to_db(S) plt.figure().set_figwidth(12) librosa.display.specshow(S_dB, x_axis="time", y_axis="mel", sr=sampling_rate, fmax=sampling_rate/4) plt.colorbar() 

enter image description here

$\endgroup$
1
  • $\begingroup$ I recommend you check this: Gibbs phenomenon $\endgroup$ Commented Jul 29, 2024 at 6:07

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.