Question
Please help understand the cause and solution of the problem of unable to remove the audio noise by using the signal power as filter threshold. If the approach is not correct, please advise the better or correct ways.
Background
Try to remove the audio noises in the JFK speech. For instance, ... my fellow Americans <noise> ask not what your country can do for you <noise> ..., which are the red rectangles in the waveform.
Ideally aiming to get similar to the melspectgram below where the noises are removed in between phrases.
Data
# https://github.com/openai/whisper/blob/main/tests/jfk.flac wget -P ./data https://github.com/openai/whisper/blob/main/tests/jfk.flac Idea
Using the power of the signal to remove noise as Professor Steve Brunton showed in Denoising Data with FFT. The code exceprt from CH02_SEC02_2_Denoise.ipynb in CODE.zip + DATA.zip (PYTHON CODE by Daniel Dylewsky, unzip into same folder):
fhat = np.fft.fft(f,n) # Compute the FFT PSD = fhat * np.conj(fhat) / n # Power spectrum (power per freq) indices = PSD > 100 # Find all freqs with large power fhat = indices * fhat # Zero out small Fourier coeffs. in Y ffilt = np.fft.ifft(fhat) # Inverse FFT for filtered time signal Problem
Implemented the code, but it could not remove the noises in between phrases as the white cloudy parts seen in the mel spectgraum. Not sure why the weak power (white cloudy parts) cannot not been removed.
Code
First, find out the power range where the human voice 300 - 3000 Hz exists. From the power/frequency diagram, I should extract above -100 dB, hence this is the power threshold to filter.
import librosa import matplotlib.pyplot as plt import numpy as np import scipy as sp data, original_sampling_rate = librosa.load("./data/jfk.flac", sr=None) sampling_rate = original_sampling_rate N = num_total_samples = data.shape[0] # -------------------------------------------------------------------------------- # Power per Frequency # -------------------------------------------------------------------------------- dft = np.fft.rfft(data, norm="forward") amplitude = 2 * np.abs(dft) db = 20 * np.log10(amplitude) frequency = np.fft.rfftfreq(n=len(data), d=1/sampling_rate) plt.figure().set_figwidth(8) plt.plot(frequency, db) plt.xlabel("Frequency (Hz)") plt.ylabel("Power (dB)") plt.grid(visible=True, which='major') plt.grid(visible=True, which='minor', linestyle='--', alpha=0.5) plt.xscale("log") # -------------------------------------------------------------------------------- # Mel Spectrogram # -------------------------------------------------------------------------------- S = librosa.feature.melspectrogram( y=data, sr=sampling_rate, n_mels=128, fmax=sampling_rate/2, n_fft=1024, hop_length=512 ) S_dB = librosa.power_to_db(S, ref=1, top_db=None, amin=1e-10) plt.figure().set_figwidth(8) librosa.display.specshow(S_dB, x_axis="time", y_axis="mel", sr=sampling_rate, fmax=sampling_rate/2) plt.colorbar() Extract the area surrounded by the threshold -100 dB and the 300-3000 band.
# -------------------------------------------------------------------------------- # Cut low/high frequencies with scipy.signal.butterworth # -------------------------------------------------------------------------------- HIGHCUT = 3000 LOWCUT = 300 ORDER = 10 bandpassed = butter_bandpass_filter(signal=data, lowcut=LOWCUT, highcut=HIGHCUT, sampling_rate=sampling_rate, order=ORDER) # -------------------------------------------------------------------------------- # Cut low power < -100 dB # -------------------------------------------------------------------------------- THRESHOLD = -100 fhat = np.fft.rfft(a=bandpassed, norm="forward", axis=-1) fhat_amplitude = 2 * np.abs(fhat) fhat_power = 20 * np.log10(fhat_amplitude) # zero out < -100 dB fhat_filter_indices = fhat_power > THRESHOLD fhat_power_filtered = fhat * fhat_filter_indices plt.figure().set_figwidth(8) plt.plot(frequency, 20 * np.log10(2 * np.abs(fhat_power_filtered))) plt.grid(visible=True, which='major') plt.grid(visible=True, which='minor', linestyle='--', alpha=0.5) plt.xlabel("Frequency (Hz)") plt.ylabel("Power (dB)") plt.grid() plt.xscale("log") However, the noises in-between were not removed.
data_filtered = np.fft.irfft(fhat_power_filtered, norm="forward") plt.figure().set_figwidth(8) plt.grid() librosa.display.waveshow(y=data_filtered, sr=sampling_rate) S_filtered = librosa.feature.melspectrogram(y=data_filtered, sr=sampling_rate, n_mels=128, fmax=sampling_rate/2) S_filtered_dB = librosa.power_to_db(S_filtered) plt.figure().set_figwidth(8) librosa.display.specshow(S_filtered_dB, x_axis="time", y_axis="mel", sr=sampling_rate, fmax=sampling_rate/2) plt.colorbar() Workaround
If I cut the sound data with its amplitude, the result mel spectgram looks what needed.
data, original_sampling_rate = librosa.load("./data/jfk.flac", sr=None) sampling_rate = original_sampling_rate data = data * (data > 0.1) # <--- Cut the small amplitude data # Same code as above from here. However, the sound has high note noise and the voice is like that of someone with stuffed nose. Also, if power is form amplitude as in power = (amplitude ** 2) / 2, then why cut by amplitude remove the noises in-between, but power cannot?
from IPython.display import ( Audio, display ) display(Audio(data=data_filtered, rate=sampling_rate)) Butterworth filter
def butter_bandpass( lowcut, highcut, sampling_rate: int, order: int = 5 ): """ Args: signal: signal data to filter lowcut: low cut-off frequency highcut: high cut-off frequency sampling_rate: sampling rate used to sample the signal order: filter order Returns: filtered output """ nyquist = 0.5 * sampling_rate low = lowcut / nyquist high = highcut / nyquist sos = sp.signal.butter(N=order, Wn=[low, high], btype='band', analog=False, output='sos') return sos def butter_bandpass_filter( signal: np.ndarray, lowcut, highcut, sampling_rate: int, order: int = 5 ): """Butterworth bandpath filter Args: signal: signal data to filter lowcut: low cut-off frequency highcut: high cut-off frequency sampling_rate: sampling rate used to sample the signal order: filter order """ sos = butter_bandpass( lowcut=lowcut, highcut=highcut, sampling_rate=sampling_rate, order=order) y = sp.signal.sosfilt(sos=sos, x=signal) return y Update
It looks zero-ing in the frequency domain is not a good idea according to Why is it a bad idea to filter by zeroing out FFT bins?, but I have no idea what the explanation mean.
Zeroing bins in the frequency domain is the same as multiplying by a rectangular window in the frequency domain. Multiplying by a window in the frequency domain is the same as circular convolution by the transform of that window in the time domain. The transform of a rectangular window is the Sinc function ($\sin(\omega t)/\omega t$). Note that the Sinc function has lots of large ripples and ripples that extend the full width of time domain aperture. If a time-domain filter that can output all those ripples (ringing) is a "bad idea", then so is zeroing bins.








