STFT window width vs. detectable Hz range

Question

Assuming 44100 samples per second in a short 4-8 second-long audio .wav file, if I want perform FFT to detect the power spectrum, amplitude, and phase shift from 20 Hz to 10000 Hz (humans lose the ability hear above 10000 Hz with increasing age), I am thinking I will need window lengths of at least 4410 samples? This is because, within a second, a 20 Hz cycle will occur every 2205 samples (44100/20), but due to Nyquist, I will need twice the 2205 or 4410 samples.

For human hearing, what is the typical window width used so that deconvolution will result in clean audio quality? For some reason, I think the long wavelengths of 20 Hz will require window widths no shorter than 4410 samples. Given this, why are their STFT matrices shown in reports which have 1024 bins in the time domain for 4 second .wav files? (see the image below from the NMF chapter at musicinformationretrieval.com). The 1024 bins must be based on the jump size -- that site never states that they employ windows widths of 4410 with much shorter jump sizes.

UPDATE

The goal is to replicate the NMF results presented here via OOP with a compilable programming language (.NET) -- not an interpretive language like R, Matlab, Librosa, etc. Their STFT results were obtained using Librosa by merely invoking the syntax S = librosa.stft(x). It would be futile to chase down Librosa's default settings for FFT bin spacing and window jump length for STFT matrix generation by Librosa because I am not using Librosa.

$\begingroup$ see the update in the OP. $\endgroup$

user16354
– user16354

2019-10-26 17:38:43 +00:00
Commented Oct 26, 2019 at 17:38 — user16354
– user16354, Commented Oct 26, 2019 at 17:38

hotpaw2 · Accepted Answer · 2019-10-26 17:08:59Z

Humans do not use FFTs or fixed size windows when perceiving sound. So the mechanism of displaying STFTs can't be used to categorize what a human will "detect", or to produce data that will have a "clean audio quality".

For high frequencies, humans can perceive with a higher time resolution than STFTs that are spaced apart by 1024 when overlapped.

The Nyquist sampling criteria is about the highest frequencies, not the lowest. For the lowest audible frequencies, IIRC, some psychoacoustic experiments seem to show that humans require 6 or more periods of a sound frequency to hear it as a tone at a point in time, which would be displayed by an FFT frame length of closer to 16384, not 4410.

A spectrogram that more realistically models human hearing would have to use something like a different frequency analysis window width (time duration) for every horizontal frequency line or band (N per octave for some N that depends on frequency), and more closely spaced near the highest audible tone frequencies than for low.

Just because one has a hammer does not make everything else a nail. Fixed-size (and fixed offset) FFTs are used because they are mathematically clean, complete, and invertible computationally efficient linear basis transforms. Easier to graph. But the parameters of human perception do not necessarily fit the criteria or result of computing or displaying any particular window size or offset of FFT.

That's interesting, since it implies that problems associated with Hz bin spacing during STFT generation and ISTFT to deconvolute with IFFT is not really linear and there are no optimal rules of thumb. Everything always depends on the data. An FFT framelength (N) of 16384 at low frequencies would exacerbate the already suffering computational expense. — user16354
– user16354, Commented Oct 26, 2019 at 17:59
A good question to ask might be why do audio compression algorithms get away with using smallish fixed size DCTs, even after throwing away much of the details of the DCT results by filter-banking and thresholding the outputs. Possibly because the visual display of filtered DCT results would be uglier and un-intuitive. — hotpaw2
– hotpaw2, Commented Oct 26, 2019 at 18:32

Richard Lyons · Accepted Answer · 2019-10-26 10:24:13Z

Regarding the first paragraph of your question: The number of samples used in a discrete Fourier transform (DFT) depends on your desired DFT bin spacing (the frequency difference between adjacent DFT bins) in the frequency domain. If that desired bin spacing is Fo Hz, and the time-domain signal sample rate is Fs samples/second, then the number time samples, N, is:

N = Fs/Fo samples. (1)

For example, at Fs = 44100, if you select N = 4410 samples then the DFT bin spacing will be:

Fo = Fs/N = 44100/4410 = 10 Hz. (2)

However, for the traditional radix-2 fast Fourier transform (FFT) N must be an integer power of two. So using the above Eq. (2) when N = 2^12 = 4096 then Fo = 44100/4096 = 10.766 Hz.

Regarding the second paragraph of your question: I don't fully understand that paragraph but if Fs = 44100, someone performing 1024-point FFTs (N = 1024) produces spectra where the FFT bin spacing is:

Fo = Fs/N = 44100/1024 = 43.06 Hz. (3)

Does that Fo = 43.06 Hz FFT bin spacing seem correct based on other information you see from the NMF chapter?

@JoleT. I don't know what you mean by: "since N=1024 for a radix-2 FFT can not reliably detect 10000 Hz due to Nyquist." But remember, when N = 1024 and Fs = 44100, the positive frequency range of the FFT will go from zero Hz to (1/2-1/N)*Fs = 22.007 kHz. — Richard Lyons
– Richard Lyons, Commented Oct 26, 2019 at 17:58
Thanks, that chapter on NMF does not provided any information on the FFT framelength (N), or the jump length of overlapping windows. I presume they did go down to 20 Hz, which would require an N of 2048 for 21.53 Hz bin spacing. — user16354
– user16354, Commented Oct 26, 2019 at 18:04

Stack Exchange Network

STFT window width vs. detectable Hz range

2 Answers 2

Hot Network Questions

STFT window width vs. detectable Hz range

2 Answers 2

Related

Hot Network Questions