Assuming 44100 samples per second in a short 4-8 second-long audio .wav file, if I want perform FFT to detect the power spectrum, amplitude, and phase shift from 20 Hz to 10000 Hz (humans lose the ability hear above 10000 Hz with increasing age), I am thinking I will need window lengths of at least 4410 samples? This is because, within a second, a 20 Hz cycle will occur every 2205 samples (44100/20), but due to Nyquist, I will need twice the 2205 or 4410 samples.
For human hearing, what is the typical window width used so that deconvolution will result in clean audio quality? For some reason, I think the long wavelengths of 20 Hz will require window widths no shorter than 4410 samples. Given this, why are their STFT matrices shown in reports which have 1024 bins in the time domain for 4 second .wav files? (see the image below from the NMF chapter at musicinformationretrieval.com). The 1024 bins must be based on the jump size -- that site never states that they employ windows widths of 4410 with much shorter jump sizes.
UPDATE
The goal is to replicate the NMF results presented here via OOP with a compilable programming language (.NET) -- not an interpretive language like R, Matlab, Librosa, etc. Their STFT results were obtained using Librosa by merely invoking the syntax S = librosa.stft(x). It would be futile to chase down Librosa's default settings for FFT bin spacing and window jump length for STFT matrix generation by Librosa because I am not using Librosa.
