I've spent a lot of time trying to understand the Google's WaveNet work (also used in their DeepVoice model), but still confused about some very basic aspects. I'm referring to this Tensorflow implementation of Wavenet.
Page-2 of the paper says:
"In this paper we introduce a new generative model operating directly on the raw audio waveform.".
If we already have raw audio waveform, why do we need WaveNet? Isn't that what model is supposed to generate?