0
$\begingroup$

I've spent a lot of time trying to understand the Google's WaveNet work (also used in their DeepVoice model), but still confused about some very basic aspects. I'm referring to this Tensorflow implementation of Wavenet.

Page-2 of the paper says:

"In this paper we introduce a new generative model operating directly on the raw audio waveform.".

If we already have raw audio waveform, why do we need WaveNet? Isn't that what model is supposed to generate?

$\endgroup$
4
  • $\begingroup$ It appears that you're asking two different questions, one about why the paper is concerned about generative models and one that asks about the input size. Perhaps you could edit it to focus on just one question? $\endgroup$ Commented Jun 15, 2020 at 19:19
  • $\begingroup$ I've edited it to focus on just one question for now as requested. $\endgroup$ Commented Jul 15, 2020 at 1:26
  • $\begingroup$ I have updated the question so it only asks 1 specific question as suggested by a previous comment. This seems to be relevant topic on this forum regarding a well-known text-to-speech generation system by Google. Can the question be reopened now as it addresses the issue cited in the comment above? $\endgroup$ Commented Jul 26, 2020 at 19:40
  • $\begingroup$ I’ve voted to reopen. It needs four additional votes to be reopened. $\endgroup$ Commented Jul 26, 2020 at 20:54

1 Answer 1

3
$\begingroup$

I think you may have misunderstood what they were talking about in the quote you posted. Having read the paper, and just having finished a graduate course on speech technology, I think that the part you have missed is this:

WaveNet, as opposed to other earlier forms of speech synthesis, creates a raw audio waveform from the text it is given. This is very different from how parametric or concatenative synthesis created speech in text-to-speech applications.

$\endgroup$
3
  • $\begingroup$ I understand it's supposed to generate audio, but could you reconcile what i quoted? how else one to interpret "operating directly on the raw audio waveform"? what's the input to wavenet when it's used with tacotron-2 for text-to-speech, esp the input to input_convolution that described in the OP? $\endgroup$ Commented Jun 15, 2020 at 21:55
  • $\begingroup$ where's the quote in the paper "creates a raw audio waveform from the text it is given"? I i couldn't the find it in the paper though i understand Wavenet is supposed to generate audio and that's why it's unclear to me, which is the reason stated in the title and why i made this question. $\endgroup$ Commented Jun 15, 2020 at 21:58
  • 1
    $\begingroup$ In the paper, "operate on audio waveform" does not mean "take audio waveform as input". It simply means that they model the audio waveform directly. Your post is off topic though. Try StackOverflow next time perhaps. $\endgroup$ Commented Jul 15, 2020 at 3:56

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.