Why does one need Google's WaveNet model to generate audio if it already takes audio as input?

Question

I've spent a lot of time trying to understand the Google's WaveNet work (also used in their DeepVoice model), but still confused about some very basic aspects. I'm referring to this Tensorflow implementation of Wavenet.

Page-2 of the paper says:

"In this paper we introduce a new generative model operating directly on the raw audio waveform.".

If we already have raw audio waveform, why do we need WaveNet? Isn't that what model is supposed to generate?

It appears that you're asking two different questions, one about why the paper is concerned about generative models and one that asks about the input size. Perhaps you could edit it to focus on just one question? — Sycorax
– Sycorax ♦, Commented Jun 15, 2020 at 19:19
I've edited it to focus on just one question for now as requested. — Joe Black
– Joe Black, Commented Jul 15, 2020 at 1:26
I have updated the question so it only asks 1 specific question as suggested by a previous comment. This seems to be relevant topic on this forum regarding a well-known text-to-speech generation system by Google. Can the question be reopened now as it addresses the issue cited in the comment above? — Joe Black
– Joe Black, Commented Jul 26, 2020 at 19:40
I’ve voted to reopen. It needs four additional votes to be reopened. — Sycorax
– Sycorax ♦, Commented Jul 26, 2020 at 20:54

sjp · Accepted Answer · 2020-06-15 20:36:16Z

3

I think you may have misunderstood what they were talking about in the quote you posted. Having read the paper, and just having finished a graduate course on speech technology, I think that the part you have missed is this:

WaveNet, as opposed to other earlier forms of speech synthesis, creates a raw audio waveform from the text it is given. This is very different from how parametric or concatenative synthesis created speech in text-to-speech applications.

answered Jun 15, 2020 at 20:36

sjp

6773 silver badges7 bronze badges

$\begingroup$ I understand it's supposed to generate audio, but could you reconcile what i quoted? how else one to interpret "operating directly on the raw audio waveform"? what's the input to wavenet when it's used with tacotron-2 for text-to-speech, esp the input to input_convolution that described in the OP? $\endgroup$

Joe Black
– Joe Black

2020-06-15 21:55:20 +00:00
Commented Jun 15, 2020 at 21:55
$\begingroup$ where's the quote in the paper "creates a raw audio waveform from the text it is given"? I i couldn't the find it in the paper though i understand Wavenet is supposed to generate audio and that's why it's unclear to me, which is the reason stated in the title and why i made this question. $\endgroup$

Joe Black
– Joe Black

2020-06-15 21:58:11 +00:00
Commented Jun 15, 2020 at 21:58
1

$\begingroup$ In the paper, "operate on audio waveform" does not mean "take audio waveform as input". It simply means that they model the audio waveform directly. Your post is off topic though. Try StackOverflow next time perhaps. $\endgroup$

Tim Mak
– Tim Mak

2020-07-15 03:56:05 +00:00
Commented Jul 15, 2020 at 3:56

Add a comment |

Stack Exchange Network

Why does one need Google's WaveNet model to generate audio if it already takes audio as input?

1 Answer 1

Hot Network Questions

Why does one need Google's WaveNet model to generate audio if it already takes audio as input?

1 Answer 1

Related

Hot Network Questions