VAE: what activation function (if any) to use for the last layer of my decoder if I don't want to assume any knowledge about the scale of my inputs?

Question

I'm working on an implementation of a Variational Autoencoder (VAE). There are lots of helpful examples and guides out there, which typically introduce VAE in the context of image data, e.g. MNIST. Since pixels (input features) are scaled between zero and one - $x\in[0,1]$ - these examples use sigmoid activation in the last layer of the decoder $d: z \mapsto x$. That makes sense, but what if I cannot assume any knowledge about the scale of $x$? Do I just not use any activation at all? And if so, does that make learning significantly harder?

Any help would be much appreciated!

Mr Tsjolder from codidact · Accepted Answer · 2022-06-02 09:51:15Z

Your output activation function (in combination with your loss function) correspond to (implicit) assumptions you make about the distribution of the data:

logistic sigmoid + binary cross-entropy = Bernoulli assumption
identity function + mean squared error = Gaussian assumption
softmax function + cross-entropy = Categorical assumption

You can easily see the difference (also for plain AEs) by running the following experiment (on a simple dataset like MNIST):

train an AE on images with pixel intensities in [0, 1] using logistic sigmoid + BCE
train an AE on images with normalised pixel intensities (subtract mean and divide by standard deviation) using the identity funtion + MSE.
observe the difference between the reproductions. The different assumptions should be clearly visible.

Obviously this explanation only explains what to do in a few limited cases. If your data follows some exotic distribution, there might not be a fixed/well-known loss function and/or activation function. However, it should provide a basis to make informed decisions about what functions to use to achieve good reconstructions.

TheCG · Accepted Answer · 2022-06-02 11:46:08Z

By taking a probabilistic view, the outputs of the decoder are realizations of the random variables $x\sim p_{\theta}(x|z)$.

If you want to model realizations in $[0, 1]$, you can use the continuous Bernoulli for example. In this can your decoder will output the parameter $\lambda$ of the distribution which needs to be in $[0, 1]$, hence the sigmoid function taken over the output for example.

If you want to model realizations in $\mathbb{R}$, you can use the Gaussian distribution for example. Here, the decoder will typically output the mean $m$ of your distribution. There is no restriction for $m$ so this can simply at scalar output at your last decoder layer.

why would the decoder output the variance? Typically, only the encoder produces mean and variance... — Mr Tsjolder from codidact
– Mr Tsjolder from codidact, Commented Jun 2, 2022 at 9:35
Agree with this comment, in typical use for unconstrained output you would just have decoder output $f:z \rightarrow \mu$ without any activation in the final layer. It is then customary to assume diagonal output noise so that $p(x|z)\sim N(f(z), \sigma^{-2}I)$ — Lulu
– Lulu, Commented Jun 2, 2022 at 11:15
@MrTsjolder You're right, I responded too quickly. I will edit — TheCG
– TheCG, Commented Jun 2, 2022 at 11:41

Stack Exchange Network

VAE: what activation function (if any) to use for the last layer of my decoder if I don't want to assume any knowledge about the scale of my inputs?

2 Answers 2

Linked

Hot Network Questions

VAE: what activation function (if any) to use for the last layer of my decoder if I don't want to assume any knowledge about the scale of my inputs?

2 Answers 2

Linked

Related

Hot Network Questions