I'm working on an implementation of a Variational Autoencoder (VAE). There are lots of helpful examples and guides out there, which typically introduce VAE in the context of image data, e.g. MNIST. Since pixels (input features) are scaled between zero and one - $x\in[0,1]$ - these examples use sigmoid activation in the last layer of the decoder $d: z \mapsto x$. That makes sense, but what if I cannot assume any knowledge about the scale of $x$? Do I just not use any activation at all? And if so, does that make learning significantly harder?
Any help would be much appreciated!