There are 3 possible outputs for y - thus I used Softmax in the final layer - if I was to OHE the output, would using something like Sigmoid be more appropriate as the values are bound between 0 and 1?
Softmax outputs probabilities that sum to 1, which is appropriate for classification tasks with categorical outputs. However, an autoencoder (AE) reconstructs continuous-valued input features. Softmax constrains outputs to sum to 1, distorting reconstruction. For instance, if the input is $[5.1, 3.5, 1.4, 0.2]$, softmax might output $[0.4, 0.3, 0.2, 0.1]$, which does not preserve the input values. For continuous, unbounded inputs, a linear activation is often prefered, while a sigmoid activation is more suitable for data bounded by 0 and 1 (Goodfellow et al., 2016, Chapter 14).
Altering the smallest change in the layers (encoding layer going to 6 instead of 3) --> causes a major shift in the loss -- is this normal?
In a nutshell - yes. Altering the latent layer size can result in changes in loss due to the model’s capacity to represent the input. Reducing the latent space size constrains the model, increasing reconstruction error as fewer parameters are available for representation. Expanding it increases capacity but risks overfitting without some kind of regularisation. This trade-off between compression and reconstruction reflects the AE's objective of balancing dimensionality reduction and input fidelity.
Each run of the autoencoder produces a different result - is this normal that it is not deterministic?
Yes, that is normal because neural networks initialise weights randomly. Training begins with different initial conditions, and the optimiser may find different local minima (Glorot & Bengio, 2010). Setting a random seed ensures reproducibility for the pseudorandom number generators in NumPy (use np.random.seed()) and TensorFlow (use tf.random.set_seed())
Why does the last layer have to be the same size (4) as the input dimension - are we able to force this to allow for an output of 3 for example? I know I can read from a latent layer, but then I can't fit the model based on that layer.
In AEs, the output layer size should match the input size because the model reconstructs the input features. Reducing the output size extracts a compressed representation but does not reconstruct the input. To access latent features, train the full AE, then define a new model using only the encoder layers. This approach enables exploration of the compressed representation while preserving reconstruction during training (Hinton & Salakhutdinov, 2006; Kingma & Welling, 2014).
Here is an attempt to improve on the code in the OP. It replaces softmax with a linear activation, uses mean squared error for continuous reconstruction, fixes random seeds, and extracts the latent representation.
import numpy as np import tensorflow as tf from sklearn.model_selection import train_test_split from sklearn import datasets from tensorflow.keras.layers import Input, Dense, BatchNormalization, LeakyReLU from tensorflow.keras import Model, optimizers # Set random seeds for reproducibility my_seed = 15 np.random.seed(my_seed) tf.random.set_seed(my_seed) # Load data and split it iris = datasets.load_iris() x = iris.data X_train, X_test = train_test_split(x, test_size=0.20, random_state=my_seed)
We are building a fully-connected overcomplete autoencoder. First we define the input layer which needs to accept input data with 4 features, matching the dimensionality of the Iris dataset.
input_dim = Input(shape=(X_train.shape[1],)) # 4 input features
Encoder
The first dense layer in the encoder, is known as an "overcomplete layer" because it contains more neurons than the dimensionality of the input data. In the OP, the input has 4 dimensions, but the overcomplete layer expands it to 6 dimensions. This additional capacity helps the model to learn more complex relationships and transformations by mapping the input to a higher-dimensional space.
The primary advantage of an overcomplete layer is its ability to improve the network's expressiveness, allowing it to capture detailed patterns and dependencies within the input. However, this increase in capacity comes with a risk of overfitting, as the network may learn to memorise details or noise in the data rather than generalising effectively. To mitigate this, regularisation techniques such as Batch Normalisation, dropout, or weight decay are typically applied, ensuring that the model learns meaningful and stable features. In the present example we will use Batch Normalisation.
encoded = Dense(6)(input_dim) encoded = BatchNormalization()(encoded)
Next we introduce some nonlinearity by using the LeakyReLU activation to the normalised output. Unlike it's "non-Leaky" cousin ReLU,LeakyReLU introduces a small slope for negative values instead of zero, helping to avoid issues like dead neurons, which evolve into a state where they always produce the same output because their gradients reach zero.
encoded = LeakyReLU()(encoded)
The last dense layer in the encoder has fewer neurons (3 in fact) which serve to compress the data into a 3-dimensional latent space, resulting in a reduced representation of the input data, hopefully capturing its most salient features. It represents the most compact, meaningful features of the input.
encoded = Dense(3)(encoded) # Latent space
Decoder
The decoder reconstructs the input data from the latent space. This layer has 4 units (matching the input dimension). The linear activation function outputs a weighted sum of the latent features plus a bias term. It applies no non-linear transformation, which makes it suitable for reconstructing continuous-valued inputs without distorting their scale or range.
decoded = Dense(4, activation='linear')(encoded)
We have now created the model architecture by defining how the data flows from the input, through the encoder layers, to the decoder, so the next step is to wrap this architecture into a formal model object, specify the optimiser, that we will use for training, the loss function and metrics, in a function rather unhelpfully called "compile".
autoencoder = Model(inputs=input_dim, outputs=decoded) opt = optimisers.Adam(learning_rate=0.001) autoencoder.compile(optimizer=opt, loss='mse', metrics=['mae']) # Train it. history = autoencoder.fit( X_train, X_train, epochs=50, batch_size=8, validation_data=(X_test, X_test), verbose=2 ) # Extract the latent space representation encoder = Model(inputs=autoencoder.input, outputs=encoded) latent_space = encoder.predict(X_test) print("Latent representation shape:", latent_space.shape)
Here we removed the softmax activation, introduced random seeds, and used mean squared error to align with the continuous nature of the input (Rumelhart et al., 1986). The latent space representation is extracted separately after training. By aligning the activation function and loss with the input features, the autoencoder effectively learns both reconstruction and dimensionality reduction.
References:
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249-256). JMLR Workshop and Conference Proceedings. Non-paywalled version is here
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Available in full online Chapter 14 is particularly useful for this Q&A.
Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science, 313(5786), 504-507. https://dbirman.github.io/learn/hierarchy/pdfs/Hinton2006.pdf
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. nature, 323(6088), 533-536.
https://www.cs.toronto.edu/~hinton/absps/naturebp.pdf
Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1312.6114