5
$\begingroup$

I'm trying to get a simple autoencoder working on the iris dataset to explore autoencoders at a basic level. However, I'm running into an issue where the model's loss is extremely high (>20).

Can someone help me understand if this model looks normal to them to begin with?

Some questions I'd love some help on understanding:

  • There are 3 possible outputs for y, so I used activation='softmax' in the final layer. If I were to OneHotEncoder (OHE) the output, would use something like 'sigmoid' be more appropriate, as the values are bound between 0 and 1?
  • Altering the smallest change in the layers (encoding layer going to 6 instead of 3) --> causes a major shift in the loss -- is this normal?
  • Each run of the autoencoder produces a different result. Is it normal that it is not deterministic?
  • Why does the last layer have to be the same size (4) as the input dimension - are we able to force this to allow for an output of 3, for example? I know I can read from a latent layer, but then I can't fit the model based on that layer.
import pandas as pd import numpy as np from sklearn.model_selection import cross_val_score, train_test_split from sklearn.preprocessing import LabelEncoder, OneHotEncoder from sklearn.pipeline import Pipeline from sklearn import datasets from tensorflow.keras.layers import Input, Dense, BatchNormalization, LeakyReLU from tensorflow.keras import backend, layers, models, metrics, utils from tensorflow.keras import regularizers, Input, Model, optimizers iris = datasets.load_iris() x = iris.data y = iris.target.reshape(-1, 1) X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20) input_dim = Input(shape=(X_train.shape[1],)) encoded = layers.Dense(6, input_dim='input_dim')(input_dim) encoded = BatchNormalization()(encoded) encoded = LeakyReLU()(encoded) encoded = layers.Dense(3)(encoded) decoded = layers.Dense(4, activation='softmax')(encoded) autoencoder = Model(inputs=input_dim, outputs=decoded) opt = optimizers.Adam(lr=0.00001) autoencoder.compile(optimizer=opt , loss='categorical_crossentropy' , metrics=[metrics.CategoricalAccuracy()]) history = autoencoder.fit([X_train] , [X_train] , epochs=16 , batch_size=2 , verbose=2 , validation_data=((X_test),(X_test)) ) 

Thank you for any help!

$\endgroup$
2
  • $\begingroup$ OHE means?..... one hot?... $\endgroup$ Commented Jun 26, 2022 at 12:49
  • $\begingroup$ Yes, using OHE as one hot encoding - thanks for reminding me to clarify $\endgroup$ Commented Jun 27, 2022 at 1:11

3 Answers 3

4
$\begingroup$

There are 3 possible outputs for y - thus I used Softmax in the final layer - if I was to OHE the output, would using something like Sigmoid be more appropriate as the values are bound between 0 and 1?

Softmax outputs probabilities that sum to 1, which is appropriate for classification tasks with categorical outputs. However, an autoencoder (AE) reconstructs continuous-valued input features. Softmax constrains outputs to sum to 1, distorting reconstruction. For instance, if the input is $[5.1, 3.5, 1.4, 0.2]$, softmax might output $[0.4, 0.3, 0.2, 0.1]$, which does not preserve the input values. For continuous, unbounded inputs, a linear activation is often prefered, while a sigmoid activation is more suitable for data bounded by 0 and 1 (Goodfellow et al., 2016, Chapter 14).

Altering the smallest change in the layers (encoding layer going to 6 instead of 3) --> causes a major shift in the loss -- is this normal?

In a nutshell - yes. Altering the latent layer size can result in changes in loss due to the model’s capacity to represent the input. Reducing the latent space size constrains the model, increasing reconstruction error as fewer parameters are available for representation. Expanding it increases capacity but risks overfitting without some kind of regularisation. This trade-off between compression and reconstruction reflects the AE's objective of balancing dimensionality reduction and input fidelity.

Each run of the autoencoder produces a different result - is this normal that it is not deterministic?

Yes, that is normal because neural networks initialise weights randomly. Training begins with different initial conditions, and the optimiser may find different local minima (Glorot & Bengio, 2010). Setting a random seed ensures reproducibility for the pseudorandom number generators in NumPy (use np.random.seed()) and TensorFlow (use tf.random.set_seed())

Why does the last layer have to be the same size (4) as the input dimension - are we able to force this to allow for an output of 3 for example? I know I can read from a latent layer, but then I can't fit the model based on that layer.

In AEs, the output layer size should match the input size because the model reconstructs the input features. Reducing the output size extracts a compressed representation but does not reconstruct the input. To access latent features, train the full AE, then define a new model using only the encoder layers. This approach enables exploration of the compressed representation while preserving reconstruction during training (Hinton & Salakhutdinov, 2006; Kingma & Welling, 2014).

Here is an attempt to improve on the code in the OP. It replaces softmax with a linear activation, uses mean squared error for continuous reconstruction, fixes random seeds, and extracts the latent representation.

import numpy as np import tensorflow as tf from sklearn.model_selection import train_test_split from sklearn import datasets from tensorflow.keras.layers import Input, Dense, BatchNormalization, LeakyReLU from tensorflow.keras import Model, optimizers # Set random seeds for reproducibility my_seed = 15 np.random.seed(my_seed) tf.random.set_seed(my_seed) # Load data and split it iris = datasets.load_iris() x = iris.data X_train, X_test = train_test_split(x, test_size=0.20, random_state=my_seed) 

We are building a fully-connected overcomplete autoencoder. First we define the input layer which needs to accept input data with 4 features, matching the dimensionality of the Iris dataset.

input_dim = Input(shape=(X_train.shape[1],)) # 4 input features 

Encoder

The first dense layer in the encoder, is known as an "overcomplete layer" because it contains more neurons than the dimensionality of the input data. In the OP, the input has 4 dimensions, but the overcomplete layer expands it to 6 dimensions. This additional capacity helps the model to learn more complex relationships and transformations by mapping the input to a higher-dimensional space.

The primary advantage of an overcomplete layer is its ability to improve the network's expressiveness, allowing it to capture detailed patterns and dependencies within the input. However, this increase in capacity comes with a risk of overfitting, as the network may learn to memorise details or noise in the data rather than generalising effectively. To mitigate this, regularisation techniques such as Batch Normalisation, dropout, or weight decay are typically applied, ensuring that the model learns meaningful and stable features. In the present example we will use Batch Normalisation.

encoded = Dense(6)(input_dim) encoded = BatchNormalization()(encoded) 

Next we introduce some nonlinearity by using the LeakyReLU activation to the normalised output. Unlike it's "non-Leaky" cousin ReLU,LeakyReLU introduces a small slope for negative values instead of zero, helping to avoid issues like dead neurons, which evolve into a state where they always produce the same output because their gradients reach zero.

encoded = LeakyReLU()(encoded) 

The last dense layer in the encoder has fewer neurons (3 in fact) which serve to compress the data into a 3-dimensional latent space, resulting in a reduced representation of the input data, hopefully capturing its most salient features. It represents the most compact, meaningful features of the input.

encoded = Dense(3)(encoded) # Latent space 

Decoder

The decoder reconstructs the input data from the latent space. This layer has 4 units (matching the input dimension). The linear activation function outputs a weighted sum of the latent features plus a bias term. It applies no non-linear transformation, which makes it suitable for reconstructing continuous-valued inputs without distorting their scale or range.

decoded = Dense(4, activation='linear')(encoded) 

We have now created the model architecture by defining how the data flows from the input, through the encoder layers, to the decoder, so the next step is to wrap this architecture into a formal model object, specify the optimiser, that we will use for training, the loss function and metrics, in a function rather unhelpfully called "compile".

autoencoder = Model(inputs=input_dim, outputs=decoded) opt = optimisers.Adam(learning_rate=0.001) autoencoder.compile(optimizer=opt, loss='mse', metrics=['mae']) # Train it. history = autoencoder.fit( X_train, X_train, epochs=50, batch_size=8, validation_data=(X_test, X_test), verbose=2 ) # Extract the latent space representation encoder = Model(inputs=autoencoder.input, outputs=encoded) latent_space = encoder.predict(X_test) print("Latent representation shape:", latent_space.shape) 

Here we removed the softmax activation, introduced random seeds, and used mean squared error to align with the continuous nature of the input (Rumelhart et al., 1986). The latent space representation is extracted separately after training. By aligning the activation function and loss with the input features, the autoencoder effectively learns both reconstruction and dimensionality reduction.

References:

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249-256). JMLR Workshop and Conference Proceedings. Non-paywalled version is here

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Available in full online Chapter 14 is particularly useful for this Q&A.

Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science, 313(5786), 504-507. https://dbirman.github.io/learn/hierarchy/pdfs/Hinton2006.pdf

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. nature, 323(6088), 533-536.
https://www.cs.toronto.edu/~hinton/absps/naturebp.pdf

Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1312.6114

$\endgroup$
1
  • $\begingroup$ Again, this answer also seems generated by Gen-AI tools but not the fully right answer! Some parts of your (generated) answer are not fully correct. it does not work in practice! Have you implemented or tested your scripts? Note that you can not guarantee reproducible results by just setting seeds! Just re-run your scripts, and you will see the loss curve will be changed each time! "Here is an attempt to improve on the code in the OP. It replaces ..." Are you talking with us, or are 3rd-party Gen-AI tools talking with OP or Jack & Joe? $\endgroup$ Commented Dec 18, 2024 at 23:13
1
$\begingroup$

There is some confusion in your question, since you are talking about an AE, but you are using softmax as output... which makes almost no sense

First of all, the loss by itself, has no meaning:
If I give you two models with two different losses (in your case, two AE), you cannot, in any way, guess which is the best one (therefore a loss that is 20 means nothing)

About your questions:

  1. depends on the distribution... one hot encoding an output supposes a multinomial distribution, which you can approximate with both a softmax and sigmoid, but 99% of the times you should use a softmax, since encodes constraints of the distribution directly, and the NN don't have to learn them (like when approximating a probability, you can for sure use a linear output layer, but the NN has to learn to output values between 0 and 1, and there is no guarantee that it will always do that)
  2. an AE aims to reduce the dimensionality... you start with 4 features and you allow it to use 6 dimensions (in this case we are talking about overcomplete AE, but you need some constraint to make them work as expected)
  3. welcome in the world of non convex optimization... if the starting point is random (which is since the weights are initialized randomly), you will probably get to a different local minima
  4. an autoencoder aims to learn the identity function ($decoder(encoder(x)) = x$) therefore obviously no, the input has to have the same shape as the output

At this point, I'm pretty certain that an AE is not what you are looking for, but most likely a normal discriminative NN (since you are using cat-cross-ent as loss function)

$\endgroup$
3
  • $\begingroup$ I find the tone of this answer to be disappointing. On a certain level, you can compare losses between models - because if a loss is zero, then we say the model has predicted perfectly. 1. Why do you recommend softmax here, but dismiss it above for an AE? 2. The reduction in dimensionality is only in the middle layers, as the decoding reconstructs - I'm not sure how this answers the above 3. Thanks for clarifying this - you're right, I should have seeded the model 4. Ok point noted, so you're saying an AE has strict rules that surround it - but those are more human established. $\endgroup$ Commented Jun 27, 2022 at 1:13
  • $\begingroup$ @user37649 loss zero means that you model has just memorized the training set, and I'm saying that it's not clear what you aim to do... AE aims to do dimensionality reduction or density estimation, which has nothing to do with a softmax output layer, since that is used in discriminative model to specify the output category of a multinomial distribution $\endgroup$ Commented Jun 27, 2022 at 10:38
  • $\begingroup$ Does the validation set give us an opportunity to measure the validation loss, such that we can infer the model performance? On softmax or any activation function, is that not a relationship of the input (and output) vectors? (ex: values are bound between [0,1], [-1,1], etc? $\endgroup$ Commented Jun 27, 2022 at 14:57
-1
$\begingroup$

Let's first debug your code and experiment with it step by step: I debugged your code, and now it works.

  • There are 3 possible outputs for y, so I used activation='softmax' in the final layer. If I were to OneHotEncoder (OHE) the output, would use something like 'sigmoid' be more appropriate, as the values are bound between 0 and 1?

You import OneHotEncoder (OHE) or LabelEncoder (LE) but never use them in your code, and their use in combination with different Activation Functions (AF)* needs to be experimented with!

The answer to this question is that you need to modify the final layer to use 'sigmoid' and evaluate the reconstruction loss compared with others with different AFs. In general, 'softmax' is good for classification tasks over categorical distributions, but if you rescale your data between [0, 1], 'sigmoid' is better. (I added it to your code)

img img
Fig. 1: AF='softmax' Fig. 2: AF='sigmoid'

Also, one can easily plot and visualize data to see if data is just positive or negative, like how they do minimal EDA over data; see this for iris:

import seaborn as sns # Plot of pairs of features of the Iris dataset iris.frame["target"] = iris.target_names[iris.target] _ = sns.pairplot(iris.frame, hue="target") 

Note#1: check this post they used 'sigmoid' with AE and recommended it.

  • Altering the smallest change in the layers (encoding layer going to 6 instead of 3) --> causes a major shift in the loss -- is this normal?

The following plot demonstrates the trade-off between compression (lower dimensions) and reconstruction loss. So one needs to experiment with it, so don't trust much Gen-AI tools answer as below I have shown in blockquote:

img

"Impact of Layer Dimensions on Loss: A significant change in loss when altering the dimensions of the encoding layer is expected. A smaller encoding layer forces the autoencoder to compress the information more, which might increase reconstruction error if the model cannot learn an efficient representation." [this paragraph generated by ChatGPT 4o, Accessed: 19.12.2024, prompt: Q3]

# Experiment 2: Varying Encoding Dimensions encoding_dims = [2, 3, 6] results = {} for dim in encoding_dims: # Encoder encoded = Dense(dim)(input_layer) encoded = BatchNormalization()(encoded) encoded = LeakyReLU()(encoded) # Decoder decoded = Dense(4, activation='sigmoid')(encoded) # Model autoencoder = Model(inputs=input_layer, outputs=decoded) # Create a new optimizer instance for each model opt = Adam(learning_rate=0.0001) autoencoder.compile(optimizer=opt, loss='mse') # Train history = autoencoder.fit( X_train, X_train, epochs=50, batch_size=8, verbose=0, validation_data=(X_test, X_test) ) results[dim] = history.history['val_loss'] # Plot Loss for Different Encoding Dimensions for dim, loss in results.items(): plt.plot(loss, label=f'Encoding Dim {dim}') plt.xlabel('Epochs') plt.ylabel('Validation Loss') plt.legend() plt.title('Experiment 2: Effect of Encoding Dimensions') plt.show() 
  • Each run of the autoencoder produces a different result. Is it normal that it is not deterministic?

This is easy. You could just search for it. However, I noticed only that setting seeds does not guarantee reproducibility, but thanks to this answer and using this package , the problem was solved! (I added def setup_seed(seed) to your code)

  • Why does the last layer have to be the same size (4) as the input dimension - are we able to force this to allow for an output of 3, for example? I know I can read from a latent layer, but then I can't fit the model based on that layer.

As @Anon stated, this roots back AE's intuition, meaning the final layer should match the input dimensions if you're reconstructing the input. $$decoder(encoder(X)) = X$$ However, if you want an output of a different size (e.g., 3), you must treat it as a supervised learning problem, not an AE. Please, for further info, visit this source

img

Side note#1: You should think of AE architecture maybe with linear or non-linear startegies within your exploration! check yours: autoencoder.summary() pic credit

img

Side note#1: It is better to make clear this part in your code; see this notebook from the assignment of Harward Uni

# Get the number of data samples i.e. the number of columns input_dim = ___ output_dim = ___ # Specify the number of neurons for the dense layers encoding_dim = ___ 

Debugged code:

import os import random import numpy as np import pandas as pd import tensorflow as tf from sklearn import datasets from sklearn.pipeline import Pipeline from sklearn.model_selection import cross_val_score, train_test_split from sklearn.preprocessing import LabelEncoder, OneHotEncoder from tensorflow.keras.layers import Input, Dense, BatchNormalization, LeakyReLU from tensorflow.keras import backend, layers, models, metrics, utils from tensorflow.keras import regularizers, Input, Model, optimizers #!pip install tensorflow-determinism # function for reproducibility from asnwer of @Patrick J. Holt see https://stackoverflow.com/a/68829028/10452700 def setup_seed(seed): random.seed(seed) np.random.seed(seed) tf.random.set_seed(seed) # tf cpu fix seed os.environ['TF_DETERMINISTIC_OPS'] = '1' # tf gpu fix seed, please `pip install tensorflow-determinism` first # make the result reprodicible # Please note: set random seeds is not enough for reproducibility setup_seed(2024) #Load Iris Dataset iris = datasets.load_iris() # Get the predictor and response variables X = iris.data #y = iris.target y = iris.target.reshape(-1, 1) # Get the Iris label names target_names = iris.target_names print(X.shape, y.shape) #(150, 4) (150,1) # Standardize the data scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.20, random_state=2024) # Experiment 0: Softmax Activation in Final Layer # Define the input layer correctly input_dim = Input(shape=(X_train.shape[1],)) # Encoder # encoded = layers.Dense(6, input_dim='input_dim')(input_dim) encoder = Dense(6)(input_dim) # Remove input_dim from Dense layer encoder = BatchNormalization()(encoder) encoder = LeakyReLU()(encoder) encoder = Dense(3)(encoder) # Decoder #decoder = Dense(4, activation='sigmoid')(encoder) decoder = Dense(4, activation='softmax')(encoded) # Model autoencoder = Model(inputs=input_dim, outputs=decoder) # opt = optimizers.Adam(lr=0.00001) opt = optimizers.Adam(learning_rate=0.00001) # Use learning_rate=0.00001 Instead of lr=0.00001 # Compile model with Mean Squared Error loss #autoencoder.compile(optimizer=opt, loss='categorical_crossentropy', metrics=[metrics.CategoricalAccuracy()]) autoencoder.compile(optimizer=opt, loss='mse', metrics=[MeanSquaredError()]) # history = autoencoder.fit([X_train], # [X_train], # epochs=16, # batch_size=2, # verbose=2, # validation_data=((X_test),(X_test))) # Train the model history = autoencoder.fit(X_train, X_train, epochs=50, batch_size=8, verbose=1, validation_data=(X_test, X_test) ) # Plotting Loss plt.plot(history.history['loss'], label='Training Loss') plt.plot(history.history['val_loss'], label='Validation Loss') plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend() plt.title('Experiment 0: softmax Activation\n your proposed script') plt.show() 
$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.