What does it mean when even a small set of samples don't give 0 loss?

Question

I'm trying to do a regression problem where I find Molar compositions of some chemical species. I'm using this kind of netwrok:

inp = keras.layers.Input(shape=X_train.shape[1:]) x0 = keras.layers.Activation('swish')(inp) x1 = keras.layers.Dense(64)(x0) x1 = keras.layers.Activation('gelu')(x1) x2 = keras.layers.Dense(64)(x1) x2 = keras.layers.Activation('swish')(x2) x_concat = keras.layers.Concatenate()([x0,x1,x2]) x3 = keras.layers.Dense(64)(x_concat) x3 = keras.layers.Activation('gelu')(x3) out2 = keras.layers.Dense(20, activation='linear', name = 'subsets')(x3) model = keras.Model(inputs = inp, outputs =out2)

I have 14 Inputs and 20 Outputs. When I use AdamW optimizer with lr=1e-4, loss =Huber I've get these results for loss and val_loss: Also my gradient average per step is looking like this:
I have tried to hit the zero loss for small smaples(say 200). I couldn't do that unless I widen my neural(1000 neural per layer) and also I had to set batch_size = 200(one sample per epoch)..

I have also tested that when I widen and go deeper the results improve a tiny bit.

I think I can conclude from this that my Inputs and Ouputs have a complex nonlinear relationship but how can I simplify this relation inorder for my network to be more percise??(I have tried feature Importance and eliminate two features that have less impact on the model and result improved a little again, I have also tried to create nonlinear relationship between features(power, dividing features, product, etc..) but had no luck to make model more percise! Also these are my Input and Output features for ten rows:

Inputs(Scaled with StandardScaler):

Outputs(Not scaled since they are already between 0 and 1 and If I scale them the performance is getting even worse):

Also here is the average SMAPE% error chart for the outputs on test datas(4200 samples):

First thing first, what gives you the confidence that zero loss is possible for your question? — lpounng
– lpounng, Commented Aug 6 at 7:02
@Ipounng Even If it's not possible I should hit something more than ~1e-4 for 200 samples right?? but it doesn't happen easily and fast enough — Naivahash80
– Naivahash80, Commented Aug 6 at 7:32

Valentin Calomme · Accepted Answer · 2025-08-06 10:03:31Z

tldr

What does it mean when even a small set of samples don't give 0 loss?

It doesn't mean much. Achieving zero loss is not always possible (or needed). If, like in your case, your training and validation loss both decrease and end up converging to a, similar, relatively low value, your neural net is about as close to having "perfectly" learned your data as possible.

longer answer

First of all, I'd like to point out that achieving zero loss is:

not always possible due to the data (i.e., you could have two identical inputs pointing to different targets, which you can never fully satisfy)
not always possible due to numerical precision
not always possible due to the size of the neural network

I have tried to hit the zero loss for small smaples(say 200). I couldn't do that unless I widen my neural(1000 neural per layer) and also I had to set batch_size = 200(one sample per epoch)..

I'd say that you've "accidentally" ran into a few important concepts when working with neural nets.

Mini-batch vs full batch gradients

Your gradient is the "direction" that points towards "better" weights. When you use mini-batch gradients (i.e., not your whole training set), you estimate a gradient based on a subset of your data. That gradient will be representative of that subset, and therefore, may deviate from the gradient you would have gotten with the whole training set. If your goal is to achieve zero loss, you will likely only be able to do so when using your entire training set as a batch. This would at least ensure that your model will get better for your entire training set, instead of potentially improving on some batches and being worse on others.

Wider and deeper is typically "better"

Neural networks are universal function approximators, meaning that a neural network can approximate "any" function, as long as it is wide and/or deep enough.

When you train a neural network on some data, you assume that this data behaves according to a particular function (i.e, f(x) = y), and your neural net will attempt to approximate it given the data you feed it during training.

However, as mentioned, this is an assumption, and it might not always hold. There might not be a perfect function mapping your input to your output. It could be that your data has some quirks or some randomness to it, making it impossible to perfectly map your input to your output.

Thank you so much for this detailed answer. I just have another question. when I was checking my 20 outputs on test data I have noticed that some of them predicted in a nice way but some of them were not at all(I will share average SMAPE% on 4200 of my test samples in the question) I want to know how can I prove that the neural has learned all that was there and also how to improve prediction on test data on the outputs that has a lot of SMAPE%?? — Naivahash80
– Naivahash80, Commented Aug 6 at 10:56
A zero loss may also indicate severe overfitting. Naivahash80, you may be interested in this post about knowing when you have reached the limit of predictability and in this post about the issues with the MAPE - the sMAPE has very similar problems. If this is 4200, then a flat zero prediction will be "better" (with an sMAPE of 200%). See also this. — Stephan Kolassa
– Stephan Kolassa, Commented Aug 6 at 15:49
@StephanKolassa Aside from artificially constructed scenarios, would a zero loss ever not be indicative of overfitting? — JimmyJames
– JimmyJames, Commented Aug 6 at 16:54
@JimmyJamessupportsCanada One could imagine problems with discrete outcomes and sufficiently good input data that a zero loss is achievable. At what point these stop being real life problems and become artificially constructed is a matter of opinion. — quarague
– quarague, Commented Aug 7 at 6:40
@StephanKolassa Thankyou so much for the links you have sent. I really appreciate that! Just out of curiosity since I can sense my problem that I'm working can't be explain by simple y=f(x) and hence neural network is useless. How can I identify other Inputs that might be useful for my outputs and I can use neural networks to predict them in a nice way? If you have anything in your mind please share it! My outputs of are the feed compositions in a distillation column(It is related to chemical engineering) — Naivahash80
– Naivahash80, Commented Aug 7 at 13:28

Stack Exchange Network

What does it mean when even a small set of samples don't give 0 loss?

1 Answer 1

tldr

longer answer

Hot Network Questions

What does it mean when even a small set of samples don't give 0 loss?

1 Answer 1

tldr

longer answer

Related

Hot Network Questions