I'm trying to do a regression problem where I find Molar compositions of some chemical species. I'm using this kind of netwrok:
inp = keras.layers.Input(shape=X_train.shape[1:]) x0 = keras.layers.Activation('swish')(inp) x1 = keras.layers.Dense(64)(x0) x1 = keras.layers.Activation('gelu')(x1) x2 = keras.layers.Dense(64)(x1) x2 = keras.layers.Activation('swish')(x2) x_concat = keras.layers.Concatenate()([x0,x1,x2]) x3 = keras.layers.Dense(64)(x_concat) x3 = keras.layers.Activation('gelu')(x3) out2 = keras.layers.Dense(20, activation='linear', name = 'subsets')(x3) model = keras.Model(inputs = inp, outputs =out2) I have 14 Inputs and 20 Outputs. When I use AdamW optimizer with lr=1e-4, loss =Huber I've get these results for loss and val_loss:
Also my gradient average per step is looking like this: 
I have tried to hit the zero loss for small smaples(say 200). I couldn't do that unless I widen my neural(1000 neural per layer) and also I had to set batch_size = 200(one sample per epoch)..
I have also tested that when I widen and go deeper the results improve a tiny bit.
I think I can conclude from this that my Inputs and Ouputs have a complex nonlinear relationship but how can I simplify this relation inorder for my network to be more percise??(I have tried feature Importance and eliminate two features that have less impact on the model and result improved a little again, I have also tried to create nonlinear relationship between features(power, dividing features, product, etc..) but had no luck to make model more percise! Also these are my Input and Output features for ten rows:
Inputs(Scaled with StandardScaler): 
Outputs(Not scaled since they are already between 0 and 1 and If I scale them the performance is getting even worse): 
Also here is the average SMAPE% error chart for the outputs on test datas(4200 samples): 