Normally, zero initialization is really frowned upon when training neural network for classification. But today, when I decided to implement softmax regression from scratch and play around with it a bit, I came upon a phenomenon, where apparently, when there is no hidden layer, then it is better to use zero initialization for the weight of the model.
For example, when I train the model on MNIST dataset, here are the plots of training accuracy, training loss, validation accuracy and validation loss for each weight initialization scheme I used.
As you can see, the metrics are much better when I use zero initialization. To make sure that this is not specific to MNIST, I also did experiment on CIFAR10 dataset. Here are the plots for CIFAR10.
As you can see from both datasets (clearer in MNIST), the model initialized randomly have converged, and they are still quite far away from the models that is zero initialized. I'll be really appreciate if anyone can give me an explanation for this phenomenon, or anyone can share idea on why this might happen.
If you want to reproduce the experiments, here is the notebook I created.

