Sidebands arise when you sample the signal in the time domain. Sampling in the time domain causes aliasing in the Frequency domain.
As Hilmar said, almost any book will derive this mathematically.
The key step is that, as you say, multiplication in one domain is convolution in the other. Then, the transform of a train of deltas (spacing T) is itself another train of deltas (spacing 1/T = fs).
Finally, convolution with a single delta just relocates the function to where the delta is centered. Superpose all of the shifted deltas, convolve, and then you get the superposition of copies of the input spectrum at every k.fs.
If you want a more "intuitive" explanation, think that if you sample the analog signal $\cos(2\pi kf_st +\phi_0)$ at regular intervals $t_n=n/f_s$, you get a constant sequence. Hence, you can't know just by looking at the samples which was the original value for $k$. As far as you know, every analog frequency $k.f_s$ could have given that sequence. You see every frequency $k.f_s$ as possibly present in the sampled signal.
With another notation, form the sequence $x[n] = \cos(2\pi f n/ f_s + phi_0)$. Take $f= kf_s$, and it generates a constant sequence. Thus, you can synthetise the same constant sequence with any of the $f = kf_s$.
These examples show how aliasing occurs for $f =kf_s$, but of course the analogy can be taken further for all other frequencies. The usual examples with rotating wheels should be useful (see "stroboscopic effect")