I have been trying to understand non-parametric regression using Gaussian processes (GP), which are used to represent prior distributions over the space of functions. The linear model considered is $$ \mathbf{y}_i = f(\mathbf{x}_i) + \mathbf{\epsilon}\\=f_i +\mathbf{\epsilon}$$ where $\mathbf{y}_i$ is the observations, $f_i$ is the output of the function at input location $\mathbf{x}_i$ and $\mathbf{\epsilon}$ is the additive Gaussian noise. The folowing GP is chosen in this case $$p(f|\mathbf{X},\mathbf{\theta}) = \mathcal{N}(\mathbf{0},k(\mathbf{X},\mathbf{X})).$$ where, $k(\cdot;\cdot)$ is the covariance function and $\mathbf{\theta}$ is the hyper-parameters.
I would appreciate if someone can shed light on the choice of this prior and what it actually does? Also, what is the approach to formulating the joint likelihood $p(\mathbf{Y},\mathbf{X},f,\mathbf{\theta})$ of the entire model.
Thanks in advance!!