Shortly after posting this question, I found a paper which explained the signal model for GMT very well. I have attached the link to the paper, if anyone ever needs clarification. End-to-End Moving Target Indication for Airborne Radar Using Deep Learning.
According to this paper the space-time steering vector for a single moving target is formed by the combination of both spatial steering vector (say $A_{s}$) and temporal steering vector (the steering vector formed by the Doppler freq, say $A_{d}$). The final steering vector is $A_{s} \otimes A_{d}$, where $\otimes$ represents the outer product. This answers my first question about including the Doppler frequency in the array manifold calculation. And, including the Doppler frequency in the array manifold calculation, also answers my second question. Hope this helps. Also, correct me if I am wrong.