I was going through Marsden's book, Elementary Classical Analysis, and came across the following exercise in Chapter 6. It reads as follows:
If $f: A \subset \mathbb{R}^n \to \mathbb{R}^m$ and $g: B \subset \mathbb{R}^m \to \mathbb{R}^p$, show that \begin{align*} D^2(g \circ f(x_0))(x, y) &= D^2(g(x_0)) (Df(x_0) \cdot x, Df(x_0) \cdot y) \\ &+\; Dg(f(x_0)) \cdot D^2f(x_0)(x, y). \end{align*}
I found this question concerning the same exercise, but my problem was not answered here. I know that I am supposed to apply the chain rule twice to compute this result. What I do not understand is why there is an addition involved in the result to begin with. How is the use of the product rule justified here?
If I apply the chain rule once, I get $$ D(g \circ f(x_0)) = Dg(f(x_0))) \circ Df(x_0),$$ where $Df : A \to L(\mathbb{R}^n, \mathbb{R}^m)$ and $Dg : B \to L(\mathbb{R}^m, \mathbb{R}^p)$. Clearly this is the composition of two linear transformations. But neither the product rule (introduced in the text to differentiate $gf$, where $f : A \subset \mathbb{R}^n \to \mathbb{R}^m$ and $g: A \to \mathbb{R}$) nor the chain rule applies here.
I know that we can view this equation in terms of matrix multiplication for suitably-chosen bases. But how can I differentiate the composition of linear transformations as written above? Can I view the composition of these linear operations as a bilinear form, and apply the generalized product rule to differentiate this bilinear form?