This confused me when I first took calculus. After some years, and learning more advanced concepts, I realized that the confusion boiled down to the way that vectors and linear functionals (linear maps that take a vector as input and return a scalar as output) are conflated in typical calculus classes. The derivative is naturally a linear functional. The gradient, as a vector, only makes sense as the Riesz representative of the derivative with respect to an inner product.
This answer is, perhaps, at a higher level than the question asks. However, this is the logical way to think about it that eventually made sense to me.
Definition 1: The first derivative, $Df(x)$, is the linear functional that best approximates $v \mapsto f(x + v) - f(x)$ for small $v$.
Definition 2: The gradient, $\nabla f(x)$, is the Riesz representative representative of the first derivative. That is, the vector satisfying $$ Df(x)v = (\nabla f(x), v) $$ for all vectors $v \in \mathbb{R}^n$, where $(\cdot, \cdot)$ is inner product.
Theorem 1: The gradient (as defined in Definition 2) points in the steepest direction.
Theorem 2: If the inner product $(\cdot,\cdot)$ is the dot product, then \begin{equation*} \nabla f(x) = \left(\frac{df}{dx_1}(x), \frac{df}{dx_2}(x), \dots, \frac{df}{dx_n}(x)\right). \end{equation*}
If you start with the formula for $\nabla f(x)$ in Theorem 2 as the definition (as posed in the question), you just turn around the logic a bit. From Theorem 2, $\nabla f(x)$ is the Riesz representation of $Df(x)$ under the dot product, and from Theorem 1, the Riesz representative of $Df(x)$ is the direction of steepest ascent.
There is another interesting thing this perspective illustrates: if you change the inner product, the gradient changes. However, the inner product defines the notion of distance in the domain, so changing the inner product affects the meaning of steepness. If you walk one meter horizontally and go up one meter vertically, that is pretty steep. If you walk one kilometer horizontally and go up one meter vertically, that is pretty flat. Somehow these two changes exactly cancel, so the gradient is still the direction of steepest ascent!
Proof of Theorem 1: The direction of steepest ascent is \begin{equation*} \max_{||v||=1} Df(x)v = (\nabla f(x), v), \end{equation*} and $(\nabla f(x), v)$ is maximized when $v=\nabla f(x)$ by the Cauchy-Schwarz inequality. $\square$
Proof of Theorem 2: If $e_i=(0,\dots,0,1,0,\dots,0)$ is the unit vector which has a one in the $i$th position and zeros elsewhere, then the limit definition of the derivative in 1D implies $$ \frac{df}{dx_i}(x) = Df(x)e_i. $$ The ith component of a vector is the dot product of that vector with the unit vector $e_i$. Hence, \begin{equation*} \left(\nabla f(x)\right)_i = (\nabla f(x), e_i) = Df(x) e_i = \frac{df}{dx_i}(x). ~\square \end{equation*}