I'm trying to understand the concept of the directional derivative, from the perspective of my multivariable calculus textbook. I've typed out a summary of the explanation, with the questions I couldn't answer in boldface. Any intuitive answers, geometrical answers, physical answers are welcome. Formal, rigorous answers are also welcome. Partial explanations (answering only one of the questions etc) are also very welcome!
Consider the problem of calculating the rate of change of $\phi$ in some particular direction. For an infinitesimal vector displacement $d \mathbf{r},$ forming its scalar product with $\nabla \phi$ we obtain $$ \begin{aligned} \nabla \phi \cdot d \mathbf{r} &=\left(\mathbf{i} \frac{\partial \phi}{\partial x}+\mathbf{j} \frac{\partial \phi}{\partial y}+\mathbf{k} \frac{\partial \phi}{\partial z}\right) \cdot(\mathbf{i} d x+\mathbf{j} d y+\mathbf{k} d x) \\ &=\frac{\partial \phi}{\partial x} d x+\frac{\partial \phi}{\partial y} d y+\frac{\partial \phi}{\partial z} d z \\ &=d \phi \end{aligned} $$ which is the infinitesimal change in $\phi$ in going from position $\mathbf{r}$ to $\mathbf{r}+d \mathbf{r} .$ In particular, if $\mathbf{r}$ depends on some parameter $u$ such that $\mathbf{r}(u)$ defines a space curve then the total derivative of $\phi$ with respect to $u$ along the curve is simply $$ \frac{d \phi}{d u}=\nabla \phi \cdot \frac{d \mathbf{r}}{d u}. $$ Question 1: How did we get this? Should I just divide both sides of $\nabla \phi \cdot d \mathbf{r} = d\phi$ by $du$? I don't even know if that's a valid operation. In the particular case where the parameter $u$ is the arc length $s$ along the curve, the total derivative of $\phi$ with respect to $s$ along the curve is given by $$ \frac{d \phi}{d s}=\nabla \phi \cdot \hat{\mathbf{t}} $$ where $\hat{\mathbf{t}}$ is the unit tangent to the curve at the given point. Question 2: Then why isn't $\frac{d \phi}{d s} = 0$? Surely $\nabla \phi$ is perpendicular/tangent to the surface of $\phi$, so it will be perpendicular to $\hat{\mathbf{t}}$! In general, the rate of change of $\phi$ with respect to the distance $s$ in a particular direction a is given by $$ \frac{d \phi}{d s}=\nabla \phi \cdot \hat{\mathbf{a}} $$ (Question 3: (most burning question) I have no idea how to obtain/understand, the above result/why the above result holds. Also, am I to think $\nabla \phi \cdot \hat{\mathbf{a}} = \nabla \phi \cdot \hat{\mathbf{t}}?$) and is called the directional derivative. Since $\hat{\mathbf{a}}$ is a unit vector we have $$ \frac{d \phi}{d s}=|\nabla \phi| \cos \theta $$ where $\theta$ is the angle between $\hat{\mathbf{a}}$ and $\nabla \phi$. Clearly $\nabla \phi$ lies in the direction of the fastest increase in $\phi$ and $|\nabla \phi|$ is the largest possible value of $d \phi / d s$. Question 4: I get that the largest possible value of $d \phi / d s$ is when $\theta = 0$, which is the direction of $\nabla \phi$, but why does largest $\frac{d \phi}{d s}$ imply direction of fastest increase of $\phi$?