tl;dr: perspective-correct interpolation of NDCs doesn't give us what we want; we need to _linearly_ interpolate them in screen-space.
----------------------------------------
This answer uses math notation from [this paper](https://www.comp.nus.edu.sg/~lowkl/publications/lowk_persp_interp_techrep.pdf), _section 3: interpolating vertex attributes_. (_I highly recommend reading the entire paper_ to understand how perspective-correct interpolation of vertex attributes works; the following discussion is based on it.)
---
Why does the code in the tutorial work?
-------
The hardware does perspective-correct interpolation of vertex clip-space coordinates to yield a pixel's clip-space coordinates. Then the perspective divide (by the interpolated _w_-coordinate) in the pixel shader transforms that into NDC.
But let's dive a bit deeper and see what the interpolation really does under the hood:
Let $I_1$ and $I_2$ be _clip-space_ coordinates of two vertices, and let $Z_1$ and $Z_2$ be their respective view-space _z_-coordinate (or clip-space _w_-coordinate).
We want to find a pixel's NDCs, or mathematically: $I_t/Z_t$ where $I_t$ is its clip-space coordinates and $Z_t$ is the view-space _z_-coordinate.
Now recall that NDCs are coordinates of a 3D object _after projecting it on the 2D projection window_ (screen-space). So to find a pixel's NDCs in screen-space, it suffices to linearly interpolate the vertices' NDCs _in screen-space_. Mathematically:
$$
\tag{*}\label{*}
\frac{I_t}{Z_t}=\frac{I_1}{Z_1}+s\left(\frac{I_2}{Z_2}-\frac{I_1}{Z_1}\right)
$$
($s$ is a screen-space barycentric coordinate, see in the paper linked above).
To get this, we first let the hardware do perspective-correct interpolation of the vertex _clip-space_ coordinates. This is **equation (16)** in the paper and it gives us $I_t$ - the pixel's _clip-space_ coords:
$$
I_t=\left[\frac{I_1}{Z_1}+s\left(\frac{I_2}{Z_2}-\frac{I_1}{Z_1}\right)\right]Z_t
$$
Then in the pixel shader we divide by _w_ which is simply the $Z_t$ of a pixel (remember that _w_ gets interpolated as well), which gives us $I_t/Z_t$ (NDC) as desired.
So _the hardware does the perspective divide and interpolation for us_, which gives the pixel's NDCs, but it also multiplies by $Z_t$ which transforms from NDC back to clip-space. And what the perspective divide in the pixel shader really does is transform back to NDC by cancelling out the $Z_t$.
---
What happens if we do the division in the vertex shader instead?
---
The vertex NDCs are $I_1/Z_1$ and $I_2/Z_2$ and the perspective-correct interpolation equation gives:
$$
\left[\frac{I_1}{Z_1^2}+s\left(\frac{I_2}{Z_2^2}-\frac{I_1}{Z_1^2}\right)\right]Z_t
$$
which is clearly not $I_t/Z_t$ as defined in $\eqref{*}$.
The reason this equation doesn't work in this case is that its derivation is based on the fact that the attribute _varies linearly across the triangle in 3D space_ (view space). But screen-space NDC coords don't vary linearly in 3D space because of the perspective divide, so the premise is broken. (Clip-space coords _do_ vary linearly in view-space because they are the result of a linear transform.)
**It's incorrect to do a perspective-correct interpolation of values that have undergone perspective divide**. They are not in 3D space anymore, rather in 2D projection space and thus require **linear interpolation in screen-space** to get correct results.
And yes, to get correctly interpolated NDCs we need to do the perspective division _per-pixel_.