Revision d2c48067-f745-44f9-a588-31c44ab25b4b

Your example only applies when the variable $X$ *should be in the model*. It certainly doesn't apply when one uses the usual least squares estimates. To see this, note that if we estimate $a$ by least squares in your example, we get:

$$\hat{a}=\frac{\frac{1}{N}\sum_{i=1}^{N}X_{i}Y_{i}}{\frac{1}{N}\sum_{i=1}^{N}X_{i}^{2}}=\frac{\frac{1}{N}\sum_{i=1}^{N}X_{i}Y_{i}}{s_{X}^{2}+\overline{X}^{2}}$$
Where $s_{X}^2=\frac{1}{N}\sum_{i=1}^{N}(X_{i}-\overline{X})^{2}$ is the (sample) variance of $X$ and $\overline{X}=\frac{1}{N}\sum_{i=1}^{N}X_{i}$ is the (sample) mean of $X$

$$\hat{a}^{2}Var[X]=\hat{a}^{2}s_{X}^{2}=\frac{\left(\frac{1}{N}\sum_{i=1}^{N}X_{i}Y_{i}\right)^2}{s_{X}^2}\left(\frac{s_{X}^{2}}{s_{X}^{2}+\overline{X}^{2}}\right)^2$$

Now the second term is always less than $1$ (equal to $1$ in the limit) so we get an *upper bound* for the contribution to $R^2$ from the variable $X$:

$$\hat{a}^{2}Var[X]\leq \frac{\left(\frac{1}{N}\sum_{i=1}^{N}X_{i}Y_{i}\right)^2}{s_{X}^2}$$

And so unless $\left(\frac{1}{N}\sum_{i=1}^{N}X_{i}Y_{i}\right)^2\to\infty$ as well, we will actually see $R^2\to 0$ as $s_{X}^{2}\to\infty$ (because the numerator goes to zero, but denominator goes into $var[\epsilon]>0$). Additionally, we may get $R^2$ converging to something in between $0$ and $1$ depending on how quickly the two terms diverge. Now the above term will generally diverge faster than $s_{X}^2$ if $X$ should be in the model, and slower if $X$ shouldn't be in the model. In both case $R^2$ goes in the right directions.

And also note that for any finite data set (i.e. a real one) we can never have $R^2=1$ unless all the errors are exactly zero. This basically indicates that $R^2$ is a relative measure, rather than an absolute one. For unless $R^2$ is actually equal to $1$, we can always find a better fitting model. This is probably the "dangerous" aspect of $R^2$ in that because it is scaled to be between $0$ and $1$ it seems like we can interpet it in an absolute sense.

It is probably more useful to look at how quickly $R^2$ drops as you add variables into the model. And last, but not least, it should never be ignored in variable selection, as $R^2$ is effectively a sufficient statistic for variable selection - it contains all the information on variable selection that is in the data. The only thing that is needed is to choose the drop in $R^2$ which corresponds to "fitting the errors" - which usually depends on the sample size and the number of variables.