4
$\begingroup$

I have been normalizing vectors for my work and there are generally two methods that I have been following. I assumed both the methods are equivalent until I found out they are not. The two methods are described below.

  1. Take the square of the norm of the vector and divide this value by its length. To normalize, divide the vector by the square root of the above obtained value. This corresponds to the operation given below for any vector $\mathbf{x}$. $$ E_{\mathbf{x}} = \frac{||\mathbf{x}||^2}{|x|}\hspace{2mm} \text{and}\hspace{2mm} \hat{\mathbf{x}} = \frac{\mathbf{x}}{\sqrt{E_{\mathbf{x}}}} $$ $|\cdot|$ refers to dimension of the argument.
  2. Use the variance function in the math library and divide the vector by the square root of its variance. Equivalent operations for vector $\textbf{x}$ are outlined below. $$ \text{var}_{\textbf{x}} = \mathbb{E}(\textbf{x} - \mathbb{E}\textbf{x})^2 \hspace{2mm} \text{and}\hspace{2mm} \hat{\mathbf{x}} = \frac{\mathbf{x}}{\sqrt{\text{var}_{\mathbf{x}}}} $$

The reason I believe both are equivalent is because in the norm case, we assume the origin to be at $\textbf{0}$ and we add the square of the distances from origin to each entry in the vector. In the variance case we move the origin to the mean of the random variable and then add the square of the distances taking the mean as origin.

Now I try to implement these two in python and following are the results.

import numpy as np a = np.random.randn(1000) np.linalg.norm(a) ** 2 / 1000 1.006560252222734 np.var(a) 1.003290114164144 

In these lines of code I generate 1000 length standard normal samples. Method 1 and method 2 give me equal values in this case. However when my samples have correlation, this is not the case.

import numpy as np cov_matrix = 0.3*np.abs(1-np.eye(1000)) + np.eye(1000) a = np.random.multivariate_normal(mean=np.zeros(1000), cov=cov_matrix) np.linalg.norm(a) ** 2 / 1000 1.036685431728734 np.var(a) 0.6900743017090415 

I generate 1000 length multivariate normal random vector having covariance matrix with 1's along diagonals and 0.3 in all other off-diagonal entries. This is where I am confused. Method 1 and method 2 return different values.

Why is this the case? Why do both methods return same values in i.i.d. case and different values when the vector is correlated? Thanks.

$\endgroup$
5
  • $\begingroup$ Your definition of $E_x$ seems to be a random variable, while $\sqrt(Var(X))$ is a real number (constant). They are not the same type of thing. $\endgroup$ Commented Dec 14, 2020 at 15:34
  • $\begingroup$ PS: I assume "cardinality of the argument" means "dimension of the vector." $\endgroup$ Commented Dec 14, 2020 at 15:35
  • $\begingroup$ @Michael yes I mean dimension of the vector, I will update my question $\endgroup$ Commented Dec 14, 2020 at 15:35
  • $\begingroup$ @Michael yes $E_{\textbf{x}}$ is a random variable. But for a given $\textbf{x}$, the value is a constant just like $\text{var}\textbf{x}$, right? $\endgroup$ Commented Dec 14, 2020 at 15:40
  • 1
    $\begingroup$ You are doing vector-based operations in the computer, which is not the same as an actual variance. So your definitions for $var_x$ are a bit misleading as you are really doing sample variance. $\endgroup$ Commented Dec 14, 2020 at 16:02

1 Answer 1

4
$\begingroup$

Your definitions are really vector-based operations implemented by matlab or Python (not the same as a probabilistic variance). So let me define them more clearly (I will use the matlab definitions, I assume Python definitions are the same). You are dealing with $n$-dimensional random vectors $X=(X_1, ..., X_n)$. Then:

Definition 1: $$E_X = \frac{\sum_{i=1}^n X_i^2}{n}$$

Definition 2: $$M_X = \frac{1}{n}\sum_{i=1}^n X_i$$

Definition 3: $$V_X = \frac{\sum_{i=1}^n (X_i-M_X)^2}{n-1}$$

Notice that $E_X, M_X, V_X$ are all random variables.

Observation 1:

If $\{X_i\}_{i=1}^{\infty}$ are i.i.d. with mean $E[X_1]$ and second moment $E[X_1^2]$ then by the law of large numbers $$ \lim_{n\rightarrow\infty} E_X = E[X_1^2] \quad \mbox{(with prob 1)}$$ So in the special case when $\{X_i\}_{i=1}^{\infty}$ are i.i.d. Gaussian $N(0,1)$ then $E_X\rightarrow 1$ with prob 1. Since $n=1000$ is "large" your numerical value $E_X=1.006560252222734$ makes sense. If you independently repeat the experiment you will get a new number for $E_X$ but, with high probability, it will still be very close to $1$. You would get similar results with $\{X_i\}_{i=1}^{\infty}$ any i.i.d. variables with $E[X_1^2]=1$, not necessarily having Gaussian distribution.

Observation 2:

In the special case when $\{X_i\}_{i=1}^{\infty}$ are i.i.d. Gaussian $N(0,1)$, then a surprising but classic result of statistics says that $(n-1)V_X$ has a "chi-square" distribution with $n-1$ degrees of freedom. In particular its mean is $n-1$ and its variance is $2(n-1)$. So then $V_X = \frac{(n-1)V_X}{(n-1)}$ has mean $1$ and variance $\frac{2(n-1)}{(n-1)^2} = \frac{2}{(n-1)} \approx 0$ for large $n$. So for $n=1000$ your numerical data $V_X = 1.003290114164144 \approx 1$ makes sense. If you independently repeat the experiment you will get a different numerical value for $V_X$ but, with high probability, it will still be very close to 1.

When you remove the i.i.d. assumption, so that you make the $\{X_i\}$ values correlated, then $V_X$ does not have the same distribution. That is why you get different numerical results in that case.

$\endgroup$
9
  • 1
    $\begingroup$ +1. So, suppose I had to normalize the multivariate vector, which method should I be using? $\endgroup$ Commented Dec 14, 2020 at 17:20
  • 1
    $\begingroup$ Your value $E_X$ can be viewed as an estimate of second moment of the $\{X_i\}$ variables. Your value of $V_X$ can be viewed as an estimate of variance (it is often called the "sample variance"). The second moment is only the same as the variance when the mean is zero. If you already know the mean is zero you might want to use $E_X$. To be robust to cases when the mean is unknown (possibly not zero), and to conform to more widely used estimation methods, I would suggest using the sample variance $V_X$. Of course it may depend on what you want to do. $\endgroup$ Commented Dec 14, 2020 at 18:25
  • 1
    $\begingroup$ By mean and variance I meant the probabilistic versions for a random variable $Z$: $$Var(Z) = E[Z^2] - E[Z]^2$$ You can also see the sample path versions always satisfy $$V_X = \frac{n}{n-1}E_X - \frac{n}{n-1} M_X^2$$ So regardless of distribution and regardless of i.i.d., we have $V_X \approx E_X$ whenever $M_X \approx 0$ and $\frac{n}{n-1} \approx 1$. PS: You should double-check how Python defines the sample variance, whether it divides by $n-1$ or it divides by $n$. Dividing by $n-1$ produces an unbiased estimator for i.i.d. samples. But I think Python divides by $n$. $\endgroup$ Commented Dec 14, 2020 at 18:55
  • 1
    $\begingroup$ For any vector $a$ of dimension $n$ I believe Python will give $$ np.var(a) = -np.mean(a)*np.mean(a) + (np.linalg.norm(a) ** 2) / n$$ $\endgroup$ Commented Dec 14, 2020 at 19:04
  • 1
    $\begingroup$ To divide by $n-1$ we can indeed use $$ \frac{1}{n-1} = \left(\frac{1}{n}\right)\left( \frac{n-1}{n}\right)$$ My prior comment shows that even with Python using the divide-by-n method (instead of $n-1$) we have $np.var(a) \approx (np.linalg.norm(a) ** 2)/n$ whenever $np.mean(a) \approx 0$. For your numbers in the correlated case we get $$\underbrace{np.var(a)}_{0.6900743017090415} = -(np.mean(a))^2 +\underbrace{(np.linalg.norm(a) ** 2)/n}_{1.036685431728734} $$ and so I think in your case you had $np.mean(a) = 0.58873689371373$ or $np.mean(a) = -0.58873689371373$. $\endgroup$ Commented Dec 14, 2020 at 19:59

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.