Cheap and Secure Web Hosting Provider : See Now

[Solved]: Maximum variance and useful information of dataset

, ,
Problem Detail:

I am reading through PCA and it says that the maximum variance principal component has most of the information. Can we apply that to any data set? If a data set has n attributes and most of the attributes show high variance then can we infer that the dataset has captured lot of useful information?

I am trying to understand how a high variance dataset contains useful information?

I'll answer this, even thought the question is not well defined and would possibly be better suited for cross validated. But since it has some connection to information it's not totally off topic here.

Short version: Now, we can't apply that to any data set. Variance alone is not a good measure of information. PCA chooses the components with the highest eigenvalue, because they explain the input variance (i.e. euclidean distance from the origin). That's why you should normalize your input, if you don't want to emphasize a certain input component.

Long version:

Let's say you have two random variables $X$ and $Y:=10X$ where $E(X)=E(Y)=0$, $\mathrm{Var}(X)=1$ and thus $\mathrm{Var}(Y) = 100$. Now assume a third variable $K$ uniformly distributed among $\{1,\dots,k\}$ (this models drawing a random individual form your sample).

The variance before the PCA is not a sufficient measure for information, e.g. if we use $X$ and $Y$ to distinguish the individuals: $$P(K=k|X=x) = P(K=k|Y=10x)$$

The covariance matrix looks like this: $$\left(\matrix{1&10\\10 & 100}\right)$$

Using R's "eigen" we get:

$values [1] 101 0$vectors            [,1]        [,2] [1,] 0.09950372 -0.99503719 [2,] 0.99503719  0.09950372 

If we now transform $X$ and $Y$ to $X'=M_{11}\cdot X + M_{21}\cdot Y$ and $Y'=M_{12}\cdot X + M_{22}\cdot Y$ (where $M$ is "vectors" from $R$), $X'$ contains "all the information": $$\forall y\;:P(K=k|X'=x) \neq P(K=k|Y'=y)=P(K=k)$$

But, of course, again $X''=10^{-100}\cdot X'$ contains "the same information".