Cheap and Secure Web Hosting Provider : See Now

[Solved]: Under what conditions is K-means clustering transformation-invariant?

, , No Comments
Problem Detail: 

Given a set of data points $X = \{x_1, x_2, \ldots, x_m\}$ where $x_i \in \mathbb{R}^d$ we run K-means on $X$ and obtain the clusters $c_1, c_2, \ldots, c_k$.

Now, if we create a new dataset $Y = \{y_1, y_2, \ldots, y_m\}$ where $y_i = Ax_i + b$ and $y_i \in \mathbb{R}^d$ and run K-means on $Y$ to get clusters $g_1, g_2, \ldots g_k$.

Under what conditions of $A$ and $b$ are we guaranteed to get the same clusters?

Let's assume that K-means is using the euclidean distance and has same initial conditions on both algorithms, that is, if the initial centers for X are $c^0_1, \ldots, c^0_k$ then the initial centers for Y are $g^0_1, \ldots, g^0_k$ where $g^0_i = Ac^0_i + b$.

So far I've thought that $A$ has to be full rank and $b$ can be any vector. However, I haven't been able to prove it.

Asked By : Ana Echavarria

Answered By : Yuval Filmus

The answer depends on your K-means algorithm, but what follows should work for standard algorithms.

You will get the same result if your transformation $T$ satisfies two conditions:

  1. It preserves distances: $d(z,w) = d(T(z),T(w))$, where $d$ is your metric, say $d(z,w) = \|z-w\|$.
  2. It preserves averages: if $\sum_i p_i z_i$ is a convex combination that $T(\sum_i p_i z_i) = \sum_i p_i T(z_i)$.

You can check this by going over the algorithm, showing that it always makes the same choices.

Best Answer from StackOverflow

Question Source :

 Ask a Question

 Download Related Notes/Documents


Post a Comment

Let us know your responses and feedback