Cheap and Secure Web Hosting Provider : See Now

[Solved]: Finding similar high dimensional real vectors

, , No Comments
Problem Detail: 

I have a collection of vectors $v_1,v_2\in [0,1]^n$ and I want to find similar pairs quickly. For similarity, I want to use the Euclidean distance metric $L: [0,1]^n \times [0,1]^n \longrightarrow R$. $n$ will be around $200$.

I've implemented Locality Similar Hashing (LSH) for word based documents, and I'm curious if there's an adaptation of it which will work for real based vectors.

I read here on Computer Science Stackoverflow that universal hashing will work.

If I have the vector $s=(0.55, 0.34, 0.90, 0.99)$ I can break it up into two sub vectors $s_1=(0.55, 0.34)$ and $s_2=(0.90, 0.99)$ and apply a universal hash to each:

$h_n= a\cdot s_n \space mod \space p$

Where $a$ is a random vector and $p$ is prime. I can now define a space of bins, where each $h_n$ defines a bin in this space. Each vector with a subvector in $h_n$ is mapped to the bin $h_n$

The problem is that this method is too precise. The vectors $(0.57, 0.34)$ and $(0.55, 0.34)$ are very similar but will get mapped to different bins.

I'm wondering if it's there's any theoretical and practical issues with either truncating the last digit or rounding up/down such that both vectors are now mapped to a bin representing the vector $(0.5, 0.3)$

This I believe is equivalent to multiplying all the vector elements by 10 and truncating the decimals, leading back to normal LSH.

I'm quite new at all this, so please forgive the slight errors in notation or understanding.

Asked By : AdrianGW

Answered By : D.W.

One approach: Quantization

One approach is to quantize the values, then apply an existing LSH that works on discrete values. In other words, you split up the space $[0,1]$ into several ranges, map each real number to its containing range, and then apply an existing LSH that works with discrete values. This is essentially equivalent to first truncating each coordinate of the vector, then applying some discrete LSH.

For example, suppose you split the interval $[0,1]$ up into ten ranges, $[0,0.1)$, $[0.1,0.2)$, ..., $[0.8,0.9)$, and $[0.9,1.0]$. Now you can map any value in $[0,1]$ to its corresponding range -- this is basically just truncating to keep only the first digit in the decimal representation. Map each interval to a digit from 0 to 9: e..g, $[0,0.1) \mapsto 0$, etc. In this way, given any vector $v \in [0,1]^n$, you can get a new vector $w \in \{0,1,\dots,9\}^n$. For instance, the vector $v=(0.55,0.34,0.90,0.99) \in [0,1]^4$ maps to the new vector $w=(5,3,9,9) \in \{0,\dots,9\}^n$.

The new vector $w$ is in a discrete space, rather than in a continuous space. Now if you have a LSH that works on this discrete space, you can apply it to $w$. For instance, suppose $h_1:\{0,\dots,9\}^n \to \mathbb{R}$ is a locality-sensitive hash (LSH) on this discrete space. Let $f:[0,1]^n \to \{0,\dots,9\}^n$ be the mapping that sends each real number to its corresponding range, i.e., $f(v)=(g(v_1),\dots,g(v_n))$ where $g(x)= \lfloor 10x \rfloor$ is the truncation map $g:[0,1] \to \{0,\dots,9\}$. Define $h_0 : [0,1]^n \to \mathbb{R}$ by $$h_0(v) = h_1(f(v)).$$ Then $h_0$ is a LSH on continuous values.

If you plan to generate multiple different LSHs, it's probably slightly better to apply the following tweak. Pick a random $n$-vector number $r \in [0,0.1)^n$. Now the LSH for $[0,1]^n$ is defined to be $h_0(v) = h_1(f(v+r))$ as above, where $f:[0,1]^n \to \{0,1,\dots,10\}^n$ is the truncation map defined as above and $h_1:\{0,1,\dots,10\}^n\to\mathbb{R}$ is a LSH on discrete values. This tweak lets you generate multiple independent LSH's for $[0,1]^n$. If you are only picking a single hash function, you can skip this step.

There's nothing special about base-10. Instead of using base-10, you can also use base-2 or any other base. The performance will depend upon the base you use as well as the specific discrete LSH you choose, so you might need to play with the options to see which is best overall. I suspect you will find that the optimal choice of base is base $b=2$ or $b=3$ (and, if your discrete LSH is based upon hashing a random subset of the coordinates, choosing a subset of size proportional to $\sqrt{n}/b$), but it's not possible to give a hard-and-fast rule: the optimum depends to some extent on the distribution of your data. That's why I suggest trying a few different parameter choices to see which seems to work best on your data.

This is basically the "truncation" idea in your original post (truncate all real values, then apply a discrete LSH), but now we apply it to the entire vector, rather than trying to split the vector in half. Of course, the discrete LSH might itself work by picking a few randomly chosen coordinates and hashing only those coordinates (e.g., using a universal hash).

Another approach: Use a LSH designed for continuous values

There are also some LSH's that are designed specifically for continuous real numbers. I don't know if there are any for the Euclidean distance.

A totally different approach: nearest-neighbor data structures

Finally, it's worth mentioning that there are also other approaches to this problem that don't involve LSH's at all. There are some data structures that are designed to support nearest-neighbor search (or approximate nearest-neighbor search). You might look at metric trees, k-d trees, and nearest-neighbor search.

However, beware of the curse of dimensionality. If the dimension $n$ is large, none of these approaches are likely to work well: high dimensions are just plain hard. For instance, one rule of thumb I've seen is that k-d trees tend not to work well when $n \ge 30$. I'm not sure any of these techniques will be terribly effective when $n$ is large: nearest-neighbor search in high dimensions is hard.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/44997

3.2K people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback