Cheap and Secure Web Hosting Provider : See Now

# [Answers] Proof Knuth S algortihm correctness

, ,
Problem Detail:

In the programming pearls book by Jon Bentley, there is a section about the problem of finding a random set of m integers from range 0 to n-1 integers. To do so they use Knuth's algorithm given by the following code (see also here). How can I proof that this code indeed chooses every set of m integers from the range 0 to n-1 with equal probability.

def select_random(n, m):     to_pick = n     remaining = m     positions = []     for position in range(n):         big_int = generate_random_big_int()         if big_int % remaining < to_pick:             position.append(position)              to_pick -= 1         remaining -= 1     return positions 

#### Answered By : Petar Mihalj

I have finally succeded in proving Knuth algorithm S. Lets state it first:

Knuth algorithm S takes a sample of $M$ elements from $N$ inputs, N is not known until the end of inputing. A uniform sample will be available after each input and it will remain uniform when you do all inputs.

Algorithm:

1. First $M$ numbers should definately take part in your sample, so until $N>M$, use all numbers in your sample.

2. Decide if $x_{th}$ number will take part in your sample, and if it does, put it in place of a random element already in sample. Probability for the first statement should be $\frac{M}{x}$

3. Repeat 2 for all n items.

We now have to prove that this algorithms gives each element an equal chance of being in a random sample at each subsequent step, and at the end. If we can prove that it works for $n$ (considered as individual step), it will work for $N$ (total number), since the total number is just another step.

Basically it must fix up the probabilities without knowing how many numbers will be inputted.

Algorithm is correct if and only if the probability of each item to get into sample is the same, more precisely - equal to $M/N$. Let us consider how likely is for the number to get in the sample after n steps.

$P_G(x)$ - chance of item $x$ getting in the sample after $n$ steps.

$P_E(x)$ - chance of item getting in when it is first introduced

$P_R(x)$ - chance of item remaining in the sample, in other words not getting kicked out by the remaining items

$P_S(i)$ - chance of item not getting kicked out by one of the remaining items, $i_{th}$

$P_G(x) = P_E(x) * P_R(x)$ since the item has to get in and remain inside

$P_R(x) = \prod_{i=x+1}^nP_S(i)$ - the chance of item surviving all remaining turns is equal to a product of surviving each of the individual turns.

Chance of item surviving a turn is equal to $1 - nS$ (not surviving).
$Ns$ is equal to another item both:
1. getting into sample $\frac{M}{i}$
2. kicking exactly our item out $\frac{1}{M}$

$P_S(i) = 1 - \frac{M}{i} * \frac{1}{M} = 1 - \frac{1}{i} = \frac{i-1}{i}$

It turns out that:

$P_G(x)=\frac{M}{x} * \prod_{i=x+1}^n(\frac{i-1}{i}) = M/N$
(If you notice that product is basically $\frac{n+1}{n}$, you can write it out and terms cancel out.)

Exactly what we wanted to prove.

Hopefully I did not make it too hard to read.