Cheap and Secure Web Hosting Provider : See Now

# [Solved]: Sorting when there are only O(log n) many different numbers

, ,
Problem Detail:

We have $n$ integers with lot's of repeated numbers. In this list, the number of distinct elements is $O(\log n)$. What's the best asymptotic number of comparisons for sorting this list?

Any idea or hint or pseudo code? In fact I want to learn pseudo code.

#### Answered By : Chao Xu

Because you asked for minimum number of comparisons, so I assume the algorithm can only compare the numbers.

The idea is to extend the sorting lower bound argument. Assume you want to sort $n$ elements knowing there exist at most $k$ distinct values. There are $n!$ ways to permute the elements, but many of them are equivalent. If there are $n_i$ element of the $i$th values. Each permutation is equivalent to $\prod_{i=1}^k n_i!$ other permutations. So the total number of distinct permutations is

$$\frac{n!}{\prod_{i=1}^k n_i!}$$

The number of required comparisons is bounded below by $$\log_2 \left( n!/\min \{ \prod_{i=1}^k n_i! \big| \sum_{i=1}^k n_i = n, n_i\geq 0\text{ for all } i\} \right)$$

Good thing that minimization part can be easily shown by extend factorial to the continuous domain. $\min \{ \prod_{i=1}^k n_i! \big| \sum_{i=1}^k n_i = n, n_i\geq 0\text{ for all } i\}$ is attained when $n_i=n/k$. (note the $\log$ in the next computation is base $e$ for convenience)

$$\log \left( n!/{(n/k)!^k} \right) = \log (n!) - k \log ((n/k)!) = n\log(n) - n\log(n/k) + O(\log n)= n(\log(n)-\log(n)+\log(k)) + O(\log n) = \Omega(n\log k)$$

$\log(n!) = n\log n - n + O(\log n)$ is Ramanujan's approximation.

To get an upper bound. Just consider storing the unique values in a binary search tree, and each insert we either increase the number of occurrence of an element in the BST, or insert a new element into the BST. Finally, print the output from the BST. This would take $O(n\log k)$ time.

Since both the lower bound and upper bound works for all $k$, the algorithm would take $O(n\log \log n)$ time for your problem.

I just figured out from @Pseudonym's comment that this proof also proves that we need at least $nH$ comparisons where $H$ is the entropy of the alphabet, so I might as well add this to the answer.

Let $c = \log 2$ and $p_i = n_i/n$. The entropy of the alphabet where the $i$th letter appears $n_i$ time is $H=-\sum p_i \log_2 p_i$. $nH = -\sum n_i (\log_2(n_i)-\log_2(n)) = \sum n_i (\log_2(n) - \log_2(n_i)) = c \sum n_i (\log(n) - \log(n_i))$.

\begin{align*} \log_2 \left( n!/\prod_{i=1}^k n_i! \right) &= \log_2(n!)-\sum_{i=1}^k \log_2(n_i!) \\ &= \log_2(n!)-\sum_{i=1}^k \log_2(n_i!) \\ &= c (\log(n!)-\sum_{i=1}^k \log(n_i!)) \\ &= c (n \log n-n + O(\log n) - \sum_{i=1}^k n_i \log(n_i)-n_i+O(\log n_i)) \\ &\geq c (n \log n-n - \sum_{i=1}^k n_i \log(n_i)-n_i) \\ &= c (-n + \sum_{i=1}^k n_i(\log(n) - \log(n_i))+n_i) \\ &= c (\sum_{i=1}^k n_i(\log(n) - \log(n_i))) \\ &= nH \end{align*}