**Problem Detail:**

As I understand, you can build the the word -> pages index in Google or large SQL database since indexed search has complexity O(1) -- lookup gives you a billion-page result at once

`сomputer -> About 2.14 bln results science -> About 1.93 bln results `

`computer science -> About 377 mln results "computer science" -> About 147 mln results science computer -> About 0.97 bln results "science computer" -> About 452k results `

So, you have got 2 billion pages immediately. To find out the pages which contain both words, you seem to need to make an intersection. How do they make it immediately? I do not believe that they try all 10^18 matches

`for p in pages(word1): for q in pages(word2): if p == q yield p `

###### Asked By : Little Alien

###### Answered By : D.W.

If you're asking how Google actually works, you'd have to ask Google.

If you're asking how a search engine could plausibly achieve this functionality in a reasonable amount of time (say, within 1 second), here are several strategies, any one of which could work fine:

**Use a SSD.**Normally, a search index is a map from a search term to a*sorted*list of document IDs (the IDs of the documents that contain that search term). Because these lists are sorted, you can compute the intersection of two lists in time linear in the length of those lists, by using the "merge" algorithm from mergesort. (This is faster than the quadratic algorithm you list.)As a result, the intersection can be computed in one pass, a streaming fashion. For the parameters you mention, the bottleneck will be the time to read in those sorted lists from disk. If you store them on a SSD, this can be relatively fast. Some modern SSDs can achieve a read bandwidth of perhaps 2500 MB/s. If each document ID is 4 bytes, reading in 4 billion document IDs will take about a half a second. The computation time is negligible in comparison, so the intersection could be computed under a second if indices are stored on a fast SSD.

**Parallel computation.**If you split the indices across multiple SSDs (say, using RAID or striping), you can increase the read bandwidth. If you use a cluster of machines, each with their own SSD, you can parallelize both data transfer and computation. The intersection task is trivially parallelizable, so you can achieve excellent speeds this way. For instance, if you're using magnetic hard disks with read bandwidth of 200MB/s, striping the indices and computation across 8 machines is enough to finish this task in under a second.**Caching.**You only need to do this work once. Once you've computed the size of the intersection once, you can store it in a table and cache it for some time period (a month or more), as it's unlikely to change much in the future, and then once someone else asks the same cache, you can return the number of results instantaneously.**Estimation via random sampling.**An even better approach is to avoid computing the entire intersection, but just estimate its size through random sampling. You repeatedly do the following: randomly choose a document that contains`computer`

, and check whether it also contains`science`

. Each trial (randomly picking a document that contains`computer`

) can be done in one operation, by picking a random offset into the sorted list for`computer`

, and then it's easy to check whether that document also contains the word`science`

(say, by binary search into the sorted list for`science`

). Repeat this 1000 times, and you can get a good estimate of the fraction of documents containing`computer`

that also contain`science`

-- this uses the same principles as political polling, where we survey a random sample of people and then extrapolate to draw inferences about the entire population. Multiply that fraction by the total number of documents containing`computer`

, and you've got an estimate of the size of the intersection.

You can apply similar ideas to count the number of documents containing the phrase `"computer science"`

.

Question Source : http://cs.stackexchange.com/questions/57936

**3200 people like this**

## 0 comments:

## Post a Comment

Let us know your responses and feedback