Cheap and Secure Web Hosting Provider : See Now

Taking intersection in large search

, , No Comments
Problem Detail: 

As I understand, you can build the the word -> pages index in Google or large SQL database since indexed search has complexity O(1) -- lookup gives you a billion-page result at once

сomputer -> About 2.14 bln results  science -> About 1.93 bln results 

computer science -> About 377 mln results "computer science" -> About 147 mln results science computer -> About 0.97 bln results "science computer" -> About 452k results 

So, you have got 2 billion pages immediately. To find out the pages which contain both words, you seem to need to make an intersection. How do they make it immediately? I do not believe that they try all 10^18 matches

for p in pages(word1):  for q in pages(word2):   if p == q yield p 
Asked By : Little Alien
Answered By : D.W.

If you're asking how Google actually works, you'd have to ask Google.

If you're asking how a search engine could plausibly achieve this functionality in a reasonable amount of time (say, within 1 second), here are several strategies, any one of which could work fine:

  • Use a SSD. Normally, a search index is a map from a search term to a sorted list of document IDs (the IDs of the documents that contain that search term). Because these lists are sorted, you can compute the intersection of two lists in time linear in the length of those lists, by using the "merge" algorithm from mergesort. (This is faster than the quadratic algorithm you list.)

    As a result, the intersection can be computed in one pass, a streaming fashion. For the parameters you mention, the bottleneck will be the time to read in those sorted lists from disk. If you store them on a SSD, this can be relatively fast. Some modern SSDs can achieve a read bandwidth of perhaps 2500 MB/s. If each document ID is 4 bytes, reading in 4 billion document IDs will take about a half a second. The computation time is negligible in comparison, so the intersection could be computed under a second if indices are stored on a fast SSD.

  • Parallel computation. If you split the indices across multiple SSDs (say, using RAID or striping), you can increase the read bandwidth. If you use a cluster of machines, each with their own SSD, you can parallelize both data transfer and computation. The intersection task is trivially parallelizable, so you can achieve excellent speeds this way. For instance, if you're using magnetic hard disks with read bandwidth of 200MB/s, striping the indices and computation across 8 machines is enough to finish this task in under a second.

  • Caching. You only need to do this work once. Once you've computed the size of the intersection once, you can store it in a table and cache it for some time period (a month or more), as it's unlikely to change much in the future, and then once someone else asks the same cache, you can return the number of results instantaneously.

  • Estimation via random sampling. An even better approach is to avoid computing the entire intersection, but just estimate its size through random sampling. You repeatedly do the following: randomly choose a document that contains computer, and check whether it also contains science. Each trial (randomly picking a document that contains computer) can be done in one operation, by picking a random offset into the sorted list for computer, and then it's easy to check whether that document also contains the word science (say, by binary search into the sorted list for science). Repeat this 1000 times, and you can get a good estimate of the fraction of documents containing computer that also contain science -- this uses the same principles as political polling, where we survey a random sample of people and then extrapolate to draw inferences about the entire population. Multiply that fraction by the total number of documents containing computer, and you've got an estimate of the size of the intersection.

You can apply similar ideas to count the number of documents containing the phrase "computer science".

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/57936

3200 people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback