Cheap and Secure Web Hosting Provider : See Now

Indexing billion domain names / strings

, , No Comments
Problem Detail: 

I am trying to find any datastructure/algorithm suitable to index & retrieve domain names & its associated documents.

The records that need indexing will typically be like:

www.facebook.com, 123 stackoverflow.com, 1231 www.facebook.com, 124 stackoverflow.com, 3456 cdn.facebook.com, 4566 google.com, 903002 google.co.nz, 2342 google.co.in, 84992 google.com, 902002 

The retrieval will involve getting all the document ids(not top K but all of them) for a given search. Since it is content with a structure, people can query using part of domain names.

e.g. search for 'google' will lead to matching of all documents that have  google.co.nz,google.co.in & google.com associated with them. 

Assumptions

  • As far as the distribution goes, less than 30% of the domain names will point to 80% of the documents.
  • The number of records being indexed will be in hundreds of billions
  • The index will most likely not fit into memory and will have to be stored in file-system

Initial thoughts

  • I thought of a solution where the domain names are tokenised(e.g. www.facebook.com gets tokenised into www,facebook,com,facebook.com,www.facebook.com) and hashed.
  • The hash value lookup leads to all the documents that are part of the hash
  • Some of the problems I see with this approach are
    • The values(document ids) are stored multiple times and some of the TLDs/commonly used subdomains(www) will have almost all the entries in them
    • Removing 1 entry will lead to updating multiple entries & their lists.
    • Hash collisions can affect the quality of results & hence hash function is quite vital to it
    • It won't be possible to address mis-spellings or substring search e.g. *goog*
    • The hash key will have to be 64-bit or more depending on the function

The question is what datastructure/algorithm to use to store & search for documents matching a given domain name.

Alternatively, if the above proposed solution is suitable, what can be a good hashing function given this context.

Asked By : Sundarram P.V.

Answered By : D.W.

The standard solution is to build an index: a data structure that maps from search term to all of the records that match that search term. Your index could be a hashmap or a B-tree, for instance. In this case, the search terms are things like "google" or "facebook" or "facebook.com", and each one maps to a list of all records that match it. Thus, facebook.com would be associated with two search terms ("facebook.com" and "facebook") and would be present in two of these lists.

You raised some concerns about this approach, but they are all manageable:

  • Yes, search terms like "com" will match many records. You should decide whether searching for "com" is useful or not. If it's not useful, don't include that in the index: build a list of "stopwords" (e.g., com, org, etc.), and don't include any of them in the index. For instance, you might decide that the TLD part of a domain name is not one of its matching search terms.

  • Yes, inserting or removing one record will require updating multiple entries in the index. But this just isn't that big a deal. Most domain names have very few components: e.g., "www.google.com" will require changing 2 entries in the index (assuming you have blacklisted "com"), which just isn't that big a deal.

  • Hash collisions are dealt with using standard methods for building hash tables and just aren't that big a deal. Or, you can use any other data structure, like a B-tree.

  • No, this can't handle mis-spellings. Mis-spellings are an entirely different problem, which require a range of different solutions, and you should ask a separate question about that (but first, study Levenstein distance, metric trees, and data structures for spelling correction and answering nearest-neighbor search in the Levenstein distance metric -- there are a bunch of questions on this site that are relevant).

  • No, there is no reason that the hash key needs to be 64 bits. The hash key needs to be at least the number of buckets in your hashtable.

I suggest you read some standard resources on building indices, e.g., in database textbooks, and read about how search engines work. Those topics are well-studied and those techniques will be applicable to your problem.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/52194

3200 people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback