If you want to make money online : Register now

[Solved]: How do I efficiently checking if a string matches any substring in a collection

, , No Comments
Problem Detail: 

I have a collection of substrings

"this" "is" "a" "antelope" 

I need to look at any given string and answer the question "Are any of the given substrings in this string"

So my input string might be


Which would be a match because "is" is a substring of "issue"

My first stab at this, I incorrectly turned my collection of substrings into a trie. That got me nowhere fast as it answered the inverse "is the input string a substring of the given collection".

Is there some algorithm or data structure I can transform my collection into to efficiently answer this question? I mean, I could do the simple brute force "check the input against every substring" method, but it seems like there is a better way.

In my given example, I would expect that "antelope" would never be checked because "a" covers every case that antelope would. I may even expect that "is" would remove "this" since every case that "this" would find a match "is" would as well. So it seems like eliminating longer substrings with shorter ones would yield some better performance.

I'm rambling... Anything I should be looking into?

Asked By : Cogman

Answered By : James Evans

I think the Aho-Corasick algorithm is your best bet here. The algorithm is designed to solve the set matching problem (essentially the one you've defined), in which you want to determine which elements of a set of substrings, L, can be found in a longer string, S.

It runs in time $O(n + m + z)$, where $n = \sum \limits_{l \in L} |l|$ (the total combined length of the substrings in $L$), $m$ is the length of S, and $z$ is the number of matches of $L$ in $S$.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/35819

3.2K people like this

 Download Related Notes/Documents


Post a Comment

Let us know your responses and feedback