If you want to make money online : Register now

# [Solved]: NFA-style representation of regexps which includes AND

, ,
Problem Detail:

I'm looking for a regexp matching strategy that supports conjunction and which is based on an NFA-style machine. I know how to handle conjunctions using a DFA, or using Brzozowski's approach, but for various reasons, I can't use Brzozowski's technique and bump into state explosion when using a DFA.

I know about the cross-product construction to compute the intersection of two NFAs, but there as well: while the state explosion is not as severe as when going through a DFA, it is sometimes still too severe for my needs, so I'd like an NFA-like representation that lets me build the cross-product "on the fly", kind of like "Thompson's NFA" matching constructs the DFA states on the fly.

It is a bit difficult to answer you because you state that you are using NFA to do the matching. NFA being non deterministic, that leave the interpretation technique open, depth first, breadth first, or variants. So I am trying to give you an answer that should fit any interpretation of the NFA. This technique should be applicable to other constructions than intersection.

You have 2 NFAs, $A_i=(Q_i,\Sigma,\delta_i,\hat q_i, F_i)$ for $i=1,2$. The intersection is a NFA $A=(Q,\Sigma,\delta,\hat q, F)$ defined as usual for the intersection, notably with $Q=Q_1\times Q_2$.

The idea of the technique is to implement the transition function $\delta$ not as a table, but as a function $\delta: (Q_1\times Q_2)\times\Sigma\to 2^{(Q_1\times Q_2)}$, that calls in turn the function $\delta_1$ and $\delta_2$.

function delta((q1,q2),a) { return (delta1(q1,a), delta2(q2,a))} 

This is to be read as a non deterministic program, in the sense described in this answer to How does one formulate a backtracking algorithm? That means that, in actual interpretation, the functions are actually returning sets of results, one by one if backtracking, or together if breadth-first.

The rest is up to the way you translate non-determinism into determinism, either by a compiler, or by hand. It can be done depth-first or breadth first. It can even use memo function to compute incrementally and remember the part of the transitions of the automaton $A$ that is needed for the string at hand, and possibly keep it for future strings to be matched.

Of course, if your automata are DFAs, this is much simpler to read as usual deterministic functions (possibly implemented by tables for delta1 and delta2. Then memoisation of delta will construct incrementally the table for the cross product.

The can be seen as an incremental construction, or as a generalisation of call by need. Rather than building the automaton up-front, you build the parts you need when you need them (and you can even forget some, if you run out of space). This can also be used for the powerset construction, and in many other circumstances.