**Problem Detail:**

I'm looking into PCFG sentence grammar dependency structure parsing using StanfordNLP PCFG parser. It generates tree structures represented as a string like this:

` Happy new year => (ROOT (VP (VBN Happy) (NP (JJ new) (NN year)))) `

Or in a more visual way, the tree structure looks like this:

`(ROOT (VBN (JJ Happy) (NP (JJ new) (NN year)))) `

Since some machine learning algorithms are generative, and I'm wondering if I group tree structures into two sets: one set contains sentences with negative emotions, one set on positive emotions. Can I use the machine learning algorithm to get the most commonly occurred subtree structure within each set?

I'm very sorry if I caused any confusion. First of all, I want to know if this problem is solvable in the realm of machine learning.

Remember topic modeling? Where the algorithm reads large chunks of text and generate similar topics and return keyword strings grouped together as a topic? One topic could be: `Gun, ship, sailer, ocean, fish, war, Gulf, oil`

It provides an insight into what computer understands what this document might be talking about, so we can compare it with what humans think.

However, what I want to do, is ask the question if I can find common syntactical subtree structures in a sentence that might indicate the attribute of this sentence: negative or positive.

No, I am not asking for a sentiment analysis classifier, and that is not my interest at all.

For example, (this is an unpublished study right now), as us humans, we can say certain things about the future, and it will appear in a special structure, like "I will go to the mall" or "I'm going to jogging." You can see that "will" or "going to" might appear in many sentences, and serve as a common grammatical/syntactical structure. Those structures can be spotted by us human beings (namely linguists), however, is it possible to let machine find such common pattern?

I know in data science, or in realm of machine learning, people use algorithms to find/generate common patterns. I can group my parsed sentences (as tree structures represented above) into two groups: negative group and positive group (NOT emotions, just names for two groups), and then I want to use this algorithm to see if one or multiple subtree structures are very common among the group so I can extract them out.

As I stated before, I only need two things:

- Is this solvable in Machine Learning realm?
- If it is, which algorithm should I use?

###### Asked By : Wind Dweller

#### Answered By : D.W.

It sounds like what you want is a machine learning algorithm that works on trees, in the supervised learning context. I'd imagine there are probably many machine learning algorithms that could be adapted to this setting, but a natural one to try first is naive Bayes. So, let me sketch how to use naive Bayes in this setting.

First, we need a feature set. A simple feature set would be: each possible subtree that appears anywhere in the training set is a feature. So, in the example in the question, your feature set would include the subtree `(NP (JJ new) (NN year))`

and `(JJ new)`

and every subtree of every parse tree that appears in the training set. Let $S_1,\dots,S_n$ denote these $n$ subtrees. This gives you a bunch of features. Now you can label each sentence with its label (positive or negative); and for each, you know whether it contains that feature or not.

Now remember how Naive bayes works. We have a boolean random variable $Y$ that indicates the class of an item, and $n$ boolean random variables $X_1,\dots,X_n$ that indicate which features it has. In particular, $X_i=+$ means that the $i$th feature is positive, i.e., that the $i$th subtree $S_i$ is a subtree of the current item. We want to calculate

$$\Pr[Y=+ | X_1=x_1,\dots,X_n=x_n].$$

We do this using Bayes theorem. In particular, we use our training set to calculate

$$\Pr[X_i=x_i | Y=y],$$

for each of the 4 possibilities of $x_i,y$. Then, we use Bayes theorem to compute $\Pr[Y=+ | X_1=x_1,\dots,X_n=x_n]$.

Now everything works as usual for naive Bayes. We've converted this into a situation where there's a set of $n$ features that can be present or absent for each item, and we want to label all the items; there's no more need to reason about the tree structure, we can just use a standard classifier like naive Bayes. You could also plug in some other classifier as well, too; hopefully you can see how this might work, from here.

###### Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/28218

## 0 comments:

## Post a Comment

Let us know your responses and feedback