**Problem Detail:**

There is a famous part-of-speech tagging problem in Natural Language Processing. The popular solution is to use Hidden Markov Models.

So that, given the sentence $x_1 \dots x_n$ we want to find the sequence of POS tags $y_1 \dots y_n$ such that $y_1 \dots y_n = \arg\max_{y_1 \dots y_n}p(Y,X)$.

By Bayes Theorem, $P(X,Y)=P(Y)P(X \mid Y)$.

Solving POS by HMM implies the assumptions: $p(y_i \mid y_{i-1})$ and $p(x_i \mid y_i)$.

The question is there are any particular reason why we prefere to solve it by generative model with a lot of assumption and not directly by estimating $P(Y \mid X)$, given the training corpus it's still possible to estimate $p(y_i \mid x_i)$.

The second question, even when we convinced that the generative model is preferred why to calculate is as $P(Y,X)=P(Y)P(X \mid Y)$ and not $P(X,Y)=P(X)P(Y \mid X)$. In case we have an appropriate generative story I can use $P(X,Y)=P(X)P(Y \mid X)$ as well, is it mentioned somewhere that assumed generative story is preferred.

###### Asked By : user16168

#### Answered By : alto

Isn't this exactly the same question you asked previously? I'll make some additional comments and add some links here. Hopefully that will help.

is there are any particular reason why we prefere to solve it by generative model with a lot of assumption and not directly by estimating $P(Y∣X)$, given the training corpus it's still possible to estimate $p(y_i∣x_i)$?

It just depends. Choosing whether to model $P(X,Y)$ or $P(Y|X)$ is simply the choice of generative versus discriminative. Both have advantages. See the paper On Discriminative vs. Generative classifiers by Ng and Jordan. One thing worth mentioning, that I didn't say last time, is unsupervised learning in a generative framework is normally straightforward. This means it is also fairly obvious how to do semi-supervised learning. Semi-supervised learning can be very helpful for NLP tasks where the amount of unlabeled data is essentially infinite and labelled data is hard to obtain. Semi-supervised learning is typically not as easy in a discriminative framework. See Co-training as an example of the later.

As for how one decomposes the joint, well that's up to you. There's no rule saying you cant decompose it as $P(X,Y) = P(X)P(Y|X)$. Doing so would be perfectly valid, just not sensible. Notice decomposing the joint this way includes the factor $P(Y|X)$ already. If you're ultimately interested in predictiong $Y$ given $X$, then you should should predict $$ \begin{align*} \arg\max_y P(Y=y,X=x) &= \arg\max_y P(X=x)P(Y=y|X=x) \\ &= \arg\max_y P(Y=y|X=x). \end{align*} $$ So you just use $P(Y|X)$ and ignore $P(X)$ and we're back at a discriminative classifier.

###### Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/20185

**3.2K people like this**

## 0 comments:

## Post a Comment

Let us know your responses and feedback