Cheap and Secure Web Hosting Provider : See Now

Simple Bayesian classification with Laplace smoothing question

, , No Comments
Problem Detail: 

I'm having a hard time getting my head around smoothing, so I've got a very simple question about Laplace/Add-one smoothing based on a toy problem I've been working with.

The problem is a simple Bayesian classifier for periods ending sentences vs. periods not ending sentences, based on the word immediately before the period. I'm collecting the following counts in training: number of periods, number of sentence-ending periods (for the prior), words and counts for words before sentence-ending periods, and words and counts for words before not-sentence-ending periods.

With add-one smoothing, I understand that

$$P(w|\text{ending}) = \frac{\text{count}(w,\text{ending}) + 1}{\text{count}(w) + N},$$

where $P(w|\text{ending})$ is the conditional probability for word $w$ appearing before a sentence-ending period, $\text{count}(w,\text{ending})$ is the number of times $w$ appeared in the training text before a sentence-ending period, $\text{count}(w)$ is the number of times $w$ appeared in the training text (or should that be the number of times it appeared in the context of any period?), and $N$ is the "vocabulary size".

The question is, what is $N$?

Is it the number of different words in the training text? Is it the number of different words that appeared in the context of any period? Or just in the context of a sentence-ending period?

Asked By : Tommy McGuire
Answered By : D.W.

The correct formula is

$$P(w|\text{ending}) = \frac{\text{count}(w,\text{ending}) + 1}{\text{count}(w) + N},$$

where $N$ is the number of possible values of $w$. Here $w$ ranges over the set of all words that you'll ever want to estimate $P(w|\text{ending})$ for: this includes all the words in the training text, as well as any other words you might want to compute a probability for. For instance, if you limit yourself to only computing probabilities for English words in a particular dictionary, $N$ might be the number of words in that dictionary.


The intuition/idea behind add-one smoothing is: in addition to every occurrence of a word in the training set, we imagine that we see one artificial "occurrence" of each possible word too -- i.e., we augment the training set by adding these artificial occurrences (exactly one per possible word), and then compute probabilities using the ordinary unsmoothed formula on this augmented training set. That's why we get $+1$ in the numerator and $+N$ in the denominator; $N$ is the number of artificial occurrences we've added, i.e., the number of possible words.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/51861

3200 people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback