Cheap and Secure Web Hosting Provider : See Now

Probabilities, Unigram and Bigram

, , No Comments
Problem Detail: 

Assume that we have these bigram and unigram data:( Note: not a real data)

#a(start with a) =21
bc= 42
cf= 32
de= 64
e#= 23


# 43

a= 84






what is the probability of generating a word like "abcfde"? I think for having a word starts with a the probability is 21/43. How about bc? is it like bc/b?

Asked By : liza

Answered By : farhanhubble

Augment the string "abcde" with # as start and end markers to get #abcde#. Now, as @Yuval Filmus pointed out, we need to make some assumption about the kind of model that generates this data. Because we have both unigram and bigram counts, we can assume a bigram model. In a bigram (character) model, we find the probability of a word by multiplying conditional probabilities of successive pairs of characters, so:

$\Pr[\#abcde\#] = \Pr(a|\#)*\Pr(b|a)*\Pr(c|b)*\Pr(d|c)*\Pr(e|d)*\Pr(\#|e) $

To find the conditional probability of a character $c_2$ given its preceding character $c_1$, $\Pr(c_2|c_1)$, we divide the number of occurrences of the bigram $c_1c_2$ by the number of occurrences of the unigram $c_1$.

So, for example $\Pr(e|d) = count(de)/count(d) = 64/150$

Best Answer from StackOverflow

Question Source :

3200 people like this

 Download Related Notes/Documents


Post a Comment

Let us know your responses and feedback