Cheap and Secure Web Hosting Provider : See Now

# Probabilities, Unigram and Bigram

, ,
Problem Detail:

Assume that we have these bigram and unigram data:( Note: not a real data)
bigram:

bc= 42
cf= 32
de= 64
e#= 23

unigram:

# 43

a= 84

b=123

c=142

f=161

d=150

e=170

what is the probability of generating a word like "abcfde"? I think for having a word starts with a the probability is 21/43. How about bc? is it like bc/b?

Augment the string "abcde" with # as start and end markers to get #abcde#. Now, as @Yuval Filmus pointed out, we need to make some assumption about the kind of model that generates this data. Because we have both unigram and bigram counts, we can assume a bigram model. In a bigram (character) model, we find the probability of a word by multiplying conditional probabilities of successive pairs of characters, so:

\$\Pr[\#abcde\#] = \Pr(a|\#)*\Pr(b|a)*\Pr(c|b)*\Pr(d|c)*\Pr(e|d)*\Pr(\#|e) \$

To find the conditional probability of a character \$c_2\$ given its preceding character \$c_1\$, \$\Pr(c_2|c_1)\$, we divide the number of occurrences of the bigram \$c_1c_2\$ by the number of occurrences of the unigram \$c_1\$.

So, for example \$\Pr(e|d) = count(de)/count(d) = 64/150\$