Understanding of naive bayes: computing the conditional probabilities

7

2

For a task on sentiment analysis, suppose we have some classes represented by $c$ and features $i$.

We can represent the conditional probability of each class as: $$P(c | w_i) = \frac{P(w_i|c) \cdot P(c)}{P(w_i)}$$ where $w_i$ represents each feature and $c$ is the class we have. Then empirically, we can represent $$P(w_i|c) = \frac{n_{ci}}{n_c}$$ $$P(w_i) = \frac{n_{i}}{n}$$ Our priors for each classes are then given by: $$P(c) = \frac{n_c}{n}$$ where:

$n$ is the total number of features in all classes.

$n_{ci}$ represents the number of counts of that feature $i$ in class $c$.

$n_c$ is the total number of features for the class, and

$n_i$ is the total number of features for all classes.

Is my understanding of the above correct? So given these $P(c|w_i)$ probabilities for each word, I'm the naive bayes assumption is that the words are independent, so I simply multiply each word in a document for a certain class, i.e. to compute $\prod P(c|w_i), i \in N$ where $N$ is the number of words in the document. Is this correct?

To actually compute the conditional probability numerically, would it suffice to do the following:

$$P(c | w_i) = \frac{P(w_i|c) \cdot P(c)}{P(w_i)} = \frac{n_{ci}}{n_c} \cdot \frac{n_c}{n}\cdot \frac{n}{n_i} = \frac{n_{ci}}{n_i}$$

The last part of the equation looks a bit suspicious to me as it seems way too simple to compute for a rather complex probability.

user19241256

Posted 2018-01-24T20:01:23.673

Reputation: 173

Answers

4

Your formula is correct for one $w_i$, but if you want to classify a document, you need to compute $P(c | w_1,\ldots,w_N)$.

Then you have $$P(c | w_1,\ldots,w_N) = \frac{P(c)\cdot P(w_1,\ldots,w_N|c)}{P(w_1,\ldots,w_N)} = \frac{P(c) \cdot \prod_{i=1}^N P(w_i|c)}{P(w_1,\ldots,w_N)} \neq \prod_{i=1}^NP(c|w_i)$$

where the second equation holds because of the naïve Bayes assumption.

For classification purposes you can ignore $P(w_1,\ldots,w_N)$ because it is constant (given the data). The formula is still simple ("naïve") but doesn't simplify quite as much.

The last part of the equation looks a bit suspicious to me as it seems way too simple to compute for a rather complex probability.

Keep in mind that while Naïve Bayes is a decent classifier for many applications, the generated probabilities are usually not very representative.

oW_

Posted 2018-01-24T20:01:23.673

Reputation: 5 477

Thanks for your answer. In actual programs, why can't this result be achieved? I have seen many implementations of naive bayes and none of those go directly to compute $n_{ci}$ a word. – user19241256 – 2018-01-24T21:34:57.133

not sure I understand the question... in some form or another it would come down to counting. can you give an example? – oW_ – 2018-01-24T22:32:29.247