notes:bayesian_classification

This shows you the differences between two versions of the page.

Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||

notes:bayesian_classification [2013/03/14 15:21] andy |
notes:bayesian_classification [2013/03/15 14:06] andy [Combining words] |
||
---|---|---|---|

Line 3: | Line 3: | ||

This page discusses the application of Bayes Theorem as a simple classifier for text and outlines the mathematical basis and the algorithmic approach. | This page discusses the application of Bayes Theorem as a simple classifier for text and outlines the mathematical basis and the algorithmic approach. | ||

- | The information in this page is heavily cribbed from the Wikipedia articles on [[wikipedia>Bayesian spam filtering]], [[wikipedia>naive Bayes classifier]] and [[wikipedia>Bayes' Theorem]]. | + | The information in this page is heavily cribbed from the Wikipedia articles on [[wikipedia>Bayesian spam filtering]], [[wikipedia>naive Bayes classifier]] and [[wikipedia>Bayes' Theorem]]. There's also a [[http://cs.wellesley.edu/~anderson/writing/naive-bayes.pdf|useful paper on combining word probabilities]] which is worth a read, especially the final section which discusses an erroneous assumption that some implementations make. |

===== Bayes' Theorem ===== | ===== Bayes' Theorem ===== | ||

Line 45: | Line 45: | ||

\begin{equation} P(C_i|W) = \frac{P(W|C_i)P(C_i)}{\sum\limits_{j=1}^n{P(W|C_j)P(C_j)}} \end{equation} | \begin{equation} P(C_i|W) = \frac{P(W|C_i)P(C_i)}{\sum\limits_{j=1}^n{P(W|C_j)P(C_j)}} \end{equation} | ||

- | This depends partly on the ratio of messages with particular classifications $P(C_i)$. However, some classifiers make the simplifying assumption that all classifications are initially equally likely, which yields: | + | This depends partly on the ratio of messages with particular classifications $P(C_i)$ which makes it a **biased** classifier (i.e. it makes assumptions about the distribution of incoming messages before even looking at them). However, some classifiers make the simplifying assumption that all classifications are initially equally likely, which yields: |

\begin{equation*} P(C_1) = P(C_2) = ... = P(C_n) = \frac{1}{n} \end{equation*} | \begin{equation*} P(C_1) = P(C_2) = ... = P(C_n) = \frac{1}{n} \end{equation*} | ||

- | Putting this into the equation above allows us to simplify it: | + | Putting this into the equation above allows us to simplify it and obtain an **unbiased** classifier: |

\begin{equation*} P(C_i|W) = \frac{P(W|C_i) \frac{1}{n} }{ \frac{1}{n} \sum\limits_{j=1}^n{P(W|C_j)}} \end{equation*} | \begin{equation*} P(C_i|W) = \frac{P(W|C_i) \frac{1}{n} }{ \frac{1}{n} \sum\limits_{j=1}^n{P(W|C_j)}} \end{equation*} | ||

Line 55: | Line 55: | ||

This allows the probability of a given word classifying the message correctly in terms of the relative frequencies of that word in the different categories, which is easily acquired through suitable training. | This allows the probability of a given word classifying the message correctly in terms of the relative frequencies of that word in the different categories, which is easily acquired through suitable training. | ||

- | |||

==== Combining words ==== | ==== Combining words ==== | ||

- | Of course, messages may be made up of many tokens and each has its own contribution to make to the overall classification. | + | Of course, messages may be made up of many tokens and each has its own contribution to make to the overall classification. Combining the probabilities from the previous section requires more applications of Bayes' Theorem and some other techniques. For the sake of this illustration we assume there are only two classifications $C_1$ and $C_2$, and only two tokens $W_a$ and $W_b$ that have been found in the message. We shall generalise the approach after this simple example. |

+ | | ||

+ | So, we are interested in determining the probability that the message can be classified as a particular classification, we shall take $C_1$ for this example. We already have $P(C_1|W_a)$ and $P(C_1|W_b)$ as derived in the previous section. As a starting point for the derivation, consider a simple application of Bayes' Theorem to calculate the probability of the message being classified as $C_1$ given that both tokens have been found: | ||

+ | | ||

+ | \begin{equation} P(C_1|W_a \cap W_b) = \frac{P(W_a \cap W_b|C_1)P(C_1)}{P(W_a \cap W_b)} \end{equation} | ||

+ | | ||

+ | Clearly this isn't suitable as each token's occurrence in messages is only recorded independently (because this is a **naive** classifier) so neither $P(W_a \cap W_b)$ nor $P(W_a \cap W_b|C_1)$ is available. We can simplify this based on the assumption that the tokens are [[wikipedia>Conditional independence|conditionally independent]] based on knowing whether the message is spam. This allows us to use the equality: | ||

+ | | ||

+ | \begin{equation} P(W_a \cap W_b|C_1) = P(W_a|C_1)P(W_b|C_1) \end{equation} | ||

+ | | ||

+ | This allows us to simplify our original equation to: | ||

+ | | ||

+ | \begin{equation} P(C_1|W_a \cap W_b) = \frac{P(W_a|C_1)P(W_b|C_1)P(C_1)}{P(W_a \cap W_b)} \end{equation} | ||

+ | | ||

+ | This is an improvement, but now we need to remove the denominator term as well. We can do this via **normalisation**. This uses the fact that the message containing both tokens //must// be classified as either $C_1$ or $C_2$ --- that is to say their probabilities must sum to 1. We can then restate these probabilities using Bayes' Theorem and then multiply through by the shared denominator to obtain an equivalent form for it: | ||

+ | | ||

+ | \begin{equation*} 1 = P(C_1|W_a \cap W_b) + P(C_2|W_a \cap W_b) \end{equation*} | ||

+ | \begin{equation*} \Rightarrow 1 = \frac{P(W_a \cap W_b|C_1)P(C_1)}{P(W_a \cap W_b)} + \frac{P(W_a \cap W_b|C_2)P(C_2)}{P(W_a \cap W_b)} \end{equation*} | ||

+ | \begin{equation} \Rightarrow P(W_a \cap W_b) = P(W_a \cap W_b|C_1)P(C_1) + P(W_a \cap W_b|C_2)P(C_2) \end{equation} | ||

+ | | ||

+ | As earlier, we can use the assumption of conditional independence to break this down into: | ||

+ | | ||

+ | \begin{equation} P(W_a \cap W_b) = P(W_a|C_1)P(W_b|C_1)P(C_1) + P(W_a|C_2)P(W_b|C_2)P(C_2) \end{equation} | ||

+ | | ||

+ | This finally allows us to break down our original equation to remove all probabilities of the tokens occurring together: | ||

+ | | ||

+ | \begin{equation} P(C_1|W_a \cap W_b) = \frac{P(W_a|C_1)P(W_b|C_1)P(C_1)}{P(W_a|C_1)P(W_b|C_1)P(C_1) + P(W_a|C_2)P(W_b|C_2)P(C_2)} \end{equation} | ||

+ | | ||

+ | It's fairly clear to see how this method could be extended to arbitrary numbers of words and classifications to yield the general expression: | ||

+ | | ||

+ | \begin{equation} P(C_i|W_a \cap W_b \cap ... \cap W_z) = \frac{P(C_i)\prod\limits_{j=a}^z{P(W_j|C_i)}}{\sum\limits_{k=1}^n{P(C_k)\prod\limits_{j=a}^z{P(W_j|C_k)}}} \end{equation} | ||

+ | | ||

+ | Please forgive the slightly loose use of notation, there are a few too many dimensions over which to iterate for clarity. | ||

+ | | ||

+ | One slight simplification to note results from the fact that $P(C_i)$ is presumably determined by dividing a number of trained messages by the total number of messages trained. Let $N_{C_i}$ indicate the number of messages trained in category $C_i$, $N$ indicate the number of messages trained overall and $N_{C_i}(W_a)$ indicate the number of messages containing token $W_a$ that were trained in category $C_i$. Thus the equation above becomes: | ||

+ | | ||

+ | \begin{equation*} P(C_i|W_a \cap W_b \cap ... \cap W_z) = \frac{\frac{1}{N}N_{C_i}\prod\limits_{j=a}^z{\frac{N_{C_i}(W_j)}{N_{C_i}}}}{\frac{1}{N}\sum\limits_{k=1}^n{N_{C_k}\prod\limits_{j=a}^z{\frac{N_{C_k}(W_j)}{N_{C_k}}}}} \end{equation*} | ||

+ | \begin{equation} \Rightarrow P(C_i|W_a \cap W_b \cap ... \cap W_z) = \frac{\prod\limits_{j=a}^z{N_{C_i}(W_j)}}{N_{C_i}^{x-1}\sum\limits_{k=1}^n{\frac{1}{N_{C_k}^{x-1}}\prod\limits_{j=a}^z{N_{C_k}(W_j)}}} \end{equation} | ||

+ | | ||

+ | Where $x$ is the total number of words. This version keeps the values relatively large so should hopefully reduce problems with floating point underflow (although may be susceptible to overflow if the number of tokens becomes excessive). | ||

+ | ==== Two-category case ==== | ||

+ | | ||

+ | A common case is that there are two categories --- for example, this is the case for email spam detection. In this case it can be tempting to simplify the above equation using the fact that $P(C_1) = (1 - P(C_2))$. However, this is not as effective as it seems as you would also need to assume that $P(W_i|C_1) = (1 - P(W_i|C_2))$ to achieve any significant simplifcation. However, this is clearly not the case --- just because the word "drugs" occurs in 20% of spam email, for example, it doesn't follow that it occurs in 80% of non-spam. | ||

+ | | ||

+ | ==== Precision issues ==== | ||

+ | | ||

+ | Since many of the probabilities for particular words may be quite low once a large corpus of messages has been analysed, the product of large numbers of them can lead to underflow if floating point representations are used. One solution to this is to limit the analysis to a small number of "most interesting" words - this has other performance improvements as well. | ||

+ | | ||

+ | Another technique which can also be used is to perform the multiplications in the log space and use addition instead of multiplication. This uses the identity: | ||

+ | | ||

+ | \begin{equation*} p_1 p_2 ... p_n = e^{\ln{p_1} + \ln{p_2} + ... + \ln{p_n}} \end{equation*} | ||

+ | | ||

+ | This is probably of limited use in the equation above because the summations require conversion back from log space anyway, but it may prove useful. |

notes/bayesian_classification.txt · Last modified: 2013/03/15 14:13 by andy