notes:bayesian_classification

This shows you the differences between two versions of the page.

Both sides previous revision Previous revision Next revision | Previous revision Last revision Both sides next revision | ||

notes:bayesian_classification [2013/03/15 11:43] andy [Combining words] |
notes:bayesian_classification [2013/03/15 14:11] andy [Combining words] |
||
---|---|---|---|

Line 3: | Line 3: | ||

This page discusses the application of Bayes Theorem as a simple classifier for text and outlines the mathematical basis and the algorithmic approach. | This page discusses the application of Bayes Theorem as a simple classifier for text and outlines the mathematical basis and the algorithmic approach. | ||

- | The information in this page is heavily cribbed from the Wikipedia articles on [[wikipedia>Bayesian spam filtering]], [[wikipedia>naive Bayes classifier]] and [[wikipedia>Bayes' Theorem]]. | + | The information in this page is heavily cribbed from the Wikipedia articles on [[wikipedia>Bayesian spam filtering]], [[wikipedia>naive Bayes classifier]] and [[wikipedia>Bayes' Theorem]]. There's also a [[http://cs.wellesley.edu/~anderson/writing/naive-bayes.pdf|useful paper on combining word probabilities]] which is worth a read, especially the final section which discusses an erroneous assumption that some implementations make. |

===== Bayes' Theorem ===== | ===== Bayes' Theorem ===== | ||

Line 90: | Line 90: | ||

Please forgive the slightly loose use of notation, there are a few too many dimensions over which to iterate for clarity. | Please forgive the slightly loose use of notation, there are a few too many dimensions over which to iterate for clarity. | ||

+ | |||

+ | One slight simplification to note results from the fact that $P(C_i)$ is presumably determined by dividing a number of trained messages by the total number of messages trained. Let $N_{C_i}$ indicate the number of messages trained in category $C_i$, $N$ indicate the number of messages trained overall and $N_{C_i}(W_a)$ indicate the number of messages containing token $W_a$ that were trained in category $C_i$. Thus the equation above becomes: | ||

+ | |||

+ | \begin{equation*} P(C_i|W_a \cap W_b \cap ... \cap W_z) = \frac{\frac{1}{N}N_{C_i}\prod\limits_{j=a}^z{\frac{N_{C_i}(W_j)}{N_{C_i}}}}{\frac{1}{N}\sum\limits_{k=1}^n{N_{C_k}\prod\limits_{j=a}^z{\frac{N_{C_k}(W_j)}{N_{C_k}}}}} \end{equation*} | ||

+ | \begin{equation} \Rightarrow P(C_i|W_a \cap W_b \cap ... \cap W_z) = \frac{\prod\limits_{j=a}^z{N_{C_i}(W_j)}}{N_{C_i}^{x-1}\sum\limits_{k=1}^n{\frac{1}{N_{C_k}^{x-1}}\prod\limits_{j=a}^z{N_{C_k}(W_j)}}} \end{equation} | ||

+ | |||

+ | Where $x$ is the total number of words. This version may help avoid underflow, but may instead be susceptible to overflow due to the exponentiation involved. | ||

+ | ==== Two-category case ==== | ||

+ | |||

+ | A common case is that there are two categories --- for example, this is the case for email spam detection. In this case it can be tempting to simplify the above equation using the fact that $P(C_1) = (1 - P(C_2))$. However, this is not as effective as it seems as you would also need to assume that $P(W_i|C_1) = (1 - P(W_i|C_2))$ to achieve any significant simplifcation. However, this is clearly not the case --- just because the word "drugs" occurs in 20% of spam email, for example, it doesn't follow that it occurs in 80% of non-spam. | ||

==== Precision issues ==== | ==== Precision issues ==== | ||

Line 100: | Line 110: | ||

This is probably of limited use in the equation above because the summations require conversion back from log space anyway, but it may prove useful. | This is probably of limited use in the equation above because the summations require conversion back from log space anyway, but it may prove useful. | ||

- |

notes/bayesian_classification.txt ยท Last modified: 2013/03/15 14:13 by andy