notes:bayesian_classification

**This is an old revision of the document!**

This page discusses the application of Bayes Theorem as a simple classifier for text and outlines the mathematical basis and the algorithmic approach.

The information in this page is heavily cribbed from the Wikipedia articles on Bayesian spam filtering, naive Bayes classifier and Bayes' Theorem.

A reference to the meaning of the notation described below:

$P(A)$ | The unconditional probability of event $A$ occurring. |
---|---|

$P(A \cap B)$ | The unconditional probability of both events $A$ and $B$ occurring. |

$P(A|B)$ | The probability of event $A$ occurring given that event $B$ also occurs. |

We start with the axiom of conditional probability:

\begin{equation} P(A \cap B) = P(A|B)P(B) \end{equation}

This encapsulates the multiplicative nature of conditional probabilities. Note that $A$ and $B$ can be swapped without affecting the meaning due to the commutativity of $P(A \cap B)$:

\begin{equation} P(A \cap B) = P(B|A)P(A) \end{equation}

Setting these two equal yields:

\begin{equation*} P(A|B)P(B) = P(B|A)P(A) \end{equation*} \begin{equation} \Rightarrow P(A|B) = \frac{P(B|A)P(A)}{P(B)} \end{equation}

This assumes that $P(A) \not= 0$ and $P(B) \not= 0$. This is a simple statement of Bayes' Theorem. If we assume that $P(B)$ can be partitioned into a series mutually exclusive possibilities which sum to $P(B)$ then we can generated the **extended form**:

\begin{equation} P(A_i|B) = \frac{P(B|A_i)P(A_i)}{\sum\limits_j{P(B|A_j)P(A_j)}} \end{equation}

This may be easier to interpet using the concrete example of a Bayesian classifier, where $P(B)$ represents the probability that a specific word will occur in a message, and $P(B|A_i)$ represents the probability that the word will occur in a message of a specific category $A_i$, on the assumption that categories are complete (i.e. each message is always of exactly one of the predefined categories). It is therefore easy to see that summing all of the $P(B|A_i)$ will yield $P(B)$ since they are mutually exclusive and cover all the possible ways that $P(B)$ can occur.

The process of determining the class of a piece of text involves splitting it up into tokens (words) and calculating the probability of each word occurring in each class of message. We assume $n$ classifications of messages, $C_1, C_2, ... C_n$, in the examples below and consider the effect of a word $W$.

The classifier is **naive** because it assumes the contribution of each token to the classification of the message is independent. Cases where tokens occurring together provide a much stronger indication than either token appearing individually may not be suitable for the naive approach.

Using the extended form of Bayes' Theorem, we can specify the probability that a message containing a particular word $W$ will be given a particular classification $C_i$:

\begin{equation} P(C_i|W) = \frac{P(W|C_i)P(C_i)}{\sum\limits_{j=1}^n{P(W|C_j)P(C_j)}} \end{equation}

This depends partly on the ratio of messages with particular classifications $P(C_i)$. However, some classifiers make the simplifying assumption that all classifications are initially equally likely, which yields:

\begin{equation*} P(C_1) = P(C_2) = ... = P(C_n) = \frac{1}{n} \end{equation*}

Putting this into the equation above allows us to simplify it:

\begin{equation*} P(C_i|W) = \frac{P(W|C_i) \frac{1}{n} }{ \frac{1}{n} \sum\limits_{j=1}^n{P(W|C_j)}} \end{equation*} \begin{equation} \Rightarrow P(C_i|W) = \frac{P(W|C_i)}{\sum\limits_{j=1}^n{P(W|C_j)}} \end{equation}

This allows the probability of a given word classifying the message correctly in terms of the relative frequencies of that word in the different categories, which is easily acquired through suitable training.

Of course, messages may be made up of many tokens and each has its own contribution to make to the overall classification.

notes/bayesian_classification.1363274499.txt.gz · Last modified: 2013/03/14 15:21 by andy