Naïve Bayes

Chapter Preview

Top

Bayes And Naïve Bayes

The name for this classifier looks a bit strange. You may ask: who is Bayes and why this person is considered to be naïve. You may also question the power of such a “naïve” algorithm. But do not rush to make conclusions as we will see later in this chapter that Naïve Bayes is not helpless in practice.

This classifier got its name after Thomas Bayes (1702-1761), the Englishman who was a clergyman as well as a mathematician. His contribution “Essay towards solving a problem in the doctrine of chances” was published in 1764 three years after his death in the Philosophical Transactions of the Royal Society. His work concerned so called inverse probability in connection with gambling and insurance and contained a version of what is today known as Bayes’ Theorem. This theorem lays foundations of Naïve Bayes.

From Probability to Bayes’ Theorem

We begin with a number of definitions that will be used in this chapter and across the book. As a primal source of our definitions, we will rely on (Everitt, 2006).

The notion of probability lies in the heart of Naïve Bayes. Hence, let us introduce it first. Thus, probability is a measure associated with an event A and denoted by 978-1-60960-557-5.ch004.m01 which takes a value such that 978-1-60960-557-5.ch004.m02 (Everitt, 2006). Probability is the quantitative expression of the chance that an event will occur. The higher 978-1-60960-557-5.ch004.m03, the more likely it is that the event A will occur. If an event cannot happen, 978-1-60960-557-5.ch004.m04; if an event is certainly to happen, 978-1-60960-557-5.ch004.m05. Suppose that there are n equally likely outcomes and the event A is associated with 978-1-60960-557-5.ch004.m06 of them, then 978-1-60960-557-5.ch004.m07. This formula for calculating 978-1-60960-557-5.ch004.m08 is based on the assumption that an experiments can be repeated a large number of times, n, and in r cases the event A occurs. The quantity 978-1-60960-557-5.ch004.m09 is called the relative frequency of A. As 978-1-60960-557-5.ch004.m10, 978-1-60960-557-5.ch004.m11, i.e., the relative frequency converges to probability in the limit.

978-1-60960-557-5.ch004.m12is often called unconditional probability, because it does not depend on the outcome of other events (Everitt, 2006). This implies that there is also conditional probability, which is the probability that an event A occurs given the outcome of some other event B. This conditional probability is denoted by 978-1-60960-557-5.ch004.m13 and should be read as the probability of A given B. It is not true that 978-1-60960-557-5.ch004.m14. If events A and B are independent of each other, then 978-1-60960-557-5.ch004.m15.

If two events A and B are mutually exclusive (i.e., they cannot occur at the same time), the probability of either event occurring is the sum of the individual probabilities, i.e.

978-1-60960-557-5.ch004.m16

It is straightforwardly to extend this formula to k mutually exclusive events 978-1-60960-557-5.ch004.m17:

978-1-60960-557-5.ch004.m18

This formula is the sum or addition rule for probabilities.

For events A and B that are independent, the probability that both occur is the product of the individual probabilities:

978-1-60960-557-5.ch004.m19

If there are k independent events, then the last formula can be extended to

978-1-60960-557-5.ch004.m20

As you might already guess, this formula is called the product or multiplication rule for probabilities.

In the last two formulas it was assumed that events are independent of each other. If this is not the case, then

978-1-60960-557-5.ch004.m21
Since 978-1-60960-557-5.ch004.m22

978-1-60960-557-5.ch004.m23

From this equation, we can get the famous Bayes’ Theorem as follows

978-1-60960-557-5.ch004.m24

This theorem gives us a procedure for revising and updating the probability of some event in the light of new evidence1. Given k mutually exclusive and exhaustive events 978-1-60960-557-5.ch004.m25, we rewrite the last formula as

978-1-60960-557-5.ch004.m26
where the sum rule in the form 978-1-60960-557-5.ch004.m27 (marginalization) was used in order to express the denominator in terms of the quantities appearing in the numerator.

This gives the probabilities of the 978-1-60960-557-5.ch004.m28 when A is known to have occurred. The quantity 978-1-60960-557-5.ch004.m29 is termed the prior probability and 978-1-60960-557-5.ch004.m30 the posterior probability, i.e. the probability of 978-1-60960-557-5.ch004.m31 after A has become known. As you can see, there is a subtle but important difference between prior and posterior probabilities. In the former case, we judge on the probability of 978-1-60960-557-5.ch004.m32 without knowing anything about A while in the latter case, we estimate the probability of 978-1-60960-557-5.ch004.m33 after observing A.

Now let us come back to the classification problem. Let us assume that 978-1-60960-557-5.ch004.m34 are k different classes of data and A is a feature vector of gene expression values. In this interpretation the posterior probability 978-1-60960-557-5.ch004.m35 is the probability of assigning an example x to class 978-1-60960-557-5.ch004.m36, based on the feature vector A.

Let us suppose that 978-1-60960-557-5.ch004.m37. Then

978-1-60960-557-5.ch004.m38
978-1-60960-557-5.ch004.m39

The natural classification rule for classifying x given these posterior probabilities is 978-1-60960-557-5.ch004.m40. In other words, we assign x to the class that has the highest posterior probability. One can notice that when comparing the posteriors, one can safely omit the denominator from calculations as it does not affect the result. Therefore, the posterior probabilities are

978-1-60960-557-5.ch004.m41
978-1-60960-557-5.ch004.m42
where the symbol 978-1-60960-557-5.ch004.m43 stands for ‘proportional to’.

The quantity 978-1-60960-557-5.ch004.m44 is termed the likelihood, which is the probability of A given 978-1-60960-557-5.ch004.m45. It can be seen that both the prior probabilities and likelihoods can be found from the data. Thus, construction of the classification based on Bayes’ Theorem does not seem to be difficult. But where is ‘Naïve’?

Complete Chapter List

Search this Book:
Reset