The name for this classifier looks a bit strange. You may ask: who is Bayes and why this person is considered to be naïve. You may also question the power of such a “naïve” algorithm. But do not rush to make conclusions as we will see later in this chapter that Naïve Bayes is not helpless in practice.
This classifier got its name after Thomas Bayes (1702-1761), the Englishman who was a clergyman as well as a mathematician. His contribution “Essay towards solving a problem in the doctrine of chances” was published in 1764 three years after his death in the Philosophical Transactions of the Royal Society. His work concerned so called inverse probability in connection with gambling and insurance and contained a version of what is today known as Bayes’ Theorem. This theorem lays foundations of Naïve Bayes.
From Probability to Bayes’ Theorem
We begin with a number of definitions that will be used in this chapter and across the book. As a primal source of our definitions, we will rely on (Everitt, 2006).
The notion of probability lies in the heart of Naïve Bayes. Hence, let us introduce it first. Thus, probability is a measure associated with an event A and denoted by
which takes a value such that
(Everitt, 2006). Probability is the quantitative expression of the chance that an event will occur. The higher
, the more likely it is that the event A will occur. If an event cannot happen,
; if an event is certainly to happen,
. Suppose that there are n equally likely outcomes and the event A is associated with
of them, then
. This formula for calculating
is based on the assumption that an experiments can be repeated a large number of times, n, and in r cases the event A occurs. The quantity
is called the relative frequency of A. As
,
, i.e., the relative frequency converges to probability in the limit.
is often called unconditional probability, because it does not depend on the outcome of other events (Everitt, 2006). This implies that there is also conditional probability, which is the probability that an event A occurs given the outcome of some other event B. This conditional probability is denoted by
and should be read as the probability of A given B. It is not true that
. If events A and B are independent of each other, then
.
If two events A and B are mutually exclusive (i.e., they cannot occur at the same time), the probability of either event occurring is the sum of the individual probabilities, i.e.
It is straightforwardly to extend this formula to k mutually exclusive events
:
This formula is the sum or addition rule for probabilities.
For events A and B that are independent, the probability that both occur is the product of the individual probabilities:
If there are k independent events, then the last formula can be extended to
As you might already guess, this formula is called the product or multiplication rule for probabilities.
In the last two formulas it was assumed that events are independent of each other. If this is not the case, then
Since

From this equation, we can get the famous Bayes’ Theorem as follows
This theorem gives us a procedure for revising and updating the probability of some event in the light of new evidence1. Given k mutually exclusive and exhaustive events
, we rewrite the last formula as
where the sum rule in the form

(marginalization) was used in order to express the denominator in terms of the quantities appearing in the numerator.
This gives the probabilities of the
when A is known to have occurred. The quantity
is termed the prior probability and
the posterior probability, i.e. the probability of
after A has become known. As you can see, there is a subtle but important difference between prior and posterior probabilities. In the former case, we judge on the probability of
without knowing anything about A while in the latter case, we estimate the probability of
after observing A.
Now let us come back to the classification problem. Let us assume that
are k different classes of data and A is a feature vector of gene expression values. In this interpretation the posterior probability
is the probability of assigning an example x to class
, based on the feature vector A.
Let us suppose that
. Then
The natural classification rule for classifying x given these posterior probabilities is
. In other words, we assign x to the class that has the highest posterior probability. One can notice that when comparing the posteriors, one can safely omit the denominator from calculations as it does not affect the result. Therefore, the posterior probabilities are
where the symbol

stands for ‘proportional to’.
The quantity
is termed the likelihood, which is the probability of A given
. It can be seen that both the prior probabilities and likelihoods can be found from the data. Thus, construction of the classification based on Bayes’ Theorem does not seem to be difficult. But where is ‘Naïve’?