Establishing a relationship of the empirical or observed results of an experiment to factors affecting the system or to pure chance calls for various statistical models and measures. In such scenarios, an observation is deemed statistically significant if its presence cannot be attributed to randomness alone. The literature hosts a number of statistical models to capture the uniqueness of such observations such as p-value and z-score. In the next few sections, we discuss different important statistical tools that are used for this purpose.
Before venturing forward, we provide a formal definition of the problem.
It is this measure of deviation of a sequence that we will capture by using various statistical models. In the remainder of the chapter, we interchangeably use the term string with sequence and substring with subsequence.
Hypothesis Testing and P-Value
Given an observation sample X (in this case a substring), with an associated score of S(X), the p-value of X is defined as the probability of obtaining a random sample with score S(X) or greater under the same probability model (Bejerano et al., 2004; Regnier & Vandenbogaert, 2006). For each such observation, we test the null hypothesis H0 that the substring is drawn from the given probability model P against the alternate hypothesis H1 that the subsequence is not drawn from the same probability distribution. The p-value measures the chance of rejecting the null hypothesis; in other words, the less the p-value, the less likely it is that the null hypothesis is true.
Figure 1 shows an example. For a particular score S, the shaded area represents the chance of having a sample with a score greater than the one under consideration. In other words, the p-value is the value of the cumulative density function (cdf) measured at S subtracted from the total probability, i.e.,
pvalue(S)=1-cdf(S).
Figure 1. Computing the p-value of X with score S
If the probability density function (pdf) of the scores is known, it is relatively simpler to compute the p-value of a particular score using the above formula. However, in most real situations, the pdf is hard to estimate or can be non-parametric. The accurate computation of the p-value then needs all the possible outcomes to be listed, their scores computed, and the number of outcomes having scores more than S counted. Since the number of possible outcomes is large, and is exponential in most cases, computing the p-value in such a manner is practically infeasible.
To alleviate this problem, various branch-and-bound techniques have been proposed (Bejerano et al., 2004). In systems where such accuracy in measurement is not a necessity and a small factor of error can be tolerated, an approximation of the p-value can be calculated using other statistical tools (Rahmann, 2003).