Scaling Behavior of Maximal Repeat Distributions in Genomic Sequences: A Randomize Test Follow Up Study

Scaling Behavior of Maximal Repeat Distributions in Genomic Sequences: A Randomize Test Follow Up Study

J. D. Wang, Ka-Lok Ng
DOI: 10.4018/978-1-60566-902-1.ch028
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The maximal repeat distribution analysis is redone by plotting the relative frequency of maximal repeat patterns against the rank of the appearance. In almost all of the cases, the rank plots give a better coefficient of determination values than the authors’ previous work, i.e. frequency plot was used. A randomized version is repeated for the maximal repeat study; it is found that rank plot regression analysis did not support scaling behavior; hence, the validity of the findings is not due to an artifact.
Chapter Preview
Top

Results

In this section, we present the maximal repeats frequencies study in different genomic sequences. To implement our analysis using the power-law model, we first record the total number of identified maximal repeats (N) that are appear in the genomic sequence under study. For each frequency of appearance of maximal repeat, a, we record the total number (N(a)) of maximal repeat patterns that have such a frequency of appearance in that genomic sequence. The frequency of appearance of maximal repeat a is ranked in descending order, for example, the frequency of appearance of maximal repeat two (a=2) is ranked number one, that is k equals to one. We divide number N(a) by N and call it P(k), and then plot Log P(k) against Log(k), where k denotes the rank of a with non-zero P(k). Power law states that P(k) ~ k. P(k) is the fraction of the total number of maximal repeats with rank k (1 ≦ k ≦ 999) in the genomic sequence under study, for example, P(902) is the fraction of maximal repeat patterns among the total number of maximal repeat patterns that appears in the genomic sequence one thousand times, and 1000 is ranked 902. The length of the maximal repeat pattern we search ranges from 3 to 50 bps.

Table 1 is the results of the regression analysis of the genomic sequences chosen from different groups of taxa. For each of the genomic sequences, we give the result for the total genomic sequence length, exponent of the power-law and r2. GenBank ID column label with ‘chromosomes’ means the power-law result obtained with the species’ chromosomes length assembled together. Regression analysis determines that the exponent γ ranges from 1.81 to 2.06 (the same range as our previous work [1]), and all the 56 species’ data are well fitted by the power-law distributions (the date reject the null hypothesis of uniform distribution with p-value <<10-6). All the 56 species have a r2 value larger than 0.938.

Complete Chapter List

Search this Book:
Reset