Receive a 20% Discount on All Purchases Directly Through IGI Global's Online Bookstore

Wangsen Feng (Peking University, China) and Lusheng Wang (City University of Hong Kong, China)

Copyright: © 2012
|Pages: 14

DOI: 10.4018/978-1-4666-1785-8.ch001

Chapter Preview

TopMotif identification for DNA sequences has many important applications in biological studies, e.g., diagnostic probe design, locating binding sites and regulatory signals, and potential drug target identification (Dopazo, Rodriguez, Saiz, & Sobrino, 1993; Lucas, Busch, Mossinger, & Thompson, 1991; Proutski & Holme, 1996). The most general problem is as follows:

**Two Groups:**Given two groups of sequences*B*and*G*, find a length-*l*(distinguishing) motif that appears in every sequence in*B*and does not appear in anywhere of the sequences in*G*.

The motif we want to find is called the *distinguishing* motif that can differentiate the two groups. If no error is allowed, the problem is easy. However, in practice, the occurrence of the motif in each of the given sequences has errors (mutations and indels). The problem becomes extremely hard when errors appear. Many mathematic models have been proposed.

The following definition was first stated in (Lanctot, Li, Ma, Wang, & Zhang, 1999).

**Distinguishing substring selection:**given a set*B*= {*s*_{1},*s*_{2,}*…, s*of_{n1}}*n*(bad) strings of length at least_{1}*l*, and a set*G*= {*g*_{1},g_{2},*…, g*of_{n2}}*n*(good) strings of length at least_{2}*l*, the problem is to find a string*s*of length*l*such that for each string*s*there exists a length-_{i}∈B*l*substring*t*of_{i}*s*with_{i}*d*(_{H}*s, t*)≤_{i}*d*and for any length-_{b}*l*substring*u*of_{i}*g*,_{i}∈ G*d*(_{H}*s, u*)≥_{i}*d*. Here_{g}*d*and_{b}*d*are two parameters, called the radius of the two groups. We want_{g}*d*to be small and_{b}*d*to be large._{g}*d*(,) represents the Hamming distance between two strings._{H}

When indels are allowed in the occurrences of the motif, the problem is called the *distinguishing subsequence selection* problem, where the input is the same as the previous problem and the objective here is to find a string *s* of length *l* such that for each string *s _{i} ∈ B* there exists a substring

A special case, where only one group of sequences is involved is as follows:

**Single Group**: Given a group of*n*given sequences, find a motif that appears in each of the given sequences and those occurrences of the motif are*similar*.

There are several ways to define the similarity of the motif occurrences, e.g., the *consensus score*, *general consensus score* and SP-score (Hertz & Stormo, 1995; Li, Ma, & Wang, 1999). Here we use the following mathematical definitions.

**The Closest Substring Problem:**Given*n*sequences {*s*}, each is of length_{1}, s_{2}, . . ., s_{n}*m (m ≥ l),*the problem is to find a center string*s*of length*l*and a substring*t*of length_{i}*l*in*s*such that_{i}

Search this Book:

Reset

Copyright © 1988-2018, IGI Global - All Rights Reserved