DNA-Based Indexing

DNA-Based Indexing

Max H. Garzon (The University of Memphis, USA), Kiran C. Bobba (The University of Memphis, USA), Andrew Neel (The University of Memphis, USA) and Vinhthuy Phan (The University of Memphis, USA)
Copyright: © 2010 |Pages: 21
DOI: 10.4018/jnmc.2010070102


DNA has been acknowledged as a suitable medium for massively parallel computing and as a “smart” glue for self-assembly. In this paper, a third capability of DNA is described in detail as memory capable of encoding and processing large amounts of data so that information can be retrieved associatively based on content. The technique is based on a novel representation of data on DNA that can shed information on the way DNA-, RNA- and other biomolecules encode information, which may be potentially important in applications to fields like bioinformatics and genetics, and natural language processing. Analyses are also provided of the sensitivity, robustness, and bounds on the theoretical capacity of the memories. Finally, the potential use of the memories are illustrated with two applications, one in genomic analysis for identification and classification, another in information retrieval from text data in abiotic form.
Article Preview


Techniques for large-scale compact data representation and mining have been recently introduced. Recent examples are artificial neural networks (Haykin, 1988), Kanerva's associative memories (Kanerva, 1988), and LSI (Latent Semantic Indexing (Deerwester et al., 1990). These memories have made possible methods for representation and processing of large data corpora by powerful methods that provide information useful to humans in semantic terms, unlike the conventional syntactic methods used in electronic databases and data warehouses. The new methods have a set of common features. They are trained on representative sample data sets to build the memory by extracting deep patterns from the sample data, and then they are used on unknown data to perform similar functions with comparable levels of success to that on the known data. Thus we have training algorithms, such as back-propagation with neural nets, learning patterns with Kanerva's memories, and dimension reduction and principal component analysis with LSI.

On the other hand, for over a decade now, Adleman's idea to use DNA for computational purposes has proven fertile ground for computation (Adleman, 1994) and nano-assembly (Seeman, 1999; Winfree et al., 1998). However, the problem of finding a systematic procedure to map both symbolic (abiotic) and nonsymbolic (e.g., biological) information onto biomolecules for massively parallel processing in wet test tubes has faced several challenges. Mapping of non-biological information for processing in vitro is an enormous challenge. Even the easier direct readout problem, i.e., converting genomic data into electronic form for conventional analysis, is an expensive and time-consuming process in bioinformatics (Mount, 2001). Moreover, the results of these analyses are usually only available in manual form that cannot be directly applied to feedback on the carriers of genomic information. In this paper, we propose and discuss an approach that addresses both of these problems.

Three properties are critical for eventual success of such a mapping algorithm/protocol, as discussed in (Blain & Garzon, 2004). First, the representation or indexing has to be universal, scalable, automatic, and fast. Universal means that any kind of symbolic data/pattern can be mapped, in principle, to DNA. Otherwise the mapping will restrict the kind of information mapped, and the processing capabilities in DNA form may be too constrained. Scalable means that mapping can only be justified in massive quantities that cannot be processed by conventional means in reasonable times. Therefore it must be scalable to the tera-bytes and higher orders it must and will eventually encounter. Currently, no cost-effective techniques exist for transferring these volumes by manual addition and extraction of patterns one by one. Ordinary symbol wise transductions require manually manufacturing the corresponding DNA strands, an impossible task with current technology. The indexing must also be automatic and high-speed because manual mapping, e.g., by synthesis of individual strands, is also very costly time wise. An effective strategy must be automatable and eventually orders of magnitude faster than processing of the data in silico.

Complete Article List

Search this Journal:
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing