Machine Learning Algorithms for Analysis of DNA Data Sets

Machine Learning Algorithms for Analysis of DNA Data Sets

John Yearwood (Federation University, Australia), Adil Bagirov (University of Ballarat, Australia) and Andrei V. Kelarev (University of Ballarat, Australia)
DOI: 10.4018/978-1-4666-1833-6.ch004
OnDemand PDF Download:


The applications of machine learning algorithms to the analysis of data sets of DNA sequences are very important. The present chapter is devoted to the experimental investigation of applications of several machine learning algorithms for the analysis of a JLA data set consisting of DNA sequences derived from non-coding segments in the junction of the large single copy region and inverted repeat A of the chloroplast genome in Eucalyptus collected by Australian biologists. Data sets of this sort represent a new situation, where sophisticated alignment scores have to be used as a measure of similarity. The alignment scores do not satisfy properties of the Minkowski metric, and new machine learning approaches have to be investigated. The authors’ experiments show that machine learning algorithms based on local alignment scores achieve very good agreement with known biological classes for this data set. A new machine learning algorithm based on graph partitioning performed best for clustering of the JLA data set. Our novel k-committees algorithm produced most accurate results for classification. Two new examples of synthetic data sets demonstrate that the authors’ k-committees algorithm can outperform both the Nearest Neighbour and k-medoids algorithms simultaneously.
Chapter Preview

Preliminaries And Background Information

We use standard machine learning terminology and notions and refer the reader to the monographs by Kaufman & Rousseeuw (1990), Witten & Frank (2005), Yearwood & Mammadov (2010) for prerequisites on machine learning techniques, and to Baldi & Brunak (2001), Gusfield (1997) for background information on nucleotide sequences.

Complete Chapter List

Search this Book: