Article Preview
TopIntroduction
For many bioinformatics applications using data from Next-Generation Sequencing (NGS), the frequency distribution of k-mers is quite useful (Behera et al., 2018). A few of these examples involve assembly based on de Bruijn, read error correction, an approximation of genome size, and digital normalization. Counting (or estimating) k-mers with low frequency is a pre-processing phase while designing tools for such applications. The amount of k-mers in genome data, and even more specifically the frequency distribution of k-mers, is a central component of many bioinformatics applications in genome data (Carvalho et al., 2016; Rangavittal et al., 2017).
The analysis consists of Input, Output, Counting k-mer, Sequencing, Similarity search, and Assembly. Out of these tasks, k-mer and similarity search in a sequence data are major modules to address. An alignment (Mapping/Searching) task involves arranging sequences to get the highest similarity level. Alignment helps in generating a phylogenetic tree. Two types of alignments: Local (the only portion of the sequence is aligned) & Global alignment. Basically, living things are related by evolution. Thus DNA, RNA and protein sequence (Amino acid) of different organisms are related to one another in evolution and show similarities. Reconstruction of DNA sequence by combining and aligning small fragments refers to the Assembly process which is a part of novel DNA Sequencing which includes (a) Cutting the DNA into small pieces (b) Reading the small fragments (c) Reconstituting the original DNA by merging the information on the various fragment.
While computing genomic data, k-mers (Miller et al., 2008) are used in sequence assembly and sequence alignment. K-mer generates all possible substrings of length ‘len’ from an input DNA Sequence. The total number of k-mers generated from a length ‘len’ is len-K+1, When performing sequence assembly task, choice of size(K) is very important. Therefore, if experiments have lower sized k-mer, they will decrease the edges in the graph, in turn, less space is required to store sequence and if experiments have larger size k-mer, they will result in greater edges and memory to store sequence. Next Generation Sequencing (NGS) technologies (Manekar et al., 2018) generate billions of operations for every run. There is always a scope for designing a framework that is efficient in terms of memory and time. Many of the researchers work on disk based, and they use local resources that are limited by nature. The compressed input will have better results in terms of processing time. So, the authors main concentration is related to k-mer generation and counting.
As the technology improves, there is a substantial increase in the quantity of data, which has intensified the need to reform a new effective technique to speed up the search for compatible DNA sequences in a large data collection(Li et al., 2019; Hiraishi, 2019). One of the main problems of matching approaches is the variability in the length of sequences in a given sample, which will affect the results. There will always be the most common subsequence since the longest series between the others(Alazzam et al., 2018). Two key variables are used during the matching process to determine the string matching algorithm efficiency, which is the total number of character similarities and the total number of tries(Sameer et al., 2017).
As computing becomes easy and the source of genomes expand, the dominion of bioinformatics is sure to increase and change radically, allowing us to build new models of complexity and usefulness. When sequence analysis reveals the cause for a disease, the trace of the number of occurrences of the sequence defines the possibility of the disease. As the genome is a huge database, the authors propose a Stream/String and Pattern matching technique to find out a particular sequence in the given large input sequence. Bloom filters use a proper data structure for classification while performing sequencing, (NajamL Jin et al., 2018) proposed Multiple Bloom Filters which locate the specific pattern in a DNA also tell the number of repetitions. While predicting any sort of disease these two factors are very important. This proposal focuses on a new approach for detecting the patterns present in the gene database. Stream matching is to find out the exact location of a specified pattern.