Dynamic Data Retrieval Using Incremental Clustering and Indexing

Dynamic Data Retrieval Using Incremental Clustering and Indexing

Uma Priya D, Santhi Thilagam P
Copyright: © 2020 |Pages: 18
DOI: 10.4018/IJIRR.2020070105
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The evolution of the Internet and real-time applications has contributed to the growth of massive unstructured data which imposes the increased complexity of efficient retrieval of dynamic data. Extant research uses clustering methods and indexes to speed up the retrieval. However, the quality of clustering methods depends on data representation models where existing models suffer from dimensionality explosion and sparsity problems. As documents evolve, index reconstruction from scratch is expensive. In this work, compact vectors of documents generated by the Doc2Vec model are used to cluster the documents and the indexes are incrementally updated with less complexity using the diff method. The probabilistic ranking scheme BM25+ is used to improve the quality of retrieval for user queries. The experimental analysis demonstrates that the proposed system significantly improves the clustering performance and reduces retrieval time to obtain top-k results.
Article Preview
Top

Introduction

With the innovation in technology over the past two decades, the emergence of social network organization, adoption of hand-held computerized gadgets, the explosion in the usage of the Internet and computing services contributed to the tremendous growth of heterogeneous data of structured, semi-structured, and unstructured type, commonly known as Big Data. Consistently, 2.5 quintillion bytes of data are generated every day (EDBD Statistics, 2015) as emails, audios, videos, web pages, social media messages, and so forth, where 90% account for unstructured data. The growth in data contributes to the increased complexity of the efficient retrieval of these data. Available conventional methods are well suited for static data, but the above requirements demand a more efficient way of organizing and processing the dynamic unstructured text data.

In this big data era, querying the large data necessitates the organized storage where the incoming data (usually represented as vectors) are categorized based on the similarity of vectors. Thus, similar documents can be retrieved quickly for user queries instead of handling large data instantly. As documents evolve, the clustering algorithms should cope with the dynamic nature of data with minimum sacrifice to clustering quality. Several clustering algorithms are proposed with different data representation models (Ding and He, 2004; Campr and Jezek, 2015), similarity measures (Audhkhasi and Verma, 2007; Huang, 2008), and grouping techniques (Dhillon et al., 2004; Shindler et al., 2011; Cai et al., 2013). The data representation refers to the number of classes and the available patterns applicable to the clustering algorithm. Good representations capture a vast number of possible patterns. Hence, the quality of clustering algorithms is highly dependent on representation learning. To transform the data into more cluster-friendly in this big data era, representation learning models (Mikolov et al., 2013b; Pennington et al., 2014; Yang et al., 2016; Kim et al., 2017; Joshi et al., 2018; Ren et al., 2019) are used to generate the distributed representation of words. Good representations capture a vast number of possible patterns. Hence, the quality of clustering algorithms is highly dependent on representation learning. Traditional machine learning models always result in a locally optimum solution, whereas distributed representation learners are trained by many samples to learn the representation. To state the expressiveness, traditional machine learning models such as decision tree, support vector machine (SVM), etc., requires O(N) input samples to distinguish O(N) regions. In contrast, distributed representation learning models represent the O(2k) region for the same samples (Bengio et al., 2013) (where k denote the count of non-zero elements in distributed representation).

While clustering intends for efficient organization of data to improve the retrieval performance, the complexity of the search operation in dynamic data is yet another challenge. Applying proper indexing methods shows the good impact on query processing by reducing the complexity of the search operation. Due to the unordered form of input, the mode of search is by its content, i.e., Keyword search. In practice, an inverted index is the most popular indexing method for keyword search on unstructured data. Considering the dynamic nature of the data, the indexing must be dynamic for efficient retrieval. Existing research works mainly concentrate on reducing the index build time and keyword query processing time. However, most of the current works focus on static data. On the other hand, this work differs in improving the accuracy of dynamic clustered data with less retrieval time.

Complete Article List

Search this Journal:
Reset
Volume 14: 1 Issue (2024)
Volume 13: 1 Issue (2023)
Volume 12: 4 Issues (2022): 3 Released, 1 Forthcoming
Volume 11: 4 Issues (2021)
Volume 10: 4 Issues (2020)
Volume 9: 4 Issues (2019)
Volume 8: 4 Issues (2018)
Volume 7: 4 Issues (2017)
Volume 6: 4 Issues (2016)
Volume 5: 4 Issues (2015)
Volume 4: 4 Issues (2014)
Volume 3: 4 Issues (2013)
Volume 2: 4 Issues (2012)
Volume 1: 4 Issues (2011)
View Complete Journal Contents Listing