Article Preview
TopIntroduction
There is an increasing demand for analyzing and processing multi-model data, including structured, semi-structured, and unstructured data. In particular, structured data commonly refer to relational, key-value, and graph data; semi-structured data mainly include JSON and XML documents; and unstructured data are typically text files. For multi-model data management, it is inevitable and difficult for developers to make trade-offs between multi-model databases and polyglot persistence. However, existing studies suggest current benchmarks are not completely suitable for evaluating and comparing multi-model databases and polyglot persistence, whether in terms of test datasets, workloads, or metrics. First, obtaining large-scale real multi-model data is difficult and costly, and few data generators can generate multi-model test datasets. Second, the workloads of the existing benchmarks are not comprehensive and cannot cover diversified multi-model data application scenarios. Finally, most of the existing benchmarks pay more attention to the execution time of the workloads while ignoring the metrics of infrastructure resource usage and nonfunctional attributes. Specifically, in a distributed environment, database system failure is considered a normal event rather than an accident (Ghemawat et al., 2003), so collecting and measuring database resource usage and nonfunctional attributes is very important. However, as far as we know, there is no benchmark for multi-model databases and polyglot persistence that takes resource usage and nonfunctional attributes as metrics. Aiming at these problems, we propose an end-to-end benchmark named MDBench for evaluating and comparing a multi-model database and polyglot persistence. The main contributions of this paper are summarized as follows:
- 1.
A scalable multi-model data generator is designed for generating multi-model test datasets. The key algorithm of the data generator is efficient to ensure that no matter how large the dataset is generated, it will not cause serious out-of-memory resources.
- 2.
Four groups of representative workload experiments are designed and implemented to simulate different multi-model data application scenarios. In particular, a multi-thread workload experiment and reliability and availability experiments are conducted in the research field of evaluation and comparing multi-model databases and polyglot persistence.
- 3.
Based on data store selection, we use MDBench to implement a comprehensive performance evaluation on the single multi-model database ArangoDB and a polyglot persistence instance that consists of MongoDB and Neo4j and systematically analyze the experimental results.
The subsequent contents are organized as follows. First, the research status of database benchmarking is summarized. In the next section, we introduce the data stores involved in the evaluation and the reasons for selecting them. Then, MDBench is introduced in detail from three aspects: multi-model data generation, workloads, and metrics mechanism. Next, the experimental results are introduced and analyzed. Finally, the paper is summarized and proposed.
TopOverview Of Dbms Benchmarks
The database benchmark can perform repeatable, comparable, quantitative tests on performance indicators. Existing database benchmarks in the industry can be divided into the following two categories: RDBMS and NoSQL benchmarks. The multi-model database benchmarks belong to NoSQL benchmarks. Because of the particularity of its data model, we will also introduce multi-model database benchmarks separately.