A Benchmark for Performance Evaluation of a Multi-Model Database vs. Polyglot Persistence

As the need for handling data from various sources becomes crucial for making optimal decisions, managing multi-model data has become a key area of research. Currently, it is challenging to strike a balance between two methods: polyglot persistence and multi-model databases. Moreover, existing studies suggest that current benchmarks are not completely suitable for comparing these two methods, whether in terms of test datasets, workloads, or metrics. To address this issue, the authors introduce MDBench, an end-to-end benchmark tool. Based on the multi-model dataset and proposed workloads, the experiments reveal that ArangoDB is superior at insertion operations of graph data, while the polyglot persistence instance is better at handling the deletion operations of document data. When it comes to multi-thread and associated queries to multiple tables, the polyglot persistence outperforms ArangoDB in both execution time and resource usage. However, ArangoDB has the edge over MongoDB and Neo4j regarding reliability and availability.


INTRoDUCTIoN
There is an increasing demand for analyzing and processing multi-model data, including structured, semi-structured, and unstructured data.In particular, structured data commonly refer to relational, key-value, and graph data; semi-structured data mainly include JSON and XML documents; and unstructured data are typically text files.For multi-model data management, it is inevitable and difficult for developers to make trade-offs between multi-model databases and polyglot persistence.
The subsequent contents are organized as follows.First, the research status of database benchmarking is summarized.In the next section, we introduce the data stores involved in the evaluation and the reasons for selecting them.Then, MDBench is introduced in detail from three aspects: multi-model data generation, workloads, and metrics mechanism.Next, the experimental results are introduced and analyzed.Finally, the paper is summarized and proposed.

oVERVIEw oF DBMS BENCHMARKS
The database benchmark can perform repeatable, comparable, quantitative tests on performance indicators.Existing database benchmarks in the industry can be divided into the following two categories: RDBMS and NoSQL benchmarks.The multi-model database benchmarks belong to NoSQL benchmarks.Because of the particularity of its data model, we will also introduce multimodel database benchmarks separately.

RDBMS Benchmarks
This kind of benchmark research work started early and had a wide range, and they are also the driving force behind the rapid development of relational database systems.For example, the Wisconsin benchmark (Bitton et al., 1983) consisted of 32 SQL statements and took the total time to execute the full workloads as the only metric.DebitCredit was designed by the Tandem team, which simulated a real-world transaction scenario, measuring the throughput and cost-effectiveness of various transaction processing systems.The TPC test benchmark, jointly established by Microsoft, Intel, HP, and others, was the standard for evaluating RDBMS.It tested the DBMS's ACID characteristics, query speed, and online transaction processing capabilities.Among the thirteen TPC benchmarks, TPC-C and TPC-E were designed for the OLTP databases; TPC-H and TPC-DS were designed for the decision support system.Currently, these benchmarks remain the key choice for DBMSs to provide data management solutions.

NoSQL Benchmarks
The development process of data management technology is to continuously integrate semi-structured and unstructured data into DBMS to reduce cost and improve efficiency.For each NoSQL store, there are different benchmarks for evaluating and comparing related big data systems, such as XBench (Yao et al., 2004), YCSB (Matallah et al., 2017), YCSB++ (Patil et al., 2011), BG (Alabdulkarim et al., 2018), BigDataBench (Zhan et al., 2016), and CloudSuite (Ferdman et al., 2012).XBench is a stand-alone XML benchmark that covers an overall database design defined by application categories.It tested the scalability of the database and the full XQuery functionality captured in the XML query case.YCSB is an open-source tool used by Yahoo to evaluate the performance of computer programs.It compared different data stores "apple to apple" regarding performance, elasticity, and availability.The YCSB framework consists of a workload generation client and standard workloads covering all aspects of the performance space.YCSB++ extends YCSB to evaluate the advanced features of NoSQL storage.BG is a benchmark for evaluating interactive social network behavior, which simulates social network behavior by reading or updating database operations.CloudSuite is designed for scale-out cloud applications and provides popular scale-out workloads to evaluate different NoSQL stores deployed in cloud architectures.CloudSuite also provides a list of real datasets and supports the extension of these datasets.

Multi-Model Database Benchmarks
As a part of NoSQL benchmarks, the multi-model database benchmark is listed separately due to the particularity of its data model.According to Messaoudi et al. (2017Messaoudi et al. ( , 2018)), in biomedical big data, the authors selected a single multi-model database OrientDB and a polyglot persistence instance composed of MongoDB and Neo4j to carry out performance evaluation with multiple workloads, such as insertion, deletion, and search operations.The results showed that MongoDB performed better than OrientDB in processing document data, and OrientDB performed better than Neo4j in querying graph data when the depth of the graph reached three layers.Although many workloads were involved in the evaluation, evaluating execution times alone did not give a complete picture of the capabilities of different data stores.Shah et al. (2014) evaluated eight databases, including OrientDB, Neo4j, and TitanDB, from two aspects of processing time and disk space usage.They found that OrientDB, Neo4j, and TitanDB performed well in persistence.Neo4j and MongoDB performed well in terms of query performance.Despite the fewer workload types, Shah et al. evaluated a wider range of databases than other researchers.Bagga and Sharma (2020) compared six databases, including MongoDB, CouchDB, and HBase, from backup, consistency, partition, and performance.Fernandes and Bernardino (2018) evaluated the graph and multi-model databases with graph function from seven aspects: storage mode, query language, partitioning, backup, multi-model, multi-architecture, extendibility, and cloud deployment.The experimental results showed that Neo4j and ArangoDB had the best performance.We can see that both teams are more focused on functional attributes.Macak et al. (2020) compared MongoDB and Neo4j with the multi-model database OrientDB from the perspective of eight query workloads.Finally, it was found that Neo4j was more efficient than OrientDB in processing graph data with a depth of less than four, and OrientDB performed better when the depth was greater than four, while MongoDB query efficiency was much higher than OrientDB in processing document data.Although only the time consumed by the query workload was measured, the measurements from Macak et al. covered many graph and document data query workloads.This research was of great reference value for those application scenarios with many queries.Jayathilake et al. (2012) used a column, document, tuple, graph, and multi-model database to process tree data.Membase showed the lowest latency and the highest throughput during tree creation.
On the other hand, the graph database Neo4j and multi-model database have achieved excellent results in data retrieval.Tree data are an important data type, so the research results fill the gap in the NoSQL database evaluation of tree data.The experiment run by Oliveira and del val Cura (2016) compared the combination of ArangoDB and OrientDB with MongoDB and Neo4j.The experiment could be divided into two parts: insert and query.ArangoDB inserted document data efficiently, while MongoDB inserted document data efficiently when there were many fields.ArangoDB was the most efficient when inserting graph data.In the query part, when the depth was less than two, the performance of ArangoDB was better; when the depth was between two and four, the performance of OrientDB was better; and when the depth was greater than four, the performance of the combination of MongoDB and Neo4j was better.Although only insert and query workloads were evaluated, Oliveira and del val Cura's experimental findings in graph depth traversal laid the foundation for the results reported by Macak et al. (2020).
We list the characteristics of six representative benchmarks from three aspects: dataset, workload and metric in Table 1.However, it can be seen from Table 1 that in the performance evaluation of different databases, most of them focus on the execution time of workloads while ignoring the occupation of hardware resources.The type of workload is relatively singular.The most incredible thing is that all benchmarks ignore the evaluation of the multi-thread workload.Therefore, we propose an end-to-end benchmark named MDBench for multi-model databases and polyglot persistence, aiming to provide a comprehensive solution for storing and managing multi-model data.

oVERVIEw oF THE EXPERIMENTAL DATABASE
Selecting the right objects for benchmarking is the starting point.Performance, price, and energy consumption are the most common metrics for computer program evaluation (Han et al., 2017).Therefore, based on the above standards, we select outstanding databases in various data model fields for benchmarking.Here, MongoDB and Neo4j are selected as the instances of polyglot persistence, and ArangoDB is representative of a single multi-model database for this study.Next, we will introduce the characteristics and selection basis of these data stores based on the literature analysis and comparison.

MongoDB
MongoDB is a document-oriented and scalable high-performance database (Banker et al., 2016, Plugge et al., 2015), whose efficient indexing mechanism brings high-speed queries that make it stand out among NoSQL databases (Zong et al., 2017).Truică et al. (2018) proposed T2K2 and T2K2D2 benchmarks and used them to test the performance of MongoDB, Oracle, and PostgreSQL.Experimental results showed that MongoDB performed better than Oracle and PostgreSQL in calculating top-K keywords and documents.

Neo4j
Neo4j is a high-performance graph database engine whose unique Cypher language enables convenient graph data processing (Holzschuher & Peinl, 2013).It follows the characteristics of the graph data model to maintain three data structures: nodes, relationships, and attributes.In addition, it has the characteristics of reliability, transactional, high availability, and security (Miller, 2013).Although Neo4j is a relatively new open-source project, it has been used in over 100 million nodes and meets enterprise robustness and performance requirements.Beis et al. (2015) conducted a comprehensive comparative evaluation of three popular graph databases, Titan, OrientDB, and Neo4j.Experimental results showed Neo4j was the most efficient graph database for most workloads.Only by knowing the capabilities and limitations of each system can researchers know where to focus their efforts.Therefore, Lissandrini et al. (2018) conducted a comprehensive performance evaluation and analysis of seven graph databases: ArangoDB, BlazeGraph, Neo4j, OrientDB, Sparksee, SQLG, and Titan.The results showed that Neo4j and the other three databases performed better in graph traversal.Furthermore, completing the entire set of queries in a single and batch manner was the most efficient.Dominguez et al. ( 2010) evaluated four of the most scalable native graph databases, Neo4j, HypergraphDB, Jena, and DEX, against the HPC extensible graph analysis benchmark and tested the performance of each database for different typical graph operations and graph sizes.The results showed that Neo4j and DEX were the most efficient graph databases.

ArangoDB
In ArangoDB, documents are stored in collections.Collections use _id to uniquely identify each document.The _id can be assigned by the user at creation time or automatically generated by ArangoDB.Indexes are created for both the _id and _key attributes, where the index on the _key attribute is called the primary index, which exists in each collection and cannot be deleted.There are two types of sets in ArangoDB: vertex sets and edge sets.Documents in an edge collection have two additional attributes, _from and _to.Both must be bound to the corresponding vertex document's _id attribute.ArangoDB uses the ArangoDB Query Language (AQL) to manipulate graphs or collections.AQL syntax is different from SQL syntax, although many of the same keywords exist.Compared with SQL syntax, AQL is more powerful and read-write.
Currently, OrientDB and ArangoDB are representative and influential multi-model databases.Zhang et al. (2018) proposed the UniBench benchmark for multi-model database evaluation and evaluated OrientDB and ArangoDB, and the experimental results show that ArangoDB performed better than OrientDB in most cases.
While there are few studies on multi-model databases, there is no other relevant research except Chao's evaluation of multi-model databases above.Therefore, we should select OrientDB or ArangoDB as the multi-model database in the experiment.Based on workloads {C1, C2, R1, R2, U1, U2, D1, D2}, this paper compares the running times of OrientDB and ArangoDB.According to the experimental results shown in Figure 1, it can be seen that the overall performance of ArangoDB is indeed better than OrientDB under basic document operation and graph operation.Therefore, they chose ArangoDB as the multi-model database for the experiment.
In summary, by analyzing and comparing existing studies, the single multi-model database ArangoDB and a specific polyglot persistence instance composed of MongoDB and Neo4j are chosen as research objects for benchmarking and comparison.

THE BENCHMARK PRoPoSED
From a macro perspective, the three elements of the benchmark are data, workload, and metrics mechanism (Xia et al., 2015).This section will present our end-to-end benchmark from the three perspectives above.

Multi-Model Data Generation
One of the challenges facing the performance evaluation of multi-model databases and polyglot persistence is the lack of a large-scale multi-model dataset.Previous data generators have focused on single-model data.Combining multiple single-model data generators can increase system instability because we tailored each data generator to a specific application scenario.Jiaheng Lu and his team proposed a multi-model data generator in UniBench that can generate JSON, XML, relational, document, and graph data.However, the data generator has different requirements on the hardware operating environment according to different workload factors.The data generator of MDBench is realized after optimization of the data generator based on UniBench.Compared with UniBench, the data generator proposed in this paper occupies very low memory and saves hardware resources to a large extent.Compared to the pre-optimized data generator, the optimized data generator frees up three-quarters of the memory space.Algorithm 1 shows the implementation process of the data generator.This data generator generates the dataset used in the relevant experiments presented in this paper.Here, we select two types of data: document data and graph data.The document data comprises commodity and order information, while the graph dataset comprises customers and their social networks.In addition, the productId in the order points to the item's primary key, and the personId points to the customer in the graph dataset.The orderId in the suborder points to the order primary key.We can see the specific information and relationship between goods, orders, and customers in Figure 2. workloads There are many types of workloads, and some database vendors focus on query performance, while others focus on transaction consistency.Different databases behave differently even though they handle the same workloads, so the workloads should be designed with broad coverage.To explore and compare the processing capability of multi-model database and polyglot persistence on different workload types, to make the evaluation scenario similar to the real big data application scenario, and reflect the use case of the real environment, a series of workloads are designed, as shown in Table 2.Each workload contains a label, brief description, data model, and quantity.Categorized from the perspective of create, delete, update and query, C = {C1, C2, C3, C4, C5, C6} are the insert workloads, R = {R1, R2, R3, R4, R5, R6, R7} are the query workloads, U = {U1, U2, U3} are the update workloads, and D = {D1, D2, D3, D4, D5, D6} are the delete workloads.From the perspective of data type, D = {C1, C3, C4, C5, R1, R3, U1, U3, D1, D3, D4, D5} are the document workloads, G = {C2, C6, R2, U2, D2, D6} are the graph workloads, and M = {R4, R5, R6, R7} are the multi-model workloads.Previous metrics of database benchmarks have mainly focused on the execution time of workloads.However, at a time when data volumes are exploding, it is not reasonable to focus solely on execution time.Therefore, compared with the previous database benchmark, the measurement mechanism of MDBench proposed in this paper measures the experimental results from four dimensions.The first dimension is the execution time of workloads, the second dimension is the resource occupation, the third dimension is reliability, and the fourth dimension is availability.

Execution Time
The execution time T is measured using the Timer class built into the Java language's software Development Kit (JDK).In these experiments, the statistics of execution time follow Formula (1), where w is is the start time of the i th workload, and w ie is the end time of the ith workload.

Resource occupation
Resource occupation is monitored by the distributed monitoring unit Prometheus, which collects server performance data.Meanwhile, time series data collected by Prometheus are presented by Grafana in the form of graphs, which is an interface tool.CPU and memory statistics follow Formula (2), where r i is the CPU or memory consumed by the i th workload.(2)

Reliability
The reliability metric is the ratio of successfully responded requests to the total number of requests, as shown in Formula (3).

Availability
The measure of availability is the ratio of effective working time to total working time, as shown in Formula (4).

DESIGN oF EXPERIMENTS
From the perspective of the practical application of big data, this paper divides the experiment into four groups.They are the single table workload experiment, multi-thread workload experiment, multitable joint query experiment, reliability and availability experiment.We tune each type of database for performance prior to experimentation to ensure that the database maximizes its advantages.

Experimental Configuration
The experiment runs on three servers containing one master node and two slave nodes.The master node is configured with eight-core 16 GB memory, and the two slave nodes are configured with four-core CPU and 8 GB memory.Three databases related to the experiment, including MongoDB, Neo4j, and ArangoDB, are installed and deployed on three servers in a distributed architecture.The evaluation platform is implemented in Java and runs on a single slave node, so the server has a preinstalled dependency environment.The measurement of experimental results is divided into two parts: execution time and resource occupancy.The execution time of the workload is measured using the Java Development Kit (JDK).The resource occupancy is measured by distributed resource monitoring platforms Prometheus and Grafana.We show the software and hardware parameters in Table 3.

CoMPARISoN oF THE MULTI-MoDEL DATABASE AGAINST PoLyGLoT PERSISTENCE
This paper's experiments on the performance evaluation of a multi-model database and polyglot persistence comprise four parts: a single-table workload experiment, a multi-thread workload experiment, a multi-table joint query experiment, and a reliability and availability experiment.
The first part of the experiment is a single-table workload experiment.This is because we must migrate the data in the database for practical applications, and persistence is inevitable in iterative operation systems.For example, in data migration, the consumption of time and resources to the downstream consumer must be predictable.Otherwise, it will directly affect the normal operation of the downstream system.Therefore, it is necessary to measure this kind of single-table workload.
The second part of the experiment is the multi-thread workload experiment.With the popularization of the Internet and the intelligence of mobile devices, servers are facing increasing concurrency pressure.The processing of sudden and high concurrent requests is the ability that distributed databases should have, and it is also the necessary condition for databases to be put into production and life.
The third part is the multi-table joint query experiment.In the real application scenario, with time and business development, the data in the tables will increase, increasing the cost of database operations.Therefore, at the beginning of the system, developers will cut data separately based on factors such as function modules and data relationships, using tables to store them.While these data are needed, we can query them via an associated query to multiple tables.Except for data migration and persistence involving only one table, most operations involve associated queries to multiple tables.Therefore, the effect of multi-model databases and polyglot persistence-associated queries on multiple tables is also a concern of users.The above three experiments will use measurements from execution time and resource occupation.
The fourth part is the reliability and availability experiment.E Bauer and R Adams proposed the calculation formula for reliability and availability in 2012 (Bauer and Adams., 2012).In this experiment, we sent 1000 requests to the evaluation database.During the request process, we adopt fault injection to simulate the restart of the server after power failure.The quantified reliability result is obtained by calculating the ratio of the number of successfully responded requests to the total number of requests.The availability is quantified by the time the last request responded before the restart and the time the first request responded to after the restart.We will perform the reliability and availability tests for each database five times, averaging the remaining three times by removing one maximum and one minimum.The reliability and availability of polyglot persistence follow the Cannikin law.

Single Table workload Experiment
Single table workloads are the simplest of all workload types and are the basis of all workload types.The measurement of single table workloads includes a total of eight workloads.C5, R3, U3, and D5 are the CRUD workloads of document data, while C6, R2, U2, and D6 are the CRUD workloads of graph data.We measure experiments from execution time and resource occupation.Figure 5 shows the comparison of processing time for single-table workloads, and Figure 6 shows the comparison of resource occupation for single-table workloads.
Figure 5 shows that the ArangoDB takes almost twice as long to delete document data as polyglot persistence.When inserting graph data, polyglot persistence took nearly three times as long as ArangoDB.In the case of handling other workloads, there is little difference between polyglot persistence and ArangoDB.
From the comparison of resource occupation of single table workloads in Figure 6, we can see that in most cases, the CPU consumption of polyglot persistence is higher than that of ArangoDB, and the memory consumption of ArangoDB is generally higher than that of polyglot persistence.

Multi-Thread workload Experiment
The multi-thread experiment simulates a high-concurrency application scenario by controlling the number of threads created by the evaluation platform.R4 is selected to carry out the multi-thread workload experiment.R4 is a mixed workload involving document and graph operations.The evaluation platform measures the experimental results from two dimensions: execution time and resource occupation.
Figure 7 shows the comparison of ArangoDB and polyglot persistence under multi-thread workloads.As we observe in Figure 7, the execution time changes significantly as the number of threads increases from 1 to 5.This is because both ArangoDB and polyglot persistence can handle high concurrency scenarios.However, when the number of threads increases from 5 to 80, the ArangoDB and polyglot persistence processing times do not change, which is normal because all systems that support high concurrency have a performance ceiling on the number of concurrent processes they can support.
Figure 8 shows the resource occupation comparison of ArangoDB and polyglot persistence under multi-thread workloads.As the figure shows, the memory occupation of ArangoDB and polyglot persistence is remarkably stable, consistently at 40%.We find that when the number of threads is small, the CPU usage of polyglot persistence is significantly higher than that of ArangoDB.As the number of threads increases, the CPU usage of both approaches 100%.

Multi-Table Joint Query Experiment
This paper selects R4, R5, R6, and R7 to conduct the multi-table joint query experiment.This part of the experiment uses a vector-based method to represent the results of the associated query to multiple tables.Specifically, based on known parameters, an intermediate result is first queried, and then the eventual result is obtained progressively based on the intermediate result.For example, R4 is given the customer's name and queries the total price of orders paid by the customer.According to the customer's name, a null vector can be calculated (|C|, |O|, |CO|) (|O|, |S|, |OS|).This method can reflect how many associated queries the workload contains and the size of the intermediate results at each step.
Figure 9 compares the time taken by ArangoDB and polyglot persistence under the workload of the associated query.It can be seen from the figure that with the increase in the number of associated tables, the time taken by ArangoDB and polyglot persistence increases gradually.At the same time, we can find that the time of polyglot persistence is always less than that of ArangoDB.
Figure 10 displays the resource occupation comparison of ArangoDB and polyglot persistence under the associated query to multiple tables.It can be seen from the figure that the number of associated tables has little influence on CPU and memory consumption, except that the consumption of CPU and memory of polyglot persistence increases slightly when the number of associated tables increases from 4 (R6) to 6 (R7).

Reliability and Availability Experiment
Table 4 shows the number of failed response requests among 1000 in the reliability experiment.Table 5 shows the time in milliseconds for each database processing fault in the availability experiment.Reliability and availability calculations follow the Cannikin law to avoid experimental contingency, removing one maximum and one minimum and averaging the remaining three values.In the end, the reliability of ArangoDB was 97.40%, and the availability was 97.19%.Polyglot persistence has Figure 1.ArangoDB vs. OrientDB

Figure 4
Figure 4 is an architecture diagram of the MDBench.It comprises four parts: database cluster, resource monitoring unit composed of Prometheus and Grafana, data pipeline unit composed of Zookeeper and Kafka, and workload injection unit written in Java language.Previous metrics of database benchmarks have mainly focused on the execution time of workloads.However, at a time when data volumes are exploding, it is not reasonable to focus solely on execution Figure 2. Relationship between datasets Figure 4.The architecture of MDBench

Figure
Figure 5. T ime consumption of ArangoDB (MD) and polyglot persistence (PP) when processing a single table workload Figure 6.CPU and memory usage of ArangoDB (MD) and polyglot persistence (PP) when processing a single table workload

Figure 9 .
Figure 9.Time consumption of ArangoDB (MD) and polyglot persistence (PP) when processing the associated query to multiple tables Truică et .(2021)also proposed a universal document-oriented distributed benchmark TEXTBENDS, which was used to evaluate the computational efficiency of word weighting under two different weighting schemes: TF-IDF and Okapi BM25.Comparing MongoDB, Hive, and Spark, the experimental results showed that MongoDB had the best overall performance.Mishra et al. evaluated the performance of four document databases and databases with a document model.When comparing database throughput and runtime in a single-threaded state, MongoDB outperformed other databases with the highest throughput and lowest runtime.In a comprehensive analysis of MongoDB and ArangoDB for some threads under different workloads, MongoDB outperformed ArangoDB by a high percentage. al

Table 3 . Software and hardware parameters
, including the size of the vector |C| on behalf of the customer table, |O| represents the size of the order table, and |CO| on behalf of the associated query result.In the same way, R5 can be expressed as (|O|, |S|, |P|, |OSP|), where |S| represents the size of the order table and |P| represents the size of the goods table.R6 can be expressed as (|P|, |S|, |O|, |C|, |PSOC|).R7 can be expressed as (|X|, |Y|, |XY|), where X = (|P|, |S|, |O|, |C|, |PSOC|), Y =