Privacy-Preserving Contact Tracing for Curbing the Spread of Infectious Disease

The world is facing one of the greatest public health threats in modern history. Various techniques based on contact tracing have been developed to support non-pharmaceutical interventions. The growing evidence shows that app-based contact tracing can reduce the spread of COVID-19 if a certain proportion of the population uses the apps. However, the risk of privacy breaches that comes with such apps has long been a public concern which may hinder the uptake of the apps. In this paper, the authors attempt to find a solution to complete the spatiotemporal intersection computation without exposing the infected patient location and the user location to one another. The authors implement the solution in the WeChat applet to aid the local health center. This study conducts experiments for six scenarios to justify the applicability of the applet. Experiment results indicate that the applet is a promising non-pharmaceutical tool for curbing the spread of COVID-19.


INTRoDUCTIoN
The COVID-19 pandemic has spread across the world since the sudden outbreak of the virus in 2019. As of August 18, 2022, there have been 593,424,218 cases, and 6,463,805 deaths confirmed across 227 countries and territories (virusncov.com). The outbreak of COVID-19 has interrupted immunization services and overwhelmed health systems in many countries. In response, countries have introduced various Non-Pharmaceutical Interventions (NPIs) to slow down the transmission of the virus. The most common NPIs are physical distancing and closures such as the temporary closing of workplaces and schools and the canceling of events that attract large crowds. NPIs are among the best ways of controlling disease pandemic when vaccines are not yet available. For example, it was verified to be very effective in the H1NI influenza pandemic (1918)(1919), which was similar to the COVID-19 pandemic (Ferguson et al., 2020).
The countries or cities that implemented NPI early in the COVID-19 pandemic have successfully reduced the number of infected cases and the mortality rate, nevertheless, this came at great socioeconomic costs. Scientists have thus suggested an alternative strategy to trace the close contacts of those infected with COVID-19 via the smartphone app. A previous study found that the 1.83 m (6 feet) social distancing policy is the minimum requirement and is not sufficient to avoid the contact of COVID-19 due to the complexity of environmental wind conditions (Feng et al., 2020). If the distance to the newly infected person is determined, the goal is to prevent them from becoming infected by asking those nearby to self-isolate. Ferretti et al. (2020) show that fast contact tracing combined with large-scale nucleic acid testing is able to delay or even entirely stop the epidemic (Ferretti et al., 2020).
The use of contact-tracing apps has shown promise. Thus far, various contact-tracing apps have been developed and deployed in many countries to eliminate the bottlenecks of labor-intensive methods (Nageshwaran et al., 2021). For example, the app (Eames & Keeling, 2003) uses Bluetooth signals to detect whether two individuals have come into close enough physical proximity to risk an infection. More and more evidence indicates that app-based contact tracing can curb the spread of COVID-19 if a large enough proportion of the population uses the app (Union, 2020). Nevertheless, concerns have been raised about such apps due to the potential privacy breaches (Kaptchuk et al., 2020). Altmann et al. (2020) have investigated the factors that may hinder or facilitate uptake and found that concerns about privacy, together with a lack of trust in the government, is the main barrier to increasing the uptake (Altmann et al., 2020). Privacy information such as a person's location data is captured at high resolution and speed by smart devices and applications, online services, and citizens using offline services involving digital records. Once such sensitive personal data is collected and leaked to unauthorized parties or misused, citizens are at risk of privacy breaches. Thus, the mainstream approach to protecting user privacy today is to minimize data collection and aggregation (De Viti et al., 2023).
In this paper, the authors address the problem of privacy protection in contact tracing using cryptographic techniques of Private Set Intersection (PSI). Data encryption is a common method to protect the credibility of citizen data (Saleh et al., 2016). The goal of PSI is to allow two parties to compute the intersection of their sets without revealing any other element except the intersection. Based on PSI, the authors developed a small WeChat program to calculate the intersection of time and space without disclosing information about the trajectory of a confirmed patient to the system citizen or the privacy of the citizen to the health prevention department. For the two databases which store the trajectory data of the citizen and the trajectory data of the patient, the system adopts the protection measures, such as encryption, when reading the information for intersection calculation.
The rest of the paper is structured as follows. First, the background section briefly introduces related technologies including the intersection of traditional contact tracing and private collections. Then the private set crossing method and its implementation are introduced in detail. Based on PSI, the authors propose a computational application model of spatiotemporal intersection and illustrate how to process the trajectory database of citizens and patients. The experimental results of the model are then presented, along with some discussion of performance and limitations.

Contact Tracing
Contact tracing is one of the most effective non-pharmaceutical measures to curb the spread of COVID-19. It utilizes information in the smartphone to identify and notify the individuals who have been in close contact with confirmed patients. Keeping a safe distance from people who may have been exposed to the virus is a key challenge related to physical containment (Storey et al., 2022). Abeler et al. (2020) proposed a contact tracing method by periodically receiving temporary IDs of nearby people via Bluetooth. The system then automatically reads these temporary IDs and compares them with the encrypted patient database provided by the health center to find the contact individuals . Some contact tracing applications also use base station signals to provide the fundamental data. Altuwaiyan et al. (2018) identified the individuals at risk by collecting data from nearby base stations and WI-FI routers (Altuwaiyan et al., 2018).
However, the success of this category of contact tracing critically depends on the level of contact tracing uptake. Since it is required to trace individuals' interactions with others, privacy concerns may undermine the adoption of apps. Calculating the distance between two individuals through encrypted trajectory data is one way to enable contact tracing with privacy protection. Block chain technology implements operations that provide encrypted and anonymous personal identities; hence, Xu et al. (2022) created two chains and a public key to handle location data. The control and prevention authority then generated a diagnostic key to determine whether the individual is a close contact (Xu et al., 2022). To precisely find who is in close contact with a patient, W. Kim's (2020) team proposed a model using the encrypted spatiotemporal trajectory data (W. Kim et al., 2020). An et al. (2021) proposed a secure proximity calculation method based on a homomorphic encryption scheme .
In the era of big data, data collection has increased dramatically, and privacy threats are unprecedented (Janghyun et al., 2022). Since the trajectory data of citizens is individual privacy, its protection should be given a very high priority (Kozak et al., 2014). Dong et al. (2022) proposed a linear privacy budget allocation strategy to enhance the privacy protection (Dong et al., 2022). In this paper, the authors construct a new tracking service model that pays more attention to privacy protection. In addition, it is equally important to protect the databases that store personal data. Kantarcioglu et al. (2008) proposed a cryptographic model-based approach to safely answer count queries without compromising privacy information in DNA sequence databases (Kantarcioglu et al., 2008). In terms of hardware security, Canim et al. (2011) proposed a protocol that uses secure encryption hardware to securely store, share and query clinical genomics data, securely implementing the query of encrypted data (Canim et al., 2011). Subsequently, Pappas et al. (2014) built a technique called the blind predictor to support a fairly rich set of queries on private database management systems (Pappas et al., 2014). Considering the computational cost, Choi et al. (2021) proposed a new secure search structure based on all-homomorphic encryption to complete the data query at a lower computational cost (Choi et al., 2021). This study is partly based on some strategies in  and W. Kim et al., (2020). The authors performed spatiotemporal cross-computation using the PSI method. In this way, the infected patient location and user location are not exposed to one another, while protecting the privacy of both databases.
The main contributions of this paper are summarized as follows: First, the authors develop a contact-tracing model that provides privacy protection via the PSI protocol. To date, little published literature can be found applying PSI to contact tracing of infectious diseases. The model proposed in this paper is a novel application of private set intersection in the field of contact tracing. Second, the significance of this scheme is that it balances computational efficiency and privacy protection. Most contact tracing schemes do not technically provide privacy protection. Other PSI methods, such as homomorphic encryption based on public key cryptography, have more computational complexity to prevent themselves from being used in a real-world application. Taking Paillier homomorphic encryption (Paillier, 1999) as an example, the length of data encrypted with the actual key occupies more storage space than the original data. Longer keys make data less likely to be cracked, but they come with significant overhead. In this aspect, however, the model proposed in this paper does not increase the size of encrypted data, thus keeping the computational complexity at a low level. Third, the authors pay special attention to respecting privacy protection, which is important, but often ignored as the concern of engineering ethics.

Private Set Intersection
Private set intersection is a privacy-preserving technique that allows two or more parties to compute the intersection of their sets without revealing any other elements except the intersection. Other scholars have implemented PSI from the perspective of the Diffie-Hellman problem (DHP) (Meadows, 1986) and RSA blind signature schemes (De Cristofaro & Tsudik, 2012).
Homomorphic encryption allows for computations on data while the data is encrypted, and thus provides the capability for computing PSI without the hassle of sharing the secret key with any service provider. The Paillier cryptosystem, as a typical homomorphic encryption scheme, has been successfully applied to computing the PSI (M. Kim et al., 2012). Homomorphic encryption enables one to perform arbitrary computations. However, it is quite slow and requires large cipher texts and keys (Boyle et al., 2017).
Another way to build the PSI protocol is based on oblivious transfer (OT). Kolesnikov et al. (2016) adopted the OT extension to implement Oblivious Pseudorandom Functions (OPRFs) (Freedman et al., 2005), which is a secure two-party protocol that implements function g r w f r , , for some pseudorandom function family f r (Kolesnikov et al., 2016). One can extend the OPRFs solution to the PSI problem without much difficulty. Moreover, the OPRF-based PSI computation is significantly faster than other schemes. Since the trajectory data is a bit large, this study uses the OPRF-based method to complete the spatiotemporal intersection computation.

MeTHoDS overview
The method first treats the trajectories of some individual during a fixed interval as a set. To determine if an individual is in close proximity to an infected patient, one may compute the intersection of these two sets. If the intersection is empty, then the individual is safe. Otherwise, the individual is at a risk of infection to some extent. Figure 1 illustrates the data exchanges between the client and the server for PSI computation. In Figure 1, » p gives the key space; E is the encryption function; F is the function of processing the encrypted data; H is the function used to construct the mapping matrix; and Z is the function to calculate the intersection.

Figure 1. Sketch of the Two-Party PSI Computation
The server encrypts the original data in the database using the session key k and sends the session key to the client. Next, the client and the server choose an OT method to exchange information. Both parties then calculate X Y ∩ from the received data, without knowing any other information except the intersection, and without knowing the privacy information in each other's database. The algorithm has complexity of  n ( ) .

PSI Algorithm
Assume the name of the server is "Alice" and the client is "Bob." In the real application, the server may have collected trajectory data from couples of confirmed positive individuals. Without loss of generality, the following content only takes Alice for illustration.
The authors convert the location information at some moment into w -bit binary data. Thus compose m pieces of location data along the trajectory during a certain interval into a matrix denoted by Y m w × . That is to say, Alice's initial set is the matrix Y m w × . Likewise, Bob's initial set is denoted by X m w × . From the initial sets, Alice and Bob take the OT step. They both perform a mapping operation on the encrypted data. After that, Alice and Bob send the mapped matrix to each other for PSI calculation.

Matrix XOR Modulo Operation
The authors implemented the OPRF encryption in the form of the matrix XOR modulo. In this module, Alice randomly chooses a matrix k m w × from the key space, where each element of k is 0 or 1 . Take the i -th row in Y and XOR it with k , row-wise, to obtain a new matrix. By performing the "mod m " operation on each row of the new matrix, Alice obtains a number between 1 and m . Take the result of w modulo operations as the ciphertext corresponding to the i -th row of the original matrix. Traversing each row of the original matrix, one obtains the encrypted matrix, as shown on the bottom of Figure 2. Alice obtains the encrypted matrix Y m w ' × . Next, she sends the key matrix k to Bob.
Bob then calculates the encrypted matrix X m w ' × using the same k .

Oblivious Transfer
In the 1-out-of-2 OT protocol, Alice has two input strings M M For the two-party PSI computation, Bob only needs to receive one of the two sets of information sent by Alice. The authors implement a simple 1-out-of-2 OT protocol similar to Chou and Orlandi (2015). In this OT protocol, Alice possesses the plaintext M M . Traversing each line in Y ' from top to bottom, one obtains the matrix D m w ' × , as illustrated in Figure 3.
Alice randomly generates matrix A , which has elements of 0 or 1 . After XOR-ing the elements in the same position in matrix A and matrix D ' , one obtains matrix B . Next, Bob randomly selects a string s of length w , each element is 0 or 1 too as well. This step is referred to as the mapping of the encryption matrix, as illustrated in Figure 4. Once the elements in matrix Y r are filled up, Alice sends it to Bob.

Figure 2. Matrix XOR and Modulo Operation
Bob uses the same mapping method to generate a mapping matrix X r of the matrix X ' in C , and then sends X r to Alice. Both Alice and Bob compute X Y r r ∩ to obtain the same intersection. The index of the intersection in the original matrix X and Y are not necessarily the same.
Alice and Bob obtain the index of the intersection by calculating the result of X Y r r ∩ respectively, thereby obtaining X Y ∩ from the index of intersection. Figure 5 summarizes the improved PSI calculation proposed in this study with an illustrative example. Alice and Bob each have a matrix with 3 rows and 4 columns. They attempt to obtain X Y ∩ without revealing the data except the intersection. Alice and Bob need to cooperate by going through the encryption module, oblivious transfer, and matrix mapping. After that, Alice and Bob obtain X r and Y r respectively. Alice

Spatiotemporal Intersection Computation
When Spatio-Temporal Intersection (STI) computing is applied to the problem of contact tracing, where a health center collects personal trajectory data of newly infected patients and stores it in a database. The health center was authorized to provide encrypted track data on behalf of Alice, one of the infected patients. A user like Bob is an application user who wants to know if he is too close to Alice at some point. Therefore, in this application, Bob will read the location information from This study makes the assumption of "semi-honest" in the scenario of STI computation. When the health center and the user exchange information, it transmits over a secure channel. The data transmission channels are not subject to malicious third-party attacks. Patient trajectory data is honestly collected by the health center and used only for historical intersection calculations. If the user wants to know that he/she is safe, then he/she is willing to cooperate and will not participate in the calculation with false data. Therefore, assuming semi-honest security in this scenario is generally plausible. Figure 6 illustrates the data transfer process. The data of an infected patient managed by the health center is denoted by Y m w × , and the user's personal trajectory data is denoted by Y m w × . In the STI computation scenario, the user sends "STI Computation" requests and parameters for OT with the health center. The health center cannot access or store the user's personnel information, since PSI is calculated locally. In addition, the user's personal trajectory data is masked via encryption.
When the health center confirms a freshly infected patient, it collects the patient's historical trajectory data and updates the core database. If some patient's data is three weeks overdue, then the health center also removes or disables the record from the core database. That is to say, the health center updates the core database constantly.
From the raw trajectory data, the model assigns each location visited a risk factor r . The risk factor represents the risk level for that location. The setting of risk factors is the responsibility and supervision of health institutions. After the user starts a computation request, the successive process is as follows: Figure 6. Sequence Diagram of STI Computation 1. Upon receiving the user's STI computation request, the server on behalf of the health center generates a key matrix k to encrypt the patient's trajectory data Y . The server specifies the parameters (matrices A , B , and D ), where matrices A and B serve as input to the OT step. Next, the server calculates the mapping Y r of the encrypted data in matrix A . Then, the server sends the risk factor set Φ y , the mapped data Y r , and the key k to the user.
2. The client on behalf of the user randomly samples and generates the string s locally, and thus obtains matrix C after w times of OT with the server. The client encrypts the trajectory dataset X locally, picking the same time interval as the patient. After calculating the mapping X r of X ' in matrix C , the client obtains X Y r r ∩ and successively obtains the historical spatiotemporal intersection. 3. If the intersection is empty, it indicates that the user has not been in close contact with the patient.
Otherwise, the client takes out the set of risk factors to calculate the user's infection risk. If the calculated risk index exceeds a certain threshold, the health center will issue a warning to users and prompt them for close contact with COVID-19.

eXPeRIMeNTS
The authors developed a WeChat applet based on the above-mentioned STI computation scheme. Some user interfaces of the applet are displayed in Figure 7. Figure 7(1) shows the main page of the application, which includes three basic functions: STI Computation, query of trajectory data for the past 14 days, and COVID risk assessment. When the health center obtains trajectory data of the newly infected patients, the applet notifies the user that the database has been updated and is ready for a new STI computation. If the user requests STI computation and obtains the confirmed response from the central server, the applet displays the calculation progress, as shown in Figure 7(2). After the calculation is completed, the applet displays the STI results on the page, and assesses whether or not the user is at risk of being infected. If the risk level is high, the applet will prompt a warning message, as displayed in Figure 7(3).
The authors also established a web service for the health center to manage patient information and disease profiles during the pandemic. Figure 8 shows the main page, including trajectory data of confirmed patients and markers for risk regions. If the administrator uploads the trajectory data of newly diagnosed patients, the web page will be automatically updated. The records of diagnosed patients contain the encrypted ID, the date range, and the number of times comparisons. The risk level changes for the regions also trigger the refreshment of the web page.

Data Source
To evaluate the performance of the applet, the authors conducted a small-scale experiment in the district around campus and constructed a trajectory dataset inspired by the one released by Google (Fu et al., 2020). Each record of trajectory data is composed of time, longitude, latitude, and risk factor r. In addition, when location information is collected from multiple devices, a constrained integrity check (Jaziri et al., 2016) can be performed to ensure consistency, and the data is kept in a database and processed encrypted. In accordance with the guidance of the Centers for Disease Control and Prevention (CDC), the risk factor in the high-risk region is defined as r ∈    ( 0 5 1 . , . The mediumrisk region is defined as r ∈    ( 0 2 0 5 . , . , and r ∈       0 0 2 , . defines a low-risk region. The authors conducted experiments simulating six scenarios as presented in Table 1. Instantaneous GPS data is sampled every second in the first three scenarios. Scenario 1 records the trajectory data between 9 a.m. and 9 p.m.; hence, the total amount of data in two weeks is  In most scenarios, the outdoor time interval is the primary consideration in STI calculation. Indoor intervals are also taken into account in STI calculations if the pandemic situation becomes severe. Wider ranges of the time interval can be found in Scenarios 3 through 6, in which the trajectory data in the daytime and night are all included.
The authors designed the collection time points based on the above six scenarios. In the case of sampling once per minute, the collection was performed at the 30th second. At 15 seconds and 45 seconds, the system will collect location information twice per minute. A sampling at the 10th, 30th, and 50th second yields three samples per minute. In a situation that required six samples per minute, the authors collected location information every 10 seconds. For the location information matrix, each row corresponds to a record in the dataset. The sizes of the longitude and latitude are both 32-bit. Hence, each row of the location information matrix is 64-bit. If the user remains in one place for an extended period of time without movement, the system merges the location data within that time period to speed up STI calculations.

Performance
In the small-scale experiment, the server has an 8-core CPU with 16 gigabytes of memory. The authors recorded and analyzed the data transfer time and total uptime of STI calculations in six scenarios. Data transmission refers to the parameter transmission and the OT message exchange between the client and the server. The total run time recorded includes the data encryption, data transfer, and cross-computing phases.
Suppose that each number pair to be compared has n -bits. The computational complexities of data encryption, OT, and intersection calculation are all  n ( ) , which agrees with other literature regarding garbled circuit (Kolesnikov et al., 2009). If the data volume is considered, namely the number of pairs denoted by k , then the corresponding computational complexity is  kn ( ) .
The timer is triggered when the client starts an STI computation request and stops when the STI result is returned. The running time for each piece of data increases with the scale of the dataset, as shown in Figure 9. Scenario 1 contains 10,080 pieces of data, and the average total running time is 1.60s. When the data volume increases to 120,960 (Scenario 6), the average running time reaches 81.91s. If the data volume is small, then the average running time per piece of data can be less than 0.2 mins. During the experiment, the actual average delay of network connection was 0.139 mins.

LIMITATIoNS
The proposed STI computation method has some limitations to consider. First, when both the user and the infected patient have been to the same high-rise building, the applet does not work. Since the applet cannot tell the difference between different stories within a building, the applet may falsely warn the user even if the user is far away in different stories. Second, the application model makes semi-honest assumptions to ensure that citizens' privacy is not violated. However, the model does not carefully protect the privacy of infected patients. A lack of trust may reduce the uptake of the applet.
Third, as mentioned previously, the running time for STI computation increases with the scale of the dataset. The method is applicable when the number of actively infected patients registered in the health center is not very large, or the disease outbreak is in the early stage.

CoNCLUSIoN
In response to the COVID-19 pandemic, the authors developed a PSI-based applet aiming to curb the spread of the pandemic. The applet performs spatiotemporal intersection computation between the end user to the infected individual without breaching the user's privacy, thus providing a powerful tool for risk assessment. The authors conducted a small-scale experiment to verify the applicability and efficiency of the applet to test the performance in six scenarios. If the number of actively infected patients is not overly large, the running time can be confined to less than 0.2 mins per item. In the era of big data where privacy problems are particularly prominent, it is of practical significance to explore a contact-tracing scheme that balances the privacy protection and computing costs for controlling the spread of diseases. The STI model proposed in this paper will also perform well in scenarios involving intersection computing and privacy protection, such as finding common friends in social applications. Encrypted databases are a popular way to protect data from database management systems (Grubbs et al., 2017). The encryption module in the STI model protects the information in each party's database from leakage very well. The study also provided a web service for local health centers to manage data sets of confirmed patients. The authors are continuing to work diligently with the health center to jointly promote the applet for curbing the pandemic.

AUTHoR NoTe
The authors of this publication declare there are no competing interests.