Trajectory Data Publication Based on Differential Privacy

Analyzing trajectory data can provide people with a higher quality of life. However, publishing trajectory data directly will leak privacy. The authors propose a trajectory data publication method based on differential privacy (TDDP). TDDP method consists of two stages. In the location generalization stage, firstly, the locations at each timestamp are clustered into classes by k-means++ algorithm, and then the representative location of each class is selected by using the exponential mechanism. In the generalized trajectory data publication stage, the authors design a sampling mechanism to form the generalized trajectories. The locations are sampled from the representative locations under different timestamps to form the generalized trajectories. The TDDP method can avoid the generation of non-semantic representative locations and ensure that the generalized trajectories can resist filtering attacks. The experimental results show that the trajectory data released by TDDP method can achieve a good balance between privacy protection and data availability.


INTRoDUCTIoN
In the era of big data, location-aware technologies such as mobile communications and sensing devices digitize the geographic locations of people and objects, and subsequently generate a large amount of trajectory data.Location data contains characteristics of human behavior, by analyzing and mining trajectory data, better services can be provided to people (Yang et al., 2019).For example, urban traffic can be reasonably planned to avoid traffic congestion by analyzing trajectory data (Yuan et al., 2012, &Yuan et al., 2013).However, trajectory data contain a lot of sensitive personal information, such as the home address, work address, physical health status.If the location or trajectory data are directly released, it will lead to privacy leakage (Wernke et al., 2014, Gursoy et al., 2019, & Ding et al., 2020), seriously, it will even threaten people's personal safety and property safety.The researches on trajectory data privacy protection are mainly divided into two types.One is the trajectory data privacy protection in offline mode.A specific organization collects trajectory data for analysis and mining to provide useful information to specific customers (Abul et al., 2008, Hua et al., 2015, Li et al., 2017, & Ma et al., 2021).The other type is online trajectory data privacy protection, such as location-based services.The real-time trajectory data of moving objects needs to be uploaded to the service provider, in this case, privacy protection of trajectory data is also required (Zhang et al., 2017, Zhang et al., 2018).In this paper, the authors mainly study the privacy protection of trajectory data in offline mode.
The existing trajectory data privacy protection methods mainly include k -anonymity method (Sweeney et al., 2002), encryption method and random disturbance method.The k -anonymity method is vulnerable to attacks with background knowledge.The encryption method is not a commonly used method due to its high computational cost.Among the random perturbation methods, the trajectory data publishing based on differential privacy has become a more popular research (Hua et al., 2015, & Liu et al., 2021).Differential privacy technology (Dwork et al., 2017) is the strongest unconditional privacy protection technology currently known, differential privacy can resist attacks from any background knowledge.However, some current researches on trajectory data publishing based on differential privacy also have some aspects that need to be improved.
(a) Some current researches require that the start and end times of any two trajectories must be the same or assume that the raw trajectories need to contain the same prefix or a common subsequence.However, it is difficult for the actual collected trajectory data to have these characteristics.(b) Some current methods are to cluster the locations of all trajectories at each timestamp, and then use the cluster center of each class as a representative of all locations within the same class, at last, they use the cluster centers to generate the generalized trajectory.However, the cluster centers sometimes do not have semantic information, even, non-semantic representative locations can appear in multiple clusters, which will make the published trajectories to be identified and filtered by the adversary.
For the above aspects, the authors propose a trajectory data publishing method based on differential privacy (TDDP), and the contributions are as follows: (a) The authors propose a trajectory data publishing method based on differential privacy.The TDDP method consists of two stages.In the first stage, the locations of each timestamp are clustered into K classes, and then the representative location of each class at each timestamp is selected by the exponential mechanism.In the second stage, locations are sampled from representative locations of different timestamps to form the generalized trajectories, which can avoid generating non-semantic locations and resist filtering attacks.(b) In order to improve the utility of the trajectory data published, in the second stage, the TDDP method make the generalized trajectory data set is partly composed of the generalized trajectories which containing the raw trajectories.(c) The authors conduct experiments on real data set.Hausdorff distance and spatial range queries are used for measuring the utility of the published trajectory data, mutual information is used to measure the total privacy loss, the results show that the published trajectory data can maintain good utility.et al. (2008) proposed the NWA (Never walk alone) algorithm which uses the inherent uncertainty of the motion trajectory to make the k trajectories in the trajectory cylinder indistinguishable and achieves the effect of k anonymity.In the process of constructing the trajectory k anonymity set, the NWA algorithm uses the Euclidean distance to calculate the distance between two trajectories.It requires that the start and end times of any two trajectories must be the same, and the sampling points corresponding to the two trajectories must match.However, the trajectory data collected in reality rarely meet such requirements.Nergiz et al. (2008) proposed to generalize the time and space of the trajectory data.The logarithmic distance (log cost distance) is used to measure the similarity of any two trajectories, and then randomly select locations in each anonymous area to reorganize the trajectory, and finally release the reorganized trajectory to improve the utilization efficiency of the released trajectory data.Abul et al. (2010) proposed W4M (Wait for Me) algorithm which is an improvement of the NWA algorithm.In the trajectory clustering stage, the Euclidean distance is no longer used, but the EDR distance function is used to calculate the distance between two trajectories.The W4M algorithm can solve the problem of mismatching trajectory lengths in the process of trajectory data clustering.Yu et al. (2018) proposed a method that using frequent path patterns for trajectory privacy protection.Firstly, the trajectories is divided into several road segments, infrequent road segments are removed.Then the most frequent path is found and the k anonymity sets are constructed.Finally, the representative track for each k anonymity set are published.Gursoy et al. (2018) proposed AdaTrace method which builds a generative model from a given set of real traces through a four-stage synthesis process consisting of feature extraction, synopsis learning, privacy and utility preserving noise injection, and then generates differentially private synthetic location traces, the output traces crafted by AdaTrace are robust against known location trace attacks.

Abul
Although the k -anonymity method is a relatively mainstream method, it is vulnerable to attacks with background knowledge.In recent years, trajectory data publishing based on the differential privacy have emerged.Differential privacy can guarantee unconditional privacy due to its rigorous mathematical form, even if the attacker has partial background knowledge, it is impossible to perform inference attacks.Chen et al. (2011) first proposed to use differential privacy technology to solve the privacy protection problem of large-scale trajectory data release.Chen et al. (2011) proposed the prefix tree to store the trajectory data, and the Laplace noise mechanism is used to add noise to every node in the tree except the root node.Aiming at the phenomenon of data inconsistency easily caused by independent noise, a consistent processing of noise values is proposed by using the characteristics of prefix tree itself, the trajectory data released are suitable for counting and frequent pattern queries.Chen et al. (2012) extended the work of prefix tree to use a variable-length n gram model to deal with sequence data and developed a solution for generating a synthetic data set, which enables a wider spectrum of data analysis tasks.However, the above work assumed that the raw trajectories need to contain the same prefix or a common subsequence of length, which is difficult to achieve in practice.To address this challenge, Hua et al. (2015) proposed a differential privacy algorithm based on spatial generalization.Firstly, the exponential mechanism is used to merge the locations with a short distance at the same timestamp.Then the Laplace mechanism is used to add noise for the number of generalized trajectory data.This method solves the requirement that most of the current research methods require that the trajectories must have the same prefix.However, Laplace noise is unbounded which will lead to privacy leakage (Li et al., 2017), a trajectory publishing method by bounded Laplace noise is proposed by Li et al.(2017), in order to improve the utility of published trajectory data, they only released the generalized trajectories containing the raw trajectories during the stage of trajectory generalization.When the correlation between multiple locations in the trajectory is ignored, it is vulnerable to a large number of inference attacks.Lu et al. (2017) proposed the algorithm that can protect the location privacy of multiple locations with correlation.They used Hidden Markov similarity metric to quantify the correlation of two locations, and then the Laplace mechanism is used to publish trajectory data.Tian et al. (2018) proposed a personalized privacy-preserving method for trajectory data release, which weighted the representative units of clusters based on privacy budgets and satisfies the personalized privacy preference.Yuan et al. (2021) proposed a differential privacy trajectory data protection scheme based on R -tree.Firstly, a trajectory similarity tree structure is presented based on the R -tree, which realizes the spatial storage and query processing of trajectory data.Secondly, the DPTS-tree is constructed with adding noise to the statistical values of users in the nodes, which can greatly improve the ability to resist arbitrary background knowledge attacks.Finally, the algorithm is subjected to consistency constraints to solve the problem that the added independent noise may cause data inconsistency.Liu et al. (2021) designed two privacy-preserving trajectory data publishing algorithms NPGG and PGG.For the NPGG method, firstly, they clustered the locations of each timestamp into K classes, taking the cluster center as the representative of all locations in the same class, and then randomly connect the representative locations at different timestamp to construct a generalized trajectory.The data set finally released by NPGG method consists of two parts, one is the generalized trajectory containing the original trajectory, and the rest is the generalized trajectory generated by randomly connecting the cluster centers of different timestamps, at last, they used a bounded staircase noise generation algorithm to perturb the number of the generalized trajectories.In (Liu et al., 2021), they also proposed PGG method which is an improvement of NPGG method, the difference between them is only that they used the staircase mechanism to perturb the cluster center before constructing the generalized trajectory.However, sometimes using the cluster center to represent the locations of the class will result in the generalized trajectories contained non-semantic information locations, so the published trajectories are easily identified and filtered by adversaries.

PReLIMINARIeS Differential Privacy
Differential privacy provides a rigorous privacy protection for sensitive information, it can be quantified by mathematical formulas.The essence of differential privacy is to randomly perturb the query results.There are Laplace mechanism and exponential mechanism.The Laplace mechanism is suitable for numerical queries and the exponential mechanism is suitable for nonnumerical queries.
Definition 1 (Differential Privacy) (Dwork et al., 2017) A randomized algorithm M satisfies e differential privacy, if for any two neighboring databases D 1 , D 2 , and for any S S Rang M ∈ ( ) , where e is a non-negative real number called the privacy budget.
Definition 2 (Sensitivity) (Dwork et al., 2006) Let f be a function that maps a database into a fixed size vector of real numbers, f D R d : ® , for any neighboring databases D 1 and D 2 , the sensitivity of f is defined as: Theorem 2(Parallel Composition) (Dwork et al., 2014) , , be a series of privacy algorithms, and their privacy budgets are e e e 1 2 , , , Theorem 3 (Post Processing Immunity) (Dwork et al., 2014) Let M D R : ® be a randomized algorithm that is e differential privacy, let f R R : → ′ be an arbitrary mapping, then f M D R  : → ′ is e differential privacy.

Problem Description
Definition 5 (Time-Series Trajectory) A time series trajectory can be represented as tr l t l t l t T T = ( ) → ( ) → → ( ) , where l x y , is a discrete spatial point, which is represented by the latitude and longitude coordinate, T is the length of the trajectory.
The trajectory data set is D tr tr , , ,  .Given a trajectory data set D , the goal of this paper is to publish a synthetic trajectory data set  D which is the same scale as D , and has high data utility while satisfying differential privacy.The mathematical notations used in this paper are summarized in Table 1.

The TDDP Method
The TDDP method for trajectory data publishing consists of two stages: the first stage is location generalization stage and the second stage is generalized trajectory data publication stage.In the first stage, the k -means++ algorithm is used to cluster the locations at each timestamp into K classes.Then the exponential mechanism of differential privacy is used to randomly select a location as the representative of all locations in the same class.After the first stage, the K representative locations under each timestamp are obtained.In the second stage, the exponential mechanism is used again to sample locations from the representative locations at different timestamps, then the sampled locations are connected to form the generalized trajectory for publishing.The specific details of location generalization are in Pseudo-code 1, and the specific details of the generalized trajectory generalization are in Pseudo-code 2.

Differentially Private Location Generalization
In the first stage, the locations at each timestamp are clustered into K classes by k -means++.The adjacent locations are clustered in the same class.In each class, using the exponential mechanism of differential privacy to select the representative location of each class can protect the privacy of the location within the class, and compared with the research methods that use the cluster center as the representative location, the representative locations selected by this way can avoid the generation of non-semantic locations, which can avoid being identified and filtered by adversaries.The utility function is defined in this stage as follows: In each class, the probability of each location being selected as a representative location decreases with its Euclidean distance from the cluster center.This utility function makes the Pseudo-code 1 more inclined to select locations closer to the center of the cluster, which is good for preserving the utility of the published trajectory data.This process of location generalization satisfies e differential privacy, and the specific details of location generalization are in Pseudo-code 1. Pseudo-code 1.Location generalization: for t=1 to T do Using k-means++ to cluster locations at timestamp t into K classes d tki =The Euclidean distance of location l tki from its cluster center for k=1 to K do

D
The raw trajectory data set.

 D
The published trajectory data set. tr The raw trajectory in D .

 tr
The generalized trajectory in  D . n The number of trajectories in D .

T
The number of timestamps (length of the trajectory).

K
The number of classes that the locations are clustered at each timestamp.

D tk
The set of locations belonging to the k -th class at timestamp t .

L t
The set of K representative locations at timestamp t .

n tk
The number of locations in D tk .

l tk
The representative location of the k -th class at timestamp t .

l tki
The i -th location of the k -th class at timestamp t .
x y tki tki , ( ) The latitude and longitude coordinate of the i -th location in the k -th class at timestamp t .

d tki
The distance between the i -th location of the k -th class at timestamp t and its cluster center.By using the Pseudo-code 1, the locations of each timestamp are clustered into K classes, and the representative location of each class is obtained.As a result, the task of the Pseudo-code 2 is to generate the generalized trajectories by sampling locations from the representative locations at different timestamps.A simple approach is to select a location from the K representative locations with equiprobability at each timestamp, and then the selected locations with different timestamps are connected to form a generalized trajectory for publishing.But it is infeasible due to the very high dimensionality of the trajectory data even though the positional universe has been greatly compressed in the first step.For example, the raw trajectory data set contains 500 trajectories, and the locations under each timestamp are clustered into 40 clusters, and the number of timestamp is 30.Then the size of candidate generalized trajectory need to be considered is 40 30 , which is infeasible.In order to address this challenge and improve the utility of published trajectory data set, the authors use the generalized trajectories that contain the raw trajectories as a part of the published trajectory data set D , that is they generate n 1 generalized trajectories that containing the raw trajectories as a part of the released trajectory data set  D .In order to achieve the same scale as the raw trajectory data set D , the authors design a sampling algorithm for generating the rest n n -1 generalized trajectories of  D .Firstly, exponential mechanism is used again to sample locations from the K representative locations at each timestamp, then the locations sampled from different timestamps are connected to form a generalized trajectory.The design idea of Pseudo-code 2 is that the more locations a class contains, the more likely the representative location of that class should be sampled, which is beneficial to improve the utility of the released trajectory data.So the authors define a utility function v L l n , , ,  , and n tk is the number of locations in L t .The sensitivity of the utility function is ∆v = 1 .Given the utility of all candidate representative locations, the location l tk is sampled from L t with probability proportional to: The more locations contained in the class, the easier the representative location of this class to be sampled, and the utility of published trajectory data are improved.The specific details of the generalized trajectory generalization are in Pseudo-code 2. Pseudo-code 2.Generalized trajectory data publishing for t=1 to T do for k=1 to K do v(L t ,l tk )=n tk end for P t0 =0 for k=1 to K do end for end for for t=1 to T do Generate a random number r Î (0,1) for k=1 to K do if P t(k-1) <r<P tk then l t =l tk end if end for end for Generate generalized trajectory  tr =(l 1 ,t 1 )→(l 2 ,t 2 )→…→(l T ,t T ) For example, in Figure 1, seven raw trajectories are included in the trajectory data set.In the first stage, the locations at each timestamp is clustered into K = 2 classes, by using exponential mechanism of differential privacy, the set of representative locations of each timestamp is obtained In the second stage, the released trajectory data set consists of the generalized trajectory containing the raw trajectory and then the rest generalized trajectory generated by Pseudo-code 2.

Privacy Analysis
Lemma 1.The Pseudo-code 1 satisfies e differential privacy.Proof: The locations at timestamp t are clustered into K classes, denote D tk as the set of locations , , ,  .In Pseudo-code 1, the location l tki is sampled form D tk as a representative location with probability proportional to: , , ,  are disjoint, according to Theorem 2, the Pseudo-code 1 satisfies e differential privacy.
The first stage (Pseudo-code 1) of TDDP method guarantees e differential privacy.In the second stage (Pseudo-code 2), the generalization trajectory is generated based on the output of Pseudo-code 1.According to the post-processing immunity of differential privacy (Theorem 3), the TDDP method satisfies e differential privacy.

Time Complexity Analysis
The time complexity of the TDDP method is O TKn ( ) .This is because of the following reasons.In Pseudo-code 1, the time cost is mainly reflected in the clustering of the locations on the original trajectories, and the locations of each timestamp must be clustered into K classes, and the number of timestamps is T .After the clustering is completed, the distance between each location in the class and the center of the class needs to be calculated, so the time complexity of this stage is O TKn tk ( ) where T is the number of timestamps, K is the number of classes, and n tk is the number of locations in k -th class at timestamp t .n tk is proportional to n ,so the time complexity of this stage is O TKn ( ) .In Pseudo-code 2, the time cost is mainly reflected the selection of locations from each timestamp to generate the release trajectories.so the time complexity of this stage is O TK ( ) .Based on the above analysis, the time complexity of the TDDP method is O TKn TK O TKn + ( ) = ( ) .

eXPeRIMeNT
In this section, the authors conduct experiments on real data set to verify the effectiveness of the TDDP method.The real trajectory data set used in this paper is T-Drive data set.This data set contains oneweek trajectories of 10357 taxis in Beijing.The authors choose 500 trajectories from the same time period from 8:30 to 13:30 for our experiment.Each trajectory is refined to contain 30 timestamps.The authors validate the effectiveness of the TDDP method by comparing with the NPGG and PGG methods (Liu et al, 2021).Each group of experiments is repeated 20 times, and take the average results.In the experiment, the authors make privacy budget e in Pseudo-code 1 take different values, and the parameter q in Pseudo-code 2 take 0.1.Data utility is measured from two aspects: the query distortion rate, and the Hausdorff distance.At last, mutual information is used to measure the total privacy loss.

Similarity
The Hausdorff distance is used to measure the similarity between raw trajectory data set D and the generalized trajectory data set  D .The smaller Hausdorff distance between the two data sets, the higher the utility of published trajectory data set  D .The Hausdorff distance between D and  D is: .
From Figure 2, it can be seen that the Hausdorff distance of the trajectory data set released by TDDP method is significantly smaller than that of the PGG method.The results indicate that the TDDP method can ensure the effectiveness of publishing trajectory data.Although the Hausdorff distance of the trajectory data set released by TDDP method are higher than the NPGG method, but the NPGG method has no privacy protection in the location generalization stage and they use cluster centers as representative locations which may generate non-semantic locations.

Query Distortion Rate
The purpose of trajectory data publishing is to query or analyze the data.It is a way to measure utility by comparing the query results on the original data set D and the published data set  D .The following two kinds of spatio-temporal range queries (Abul et al., 2008) are used for measuring the utility of released trajectories., , , , Using the query distortion rate: As a metric to evaluate the utility of the published data  D .The authors randomly generate query circles with radius 0.5 km and 1 km, and perform 1000 queries on each of them.Figure 3 and 4 are the query distortion rate of Q 1 and Q 2 , respectively.As can be seen, the query distortion rate of TDDP and PGG methods are relatively close.Sometimes the query distortion rate of NPGG method is a little higher.The experimental results on query distortion rate demonstrate the effectiveness of the TDDP method.

Mutual Information
Mutual information is used to measure the total privacy loss of different methods.Mutual information is a useful information measure in information theory.Privacy as a kind of information can be quantified by information entropy.Mutual information is used to measure the interdependence between two data sets (Peng et. al., 2016).The privacy parameter e takes different values.The changes of mutual information under the three methods are compared, and the results are shown in Figure 5.The experimental results show that the mutual information value of TDDP method is between NPGG and PGG.Although the mutual information of the TDDP method is slightly higher than that of the PGG method, that is, the privacy loss is higher than that of the PGG method, but the TDDP method can avoid generating non-semantic locations and resist filtering attacks.

CoNCLUSIoN
In this paper, the authors propose the TDDP method for trajectory data publishing based on differential privacy, which do not require trajectories to contain a lot of identical prefixes or n -grams and it is also not require that the trajectories have the same length.In addition, the representative locations generated by the TDDP method are sampled from the original locations, which will avoid generating  non-semantic representative locations and ensure that the generalized trajectory can resist filtering attacks.The experimental results show that the TDDP method can achieve a good balance between privacy protection and data availability.However, this paper also has shortcomings.For example, it does not consider the different privacy requirements of different locations in the trajectory, sometimes only a few sensitive locations in the trajectory need to be protected.In the future, the authors will study how to protect locations on trajectories with different privacy requirements.
denotes the real number space mapped, d denotes the query dimension of the function f .Definition 3 (Exponential Mechanism) (McSherry et al., 2007)For any function u privacy.Theorem 1 (Sequential Composition) (Dwork et al., 2014) Let M M M n 1 2 , , be a series of privacy algorithms, and their privacy budgets are e e e 1 2 , , ,  n , for the same data set D , the combined algorithm M M D M D M D n 1 2 assigns a score value for each location within each class.The sensitivity of utility function is ∆u = 1 .In set D tk , one location l tki is sampled as a representative location with probability proportional to: 3, the process of sampling the location l tki from D tk satisfies e differential privacy.The sets

Figure 1 .
Figure 1.Differentially private release the raw trajectories