Random Forest Algorithm Based on Linear Privacy Budget Allocation

In the era of big data with exponential growth in data volume, how to reduce data security issues such as data leakage caused by machine learning is a hot area of recent research. The existing privacy budget allocation strategies are usually only suitable for data applications in specific spaces and cannot meet users’ personalized needs for privacy budget allocation. Therefore, a linear privacy budget allocation strategy is proposed. The strategy assigns each layer a linearly increasing privacy budget from the root of the decision tree to the bottom by adjusting the coefficient or constant term. Combining this strategy with the random forest algorithm, a random forest algorithm based on linear privacy budget allocation (DiffPRF_linear) is formed. Experimental results show that the proposed algorithm can realize uniform, arithmetic, and geometric privacy budget allocation policy effects and can also achieve better classification effects than the former, which not only meets the needs of users to protect private data personalized but also maintains high classification accuracy.

and other data security problems  pose significant risks to users, developers and society (Siau et al., 2020). Aiming at reducing the risk of privacy disclosure (Dumbill et al., 2013;Meng et al., 2013), researchers have proposed privacy protection technologies such as anonymization technology (Sweeney, 2002;Machanavajjhala et al., 2007;Li et al., 2007;Xiao et al., 2007), cryptography technology (Clifton et al., 2002;Rothe, 2002;Jiang et al., 2006;Ishai et al., 2006), differential privacy technology (Dwork, 2006;Dwork, 2008;) and blockchain technology (Turesson et al., 2021). Among them, differential privacy technology is currently the mainstream privacy protection technology. It is difficult for attackers to use background knowledge to predict the sensitive attributes of individuals who add noise compared to the primary data, so as to control the disclosure of personal privacy in a small range (Li et al., 2012;Xiong et al., 2104).
Adjusting the magnitude of the privacy budget can offer different degrees of protection for data and affect data availability (Mcsherry et al., 2007). Dwork et al. (2012) first proposed a uniform allocation strategy to distribute the privacy budget, i.e., the privacy protection budget is evenly distributed to each layer of the tree structure. However, such an allocation scheme will lead to more waste of the privacy budget, low use efficiency of the privacy budget, and the allocation of the privacy budget is relatively fixed and unadjustable. Cormode et al. (2012) proposed a geometric budget allocation strategy, namely increasing the privacy budget geometrically from the root node, but the query error may be proportional to the number of leaves in the query. Wang et al. (2016) put forward an adaptive privacy budget allocation strategy to provide privacy protection statistics which is published on unlimited timestamps, but this strategy can only be applied to specific data.  proposed a p-series privacy budget allocation method for infinite attacks that attackers may launch. However, when there are too many iterations, the privacy budget tends to be zero, resulting in poor data availability. In the real-time location protection of users, Li et al. (2017) proposed a privacy budget allocation strategy with adaptive adjustment according to the distribution of underlying data. However, when the total number of users is too large, the privacy budget allocated on each timestamp is too small, resulting in the addition of too much noise. Wang et al. (2018) proposed the privacy budget allocation method of arithmetic sequences and geometric sequences, but it is vulnerable to the influence of parameters and the privacy budget allocation method is not flexible enough. Ke et al. (2021) proposed a dynamic privacy allocation mechanism on the basis of the different output contributions of different input features to the model, while the accuracy of the model is low and the balance between model utility and privacy protection is not reached. Blum et al. (2005) first combined differential privacy technology with a single learner, proposed the SuLQ-based ID3 algorithm, and applied Laplace to add noise. However, due to the excessive noise added, the accuracy of the classification results decreased significantly compared with that without noise. Since a single individual learner can no longer satisfy people's desire to have a stable model with good performance in all aspects, Breiman (2001) proposed the random forest algorithm. Patil et al. (2014) first combined the random forest algorithm with differential privacy and used the ID3 algorithm to construct the subtree of the random forest. However, this method can only handle discrete attributes, and can only preprocess continuous features when facing continuous features. The DiffPRFs algorithm proposed by Mu et al. (2016) eliminates the dependence on discrete datasets and selects split points and split attributes through the exponential mechanism. However, each iteration requires two calls to the exponential mechanism, which consumes too much privacy budget. Li et al. (2020) further proposed the RFDPP-Gini algorithm, using the CART classification tree as a subtree and invoking the exponential mechanism only once when processing continuous features to improve the use efficiency of the privacy protection budget. However, since the privacy budget allocation scheme selected is uniform allocation, the utilization rate of the privacy budget is low.
The authors present DiffPRF_linear, a stochastic forest algorithm based on linear privacy budget allocation. Through a linear allocation strategy, users can more flexibly adjust the coefficient or constant term to allocate the privacy budget of each layer, and can also realize the effect of uniform, arithmetic and geometric privacy budget allocation strategy. In addition, the random forest algorithm retains high classification accuracy while protecting personal privacy.

Differential privacy protection technology
This technology was first applied in the field of database security. Attackers cannot carry out differential attacks when querying two datasets with only one record difference and the probability of obtaining the same result is very similar.

Definition and Related Concepts of Differential Privacy
Let dataset D and dataset D ' have the same attribute structure, D D D ' is called the difference of symmetry between the two, and D D ∆ ′ represents the number of records in D D D ' . If D D ∆ ′ = 1 , D and D ' are called adjacent datasets. Definition 1. Differential Privacy (Dwork, 2006). Given a random algorithm M , where PM is the set of all possible outputs of M . For any two adjacent datasets D and D ' and any subset SM of PM , if algorithm M satisfies: i.e, algorithm M provides differential privacy protection, where parameter e is called privacy protection budget, which is used to ensure the probability of consistent output results of algorithm M by adding or reducing one record in the dataset, and Pr[] represents the probability of occurrence of an event.
Definition 2. Local sensitivity (Yang et al., 2019). Given any query function f D R d : ® , for the adjacent datasets D ' of given datasets D and D , the local sensitivity of f on D is , where · 1 stands for Manhattan distance.
Differential privacy protection technology has flexible combination characteristics, which are serial combination and parallel combination. (Dwork et al., 2006). Given that there are n random algorithms K , where K i satisfies e i -differential privacy, then for the same dataset, the algorithm combined

Definition 3. Serial Combination
This characteristic shows that in the same dataset, if each algorithm satisfies e i -differential privacy, then the combined algorithm composed of these algorithm sequences will provide the privacy protection ability of the sum of the privacy budgets of all algorithm sequences.
Definition 4. parallel Combination (Dwork et al., 2006). Given that there are n random algorithms K , where K i satisfies e i -differential privacy, and any two datasets do not intersect, then the This characteristic shows that in a dataset providing differential privacy protection algorithms, if the subdata sets processed by each algorithm do not intersect each other, then the privacy protection level provided by the combined algorithm composed of these algorithm sequences depends on the algorithm sequence with the worst protection level, namely the largest budget.

Mechanism
Differential privacy technology usually realizes data privacy protection through the Laplace mechanism and exponential mechanism. For numerical data, protection is usually applied by the Laplacian mechanism, and for nonnumerical data, protection is usually applied by Exponential mechanism.
The Laplace mechanism realizes differential privacy protection by adding random variable noise to query the results, and the added random variable noise obeys the Laplace distribution. Given that the scale parameter is l and the position parameter is 0, the Laplacian distribution is Lap l ( ) and its probability density function is: The Laplacian distribution of parameter l at different scales is shown in Figure 1 Definition 5. Laplace Mechanism (Dwork et al., 2006). Given dataset D and privacy budget ɛ, with query function f D ( ) , whose sensitivity is Df , then the random algorithm ) is random noise. It obeys the Laplacian distribution with scale parameter Df / e . The overall idea of the Laplace mechanism is to output outliers with a certain probability. To make the noisy data deviate from the real value, it is necessary to improve the probability of outlier output, i.e., to stably obtain an outlier. Therefore, when the Laplace function curve is flatter, the probability of outlier output is higher.

Figure 1. Laplacian distributions of parameter l at different scales
Different from the Laplace mechanism, the Exponential mechanism achieves differential privacy by not simply adding noise to the query results but introducing a scoring function to obtain a score for each possible output, which is normalized as the probability value returned by the query, i.e., the element in the discrete data { , , , } R R R N 1 2 ¼ is output probabilistically.
Definition 6. Exponential Mechanism (Mcsherry, 2007). Given the random algorithm M , the input dataset is D , the output is an entity object , and q D R i , ( ) is the availability function to evaluate the quality of the output value R i , where Dq is the sensitivity of function , then algorithm M provides e -differential privacy protection. The overall idea of the exponential mechanism is that, after receiving a query, it does not output a result deterministically, but uses a scoring function to measure the probability that the output value R i is selected, to achieve differential privacy. Among them, the higher the scoring function value is, the higher the probability of being selected for output; the lower the value is, the lower the probability of being selected for output.

RANDoM FoReST
Random forest, proposed by Breiman (1999), is an ensemble learning algorithm that combines several weak learners into strong ones, and its base learner is a decision tree. In this paper, the CART classification tree is used as the base decision tree in random forest.

CART (Classification and Regression Tree) Decision Tree
The CART tree is a binary tree that handles both classification and regression. Fearn et al. (2006) used square error minimization criterion to generate a regression tree for continuous data. For discrete data, the Gini coefficient minimization criterion is used to generate a classification tree. The smaller the Gini coefficient is, the more significant the feature is.
Definition 7. Gini coefficient (Breiman, 1984). Assuming that there are n categories and the probability of the i-th category is p i , the Gini coefficient of the probability distribution can be expressed as: For the dichotomy problem, if p is the probability of the sample belonging to the first category, the Gini coefficient of the probability distribution can be expressed as: For sample set D , which counts D , and the i-th category counts C i , the Gini coefficient expressions for sample set D can be expressed as: In light of a certain value a of feature A , D is divided into D 1 and D 2 . Then under the condition of feature A , the Gini coefficient of sample D can be expressed as:

CART Classification Tree Algorithm
In practice, each dataset sample usually includes continuous attributes and discrete attributes. The CART classification tree has different processing methods for different types of attributes.

Continuous Attribute Processing by the CART Classification Tree Algorithm
When facing continuous attributes, continuous attributes need to be discretized. Assuming that there are m samples in the dataset, the continuous feature A has m values, which are sorted from small to large, and the average value of two adjacent samples is taken as the dividing point. There are m -1 dividing points in total, and the i-th dividing point T i is represented as T a a The Gini coefficients of the m -1 points were calculated respectively, and the point with the lowest Gini coefficient was selected as the binary discrete classification point with the continuous feature. Then, the dataset of this node is divided into two parts on the basis of this point.

Discrete Attribute Processing by the CART Classification Tree Algorithm
When facing discrete attributes, the Gini coefficient of each value of the existing discrete attributes of the current node is first calculated, and the feature with the smallest Gini coefficient and its corresponding value are selected as the optimal feature and optimal split point. Then, the dataset of this node is divided into two parts on the basis of this point. Algorithm 1. CART classification tree generation algorithm Input: Training dataset D , Gini coefficient threshold Thr gini _ , sample number threshold Thr sample _ , feature set A . Output: CART classification tree. 1: Initial Thr gini _ , Thr sample _ . 2: Starting from the root node, recursively perform the following for each node, until the number of node samples is less than Thr sample _ , or the Gini coefficient of the sample set is less than Thr gini _ . 3: Let the training dataset of the node be D n . 4: for each feature in A 5: For each possible value a, calculate the Gini coefficient when A a i = . 6: Take the minimum Gini coefficient as the optimal splitting point of A i . 7: end for 8: Take the feature of the smallest Gini index and its corresponding eigenvalue, and the training dataset is divided into two subnodes according to the feature and eigenvalue.

Random Forest Algorithm
The core of random forest is sample randomness and attribute randomness. During classification, each subtree outputs the classification with the most votes through "voting" as the final result. Algorithm 2. Random forest generation algorithm based on the CART classification tree Input: Training dataset D , Number of CART classification trees T , Depth of CART classification tree h , number of features selected randomly during splitting m , minimum threshold of splitting species Thr , and feature set F . Output: Random Forest composed of T CART classification trees. 1: Initial T , h , m , Thr . 2: for i = 1 to T 3: Randomly draw N training samples with replacement from the training dataset D to obtain the training set D t . 4: if (Reach the minimum threshold of split species Thr or reach the maximum depth of decision tree h ) 5: Generate CART classification tree T i . 6: else 7: Randomly select m features in feature set F , and select the optimal split feature as the node. 8: end if 9: end for

Linear Privacy Budget Allocation Strategy
For the sake of meeting the needs of users to be able to individually adjust the privacy budget and reasonably protect personal privacy, this paper puts forward a linear allocation of the privacy budget strategy. This article takes a full binary tree (Cormen et al., 2001) as an example, assuming that the depth of the tree is h , where the root node resides is the first layer, and the leaf node is the i-th layer, as shown in Figure 2.
From the level where the root node is located, the privacy budget of e e e e 1 2 3 , , , ,  h is allocated to each layer of the decision tree with depth h . From the second layer, the privacy budget allocated to each layer is: It can be obtained from equation (8) and equation (9): Because e 1 0 > , then 0 2 Substituting Formula (7) into Formula (10) and Formula (11) respectively, the privacy protection budget allocated to each node in the i-th layer ( ) i > 0 of the decision tree is:

DiffPRF_linear Algorithm Description
The linear privacy budget allocation strategy has excellent combinability. Multiple privacy budget allocation schemes can be obtained by combining coefficient q and constant d . Therefore, the DiffPRF_linear algorithm is proposed based on the combination of a linear privacy budget allocation strategy and a random forest algorithm to realize users' personalized demand for privacy protection. Randomly draw n training samples with replacement from D to obtain the training set D n . 7: Recursively execute the following steps to build a decision tree T i . 8: if (The classification of all samples in the node is consistent, or the samples are smaller than the minimum number of samples required for splitting, or B is exhausted, or reach h ) 9: Select the category with the largest value in the sample as the label of the leaf node. 10: else 11: Set the current node as a decision node. 12: Calculate the number of samples for each category of the decision node, and add Laplace noise. 13: if ( D n has a continuous feature) 14: Randomly select several features from F . 15: if (There are m continuous features) 16: Allocate the privacy budget equally to each continuous feature and one copy is reserved for discrete features ′ = + ( ) e e N m / 1 .

17:
Use the following formula to calculate the value of the scoring function, and take the highest value as the best continuous feature and split point.
; q is the Gini coefficient, the smaller the better, Dq is the sensitivity of the Gini coefficient, and D c is the training set on the current node. 18: Calculate the Gini coefficient of remaining discrete feature and compare it with the best feature in the continuous feature. Take the smallest Gini coefficient as the optimal split feature and optimal split point, and split this node. 19: else 20: Calculate the Gini coefficient of discrete features, take the smallest Gini coefficient as the optimal split feature and optimal split point, and split the node according to this feature and eigenvalue. 21: end if 22: else 23: Calculate the Gini coefficient of discrete features, take the smallest Gini coefficient as the optimal split feature and optimal split point, and split the node according to this feature and eigenvalue. 24: end if 25: end if 26: end for After establishing a random forest satisfying e -differential privacy through Algorithm 3, the description of the process of classifying the test set is shown in Algorithm 4. Algorithm 4. Algorithm for classifying test set samples with random forest satisfying e -differential privacy Input: Test set S , random forest satisfying e -differential privacy. Output: Test set classification results. 1: for each sample in S .

2:
for i = 1 to T . 3: Starting from the root node of the i-th tree, on the basis of the classification result of the current node, determine which branch node the sample enters until it reaches a certain leaf node. 4: Obtain the classification result of the i-th tree. 5: end for 6: Take the mode of the classification results of all trees as the classification results of the test set. 7: end for

Algorithm Analysis
All the privacy budgets consumed in the DiffPRF_linear algorithm conform to e -differential privacy. The total privacy budget is evenly distributed to each decision tree, e T B T = / , and a linearly increasing privacy budget is distributed to each layer from the top to the bottom according to the ) , which is equivalent to the geometric allocation method, i.e., the privacy budget allocated at each layer from root to leaf nodes increases by a constant d times. Therefore, the analysis shows that the linear privacy budget allocation strategy is more flexible and has more portfolio diversity.

The experimental Data
The programming language used in this experiment is Python 3.8. The selected datasets are derived from the Adult and Bank Marketing datasets from the publicly available UCI machine learning database. There are 32561 data samples in the Adult dataset, and each sample contains 15 attributes, including 14 classification features and 1 classification result. Through the analysis of the attribute set, it was found that there were 5057 samples of non-American nationality, accounting for only 16% of the total sample number. Therefore, only the sample records of American nationality were selected for training and testing in this experiment, which included 6 continuous attributes and 7 discrete attributes. The classification attribute 'income level' can be divided into " £ 50k " and " > 50k ". There are 41188 data samples in the Bank Marketing dataset, and each sample has 21 attributes, including 20 classification features and 1 classification result. Through the analysis of the attribute set, it was found that the sample attribute contained 10 continuous attributes and 10 discrete attributes. The classification attribute 'has the client subscribed a term deposit' was divided into two categories: "Yes" and "no". In the following experiment, 70% of the Adul and Bank Marketing datasets were used as training sets and 30% were used as test sets.

experimental Results and Analysis
When e T = 0.2, 0.4, 0.8, 1, 2, 4, 8, T = 25, h=9, the classification accuracy of the DiffPRF_linear algorithm was calculated under different combinations of parameter q and parameter d .Each group of experiments with the same parameter was run repeatedly 15 times and the average value was obtained, as shown in Figure 3 and . e , q 1 1 1 = . , q 2 1 2 = . , q 3 1 3 = . and q 4 2 = .
It can be seen from Figures 3 and 4, when e T increases, the classification accuracy of the DiffPRF_linear algorithm decreases to varying degrees, but the overall trend is rising. From Figs. 3(a) ~ 3(c) and Figs. 4(a) ~ 4(c), it can be seen that when e T £ 0 8 . , the classification accuracy of the DiffPRF_linear algorithm shows a positive increase, and the growth rate is faster. When e T > 0 8 . , the classification accuracy of the DiffPRF_linear algorithm occasionally increases negatively, but tends to gently increase. This is because when e T is too small, the privacy budget allocated to the node is extremely small. At this time, Df / e is extremely large, i.e., adding excessive noise. On the basis of the Laplace function curve in Figure 1, when λ ε = ∆f / is too large, the curve tends to a straight line, and the probability of outlier output is extremely high, resulting in extremely low classification accuracy of the algorithm. When e T gradually increases, Df / e gradually decreases, i.e., the added noise gradually decreases. According to the Laplace function curve in Figure 1, the probability of outlier output decreases, and the classification accuracy of the DiffPRF_linear algorithm fluctuates occasionally, but it still tends to grow slowly. By observing Figure 3(d) and 4(d), it is found that when q q = 4 , the classification accuracy of the DiffPRF_linear algorithm is generally lower than that of Figs. 3(a) ~ 3(c) and Figs. 4(a) ~ 4(c) under the combination of the same q and different d values. This is because the total privacy budget e T of each subtree is fixed. In light of the linear allocation strategy, the privacy budget allocated to each layer increases linearly compared to the upper layer. When the value of the coefficient is too large, the privacy budget distributed to the first layer of the tree will be extremely small. The q 4 2 = selected in Figure 3(d) and Figure 4(d) causes the added noise to be too large and the data availability to be extremely low, so the classification accuracy is low. However, with the increase in e T , the privacy budget distributed to each decision tree increases, so that the privacy budget distributed to the first layer also increases, the noise level decreases, and the availability of data improves. Therefore, the classification accuracy of the DiffPRF_linear algorithm is gradually improving.
In addition, d d d = 1 2 , , q q q = 1 2 , , e T = 1 , h = 5, 6,7,8,9,10 are selected to compare the classification accuracy of the DiffPRF_linear algorithm with the uniform, arithmetic, geometric privacy budget allocation method under different decision tree depths. According to Table 1 and  Table 2, draw line charts, as shown in Figures 5 and 6. Among them, the coefficient values selected and q q = 1 are selected to be the coefficients of DiffPRF_linear3, and d d = 2 and q q = 1 are selected to be the coefficients of DiffPRF_linear4. As seen from Figures 5 and 6, due to the strong combinativity of the linear privacy budget allocation strategy, the results of classification accuracy become more diverse, i.e., there are multiple combinations of results. From Figure 5, under h=5,6,7 and 9, the linear privacy budget allocation method in the Adult dataset can achieve better classification results than uniform, arithmetic, and geometric distribution methods. From Figure 6, under h=5,6,7,8 and 9, the linear privacy budget allocation method in the Bank Marketing dataset can achieve better classification results than uniform, arithmetic, and geometric distribution methods. However, when the tree depth is too large, the classification accuracy of the DiffPRF_linear algorithm decreases when the depth of the decision tree increases. This is because the depth of the tree is too large, which causes the privacy budget allocated to each layer of the decision tree to be less. The lower budget is, the larger the added noise, so the classification accuracy will drop significantly. Table 3 summarizes the characteristics of different methods under differential privacy. In conclusion, the DiffPRF_linear algorithm is more flexible and has a variety of combinations. By changing the values of d and q , in addition to achieving the effect of a uniform, arithmetic and    geometric privacy budget allocation strategy, a better classification effect can also be achieved by combining the coefficient q and the constant term d . It provides more solutions for users to choose the appropriate privacy budget.

CoNCLUSIoN
This paper puts forward a random forest algorithm based on linear privacy budget allocation: DiffPRF_linear. Through the linear privacy budget allocation strategy, users can realize the needs of personalized privacy budget allocation, and provide more schemes for users to reasonably choose  Table 3. Comparison of different methods under differential privacy

RFDPP-Gini
Assign the same privacy budget to each level of the subdecision tree Exponential mechanism selection efficiency is low, vulnerable to the influence of tree fan out

DiffPRFarithmetic
Starting from the root node level, the privacy budget allocated to each subdecision tree level increases with The setting of d value is limited by the tree depth h , and the classification effect is relatively general

DiffPRFgeometric
Starting from the level where the root node is located, the privacy budget allocated to each subdecision tree layer is a constant q ( q ³ 1 ) times of the previous layer When ε is constant, a large q value results in a very small privacy budget allocated to the first layer of the decision tree and poor data availability

DiffPRF_ linear
Starting from the level where the root node is located, the privacy budget allocated to each level of the subdecision tree is a linear combination of constant d and coefficient q More flexibility, more combinations to choose from the privacy budget. It can meet the requirements of uniform, arithmetic and geometric of privacy budget allocation at the same time and can be combined with a random forest algorithm to classify data and protect privacy. The results of the experiment indicate that the DiffPRF_linear algorithm has good privacy protection performance and classification accuracy. Of course, when the coefficient q is too large, the privacy budget allocated to each layer is too small, which makes the added noise too large, resulting in poor classification results when e is small. In the next step, the algorithm will be further optimized, and the linear privacy budget allocation strategy will be combined with other machine learning algorithms to obtain a better classification effect.