Limiting Attribute Disclosure in Randomization Based Microdata Release

Guo Ling; Ying Xiaowei; Wu Xintao

doi:10.5626/JCSE.2011.5.3.169

OA학술지
Journal of Computing Science and Engineering

Limiting Attribute Disclosure in Randomization Based Microdata Release

DOI : 10.5626/JCSE.2011.5.3.169
Author: Guo Ling, Ying Xiaowei, Wu Xintao
Organization: Guo Ling; Ying Xiaowei; Wu Xintao
Publish: Journal of Computing Science and Engineering Volume 5, Issue3, p169~182, 30 June 2011

ABSTRACT

Limiting Attribute Disclosure in Randomization Based Microdata Release

KEYWORD

Algorithms , Privacy preservation , Randomization , Attribute disclosure , Linking atack

본문

Collapse all

I. INTRODUCTION

Privacy preserving microdata publication has received wide attention [1-5]. Each record corresponding to one individual has a number of attributes, which can be divided into the following three categories: 1) identity attributes (e.g., Name and SSN)whose values can uniquely identify an individual; 2) quasiidentifier(QI) attributes (e.g., demographic attributes such as ZIP code, age, and gender) whose values when taken together can potentially identify an individual; 3) sensitive attributes(e.g., disease and income) that indicate confidential information of individuals.

Before microdata are released, identity attributes are often directly removed to preserve privacy of individuals whose data are in the released table. However, the QI information may be used by attackers to link to other public data sets to get the private information of individuals. This is recognized as linking attacks in microdata publishing. Two types of information disclosures have been identified under linking attacks: identity disclosure and attribute disclosure [6]. Identity disclosure occurs if attackers can identify an individual from the released data.Attribute disclosure occurs when confidential information about an individual is revealed and can be attributed to the individual.Samarati and Sweeney [1] proposed the k-anonymity model and presented a generalization approach that divides tuples into QI-group by transforming their QI-values into less specific forms,such that tuples in the same QI-group cannot be uniquely identified by attackers to counter linking attacks based on quasi-identifiers.It was identified that k-anonymity is vulnerable to homogeneity and background knowledge attacks when data in the QI-group lacks diversity in the sensitive attributes [2]. ldiversity[2], as well as following models (e.g., t-closeness [7]),were proposed to protect attribute disclosure. l-diversity requires that the sensitive attribute has at least one well-represented values for each QI-group in the generalized dataset.

The randomization approach has also been adopted to publish microdata [8-12]. Instead of generalizing QI attribute values,randomization approach distorts the original value to another domain value according to some distortion probabilities.The application of the randomization technique was studied to prevent identity disclosure under linking attacks in data publishing [12]. They focused on evaluating the risk of successfully linking a target individual to the index of his record given values of QI attributes. Our research moves one step further to investigate attribute disclosure under linking attacks. We focus on evaluating the risk of successfully predicting the sensitive attribute value of a target individual given his QI attribute values.We present a general randomization framework and give efficient solutions to determine optimal randomization parameters for both QI and sensitive attributes. Thus, we can maximize data utility, while satisfying privacy preservation requirements for sensitive attributes. We compare our randomization approach with other anonymization approaches within the framework(e.g., two representative approaches l-diversity [2] and anatomy[5] are used in this paper). Our result shows the randomization approach can better recover the distribution of the original data set from the disguised one. Thus, intuitively, it might yield a disguised database with higher data utility than l-diversity generalization and anatomy.

Our contributions are summarized as follows. We present a systematic study of the randomization method in preventing attribute disclosure under linking attacks. We propose the use of a specific randomization model and present an efficient solution to derive distortion parameters to satisfy requirements for privacy preservation, while maximizing data utilities. We propose a general framework and present a uniform definition for attribute disclosure which is compatible for both randomization and generalization models. We compare our randomization approach with l-diversity and anatomy in terms of utility preservation(under the same privacy requirements) from three aspects(reconstructed distributions, accuracy of answering queries, and preservation of correlations). Our empirical results show that randomization incurs significantly smaller utility loss.

The remainder of this paper is organized as follows. In Section II, we discuss closely related work on group based anonymization approaches and randomization approaches in privacy preservation data publishing. In Section III, we present background on randomization based distortions, including analysis and attacks on the randomized data. In Section IV, we first quantify attribute disclosure under linking attacks and then show our theoretical results on maximizing utility with privacy constraints. In Section V, we conduct empirical evaluations and compare the randomization based distortion with two representative group based anonymization approaches (l-diversity [2]and anatomy [5]). We conclude our work in Section VI.

II. RELATED WORK

  > A. Group Based Anonymization

k-anonymity was proposed [1] to counter linking attacks for data publishing. The method generalizes the values of quasiidentifier attributes to less-specific ones, so that each individual cannot be distinguished from at least k - 1 other individuals based on quasi-identifier information. There has been much study in designing efficient algorithms for k-anonymity using generalization and suppression techniques [2, 3, 5, 13, 14].

k-anonymity provides protection against identity disclosure,while it contributes little to attribute disclosure. l-diversity [2] is proposed to counter attribute disclosure. A table is l-diverse if,in each QI-group, at most 1/l of the tuples possesses the most frequent sensitive value. Thus, an individual can be linked to his sensitive value correctly with probability at most 1/l. Similarly,the notation of t-closeness [7] requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold t). The anatomy method was proposed to protect attribute disclosure by breaking the link between quasi-identifiers and sensitive attributes [5]. Anatomy releases all the quasi-identifier and sensitive values directly in two separate tables: a quasiidentifier table (QIT) and a sensitive table (ST). They presented an algorithm to compute anatomized tables that satisfy the l-diversity requirement, while minimizing the error of reconstructing the data.

Kifer and Gehrke [15] investigated the problem of injecting additional information into k-anonymous and l-diverse tables to improve the utility of published data. Koudas et al. [16] studied the problem of preserving privacy through sensitive attribute permutation and generalization with the addition of fake sensitive attribute values. Narayanan and Shmatikov [17] presented a de-anonymization methodology for sparse multi-dimensional microdata, such as individual transactions, preferences, and so on. The de-anonymization algorithm could be used by attackers against data sets containing anonymous high-dimensional data.Brickell and Shmatikov [18] compared the privacy loss (defined by certain kinds of information learned by attackers) with the utility gain (defined as the same kinds of information learned by analysts) caused by data anonymization and concluded that“even modest privacy gains require almost complete destruction of the data mining utility” in generalization methods. Li and Li[19] proposed the use of the worst-case privacy loss rather than the average privacy loss among all individuals adopted in [18]to measure privacy loss. They further presented a methodology to evaluate the privacy-utility tradeoff, where privacy loss is quantified by the adversary’s knowledge gain about the sensitive values of specific individuals with the trivially-anonymized data as the baseline and utility loss is measured by the information loss about the sensitive values of large populations with the original data as the baseline. In our paper, we use 1/l to bound the attribute disclosure for all records and use the difference of distributions between the original data and the reconstructed data to quantify utility loss. We also examine the utility loss from the perspective of answering aggregate queries and preserving correlations between the sensitive attribute and QI attributes.

In this paper, we focus on the randomization model in resisting attribute disclosure of data publishing. The proposed randomization model is compared to traditional l-diversity [2] and anatomy [5] models. Recently, Ghinita et al. [14] proposed a framework to solve the privacy-constrained and accuracy constrained data anonymization problems under the k-anonymity and l-diversity models. They developed efficient heuristics to solve the one-dimensional problems in linear time and generalized solutions to multi-dimensional attributes using space-mapping techniques. Their empirical evaluation results showed that privacy/accuracy-constrained methods outperform existing generalization approaches in terms of the execution time and information loss. We would also point out that researchers have developed various anonymity models, including p-sensitive k-anonymity[20], (α, k)-anonymity [21], (k, e)-anonymity [4], (c,k)-safety [22], m-confidentiality [23], privacy skyline [24], δ-presence [25], (ε, m)-anonymity [26], and (t, λ)-uniqueness[27]. Detailed comparisons to these works are beyond the scope of this paper.

  > B. Randomization

The first randomization model, called Randomized Response(RR), was proposed by Warner [28] in 1965. The model deals with one dichotomous attribute, i.e., every person in the population belongs either to a sensitive group A, or to its complement ？. The problem is to estimate π_A, the unknown proportion of population members in group A. Each respondent is provided with a randomization device by which the respondent chooses one of the following two questions Do you belong to A or Do you belong to ？ with respective probabilities p and 1 - p and then replies yes or no to the question chosen. The technique provides response confidentiality and increases respondents’ willingness to answer sensitive questions, because no one but the respondent knows to which question the answer pertains. It has been extended to the field of privacy preserving data mining[29].

Aggarwal and Yu [30] provided a survey of randomization models for privacy preservation data mining, including various reconstruction methods for randomization, adversarial attacks,optimality and utility of randomization. Rizvi and Haritsa [9]developed the MASK technique to preserve privacy for frequent item set mining. Agrawal and Haritsa [31] presented a general framework of random perturbation in privacy preservation data mining. Du and Zhan [8] studied the use of a randomized response technique to build decision tree classifiers. Guo et al. [10] investigated data utility in terms of the accuracy of reconstructed measures in privacy preservation market basket data analysis. Aggarwal [32] proposed adding noise from a specified distribution to produce a probabilistic model of k-anonymity.Zhu and Liu [33] investigated the construction of optimal randomization schemes for privacy preservation density estimation and proposed a general framework for randomization using mixture models. Rebollo-Monedero et al. [34]defined the privacy measure similar to t-closeness [7] and formalized the privacy-utility tradeoff from the information theory perspective. The resulting solution turns out to be the postrandomization(PRAM) method [35] in the discrete case and a form of random noise addition in the general case. They also proved that the optimal perturbation is in general randomized,rather than deterministic in the continuous case.

Huang and Du [36] studied the search of optimal distortion parameters to balance privacy and utility. They developed an evolutionary multi-objective optimization method to find optimal distortion matrices when applying the RR technique on a single attribute. The complexity of the algorithm is O(((N_Q + N_V)³ + n²) L) where N_Q is the population size, NV is the archive size, n is the number of columns in the RR matrix, and L is the maximum number of iterations. However, there is no guarantee that the derived matrices are optimal in the entire search space and the performance tends to be sacrificed for multiple attributes due to the intractable size of the search space. Similarly,Xiao et al. [37] investigated the optimal random perturbation at multiple privacy levels.

In our paper, we focus on a specific randomization model(i.e., perturbation retains the original value of an attribute A_i with probability p_i and replaces the original value with a random value from the domain of A_i, see Equation 1) and derive an efficient solution to determine the optimal randomization parameters under linking attacks. Chaytor and Wang [38] also applied the use of this simple randomization model to randomize a sensitive value only within a subset of the entire domain.The developed algorithm, called Small Domain Randomization,can effectively derive optimal partitions, such that randomizing the sensitive attribute values of records (with small domains) in each partition separately can retain more data utility with the same level of privacy than randomizing the whole data set over the entire domain. Our work focuses on attribute disclosure when applying full domain randomization on both QI attributes and sensitive attribute S. The derived optimal randomization parameter values under privacy constraints can be incorporated in the small domain randomization.

III. PRELIMINARIES

Dataset T contains N records and m + 1 categorical attributes: A₁, A₂, . . . , A_m, and S. We use QI = {A₁, . . . , A_m} to denote the set of quasi-identifier attributes (e.g., demographic) whose values may be known to the attacker for a given individual and use S to denote one sensitive attribute whose value should not be associated with an individual by attackers. Generally, T may also contain other attributes that are neither sensitive nor quasiidentifying.Those attributes are usually kept unchanged in the released data. We exclude them from our setting, since they do not incur privacy disclosure risk or utility loss. All of the discussions in this paper are also explained in the single sensitive attribute setting and can be generalized to multiple sensitive attributes.

Attribute A_i has d_i categories denoted by 0,1,. . . , d_i - 1. We use Ω_i to denote the domain of A_i (S), Ω_i = {0, 1, . . . , d_i - 1},and Ω_QI = Ω₁ × · · · × Ω_m is the domain of quasi-identifiers. Similarly,attribute S has d_s categories denoted by 0,1,. . . , d_s - 1. We use Ω_s = {0, 1, . . . , d_s - 1} to denote the domain of S. The r-th record R_r is denoted by (A_1r, A_2r, . . . , A_mr, S_r) or simply (Q_Ir, S_r).Let D = d_s Π^m_i-1d_i be the total number of cells in the contingency table.

Let π_{i₁, … , i _m, i_s}denote the true proportion corresponding to the categorical combination (A₁ = i₁, · · · , A_m = i_m, S = i_s). Let π_i₁be the column vector with D elements π_{i_l}, · · ·, _i^m, _{i_s} arranged in a fixed order. Table 1 shows one contingency table example for a data set with one QI attribute (Gender, d₁ = 2) and one sensitive attribute S (Disease, d_s = 3). Table 1a shows the original contingency table where π = (π₀₀, π₀₁, π₀₂, π₁₀, π₁₁, π₁₂)' corresponds to a fixed order of cell entries π_ij in the 2 × 3 contingency table. π₁₀ denotes the proportion of records with Female and Cancer. The column sum π₊₀ represents the proportion of records with Cancer across both genders. Note that contingency tables are widely used in statistics to record and analyze the relationship between

[Table 1.] 2 × 3 contingency tables for two variables gender (QI) disease(sensitive)

2 × 3 contingency tables for two variables gender (QI) disease(sensitive)

[Fig. 1.] Randomization based privacy-preserving data publishing.

categorical attributes. Results of most data mining tasks (e.g.,clustering, decision tree learning, and association rule mining),as well as aggregate queries, are solely determined by the contingency table. That is, analysis on contingency tables is equivalent to analysis on the original categorical data. Table 1b shows one contingency table instance derived from the data set with 100 tuples. We will use this contingency table as an example to illustrate properties of link disclosure.

  > A. Distortion Procedure

We use the left part of Fig. 1 to illustrate the process of privacy preserving data publishing. For each record R_j, the data owner independently randomizes attribute A_i using the distortion matrix P_i. Specifically, for attribute A_i (or S) with d_i categories,the randomization process is to change a record belonging to the v-th category (v = 0, . . . , d_i - 1) to the u-th category with probability p_uv⁽ⁱ⁾: Pr(A_i = u'A_i = v) = p_uv⁽ⁱ⁾. Let p_i = [p_uv⁽ⁱ⁾]_{d_i×d_i}, and we call P_i (or P_s) the distortion matrix for A_i (or S). Naturally,the column sums of P_i are equal to 1. The original database T is changed to T, and then both the randomized data set and the distortion matrices are published. The randomization matrices indicate the magnitude of the randomization; this can help data analysts estimate the original data distribution.

Let λ denote the contingency table of the randomized data T. We arrange λ into the column vector with the same order of π.. Table 1c shows one example of the randomized contingency table. The randomized contingency table has a close relationship to the original contingency table and the randomization matrices: E(λ) = P π where P = P₁ ？ · · · ？ P_m ？ P_s, and ？stands for the Kronecker product. The Kronecker product is an operation on two matrices, an m-by-n matrix A and a p-by-q matrix B, resulting in the mp-by-nq block matrix.

Distortion matrices determine the privacy and utility of the randomized data. How to find optimal distortion parameters with privacy or utility constraints has remained a challenging problem [39]. In this paper, we limit the choice of randomization parameters for each QI attribute A_i (and sensitive attribute S) as:

That is, for each attribute A_i, all categories have the same probability p_i to remain unchanged, and have the same probability q_i to be distorted to a different category. With this choice limit, we can derive an efficient algorithm (with explicit formula)to determine the optimal randomization parameters (as shown in Section IV-B).

  > B. Analysis on the Randomized Data

One advantage of randomization procedure is that the pure random process allows data analysts to estimate the original data distribution based on the released data and the randomization parameters. The right part of Fig. 1 shows how data analysts estimate the original data distribution. With the randomized data T and its contingency table λ, the unbiased estimate of π is given by π = P^-1λ and the covariance matrix of the estimator is

where (Pπ)^δ stands for the diagonal matrix, whose diagonal values are Pπ.

Thus, data analysts derive the estimation for the distribution of original data (in terms of contingency table) without disclosing the individual information of each record. We choose the accuracy of reconstructed distribution as the target utility function in Section IV-B, because most data mining applications are based on the probability distribution of the data.

  > C. Attacks on Randomized Data

Let X be an attribute or a subset of attributes in the data set T with domain Ω_X, and X is the randomized value of X in T . It is not reasonable for attackers to regard the observed value as the true value of X, with the randomization process and parameters.Instead, attackers can try to estimate the original value based on the observed data and the released randomization parameters.Let X denote the attackers’ estimation on the original value of X. Any value in Ω_X is possible due to the randomization procedure.We assume that the attacker is able to calculate the posterior probability of content in the data set and takes the following probabilistic strategy: for any μ, ν ∈ Ω_X,

where Pr(X = μ' X= ν) denotes the attacker’s posterior belief on the original value X = μ when he observes X = ν. With the Bayes’ theorem, it can be calculated by:

The following lemma gives the accuracy of the attacker’s estimation.

LEMMA 1. Suppose attackers adopt the probabilistic strategy specified in (3) to estimate the data. The probability that attackers accurately estimate the original value of X is given by:

PROOF. For a particular observed value X = ν ∈ Ω_X,Pr(X = X = μ, X = ν) = Pr( = ν'X = μ) Pr(X = μ' = ν). Then, with the law of total probability, we have

The probability of attackers’ correct estimation is also defined as the reconstruction probability, Pr(X = μ → X = μ), in [9, 12].It captures the round-trip of going from the original content to the distorted one and then returning to estimate the value of the original content. That is, it indicates how much information of the original data is preserved after randomization. To make the notation concise, we adopt the notation Pr(X= X = μ) to denote the reconstruction probability to evaluate the risk of the sensitive attribute disclosure.

IV. ATTRIBUTE DISCLOSURE UNDER LINKING ATTACKS

We measure privacy in terms of the attribute disclosure risk,whose formal definition is given, as follows:

DEFINITION 1. The attribute disclosure risk under linking attacks is defined to be the probability that the attacker predicts Sr successfully given QI_r of a target individual r, denoted as Pr(S_r'QI_r).

We need to quantify the background knowledge of attackers to derive the disclosure probability Pr(S_r'QI_r). We have the following standard assumptions for background knowledge of attackers in this paper. We assume that the attacker has access to the published data set T and he knows that T is a randomized version of some base table T. The attacker knows the domain of each attribute of T. We also assume that the attacker can obtain the QI-values of the target individual (e.g., Alice) from some public database or background knowledge and knows that the target individual is definitely contained in the published data.However, he has no knowledge of which record in the published data belongs to the target individual. Finally, we assume that the distortion matrices P_i are available to the attacker, because they are necessary for data miners to conduct analysis. The algorithm of l-diversity in [2] preserves privacy by generalizing the QI attributes to form QI-groups. Individuals in the group are linked to any sensitive attributes with probability at most 1/l, i.e.,Pr(S_r'QI_r) ≤ 1/l . However, the randomization based approach achieves the privacy protection probabilistically. In the following subsection, we show how to quantify the attribute disclosure risk in the randomization settings.

  > A. Quantifying Attribute Disclosure

When there is no randomization applied, for those records with their quasi-identifiers equal to QI_r, the attacker simply regards every record as having the same probability to represent individual r. The risk of sensitive attribute disclosure is equal to

because there are π_{QI_r} records within the group of QI_r, and π_{QI_r, S_r} of them have the sensitive value equal to S_r. This case corresponds to the worst case of the attribute disclosure risk. When randomization is applied, the attribute disclosure risk will be reduced, because the randomization increases the attacker’s uncertainty.

1) Randomize S only (RR-S): When data owners only apply randomization to the sensitive attribute, for each record within the group of QIr, the attacker takes a guess on its sensitive value using the observed sensitive value and the posterior probability in (4). According to (5), the probability of a correct estimation is Pr( = ？r'QIr), then the risk of sensitive attribute disclosure is

2) Randomize QI only (RR-QI): Similarly as RR-S, when data owners only apply randomization to the quasi-identifiers,the probability of correctly reconstructing QIr is given by Pr(Xr= QIr), and hence the risk of sensitive attribute disclosure is

3) Randomize QI and S (RR-Both): When data owners apply randomization to both QI and S, the attacker first needs to ensure the values of identifier attributes are correctly reconstructed.The probability is given by Pr(QIr = QIr). Second, the attacker needs to ensure the value of the sensitive attribute,given the correctly reconstructed identifier attribute values are correctly reconstructed. We summarize the risk of sensitive attribute disclosure in RR-Both , as well as RR-S and RR-QI, in the following result and give the general calculation of the attribute disclosure risk in the randomization settings.

RESULT 1. Assume an individual r has quasi-identifier QI_r =α = {i₁, i₂, . . . , i_m}, i_k ∈ Ω_k, and his sensitive attribute S_r = u, u∈ Ω_s. The probability of successfully predicting the sensitive attribute value S_r of the target individual r given his quasi-identifier values QI_r is:

We give the formal expressions of the two reconstruction probabilities needed in calculating Pr(S_r'QI_r) in (6). The reconstruction probability of the quasi-identifier is given by:

where Pr(QI^~_r = β'QI_r = α) =

is the probability that α ={i₁, i₂, . . . , i_m} is distorted to β = {j₁, j₂, . . . , j_m}.

The reconstruction probability of the sensitive attribute S for the target individual with the quasi-identifier QI_r is given by:

We are interested in when the attribute disclosure reaches the minimum. We show our results in the following property and include our proof in the Appendix.

PROPERTY 1. Given QI_r of individual r:

for RR-S, Pr(S_r'QI_r) is minimized when

for RR-QI, Pr(S_r'QI_r) is minimized when

for RR-Both , Pr(S_r'QI_r) is minimized when

Example. We use the instance shown in Table 1b to illustrate this property. For an individual r with (QI_r = Female, S_r = Cancer),we randomize S (Disease) with p_s and QI (Gender) with p_G independently. Fig. 2 shows how the attribute disclosure is varied when we apply different randomization parameters. We can see that Pr(S_r'QI_r) reaches the maximum (i.e.,

when no randomization is introduced. Fig. 2a shows the scenario when

[Fig. 2.] Pr(Sr'QIr) vs. randomization parameters. QI: quasiidentifier.

we only randomize S. We can see that min

when P_s = 1/d_s = 1/3_s

Fig. 2b shows the scenario when we only randomize QI. We can see that min Pr(S_r'QI_r) = π₁₀ = 0.12 when p_G = 1/d_c = 1/2

Fig. 2c shows the case where randomization is applied to both QI and S. Pr(S_r'QI_r) reaches the minimum only when both p_s =1/3 and p_G = 1/2 and min

Computational Cost.The main computation cost in (6)comes from calculating Pr(QI_r = QI_r). Let P_QI be the distortion matrix on quasi-identifiers: P_QI= P₁ ？ · · · ？ P_m, π_QI be the contingency table on all quasi-identifiers arranged in a column vector,and λ_QI denote the expected QI contingency table of the randomized data: λ_QI = P_QI π_QI. Then, the denominator in (7) is exactly the cell of λ_QI corresponding to β. Let η denote the column vector of the reconstruction probabilities of quasi-identifiers,arranged in the same order of π_QI. We can further express(7) in matrix form:

where × denotes the component-wise multiplication, P_i² is the component-wise square of P_i, and λ_QI^-1 is the component-wise inverse of λ_QI.

In (9), we need to repeatedly calculate (P₁ ？ · · · ？ P_m)x,where x denotes a column vector. Assume we use the naive algorithm in all matrix multiplications. Calculating (P₁ ？ · · · ？ P_m)x directly results in the time and storage complexity of O(π_id_i²). The main storage complexity is from storing matrix P_QI. The following lemma allows us to reduce the cost of such computation:

LEMMA 2. Let A, B and X be the matrices of size n × n, m × m,and m × n. Then

(A ？ B) vec(X) = vec(BXA^T),

where vec(X) denotes the vectorization of the matrix X formed by stacking the columns of X into a single column vector.

Applying Lemma 2 recursively, we can reduce the time complexity to O([Σ_id_i] π_id_i). The storage complexity is also reduced to O(π_id_i + Σ_id_i²); this is mainly used to store the

contingency table.

  > B. Maximizing Utility with Privacy Constraints

The ultimate goal of publishing data is to maximize utility,while the minimizing risk of attribute disclosure simultaneously.The utility of any data set, whether randomized or not,is innately dependent on the tasks that one may perform on it.Without a workload context, it is difficult to say whether a data set is useful or not. This motivates us to focus on the data distribution when evaluating the utility of a database, since many data mining applications are based on the probability distribution of the data.

where E[d(.)] denotes the expectation of d(.) and d(X?, π)denotes a certain distance between X and π.

We can set the privacy threshold formalize as the same privacy requirement of l-diversity (i.e., τ = 1/l ) to compare the performance of different disguised schemes. That is, we would determine the optimal randomization parameters (p₁, p₂, . . . , p_m,and p_s) to maximize the utility, while ensuring that the sensitive value of any individual involved in the data set cannot be correctly inferred by an attacker with probability more than 1/l. A larger l leads to stronger privacy protection. In general, privacy constraints may be flexible. For example, different individuals may have different concerns about their privacy, so we can set different thresholds for Pr(S_r'QI_r).

Problem 1 is a nonlinear optimization problem. In general,we can solve it using optimization packages (e.g., trust region algorithm [40]). In Section IV-A, we discussed how to efficiently calculate the attribute disclosure of target individuals(shown as constraints in PROBLEM 1). Next, we show how we can efficiently calculate E[∥X-π₂²∥], the expected Euclidean distance difference between the original data and the estimated one.

RESULT 2. When d(X, π) is the squared Euclidean distance,Problem 1 is equivalent to: determine p_i, i = 1, · · · ,m, and p_s to

[Table 2.] Description of the Adult data set used in the evaluation

Description of the Adult data set used in the evaluation

squared Euclidean distance, with Lemma (3) shown in the Appendix, to minimize d(X, π) is equivalent to minimize trace(Σ) where Σ is the covariance matrix of the estimator of cell values in the contingency table (shown in Equation 2). Calculating trace(Σ) still involves high computational cost. However,when P_i has the specific form shown in (1), we can further reduce the problem to minimizing π_i ┃P_i^-1∥_F², and with Lemma 4,we have

where 1 is the column vector whose cells are all equal to 1.

V. EMPIRICAL EVALUATION

We ran our experiments on the Adult Database from the UCI data mining repository [41] in our evaluations. The same database has been used in previous work on k-anonymity, l-diversity,t-closeness, and anatomy [2-5]. The Adult Database contains 45,222 tuples from US census data and 14 attributes. Table 2 is a summary description of the data including the attributes we used, the number of distinct values for each attribute, and the types of the attributes adopted in the evaluation.

It is expected that a good publication method should preserve both privacy and data utility. We set different l values as privacy disclosure thresholds. We adopt the following standard distance measures to compare the difference of distributions between the original and reconstructed data to quantify the utility. Given two distributions P = (p₁, p₂, ..., p_m), Q = (q₁, q₂, ..., q_m), KL distance is defined as

and χ²-distance is d_x²(P, Q)=

In our evaluation, we first investigate utility vs. privacy of the randomization method in protecting attribute disclosure on two aspects: 1) compare data utility among different scenarios,and 2) the impact of cardinality upon data utility. Then, we compare randomization with l-diversity and anatomy.

[Table 3.] Randomization parameters pi for three cases of RR (data setEMGRW)

Randomization parameters pi for three cases of RR (data setEMGRW)

  > A. Randomization

We treated Education, Martial-status, Gender, and Race as the quasi-identifier and used Workclass as the sensitive attribute. We term this data set EMGRW. We distort only QI attributes, or sensitive attribute S, or both in different application scenarios for randomization. Table 3 shows the derived randomization parameter p for three scenarios (RR-QI, RR-S, and RR-Both). We set l = 2, 3, 4, 5. We can observe that more distortions(p is away from 1) are needed to achieve better privacy protections (when l is increased). We can also observe that p_i for QI attributes in RR-Both is generally closer to 1 than that in RR-QI,because RR-Both could also distort the sensitive attribute in addition to those QI attributes to achieve some given privacy protection. Thus, a small magnitude of distortion is needed for QI attributes in RR-Both.

Fig. 3 shows results on various distance measures. Naturally,there is a tradeoff between minimizing utility loss and maximizing privacy protection. Fig. 3 indicates that the smaller the distance values, the smaller the difference between the distorted and the original databases, and the better the utility of the distorted database. We can observe that the utility loss (in terms of distance differences) increases approximately linearly with the increasing privacy protection across all three randomization scenarios. RR-Both achieves the best in terms of utility preservation,because we use the optimal randomization parameters for both QI and the sensitive attribute.

As discussed in Section III, one advantage of the randomization scheme is that the more data we have, the more accurate reconstruction we can achieve. We generate four more data sets with varied sizes by sampling r * N tuples from the Adult Data randomly where N is the size of the original Adult data set and we set r ∈ [0.5, 1.5, 2, 2.5] to investigate the impact of data size upon the data utility of the randomized data. All four generated data sets have the exact same distribution as the original one.Fig. 4 shows the accuracy of reconstructed data distribution when data size increases. We see that data utility is further improved when more data are available.

  > B. Comparison to Other Models

We chose Education, Salary, Gender, Race as QI, and Occupation as the sensitive attribute, similar to the settings of empirical evaluations in [2, 5] to compare randomization scheme with l-diversity and anatomy. We term this data set ESGRO. We did not use the previous EMGRW data set, because l-diverse partitions cannot be derived by the anatomy algorithm or entropy ldiversity.As specified in Machanavajjhala et al. [2] and Xiao and Tao [5], an l-diverse partition exists, if and only if at most N/l records are associated with the same sensitive value, where N is the cardinality of the data set. The data distribution of attribute Work-class is much skewed and the frequency of one value is much larger than the others. In this section, we focus on the use of RR-QI to compare to those two group based models:

[Fig. 3.] Distances between and π for three scenarios of RR (data set E π MGRW). RR: randomized response QI: quasiidentifier.

[Fig. 4.] For RR-Both distances between π and π decrease as the data set size increases (data set EMGRW). RR: randomized response QI: quasiidentifier.

l-diversity and anatomy, because the overall distribution of sensitive attribute values is unchanged in both group based methods after anonymization and RR-QI after randomization.

Generalization approaches usually measure the utility syntactically by the number of generalization steps applied to quasi-identifiers [42], average size of quasi-identifier equivalence classes [2], sum of squares of class sizes, or preservation of marginals [3]. We further examine the utility preservation from the perspective of answering aggregate queries, in addition to the previous distribution distance measures, since the randomization scheme is not based on generalization or suppression.We adopt query answering accuracy; the same metric is used in Zhang et al. [4]. We also consider the variation of correlations between the sensitive attribute and quasi-identifiers. We used an implementation of the Incognito algorithm [3] to generate the entropy l-diverse tables and used the anatomy algorithm in [5] in our experiments.

1) Distribution Distance: We compare data utility of reconstructed probability distributions for different models according to the distance measures. Fig. 5 shows distances between X and π for anatomy, l-diversity and RR-QI on the data set ESGRO.We can observe that randomization outperforms both anatomy and l-diversity methods, because we can partially recover the original data distribution in the randomization scheme, whereas data distribution within each generalized equivalence class is lost in l-diversity generalization and the relations between the quasi-identifier table (QIT) and the sensitive table (ST) are also lost in anatomy.

Another observation is that data utility (in terms of distance between original and reconstructed distributions) monotonically decreases with the increment of the privacy thresholds (l) for randomization and anatomy. This is naturally expected, because more randomness needs to be introduced with the increment of the privacy requirements for randomization and larger l in anatomy means that more tuples are included in each group, which decreases the accuracy of the estimate for the distribution of the original data. However, there is no similar monotonic trend in l-diversity,because the generalization algorithm chooses different attributes for generalization with various l values; this makes the accuracy of the estimated distribution vary to different extents.

2) Query Answering Accuracy:The accuracy of answering aggregate queries is one of the important aspects to evaluate the utility of distorted data. We compare randomization against two other anonymization approaches [3, 5] using the average relative error in the returned count values. For each count value(corresponding to the number of records in a group), its relative error is defined as |act - est|/act, where act is the actual count value from the original data, and est is the estimate from the reconstructed data for RR approaches or the anonymized data for anonymization approaches. We consider two types of queries in our evaluation.

- Base group-by query with the form:SELECT A1,. . . , Am, S, COUNT(*) FROM data

[Fig. 5.] Distances between and π for anatomy l-diversity and RR-QI (π randomized response-quasiidentifier; data set ESGRO).

[Fig. 6.] Relative errors of queries for anatomy l-diversity and RR-QI (randomized response-quasiidentifier; data set ESGRO).

WHERE A1 = i1 . . . Am = im AND S = is GROUP BY (A1, · · · ,

Am, S)

Where ik ∈ Ωk and is ∈ Ωs.

- Cube query with the form:

SELECT A1,· · · , Am, S, COUNT(*) FROM data

GROUP BY CUBE (A1, · · · , Am, S)

The base group-by query returns the number of records in each group of (A₁ = i₁ . . . A_m = i_m and S = i_s). Note that the total number of groups is equal D = d_s Π_i-1^m d_i).

After running the base group-by query on the original data and the reconstructed data or the anonymized data, we get the count values of act and est, respectively. Fig. 6a shows the calculated average relative errors of three methods (anatomy, l-diversity,and RR-QI) with varied l (l = 1, · · · , 7). We use the CUBE query to describe all the possible hierarchical aggregate queries (In multidimensional jargon, a cube is a cross-tabulated summary of detail rows. CUBE enables a SELECT statement to calculate subtotals for all possible combinations of a group of dimensions. It also calculates a grand total.). The CUBE query returns aggregate values of all possible combinations of QI attributes. We calculate the average relative error for each l = 1,· · · , 7 and show the results in Fig. 6b. We can observe from Fig. 6a and 6b that randomization permits significantly more accurate aggregate analysis than both l-diversity and anatomy, since it can recover more accurate data distribution. Conversely, l-diversity loses the data distribution of records within each generalized QI-group and anatomy loses correlations between the QI attributes and the sensitive attribute S.

3) Correlation Between Attributes: A good publishing method should also preserve data correlation (especially between QI and sensitive attributes). We use Uncertainty Coefficient(U) to evaluate the correlation between two multi-category variables:

The uncertainty coefficient takes values between -1 and 1;larger values represent a strong association between variables.When the response variable has several possible categorizations,these measures tend to take smaller values, as the number of categories increases.

Table. 4 and 5 show correlation (uncertainty coefficient) val-

[Table 4.] Variation of correlation (uncertainty coefficient) between pairs ofquasiidentifier (QI) attributes under different models (×10-2) (data set ESGRO)

Variation of correlation (uncertainty coefficient) between pairs ofquasiidentifier (QI) attributes under different models (×10-2) (data set ESGRO)

[Table 5.] Variation of correlation (uncertainty coefficient) betweenquasiidentifier (QI) and S under different models (×10-2) (data set ESGRO)

Variation of correlation (uncertainty coefficient) betweenquasiidentifier (QI) and S under different models (×10-2) (data set ESGRO)

[Fig. 7.] Average value of uncertainty coefficients among attributes for anatomy l-diversity and RR-QI (randomized response-quasiidentifier; data setESGRO).

ues between each pair of attributes vary under three models(anatomy, l-diversity and RR-QI) on the data set ESGRO. We vary l from 2 to 7. We only include correlation values for l = 3,4, and 5 due to the space limitation. We use the attribute pair of Salary (S, one QI attribute) and Occupation (O, the sensitive attribute) as an example (the column with bold fonts in Table 5)to study how correlations between QI and S change. The original uncertainty coefficient is 2.74 × 10^-2. RR-QI well achieves correlation preservation, i.e., 2.41, 2.27, and 2.17 (×10^-2) for l =3, 4, and 5 respectively. Conversely, the uncertainty coefficient value under anatomy is only 0.20, 0.12, and 0.05 (×10^-2) correspondingly.For l-diversity, it achieves zero correlation preservation when l = 3, and 5 while it perfectly achieves correlation preservation when l = 4, because the Salary attribute is generalized to “All” when l = 3, and 5, while it is unchanged across all QI-groups when l = 4. l-diversity in general cannot preserve correlation well, because it is intractable to predict which QI attributes will be generalized in l-diversity.

Fig. 7 shows the average value of uncertainty coefficients among attributes for anatomy, l-diversity and RR-Both on the data set ESGRO. As shown in Fig. 7a, randomization achieves the best correlation preservation between the sensitive attribute and quasiidentifiers across all privacy thresholds. Fig. 7b also clearly shows that randomization better preserves correlation among quasi-identifiers than l-diversity does. Please note that anatomy can best achieve the correlation among quasi-identifiers,since it does not change values of quasi-identifiers.

  > C. Evaluation Summary

In summary, the evaluation shows that randomization can better preserve utility (in terms of distribution reconstruction,accuracy of aggregate query answering, and correlation among attributes) under the same privacy requirements. Utility loss is significantly smaller than that of generalization or anatomy approaches. Furthermore, the effectiveness of randomization can be further improved when more data are available. The evaluation also showed that the randomization approach can further improve the accuracy of the reconstructed distribution(and hence utility preservation) when more data are available,while generalization and anatomy approaches do not have this property.

VI. CONCLUSION

In this paper, we investigated attribute disclosure in the case of linking attacks. We compared randomization to other anonymization schemes (l-diversity and anatomy) in terms of utility preservation, with the same privacy protection requirements.Our experimental evaluations showed randomization significantly outperforms generalization, i.e., achieving better utility preservation, while yielding the same privacy protection.

There are several other avenues for future work. We aim to extend our research to handle multiple sensitive attributes. In this paper, we limit our scope, as attackers have no knowledge about the sensitive attribute of specific individuals in the population and/or the table. In practice, this may not be true, since the attacker may have instance-level background knowledge(e.g., the attacker might know that the target individual does not have cancer; or the attacker might know complete information about some people in the table other than the target individual.)or partial demographic background knowledge about the distribution of sensitive and insensitive attributes in the population.Different forms of background knowledge have been studied in privacy preservation data publishing recently. For example, the formal language [22, 24] and maximum estimation constraints[43] are proposed to express background knowledge of attackers.We will investigate privacy disclosure under those background knowledge attacks. We should note that randomization may not outperform generalization in all cases, especially when some specific background knowledge is utilized in the adversary model, because generalization is truthful and the ambiguity is between true tuples (e.g., individual a and b are indistinguishable),whereas randomization can be regarded as adding noise,and noise can be removed when some background knowledge is known.

We will continue our study of further reducing computation overhead of the randomization approach. The execution time of randomization is usually slower than l-diversity and anatomy.For example, the execution time of our RR-Both approach on the Adult data sets was 10 times slower than for generalization approaches due to the heavy cost of determining the optimal distortion parameters with attribute disclosure constraints, especially when the attribute domains are large. One advantage of randomization is that the reconstruction accuracy increases when more data are available. Guo et al. [10] preliminarily conducted theoretical analysis on how the randomization process affects the accuracy of various measures (e.g., support, confidence,and lift) in market basket data analysis. In our future work, we will study the accuracy of reconstruction in terms of bias and variance of estimates in randomizing general microdata.

Finally, we are interested in comparing our randomization approaches to most recently developed generalization approaches(e.g., the accuracy-constrained l-diversification method in [14]).We are also interested in combining the derived optimal randomization schemes with the small domain randomization [38]to further improve utility preservation. It is our belief that extensive studies on comparing different privacy preservation data publishing approaches are crucial.

  > Appendix 1. Proof of Property 1

We start with applying RR to the sensitive attribute, Pr( r =QI^{^}_r) = 1. Without loss of generality, we assume that S_r = 1. Combine other categories 0, 2, . . . , d_s - 1 into a new categories, and still use 0 to denote the new category. To make the notation simple,we simply write

as π₀ in this proof, then π₁ =1 ？ π₀. Such adjustment does not change the reconstruction probability Pr( S？_r = S_r = 1'QI_r). After the adjustment, the randomization probabilities are given by

By definition, the posterior probabilities are given by

Combining (13), (14), and (15), we have

Taking the derivative with respect to ps, we have (16) is minimized when p_s = 1/d_s , and the minimal value is

Following similar strategies, we can prove the general case when we randomize both QI and S.

  > Appendix 2. Proof of Result 2

LEMMA 3. If d( π^{^} , π) = '' π^{^} π''₂² , we have E[d( π^{^}, π)] = trace(Σ), where Σ is the covariance matrix of π^{^} shown in (2).

PROOF. We know that π_^ asymptotically follows the normal distribution N(π, Σ). Let Σ = XΛX^T be the eigen-decomposition of Σ, where Λ = diag(λ₁, . . . , λ_n) and X^TX = I. Let η = X^T (π^{^} π), then η is normally distributed with E(η) = 0 and

Cov(η) = X^T Cov( π^{^} π)X = X^TΣX = Λ.

Notice that Λ is a diagonal matrix, and hence Cov(η_iη_j) = 0 if i ≠_j, and Var(η_i) = λ_i, i.e., η_i independently follows the normal distribution N(0, λ_i). Therefore, we have:

The last equality is due to the fact that

trace(Σ) = trace(XΛXT) = trace(ΛXTX) = trace(Λ).

We proved the lemma.

LEMMA 4. Let P_ibe the randomization matrix specified in(1). When p_i ≠ 1/d_i , we have

where 1 is the column vector, whose cells all equal 1.

PROOF. We can re-write Pi as follows:

With

and the binomial inverse theorem[45], we can immediately get P_i^-1, and ''P_i^-1_F ''² can be directly calculated from P_i^-1 .

LEMMA 5. Let (P^-1)_i denote the i-th column (or row) of P^-1.Then, for any i = 1, 2, . . . , D, we have

PROOF. With Lemma 4, we observe that every row (or column)of P_i^-1has the same components, except for a change of order. Since P^-1 = P_i^-1 ？ · · · ？ P_s^-1 , we can also conclude that all rows (or columns) of P^-1 have the same components,except for a change of order. Therefore, for all i’s,

Next, we prove the main result. With Lemma 3, when the distance is the squared Euclidean distance, to minimize E[d(π？ ,π)] is equivalent to minimize trace(Σ). With (2), we have

Notice that the constraint function is an increasing function of pi, the optimal solution must occur when the equality stands,and we have proved the result.

참고문헌

1. Samarati P, Sweeney L 1998 Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression [Proceedings of the IEEE Symposium on Research in Security and Privacy]
2. Machanavajjhala A, Gehrke J, Kiefer D, Venkitasubramanian M 2006 L-diversity: privacy beyond k-anonymity [Proceedings of the 22nd International Conference on Data Engineering]
3. LeFevre K, DeWitt D. J, Ramakrishnan R 2005 Incognito: efficient full-domain K-anonymity [ACM SIGMOD International Conference on Management of Data] P.49-60
4. Zhang Q, Koudas N, Srivastava D, Yu T 2007 Aggregate query answering on anonymized tables [Proceedings of the 23rd International Conference on Data Engineering] P.116-125
5. Xiao X, Tao Y 2006 Anatomy: simple and effective privacy preservation [Proceedings of the 32nd International Conference on Very Large Data Bases] P.139-150
6. Lambert D 1993 Measures of disclosure risk and harm [Journal of Official Statistics] Vol.9 P.313-331
7. Li N, Li T, Venkatasubramanian S 2007 T-closeness: privacy beyond k-anonymity and l-diversity [Proceedings of the 23rd International Conference on Data Engineering] P.106-115
8. Du W, Zhan Z 2003 Using randomized response techniques for privacy-preserving data mining [Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining] P.505-510
9. Rizvi S. J, Haritsa J. R 2002 Maintaining data privacy in association rule mining [Proceedings of the 28th International Conference on Very Large Data Bases] P.682-693
10. Guo L, Guo S, Wu X 2007 Privacy preserving market basket data analysis [he 11th European Conference on Principles and Practice of Knowledge Discovery in Databases] P.103-114
11. Guo L, Wu X 2009 Privacy preserving categorical data analysis with unknown distortion parameters [Transactions on Data Privacy] Vol.2 P.185-205
12. Teng Z, Du W 2006 Comparisons of k-anonymization and randomization schemes under linking attacks [Proceedings of the 6th International Conference on Data Mining] P.1091-1096
13. Aggarwal G, Feder T, Kenthapadi K, Motwani R, Panigrahy R, Thomas D, Zhu A 2005 Approximation algorithms for kanonymity [Journal of Privacy Technology]
14. Ghinita G, Karras P, Kalnis P, Mamoulis N 2009 A framework for efficient data anonymization under privacy and accuracy constraints [ACM Transactions on Database Systems] Vol.34 P.9:1-9:47
15. Kifer D, Gehrke J 2006 Injecting utility into anonymized datasets [ACM SIGMOD International Conference on Management of Data] P.217-228
16. Koudas N, Srivastava D, Yu T, Zhang Q 2009 Distributionbased microdata anonymization [Proceedings of the 35th International Conference on Very Large Data Bases]
17. Narayanan A, Shmatikov V 2008 Robust de-anonymization of large sparse datasets [IEEE Symposium on Security and Privacy] P.111-125
18. Brickell J, Shmatikov V 2008 The cost of privacy: destruction of data-mining utility in anonymized data publishing [The 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining] P.70-78
19. Li T, Li N 2009 On the tradeoff between privacy and utility in data publishing [The 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining] P.517-525
20. Truta T. M, Vinay B 2006 Privacy protection: p-sensitive k-anonymity property [Proceedings of the 22nd IEEE Internationl Conference on Data Engineering] P.94
21. Wong R. C. W, Li J, Fu A. W. C, Wang K 2006 (α k)-anonymity:an enhanced k-anonymity model for privacy-preserving data publishing [The 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining] P.754-759
22. Martin D. J, Kifer D, Machanavajjhala A, Gehrke J, Halpern J. Y 2007 Worst-case background knowledge for privacy-preserving data publishing [Proceedings of the 23rd International Conference on Data Engineering] P.126-135
23. Wong R. C. W, Fu A. W. C, Wang K, Pei J 2007 Minimality attack in privacy preserving data publishing [Proceedings of the 33nd International Conference on Very Large Data Bases] P.543-554
24. Chen B, Lefevre K, Ramakrishnan R 2007 Privacy skyline:privacy with multidimensional adversarial knowledge [Proceedings of the 33nd International Conference on Very Large Data Bases] P.770-781
25. Nergiz M. E, Atzori M, Clifton C 2007 Hiding the presence of individuals from shared databases [ACM SIGMOD International Conference on Management of Data] P.665-676
26. Li J, Tao Y, Xiao K 2008 Preservation of proximity privacy in publishing numerical sensitive data [ACM SIGMOD International Conference on Management of Data] P.473-485
27. Wei Q, Lu Y, Lou Q 2008 (t λ)-uniqueness: anonymity management for data publication [The 7th IEEE/ACIS International Conference on Computer and Information Science] P.107-112
28. Warner S. L 1965 Randomized response: a survey technique for eliminating evasive answer bias [Journal of the American Statistical Association] Vol.60 P.63-66
29. Agrawal R, Srikant R 2000 Privacy-preserving data mining [ACM SIGMOD International Conference on Management of Data] P.439-450
30. Aggarwal C. C, Yu P. S 2008 A survey of randomization methods for privacy-preserving data mining [Privacy-Preserving Data Mining: Models and Algorithms] P.137-156
31. Agrawal S, Haritsa J. R 2005 A framework for high-accuracy privacy-preserving mining [Proceedings of the 21st International Conference on Data Engineering] P.193-204
32. Aggarwal C. C 2008 On unifying privacy and uncertain data models [Proceedings of the 24th International Conference on Data Engineering] P.386-395
33. Zhu Y, Liu L 2004 Optimal randomization for privacy preserving data mining [The 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining] P.761-766
34. Rebollo-Monedero D, Forne J, Domingo-Ferrer J 2010 From tcloseness-like privacy to postrandomization via information theory [IEEE Transactions on Knowledge and Data Engineering] Vol.22 P.1623-1636
35. Gouweleeuw J. M, Kooiman P, Willenborg L. C. R. J, de Wolf P. P 1998 Post randomisation for statistical disclosure control:theory and implementation [Journal of Official Statistics] Vol.14 P.463-478
36. Huang Z, Du W 2008 OptRR: optimizing randomized response schemes for privacy-preserving data mining [Proceedings of the 24th International Conference on Data Engineering] P.705-714
37. Xiao X, Tao Y, Chen M 2009 Optimal random perturbation at multiple privacy levels [Proceedings of the 35th International Conference on Very Large Data Bases] P.814-825
38. Chaytor R, Wang K 2010 Small domain randomization: same privacy more utility [Proceedings of the 36th International Conference on Very Large Data Bases] P.608-618
39. Chaudhuri A, Mukerjee R Randomized Response: Theory and Techniques
40. Coleman T. F, Liu J, Yuan W 2002 A new trust-region algorithm for equality constrained optimization [Computational Optimization and Applications] Vol.21 P.177-199
41. Asuncion A, Newman D. J UCI machine learning repository
42. Samarati P 2001 Protecting respondents’ identities in microdata release [IEEE Transactions on Knowledge and Data Engineering] Vol.13 P.1010-1027
43. Du W, Teng Z, Zhu Z 2008 Privacy-MaxEnt: integrating background knowledge in privacy quantification [ACM SIGMOD International Conference on Management of Data] P.459-472
44. Guo L, Ying X, Wu X 2010 On attribute disclosure in randomization based privacy preserving data publishing [Proceedings of the 10th IEEE International Conference on Data Mining Workshops] P.466-473
45. Strang G 2003 Introduction to Linear Algebra

OAK XML 통계

이미지 / 테이블

[ Table 1. ] 2 × 3 contingency tables for two variables gender (QI) disease(sensitive)
[ Fig. 1. ] Randomization based privacy-preserving data publishing.
[ Fig. 2. ] Pr(Sr'QIr) vs. randomization parameters. QI: quasiidentifier.
[ Table 2. ] Description of the Adult data set used in the evaluation
[ Table 3. ] Randomization parameters pi for three cases of RR (data setEMGRW)
[ Fig. 3. ] Distances between and π for three scenarios of RR (data set E π MGRW). RR: randomized response QI: quasiidentifier.
[ Fig. 4. ] For RR-Both distances between π and π decrease as the data set size increases (data set EMGRW). RR: randomized response QI: quasiidentifier.
[ Fig. 5. ] Distances between and π for anatomy l-diversity and RR-QI (π randomized response-quasiidentifier; data set ESGRO).
[ Fig. 6. ] Relative errors of queries for anatomy l-diversity and RR-QI (randomized response-quasiidentifier; data set ESGRO).
[ Table 4. ] Variation of correlation (uncertainty coefficient) between pairs ofquasiidentifier (QI) attributes under different models (×10-2) (data set ESGRO)
[ Table 5. ] Variation of correlation (uncertainty coefficient) betweenquasiidentifier (QI) and S under different models (×10-2) (data set ESGRO)
[ Fig. 7. ] Average value of uncertainty coefficients among attributes for anatomy l-diversity and RR-QI (randomized response-quasiidentifier; data setESGRO).