Differential Privacy in Practice
 Author: Nguyen Hiep H., Kim Jong, Kim Yoonho
 Organization: Nguyen Hiep H.; Kim Jong; Kim Yoonho
 Publish: Journal of Computing Science and Engineering Volume 7, Issue3, p177~186, 30 Sep 2013

ABSTRACT
We briefly review the problem of
statistical disclosure control underdifferential privacy model, which entails a formal andad omnia privacy guarantee separating the utility of the database and the risk due to individual participation. It has born fruitful results over the past ten years, both in theoretical connections to other fields and in practical applications to reallife datasets. Promises of differential privacy help to relieve concerns of privacy loss, which hinder the release of communityvaluable data. This paper covers main ideas behind differential privacy, its interactive versus noninteractive settings, perturbation mechanisms, and typical applications found in recent research.

KEYWORD
Differential privacy , Protection mechanisms , Attacks

I. INTRODUCTION
Differential privacy has become a de facto principle for privacypreserving data analysis tasks, and has had many successful applications in spite of the fact that it has just been devised less than ten years ago. However, its rapid expansion requires a systematic approach to grasp fundamental concepts from different viewpoints, as well as to clarify its conceivable limits. The ultimate goal of this paper is to highlight differential privacy ideas by presenting the motivation behind them, along with popular mechanisms and applications. Helpful advice is also given for those who want to go deeper and obtain further benefits from this emerging paradigm.
> A. Motivation of Differential Privacy
Digital information is collected daily in growing volume by governments, corporations, and individuals. Mutual benefits drive the demand for the exchange and publication of data among parties. However, how to handle this data properly is unclear, because the data in its original form typically contains sensitive information about participants.
Syntactic processing paradigms like the suppression of identifying fields were eventually proven unsuccessful, as confirmed in past privacy breaches, simply by joining a deidentified table with publicly available databases.Ad hoc anonymization models like kanonymity [1], ldiversity [2], and tcloseness [3] formalize the intuition of “privacy by blending yourself into a crowd” to achieve stronger guarantees. In kanonymity, quasiidentifier fields are suppressed or generalized, so that each record is indistinguishable from at least k1 other records.Record linkage attacks are avoided, but notattribute linkage attacks , which are prevented in ldiversity. The concept ofearth mover’s distance is employed by tcloseness to deal withprobabilistic attacks , focusing on how the attacker would change his probabilistic trust in the sensitive information of a victim after accessing the published data. Common limitations of these approaches include ad hoc assumptions on auxiliary information, heavy information loss, and suboptimality. An excellent survey on different privacypreserving data publishing (PPDP) was presented by Fung et al. [4].To respond to the need of a firm foundation for PPDP, the concept of
differential privacy was proposed by Dwork [5]. The concept stemmed from the impossibility of Dalenius’s desideratum on statistical database privacy:nothing about an individual should be learnable from the database that cannot be learned without access to the database. A new measure is therefore suggested by considering the separation of the database utility from the risk due to the victim’s participation.Two settings in PPDP are
interactive andnoninteractive . In the interactive setting, a curator sitting between the users and the database may modify the responses to queries posed by users to satisfy respondents’ privacy requirements. Some difficulties occur in this model, such as how to answer large query sets with regard to even naive difference attacks. The noninteractive setting addresses these problems by releasing some statistics (sanitization) once, and the data are not used further. The quality of answers is affected, and the range of data processing operations is reduced, among other limitations of this “oneshot” scheme.> B. Differential Privacy Concepts
Roughly speaking, differential privacy ensures that the outcome of any analysis on a database is not influenced substantially by the existence of any individual. It is therefore difficult for an adversary to make inference attacks on any data rows. The idea revolves around the popular
indistinguishability concept in semantic security. Privacy concerns about the removal or addition of a single row suggest the guarantee formulation on a pair ofadjacent databases (D, D') differing by only one row.Definition 1. (εdifferential privacy [5]) A randomized function K gives εdifferential privacy if for all data sets D and D' differing by at most one element, and all S ⊆ Range(K), where the probability space in each case is over the coin flips of K. The multiplicative
e^{ε} simplifies the composition rules discussed in later sections. It implies that zeroprobable output on a given database is also zeroprobable on any neighboring database, ruling out thesubsampleandrelease paradigm [5]. Group privacy is automatically satisfied when a collection of k users opt in or out if we extend the ratio in Definition 1 by a factor ofe^{kε} . The definition below relaxes the strict relative shift for events that are not especially likely.Definition 2. ((ε, δ)differential privacy [6]) A randomized function K gives (ε, δ)differential privacy if for all data sets D and D' differing by at most one element, and all S ⊆ Range(K), where the probability space in each case is over the coin flips of K. The additive factor
δ denotes thefailure tolerance of the constraintWe can achieve
ε differential privacy by determining the noise magnitude to add to the output. This noise depends on thesensitivity of the functionK , a fundamental quantity in this privacy paradigm.Definition 3. (Global sensitivity [7]) For f : D → ？^{d}, the L_{p}sensitivity of f is for all D, D' differing in at most one element. As an example, the count function over a set
S ,f (S ) = S  hasL _{1}sensitivity 1.For many functions, such as the median, global sensitivity yields large noise, and hence destroys the utility. Nissim et al. [8] proposed the concept of
local sensitivity , taking into account not only the function but also the database. The goal is to add instancespecific noise with smaller magnitude than theworstcase noise by global sensitivity.Definition 4. (Local sensitivity [8]) For f : D → ？^{d} and x ∈ D, the L_{p}local sensitivity of f at x is However, directly adding noise proportional to
LS_{f} (x ) might reveal information aboutx itself. So, the noise has to be aninsensitive function. One study defined aβsmooth upper bound onLS_{f} (x ) [8], and we calibrate the noise according to these smooth upper bounds.A similar idea can be found in several works where the
dataspecific ε DP mechanisms were devised to avoid the naive but inefficient noise injection approaches [9,10].II. MECHANISMS
This section surveys two widely used mechanisms, the
Laplace mechanism [7] and theExponential mechanism [11]. In addition, other variants are also introduced.> A. Laplace and Gaussian Mechanisms
For the case of real output, adding properly calibrated Laplace noise to the output is a standard technique to realize differential privacy. The noise is sampled from a Laplace distribution with probability density function (pdf)
where
λ is determined by bothΔf and the desired privacy budgetε .Theorem 1. (Laplace mechanism [12]) For any function f : D → ？^{d}, the mechanism gives εdifferential privacy if λ = Δf /ε and L_{i}(λ) are i.i.d. Laplace random variables. Notice that Laplace noise has zero mean and variance 2
λ ^{2}. So, larger values ofε mean lower noise variance, and the noisy outputs are then closer to the true ones, resulting in better utility and lower privacy level.To achieve (
ε, δ )differential privacy, we can use Gaussian noiseL _{1}sensitivity being replaced byL _{2}sensitivityΔf = max_{D,D'} f (D ) ？f (D′ )_{2}. Zeromean Gaussian noise has the pdfTheorem 2. (Gaussian mechanism [6]) For any function f : D → ？^{d} the mechanism gives (ε, δ)differential privacy if and N_{i} (0,σ ^{2})are i.i.d. Gaussian random variables. > B. Exponential Mechanism
The Laplace mechanism makes no sense in some tasks, such as when the function
f maps databases to discrete structures like strings, trees, or strategies. McSherry and Talwar [11] proposed a mechanism that can choose an outputt ∈T close to the optimum while satisfyingε differential privacy. In a nutshell, the exponential mechanism takes an inputD , output rangeT , privacy parameterε , and ascore function u : (D ×T ) → ？ that assigns a real value to everyt , where a higher score means better utility. The sensitivity of the score function is defined asΔu = max_{∀}_{t,D,D'} u (D ,t ) ？u (D' ,t ).Theorem 3. (Exponential mechanism [11]) For any function u : (D × T) → ？,the mechanism satisfies εdifferential privacy. The 2
ε DP is the worstcase privacy of exponential mechanisms as we incur at most oneε DP in the numerator and at least oneε DP in the denominator ofHowever, 2
ε DP is reduced toε DP if the denominator is a constant, as in the cases of Laplace and Gaussian mechanisms. These mechanisms are considered special cases of an exponential mechanism by takingu (D ,t ) = ？f (D ) ？t  (for Laplace) andu (D ,t ) = ？f (D ) ？t ^{2} (for Gaussian), wheret is the output. The exponential mechanism can capture any differential privacy mechanismK by takingu (D ,r ) to be the logarithm of the probability density ofK (D ) atr . While an exponential mechanism takes time linear to the number of possible results to add noise, Laplace and Gaussian mechanisms are much more efficient because of theO (1) complexity of Laplace/Gaussian noises. The exponential mechanism is used extensively inε DP research [10,1319].> C. Other Mechanisms
The geometric mechanism [20] is a discrete variant of the Laplace mechanism with integral output range ？ and random noiseΔ generated from a twosided geometric distributionAmong all
ε differentially private mechanisms, this discretized analogue strongly maximizes user utility [20].> D. Composability
Issues of composition play a key role in any approach to privacy: the sanitization mechanism should preserve privacy guarantees even when several outputs are taken together. Ganta et al. [21] exhibit the composition attacks on independent kanonymizations of intersecting data sets. Differential privacy satisfies both
sequential composition andparallel composition [22].Theorem 4. (Sequential composition [22]) Let A_{i} provide ε_{i} differential privacy. A sequence of A_{i}(D) over the dataset D provides (Σ_{i}ε_{i})differential privacy.Theorem 5. (Parallel composition [22]) Let A_{i} provide ε_{i}differential privacy. Let D_{i} be arbitrary disjoint subsets of the input domain D. The sequence of A_{i}(D_{i}) provides (max(ε_{i} ))differential privacy. As examples, both compositions are usually used in building partitioning trees for data summary under differential privacy [9,17,23,24].
> E. Utility Metrics
Popular utility metrics include (
α, β )usefulness [14], relative error with a sanity bound [25], absolute error [26], KLdivergence [19], and relative entropy [27]. The choice of proper metrics highly depends on the query types.Definition 5. ( (α, β )usefulness [14]) A database mechanism A is (α, β)useful for queries in class C if with probability 1？ β, for every Q ∈ C and every database D, for III. INTERACTIVE SETTING
For the interactive setting, queries are submitted to the curator in an interactive (and adaptive) manner. It remains unclear whether large numbers of queries could be answered accurately while preserving privacy. The first ideas were studied in the early days of differential privacy with pioneering work by Dinur and Nissim [28]. They showed that in order to achieve privacy against a
polynomially bounded adversary , one has to add perturbation of magnitudeLarge numbers of queries in interactive setting were revisited in recent studies [29,30].
> A. Methods
1) Median Mechanism
The first privacy mechanism capable of answering exponentially more queries than previous interactive mechanisms based on independent Laplace noise is the
median mechanism by Roth and Roughgarden [29]. They give a basic (but inefficient) implementation and an efficient alternative with a time complexity polynomial in the number of queriesk , a database sizen , and the domain size X . The key challenge is to determine the appropriate correlations between different query outputs on the fly without knowledge of future queries. The main idea behind the median mechanism is how to classify queries as “hard” and “easy” with low privacy cost. The number of “hard” queries is bounded toO (logk .log X ) due to a VapnikChervonenkis (VC) dimension argument [31] and the constantfactored shrinkage of the number of consistent databases every time we answer a “hard” query. A collection of databases consistent with the mechanism’s answers is maintained, and a query deemed “easy” would be answered by the median of the query over the database collection. The query error scales as (1/n ^{1/3}).polylog (k ).2) Multiplicative Weights Mechanism
Some open questions in the median mechanism are solved in the socalled private multiplicative weights (PMW) mechanism by Hardt and Rothblum [30]. The main result is achieving a running time only
linear inN (for each of thek queries), nearly tight with previous cryptographic hardness results [32], while the error scales roughly aslog
k . Moreover, the proposed mechanism makes partial progress for sidestepping previous negative results [32] by relaxing the utility requirement. Specifically, Hardt and Rothblum [30] considered accuracy guarantees for the class ofpseudosmooth databases with sublinear (or even polylogarithmic) running time. The main idea of PMW is to use a privacypreserving multiplicative weights mechanism. Let the real “fractional” database bex . In each roundt with queryf_{t} , an updated databasex_{t} is maintained, and we compare the noisy answer with the answer given by the previous round’s databasef_{t} (x_{t} _{1}) to assess this round as “lazy” or “update” depending on whether the difference is “close” or “far”. The “lazy” round simply outputsf_{t} (x_{t} _{1}) and setsx_{t} ←x_{t} _{1}, while the “update” round needs to improvex_{t} using multiplicative weights reweighting. If the total number of update rounds exceedsn , then the mechanism fails and terminates (this is a lowprobability event).IV. NONINTERACTIVE SETTING
Differential privacy in a noninteractive setting is closely related to learning problems in a privacypreserving manner. Simple statistical quantities like counting and summing can be used as building blocks in a wide range of learning tasks [33].
> A. Applications
1) Histogram and Contingency Table
A
histogram is typically defined over a specific domain as a partition of data points into disjoint bins. Two widely used types of histograms areunattributed (where the semantic meaning of each bin is irrelevant to the analysis, the histogram can be viewed as a multiset of counts) andattributed (the semantic meaning of each bin is maintained). Histogram sanitization under a formal privacy was introduced early by Chawla et al. [34]. Although not exactly in the sense of differential privacy, (c,t )isolation [34] has a similar idea of protecting items from isolation attacks. Its limitation lies in the assumption of data point distribution uniformity.A
Contingency table is informally a table of counts over a set of attributes. These counts are calledmarginals , and their release has a goal of revealing correlations between many different sets of attributes. Barak et al. [35] applied a Laplace mechanism to the Fourier coefficients of a dataset vector (after being cast from highdimensional space, indexed by attribute tuples), and then employed linear programming to create nonnegative synthetic contingency tables. An improvement forsparse data is sketched by Cormode et al. [26], with a shortcut approach using sampling and filtering techniques. Geometric noise [20] was added to fulfillε differential privacy.Hay et al. [36] pointed out that
consistencyconstrained inference, if used as apostprocessing step, is able to boost the accuracy of histogram queries, for both unattributed and universal histograms. An application of the concepts presented by Hay et al. [36] to accurately estimate the degree distribution of private networks was demonstrated in another study [37].Range count queries are optimized by wavelet transform in Privelet technique [25]. Privelet adds polylogarithmic noise to the transformed data. Onedimensional ordinal and nominal data require the Haar wavelet and nominal wavelet, respectively, while multidimensional data need standard decompositions to apply a onedimensional transform along each dimension in turn.
Further improvement on DPcompliant histogram publication for compressible histograms was demonstrated by Acs et al. [19]. Enhanced Fourier perturbation uses a better score function for exponential mechanismbased sampling. It exploits the redundancy of Fourier coefficients. Alternatively, private hierarchical partition (PHPartition) is a clusteringbased sanitization that exploits the redundancy between bins.
2) Linear Counting Queries
Li et al. [38] generalize two approaches [25,36] by a
matrix mechanism , an algorithm answering a workloadW of linear counting queries. Their idea is to use another set of queriesA (called aquery strategy ) as a query proxy to the database. Noisy answers on the query strategy are then used to derive answers for the workload. By doing so, they hoped to exploit the noise distribution correlation to increase accuracy. A semidefinite program with a rank constraint was formulated with complexityO (n ^{8}), and some approximations were also developed. The distinct novelty of this approach [38] was to point out that some of the mechanisms [25,36] are special cases of the matrix mechanism.EigenDesign [39] is an efficient implementation with a complexity ofO (n ^{4}). It uses a singular value decomposition lower bound to reduce the search space for matrixA from full space ？^{n×n} to linear space spanned by the eigenvectors ofW ^{T}W .The fullrank limitation [38] was emphasized by Yuan et al. [40]. They proposed the lowrank mechanism (LRM) to reach the theoretical lower bound proven by Hardt and Talwar [41]. The workload matrix was decomposed, and the constraint was relaxed. The results show LRM’s outperformance over the previous studies [25,36] in most query scenarios.
Other approaches [41,42] used
convex geometry to devise differentially private mechanisms and upper/lower bounds for query workloads. These schemes were based on the exponential mechanism.3) Partitioning
Treebased partitioning techniques find useful applications under differential privacy. Cormode et al. [23] focused on spatial data indexing for range queries while not revealing data points. Their method,
private spatial decomposition , addresses both datadependent (e.g., quadtree) and dataindependent (e.g., kdtree) tree structures with geometric privacy budget allocation. For datadependent structures, we need a private mechanism (e.g., private median) for choosing splitting points. Postprocessing enhances the query accuracy. DPtree, proposed by Peng et al. [43], builds a nested tree structure with consistency enforcement and adaptive privacy budget assignment. The work improves the asymptotic error bound and query accuracy compared to the approach by Cormode et al. [23].To publish
sequential data like Web browsing histories and mobility traces in private settings, Chen et al. [24] employed avariablelength ngram model , which is widely used in natural language processing, to extract essential information from a sequential dataset and build anexploration tree satisfyingε DP. Synthetic data constructed from the tree can be safely used by analysts. The salient contribution of this work is to promote novel techniques based on Markov assumptions. Similarly, privatesetvalued data publishing, which is highdimensional by nature, needs datadependent partitioning to ensure utility. Chen et al. [9] demonstrated such an approach with the help ofcontextfree taxonomy trees . Their scheme is (α ,β )useful and can be applied to the context of relational databases.4) Learning Tasks
Great effort dedicated to blending
ε DP with traditional learning tasks has been demonstrated throughout a series of papers over the last ten years. Examples include clustering tasks likek means [33], mixture of Gaussian [8]; classification tasks like ID3 [33,16], C4.5 [17], Gaussian classifier [44]; and linear and logistic regression tasks [13,45]. Other tasks like dimensionality reduction [33,46,47] and statistical estimators [48,49] also have private versions. Support vector machine [50] and boosting [51] have also been studied in the context of differential privacy. However, the applicability ofε DP to a vast number of other learning techniques remains unclear. One of the reasons for this is the cumbersome sensitivity analysis accompanied withε DP.> B. Frameworks
Several analysis frameworks and runtime toolkits have been developed for simplifying the usage of differential privacy in daily data processing tasks, hence shaping
ε DP thinking. Sublinear query (SuLQ) [33] is a primitive that is powerful enough to be used in a collection of learning techniques. Privacy integrated query (PINQ) written by McSherry [22] is a building block in various applications like network trace analysis [52], and recommender systems [53]. For the MapReduce framework, Airavat [54] combinedmandatory access control anddifferential privacy to provide strong security and privacy guarantees for distributed computations on sensitive data. It has some limitations in composition and requires trusted mappers. PASTE [18] aims at the aggregation of distributed timeseries data viaε DP Fourier transformation and homomorphic encryption. GUPT [55] uses asampleandaggregate framework [8] and a new model of data sensitivity. It resistssidechannel attacks and gains better accuracy with resampling techniques.For noninteractive contexts, a formal learning theory was proposed by Blum et al. [14]. They demonstrated that it is possible to release synthetic private databases that are useful for all queries over a discretized domain from a concept class with polynomial VC dimension. However, their mechanism is inefficient, because it requires sampling from a nontrivial probability distribution over an unstructured space of exponential size. Nissim et al. [8] provided an alternative approach to sensitivity analysis that is datadependent and can solve hard problems like median publishing or the cost of the minimum spanning tree. Beyond that, it puts forward a general sampling technique called
sampleandaggregate [8].V. CONTINUAL OBSERVATION SETTING
> A. Methods
The continual observation setting in which data aggregators continually release updated statistics also needs to be placed in a differential privacy context to protect the contributor’s privacy. Examples include Amazon, IMDb’s popular item recommendations, and Google and Yahoo’s hotsearch keyword suggestions. Chan et al. [56] addressed the
continual counting problem over a bit stream (bounded or unbounded time). Using ideas ofpsum as intermediate results, they came up with thebinary counting mechanism for a timeboundedness case with usefulnessThe extension to timeunboundedness with a
hybrid mechanism achieves the same utility. To satisfy the consistency condition (where the difference between the current count and the previous value is 0 or 1), the error increases by a factor of (logt )^{2}. A bit of modification converts the Hybrid mechanism to a panprivate one [57]. However, the work was limited toeventlevel privacy only [56].Independently, Dwork et al. [57] studied the
panprivate algorithms. Roughly speaking, these algorithms retain their privacy properties even if their internal state become visible to intruders. Another contribution was theuserlevel privacy regarding the existence of all events that belong to a user in the stream, which is stronger thaneventlevel privacy .εdifferentially panprivate versions of several counting algorithms were provided, such as the density estimator, tcropped mean estimator, kheavy hitters estimator, tincidence estimator, and modk incidence counter [57]. Some impossibility results and separation results betweenrandomized response andprivate sketching were also provided.VI. ATTACKS
> A. Blatant Nonprivacy
In an interactive setting, a database
d in the form of an nbit vector isblatantly nonprivate [28] if after interacting with the database curator, an adversary can reconstruct a candidate databasec that agrees withd on all buto (n ) entries. Dinur and Nissim [28] show that addingnoise to every response is blatantly nonprivate against a polynomialtime bounded attacker asking
O (n log^{2}n ) queries. The attack consists of two steps: posingO (n log^{2}n ) random subsetsum queries, and solving a linear program withn variables andO (n log^{2}n ) constraints then rounding the results. Dwork and Yekhanin [58] claimed the inefficiency of this attack by a worstcase running timeO (n ^{5} log^{4}n ), and proposed a sharper attack relying on the basic properties of the Fourier transform over the group ？_{2}^{k}. Their method requires onlyn queries and runs inO (n logn ). The second contribution was theinterpolation attack against a class of curators adding at least (1/2 +ε ) fraction of queries with low noise. The main idea was to achieve errorcorrection via polynomial interpolation with a running time ofpoly (e/ε ). A more comprehensive summary of blatant nonprivacy is available elsewhere [59].> B. SideChannel Attacks
Apart from blatant nonprivacy, there exist other vulnerabilities under sidechannel attacks. Processing sideeffects like long or rejected responses also reveal some information about victims. Haeberlen et al. [60] pointed out such threats against the existing systems PINQ [22] and Airavat [54]. They tested three attacks:
timing attack ,state attack , andprivacy budget attack . By intentionally pausing for a long time in the query code when a certain condition is detected, the privacy mechanism reveals one bit (yes/no). The state attack exploits a global variable to open a channel between microqueries. The privacy budget attack checks how much the given privacy budget has decreased when the outer query returns. To cope with these attacks, aFuzz system with builtin attackresistance capabilities was proposed. The GUPT system [55] was claimed to be safe under these three types of attacks.> C. No Free Lunch in Data Privacy
Kifer and Machanavajjhala [61] critically analysed the privacy protections under differential privacy via nonprivacy games. They addressed several popular misconceptions about differential privacy, including: that it makes no assumptions about how data are generated; that it protects an individual’s information even if an attacker knows about all other individuals in the data; and that it is robust to arbitrary background knowledge. By employing a
nofreelunch theorem , it was argued that it is not possible to offer privacy and utility without making assumptions about how the data are generated. They emphasized that a user’s privacy is preserved if the attacker’s inference about the user’sparticipation in the data generating process is limited. It is difficult to come up with a general definition ofparticipation that applies to all data generating mechanisms. These ideas were clarified through examples from social network research and tabular data. In a social network case study, several network evolution models were used to evaluate how the existence of an edge is revealed in special cases. Similarly, in contingency tables, differential privacy may become useless if used after deterministic data is released.VII. CONCLUSIONS
Since its emergence, differential privacy has expanded its frontier rapidly with many interesting ideas under considerations such as the geometry, the algorithmic complexity of differential privacy, or alternative privacy guarantees that are composed automatically and give better accuracy. The determination of a conceptually simple definition of differential privacy has attracted great interest over the last decade. Interactivity and noninteractivity have both witnessed theoretical and practical advancements. Connections between differential privacy and other fields such as cryptography, statistics, complexity, combinatorics, mechanism design, and optimization provide fertile ground for upcoming growth. This paper is a short recapitulation of the main results from these works in the hope of promoting this emerging privacy model. We have emphasized the motivations, popular mechanisms, and typical work in various settings. Existing work on the misconceptions about differential privacy and possible attacks in special cases suggest that it should be used with caution.