Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 4. Dataset Transformations, Dataset Distance, and Stability

Data transformations are at the heart of data analysis tasks. In the context of transformations, it will aid your understanding to consider a dataset to be any form of data that has not been made private. That is, computed statistics (such as the sum or mean estimates) are also dataset(s). Transformations encompass any function from a dataset to a dataset.

Some commonly-used data transformations compute statistics - when you have a dataset, and you want to understand a property of it. A statistic aggregates many records into a summary, like a sum, mean, standard deviation, or even regression parameter(s).

Some transformations may modify each record in a dataset one at a time, therefore preserving the dimensions of the data, like when applying mathematical transformations to each element in a column, or when assigning bin numbers to each record. Other transformations still may compute a dataset join, or load data from a csv file or database.

In order to establish privacy guarantees later, it is first necessary to consider how an individual’s influence on the data may change once a transformation is applied. This change in influence in characterized by a function’s stability.

This chapter discusses the stability of a number of dataset transformations that are typically used in differentially private releases.

In DP, it is valuable to look at statistics as special cases of more abstract data transformations. A broader perspective means that you can use more general tools to understand the sensitivity of a transformation and ensure that your data analysis process has the expected privacy guarantees.

One of the most common dataset transformations is the sum. In many situations, each element in the data you are privatizing has no natural bounds - values could approach positive or negative infinity without limitation. Naturally, unbounded data implies that many queries, like the sum, have unbounded sensitivity. When the sensitivity is unbounded, you can’t construct a DP mechanism that will privatize your statistic, since a change in a single value could cause an arbitrarily large change in the statistic, therefore leaking information about the dataset. This kind of scenario highlights the importance of having a well-defined sensitivity when performing a private data analysis.

Just like a statistic has an associated sensitivity, a more general data transformation has a notion of stability. Stability measures how much a transformation changes, given a change in the input data. By the end of this chapter, you should understand the notion of stability with respect to dataset transformations. Using this knowledge, you should be able to identify stable dataset transformations that complement differentially private mechanisms. In particular, you will be able to determine when it is appropriate to use clipping/clamping when performing a DP analysis.

Stability of Data Transformations

A transformation is a computation that maps a dataset to another dataset, or maps a dataset to an aggregate. Consider a transformation that squares each element ina dataset: $f left-parenthesis x right-parenthesis equals x squared$ . Then given the dataset $upper X equals StartSet 1 comma 2 comma 3 EndSet$ , the transformed data is:

$f left-parenthesis upper X right-parenthesis equals StartSet 1 comma 4 comma 9 EndSet$

This transformation operates on each row independently and returns a transformed dataset with the same number of rows. Such a transformation is called a row-by-row transformation. Such row-by-row transformations are rather common; you will find them useful as you learn more about DP.

The mean is another example of a transformation - one that maps a dataset to value computed from an aggregate of the data. Taking the same dataset, we can compute

$μ (X) = \frac{1 + 2 + 3}{3} = 2$

In this case, the result of the transformation is a single value, not a dataset. This single value takes the data in aggregate and returns a value that describes a property of the dataset. We’ll generally refer to such transformations as aggregators.

In this chapter, we are interested in transformations with bounded stability. Stability is a property of a transformation that guarantees that similar inputs map to similar outputs. If a transformation maps similar inputs to similar outputs, then we say it is stable.

You might be wondering what we mean by similar here - great question. Mathematically speaking, you will always want to be precise with your notions of similarity and closeness. Much like you say a release is " $ϵ$ -private,” you will also say that a transformation $T$ is " $c$ -stable.” Relative to the absolute distance, this means that whenever $StartAbsoluteValue u minus v EndAbsoluteValue less-than-or-equal-to d Subscript i n$ , then it is also true that $StartAbsoluteValue upper T left-parenthesis u right-parenthesis minus upper T left-parenthesis v right-parenthesis EndAbsoluteValue less-than-or-equal-to c dot d Subscript i n$ . In other words, inputs that are within $d_{i n}$ distance from each other will result in outputs which are no further apart than $c \cdot d_{i n}$ from each other.

Stability may already feel familiar to you, because stability is a generalization of sensitivity. Recall the definition of sensitivity from the previous chapter:

Δ f = max_{A \sim B} d_{A b s} (f (A), f (B)) = max_{A \sim B} | f (A) - f (B) |

Recall that $A \sim B$ requires that A and B are similar; the distance between them is bounded. The sensitivity is a form of stability - it guarantees that, when any two input datasets A and B are similar, the outputs are similar.

Sensitivity of the Sum Aggregator

We can use the definition of sensitivity to compute the sensitivity of a sum aggregator function, $f$ . This is not as daunting as it may sound! Define the function as a sum of the inputs: $f left-parenthesis upper X right-parenthesis equals sigma-summation Underscript i Endscripts upper X Subscript i$ where each $upper X Subscript i Baseline element-of double-struck upper R$ and substitute this form of $f left-parenthesis upper X right-parenthesis$ into the definition of sensitivity:

Δ f = max_{A \sim B} d_{A b s} (f (A), f (B)) = max_{A \sim B} \sum_{i} A_{i} - \sum_{i} B_{i}|

If you know that $A$ and $B$ differ by at most one record, then $\sum_{i} B_{i} = \sum_{i} A_{i} \pm B_{n}$ . With this observation, the formula collapses:

max_{A \sim B} \sum_{i} A_{i} - \sum_{i} B_{i}| = max_{A \sim B} \sum_{i} A_{i} - \sum_{i} A_{i} \pm B_{n}| = max_{A \sim B} | B_{n} |

This quantity is almost useful, but the magnitude of $B_{n}$ is unbounded, so the sensitivity is infinite. Therefore, the sum transformation is not stable. At least, not in its current state. All you need is an assumption that $B_{n}$ is bounded.

If you know that $B_{n}$ is no smaller than $L$ , and no greater than $U$ , then $max Underscript upper A tilde upper B Endscripts StartAbsoluteValue upper B Subscript n Baseline EndAbsoluteValue less-than-or-equal-to k max left-parenthesis StartAbsoluteValue upper L EndAbsoluteValue comma upper U right-parenthesis$ . Therefore, you can say that the sum transformation on a dataset with bounded elements is " $max left-parenthesis StartAbsoluteValue upper L EndAbsoluteValue comma upper U right-parenthesis$ -stable.”

Sensitivity and Clipping

Let’s think back to the classroom example - every score in the class was between 0 and 100. Why is that? Simply by circumstance, there was no extra credit on the exam, and there was no (negative) penalty for guessing incorrectly, so the minimum score was 0.

What happens in a scenario where we don’t have such a natural domain for the data? Let’s consider financial situations where the maximum is unknown or undefined. What is the maximum price of a house? The maximum income possible? Such scenarios do not give us natural bounds on the data.

Now remember how we calculated sensitivity as the maximum change in the final statistic that is possible from adding or removing a single data point. We knew that the lowest average score was 0, and the highest single score was 100, which gave us a sensitivity of 10. What happens if we don’t have a maximum score? We can no longer compute the sensitivity, since it becomes $\frac{(0 + \infty)}{10}$ which is undefined. In situations like this, we need to “clip” the data so that we can have a well-defined sensitivity.

Clipping: Clipping is a transformation that replaces each value outside of a set of pre-defined bounds with their nearest neighbor within the bounds. np.clip does this for the scalar case. The OpenDP Library similarly has a constructor for clamping data:

import opendp.prelude as dp
input_space = dp.vector_domain(dp.atom_domain(T=float)), dp.symmetric_distance()

clip_trans = input_space >> dp.t.then_clamp((0., 10.))

mock_dataset = np.random.normal(0., scale=10.)
transformed_dataset = clip_trans(data)

Now you have the tools to make your sensitivity value well-defined, even in cases where the data is unbounded. By using a clipping function like the ones above, you can ensure that your dataset lies between min_value and max_value, resulting in a sensitivity value that is meaningful for constructing DP statistics.

Example: DP Analysis of Income Data

Suppose you have a dataset of individuals and their annual incomes. The least amount they can be paid per year is $0, but it is unclear what the maximum is. For this reason, your data is on the domain $left-bracket 0 comma normal infinity right-parenthesis$ .

In this case, you would need to use outside domain knowledge to estimate a good maximum given the dataset. If you set it very high, you are almost guaranteed to not cut off any of the values, but your sensitivity will suffer. Conversely, if the maximum is set too low, then you will clip too many values and lose utility in your final statistic.

Choosing an optimal clipping value

There are several approaches you can take when selecting clipping bounds. Consider the DP mean process - for data without any natural bounds, the sensitivity is undefined. In order to guarantee DP in a mean release, you need to somehow provide bounds to the data.

There are two ways to deal with extreme values: trimming, and clipping. Trimming means removing any values outside the range [min, max], while clipping means replacing values outside the range with min or max.

Percentiles are a natural starting point. If you define a percentile value $p$ , you can trim (remove) the $StartFraction p Over 2 EndFraction$ % smallest and $(1 - \frac{p}{2})$ % largest values. Trimming a total of 10% of the data (the 5th and 95th percentiles) is a relatively common starting point.

This has an effect on the final statistic. As an example, consider the set of numbers $left-bracket 1 comma 2 comma 8 comma 9 comma 11 comma 22 comma 25 comma 27 comma 35 comma 47 comma 58 comma 68 comma 68 comma 74 comma 75 comma 89 comma 90 comma 94 comma 98 comma 100$ ] with mean 50.05. This dataset has 20 values, meaning that each value represents 5% of the data. If you trim this dataset based on the 5th and 95th percentiles, you end up removing the smallest and largest value: $left-bracket 2 comma 8 comma 9 comma 11 comma 22 comma 25 comma 27 comma 35 comma 47 comma 58 comma 68 comma 68 comma 74 comma 75 comma 89 comma 90 comma 94 comma 98$ ]. This has mean 50.00.

Python Functions for Bound Estimation

Thankfully, you don’t have to go through all of these steps every time you want to establish optimal clipping parameters.

import numpy as np
import opendp.prelude as dp

def release_dp_mean(bounds, contributions, epsilon):
    context = dp.Context.compositor(
        data=data,
        privacy_unit=dp.unit_of(contributions=contributions),
        privacy_loss=dp.loss_of(epsilon=epsilon),
        split_evenly_over=2
    )

    numerator = context.query().clamp(bounds).sum().laplace().release()
    denominator = context.query().count().laplace().release()
    return numerator / denominator

pr = []
true_mean = sum(data) / len(data)
print(f"True mean: {true_mean}
")

print("Mean	Utility")
for upper in [50., 75., 90., 100.]:
    clipped_dp_mean = release_dp_mean((0., upper), contributions=1, epsilon=1.)
    utility = abs(clipped_dp_mean - true_mean)
    print(f"{clipped_dp_mean}	{utility}")

Categorical Data Transformations

Categorical data is, as its name suggests, data that be divided into different categories.

Table 4-1. Vehicle database
License Plate	Vehicle Type
ABC123	Car
JCO549	Truck
OFJ295	Car
EMF494	Motorcycle
QMC583	Truck

In this case, you can aggregate the vehicles by the column “Vehicle Type” and generate a count: 2 cars, 2 trucks, 1 motorcycle. The possible categories (car, truck, motorcycle) are also known as the keys.

When dealing with categorical data, the set of keys might be entirely known or only partially known. In some cases, the set of keys is clear and well known. For example, consider a census survey. If there is a question related to citizenship status, there are only two possible keys: citizen or non-citizen. If a data analyst decides to create a histogram of citizenship status, the data analyst will query the number of individuals whose status is “citizen” and the number of individuals whose status is “non-citizen.”

The data analyst does not know how many individuals are in each category, so they will query the count of citizens and the count of non-citizen individuals. In this query, an individual belongs to either the category “citizen” or “non-citizen”, so all neighboring databases will have the same set of categories.

Next, consider an illustrative example where a data analyst queries citizenship status statistics from a database:

Table 4-2. Survey of individual characteristics of employees in a company
First Name	Last Name	Budget	Citizenship Status
Caryl	Baptie	$898,031.59	Citizen
Moyra	Leverson	$847,791.81	non-Citizen
Elinore	Gillbard	$729,605.84	Citizen
Farleigh	Crampton	$9,742,235.31	Citizen
Sebastien	Marples	$3,677,589.94	non-Citizen
Baxy	Doohan	$1,044,044.63	Citizen
Henrie	Whawell	$4,670,377.71	Citizen
Dorothy	Drummer	$2,641,401.28	Citizen
Meaghan	Clinnick	$989,042.50	non-Citizen
Tildy	Gutans	$5,986,640.99	Citizen

The data analyst queries the count of individuals with Citizenship Status == Citizen and the count of individuals where Citizenship Status == non-Citizen. These two queries represent the citizenship status histogram. In this case, adding or subtracting individuals from this database will not change the key set in the histogram, since the only two options are “citizen” and “non-citizen.”

If the analyst builds a histogram of citizenship counts from this table, it will appear like so:

This dataset contains an example of a categorical variable with a known key set. Because the key set is known, the set of keys is consistent over all possible neighboring databases. There is no third possible key that can be introduced here - the category is binary. If the data analyst wants to release a differentially-private histogram of citizenship statuses of individuals in a database, they can use a differentially private count query on the database. Because the keys are known, the DP count query will provide a differentially private data release.

Recall from chapter 3 that DP has the parallel composition property: this means that performing a $ϵ$ -DP operation on different (non-overlapping) parts of a dataset is still $ϵ$ -DP. Further, the total $ϵ$ does not increase with the number of subsets. Using this definition, we can apply a DP count function to each bin in the histogram, adding noise separately to each bin.

An example of a DP count with noise addition could look like this:

Note that the counts are close to the true values of 7 and 3.

Adding or subtracting individuals from this database will not affect the key set, since there are only two fixed options. The possible key set always remains the same independently of the individuals in the dataset.

When categorical variables have a wider variety of possible categories, you may encounter a scenario in which the key set of the categorical variable is unknown. Consider an example where a data analyst is building a histogram from the same dataset, but this time based on job titles. The job titles are self-reported by the individuals, meaning that the set of possible job titles is not known ahead of time.

Table 4-3. Job title associated with each user in the data
First Name	Last Name	Budget	Job Title	Citizenship Status
Caryl	Baptie	$898,031.59	Accountant	Citizen
Moyra	Leverson	$847,791.81	Teacher	non-Citizen
Elinore	Gillbard	$729,605.84	Actor	Citizen
Farleigh	Crampton	$9,742,235.31	Engineer	Citizen
Sebastien	Marples	$3,677,589.94	Teacher	non-Citizen
Baxy	Doohan	$1,044,044.63	Medical Doctor	Citizen
Henrie	Whawell	$4,670,377.71	Accountant	Citizen
Dorothy	Drummer	$2,641,401.28	Teacher	Citizen
Meaghan	Clinnick	$989,042.50	Accountant	non-Citizen
Tildy	Gutans	$5,986,640.99	Engineer	Citizen

Suppose the data analyst decides to make a GROUP BY query, getting the counts of individuals for each job title:

And the differentially private version of the histogram is

Now, suppose that Elionore Gillbard requests that her data be redacted from the database. The new database without Elionore Gillbard is a neighboring database. The new histogram produced by the data analyst will look like so:

[[histogram_of_job_titles_of_a_neighboring database]] .Histogram of job titles of a neighboring database (non-private) image::images/ch4_fig4.png["query database without Barb"]

Applying differential privacy to the counts will result in the following histogram

When comparing the histogram from Figure 4-4 with the histogram from Figure 4-6, it is easy to notice that there is a privacy violation happening in the data release. Both histograms should be differentially private, and yet they are leaking information. Based on a quick inspection, you can infer that the job title of the person who left the database is “Actor.”

To overcome this issue, the data analyst should pre-define the key set before making the analysis. In the case of the job titles variable, the data analyst can create a set of job titles based on their previous knowledge of possible answers to the question of “what is your job title?” In addition to the job titles that the data analyst includes in the pre-defined set, the data analyst can define a category “other” that will include new job titles and outliers.

Now let’s understand how the pre-defined key set is defined and used to make differetially private queries to categorical data in the following example.

Example: Histogram Queries

Let’s consider the database described in Table 4-4. To query a differentially private histogram of job title, the data analyst will:

Define the key set of the categories in the histogram based on previous information
Count the number of individuals in each of the pre-defined categories

The data analyst starts by defining the following job titles categories as the key set for the data analysis:

Accountant
Teacher
Engineer
Other

The histogram maps the job titles Accountant, Teacher, and Engineer to equivalent categories in the key set, while the job titles Actor and Medical Doctor are mapped to the category Other. The non-private histogram of job counts appears like so:

The private histogram has a very similar appearance, with noise added to the counts using a Laplace mechanism with $epsilon equals 1$ :

Notice that with this configuration, any individual, with any job title, can be added or deleted from the database without resulting in privacy violations. In contrast to the previous example, any unexpected job title will simply be mapped to Other.

Suppose Marta James, an architect, is added to the database:

Table 4-4. Job title associated with each user in the data
First Name	Last Name	Budget	Job Title	Citizenship Status
Caryl	Baptie	$898,031.59	Accountant	Citizen
Moyra	Leverson	$847,791.81	Teacher	non-Citizen
Elinore	Gillbard	$729,605.84	Actor	Citizen
Farleigh	Crampton	$9,742,235.31	Engineer	Citizen
Sebastien	Marples	$3,677,589.94	Teacher	non-Citizen
Baxy	Doohan	$1,044,044.63	Medical Doctor	Citizen
Henrie	Whawell	$4,670,377.71	Accountant	Citizen
Dorothy	Drummer	$2,641,401.28	Teacher	Citizen
Meaghan	Clinnick	$989,042.50	Accountant	non-Citizen
Tildy	Gutans	$5,986,640.99	Engineer	Citizen
Marta	James	$9,480,446,38	Architect	non-Citizen

The pre-defined key set prevents the new histogram configuration from adding a job title Architect in the histogram, and potentially leaking the addition of Marta to the dataset.

Summary

Exercises

Identify the following transformations as either aggregates or general transformations. If they are not aggregates, specify their range. Construct a limitation on their input so that the stability is defined.
1. f(x, min, max) = x if x in [min, max]
2. $f (x) = \frac{1}{| x |} \sum_{i} x$
3. $f left-parenthesis x right-parenthesis equals x squared$
4. $f left-parenthesis ModifyingAbove x With right-arrow right-parenthesis equals upper A ModifyingAbove x With right-arrow$ where $A = \begin{matrix} 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \end{matrix})$
Generate an array of 1000 random floats using a library like NumPy.
1. Calculate the mean of the data
2. Calculate the 5th and 95th percentiles for the data
3. Clip these values
4. Recompute the mean - how has it changed?
5. Repeat c and d with trimming
Prove that DP observes parallel composition

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
4. Dataset Transformations, Dataset Distance, and Stability

Chapter 4. Dataset Transformations, Dataset Distance, and Stability

Stability of Data Transformations

Sensitivity of the Sum Aggregator

Sensitivity and Clipping

Example: DP Analysis of Income Data

Choosing an optimal clipping value

Python Functions for Bound Estimation

Categorical Data Transformations

Figure 4-1. Histogram of citizenship status (non-private)

Figure 4-2. Histogram of citizenship status (private)

Figure 4-3. Histogram of job titles (non-private)

Figure 4-4. Histogram of job titles (private)

Figure 4-5. Histogram of job titles of a database (private) .

Example: Histogram Queries

Figure 4-6. Histogram of job titles with a pre-defined key set

Figure 4-7. Histogram of job titles with a pre-defined key set.

Summary

Exercises

Table of Contents for 4. Dataset Transformations, Dataset Distance, and Stability

Create new playlist

Sign In

Sign Up

Chapter 4. Dataset Transformations, Dataset Distance, and Stability

Stability of Data Transformations

Sensitivity of the Sum Aggregator

Sensitivity and Clipping

Example: DP Analysis of Income Data

Choosing an optimal clipping value

Python Functions for Bound Estimation

Categorical Data Transformations

Figure 4-1. Histogram of citizenship status (non-private)

Figure 4-2. Histogram of citizenship status (private)

Figure 4-3. Histogram of job titles (non-private)

Figure 4-4. Histogram of job titles (private)

Figure 4-5. Histogram of job titles of a database (private) .

Example: Histogram Queries

Figure 4-6. Histogram of job titles with a pre-defined key set

Figure 4-7. Histogram of job titles with a pre-defined key set.

Summary

Exercises

Table of Contents for
4. Dataset Transformations, Dataset Distance, and Stability