Chapter 10. Protecting Your Data Against Privacy Attacks

This chapter discusses attacks on data releases and how differential privacy can protect against them. While other chapters in this book touch briefly on privacy attacks, this chapter discusses a wider variety of attacks in greater detail. The ramifications of each type of attack are also discussed: attacks may be used to reconstruct an individual’s data, or they may be used to infer if an individual exists in a dataset.

The attacks are explained from the perspective of two parties, a data analyst and a data curator. To ensure the protections on the data are sufficiently robust to protect the privacy of individuals, assume that the data analyst harbors the worst intentions: the analyst is an adversary who is determined to violate the privacy of individuals in the sensitive dataset. Fortunately, differential privacy not only shares this assumption, but it also strengthens it, in that it assumes that the adversary has potentially unbounded computational power and access to auxiliary information. Note that, even though differential privacy makes these strong assumptions, many privacy attacks can be practically deployed without either assumption.

This chapter assumes an interactive model, where the adversary can interactively communicate with the data curator. In many scenarios, the data curator only makes one static release. This is a trivial case of the interactive model, where there is only one round of communication from the curator to the adversary. Thus, the interactive model, and techniques discussed in this chapter are just as applicable when the curator makes a one-time release.

This chapter starts with an overly-generous data curator, who only removes personally identifying columns before sharing the sensitive dataset with an adversary. As the chapter progresses, increasingly sophisticated attacks are discussed and demonstrated.

  • Record linkage

  • Singling out

  • Differencing

  • Reconstruction via systems of equations

  • Tracing

  • K-Anonymity vulnerabilities

  • ML black-box membership inference

These attacks motivate imposing certain minimal restrictions on the communication between the adversary and data curator that will maintain the privacy of individuals in the sensitive dataset. A general pattern emerges in that restrictions placed to thwart a certain kind of attack are not guaranteed to protect against all attacks. To protect against all possible attacks, one needs the formal mathematical guarantees provided by differential privacy.

Definition of a Privacy Violation

In order to qualify as a privacy attack, the attack must be able to reveal information about a specific individual.

A common refrain in the data privacy community goes:

Statistical inference is not a privacy violation.

Thus, a technique is not considered here if it is only able to infer group-level or demographic information. The attacks discussed in this chapter result in three kinds of privacy violations: membership inference, reconstruction and re-identification.

A membership inference attack is used to determine if an individual is a member of the dataset. In some cases, just leaking that an individual is in a certain dataset can cause serious harm. This kind of sensitivity occurs in many contexts, including healthcare information, human research data, and refugee data. For example, being exposed as a member of a medical trial may imply certain medical afflictions.

Microdata reconstruction is a greater privacy violation than membership inference, but it is also more difficult to execute. Microdata reconstruction is also a superset of membership inference: if a privacy attack is capable of microdata reconstruction, it is also capable of membership inference. Microdata reconstruction does not necessarily imply re-identification, but any microdata reconstruction vulnerability is just one linkage attack away from re-identification.

Re-identification occurs when the attacker is able to uniquely identify who a piece of microdata belongs to.

Tabular Datasets

The first attacks discussed in this chapter focus on the statistical analysis of tabular datasets. Tabular data is characterized by structured datasets organized in rows and columns, such as a table in a relational database or a CSV file. In this section, you will learn about attacks that specifically target vulnerabilities in tabular data.

Record Linkage

A record linkage attack leverages an auxiliary dataset with uniquely identified individuals to re-identify individuals in a sensitive dataset. This kind of attack is applicable if the attacker has access to the microdata, or row-level access, to the sensitive dataset.

Microdata is the original form of the data, before any aggregations have been applied, where each row corresponds to individual contributions. Row-level access means that the adversary can read a row from the original dataset (or at least a subset of a row).

A data curator with the best intentions to protect the privacy of individuals in their data still grants row-level access to the data if they release their dataset with personally-identifying-columns removed. That is, removing names and social security numbers from your dataset still exposes individuals in your dataset to record linkage attacks.

Say a data curator has a teacher survey that they want to share with the public. The columns of the dataset are described in the following code snip.

import pandas as pd
df = pd.read_csv("teacher_survey.csv")
df.columns = ['name',
              'sex',
              'age',
              'maritalStatus',
              'hasChildren',
              'highestEducationLevel',
              'sourceOfStress',
              'smoker',
              'optimism',
              'lifeSatisfaction',
              'selfEsteem']

Before releasing this dataset, the data curator attempts to anonymize it by removing the name column. Unfortunately, the name column is not the only column that can be used to re-identify an individual.

# naively "anonymize" by removing the name column
del df["name"]

Now switch your perspective to that of an adversary who would like to re-identify their co-worker’s data. All you need to do is filter the dataset down to those rows that match attributes you know about your co-worker. The following predicate applies to only one row in the dataset:

df.loc[(df['sex'] == 3) & (df['age'] == 27)]

You’ve now violated the privacy of your coworker, by gaining access to all of your coworker’s survey responses.

The attack is easier to execute if your coworker has any particularly distinctive attributes. If the predicate matches more than one record you may still be able to resolve the ambiguity by collecting more information about the target of interest and refining the predicate. If the ambiguity cannot be resolved, you still gain probabilistic information about the target’s responses. That is, if four records are matched, and three are smokers, then you’ve learned that your coworker is likely a smoker.

In this attack, the adversary took advantage of pseudo-identifiers in the dataset. A pseudo-identifier is a column that is not personally-identifying, but can be used to link records in a dataset to records in another dataset. Unfortunately, nearly any column in a dataset can be used as a pseudo-identifier, which makes it difficult to protect against by removing columns.

In this example linkage attack, the adversary was able to link records in a dataset to a trivial auxiliary dataset of one record. This attack can easily be generalized to auxiliary datasets with more than one record by using a database join instead of a filter. In practice, there are many commercially available datasets that can be used as auxiliary datasets.

Linkage attacks are often the second phase of a privacy attack, used to enrich row-level microdata that has been extracted from another attack. Linkage attacks escalate what was once just microdata reconstruction to full re-identification.

If you wanted to employ differential privacy to privately release the microdata, you could imagine using locally differentially private mechanisms with parallel composition over the row-space, and sequential composition over the column space. It quickly becomes apparent that the statistical utility of the data release is poor compared to central differential privacy. The privacy loss parameters grow in size proportional to the number of columns in the data, and significant noise must be added to each entry. Another failure of this approach is that the primary benefit of local DP is not realized, in that a central authority, the data curator, still has access to data in the clear. These drawbacks are to be expected: while differential privacy could be used to release the microdata, this is not the ideal point to apply DP because the release would consist of a very large body of microdata that is highly sensitive in nature.

At this point, the data curator prevents re-identification via linkage attacks by trying to deny access to microdata completely. The data curator instead creates a portal from which the public can pose queries to be run on the data. This begins a battle of wits between the adversary and the data curator. While preventing access to microdata thwarts record linkage attacks, this is easier said than done. If the data curator does answer the queries via differential privacy, then the adversary can still gain access to microdata via singling out, differencing, and reconstruction.

Singling Out

In response to the data curator restricting access to the microdata, the adversary updates their strategy to use queries that single out individuals. A query singles out an individual when it contains a predicate that uniquely identifies an individual. If the data curator answers any query that singles out an individual, they unknowingly give the attacker access to microdata.

This attack takes advantage of the false assumption that data aggregations, like the mean, afford privacy to individuals in the data. While the example query of counting the number of smokers in the given demographic may seem benign, it actually extracts the smoking status of the previously-targeted coworker.

df.loc[(df['sex'] == 3) & (df['age'] == 27)]['smoking'].sum()

Singling out is more difficult to defend against than you might expect. A data curator may unknowingly single out an individual by releasing statistics grouped by several attributes. For instance, in the teacher survey example, a query for the smoking rates when grouped by “sex” and “age” would single out any individuals in any group of size one, which definitively reveals their smoking status.

The data curator may try to protect against singling out by refusing to answer any query that matches a single individual. While this approach prevents microdata reconstruction, it still allows for membership inference. If the data curator rejects any query that singles out an individual, then the refusal to answer the query gives the adversary sufficient information to distinguish if the target individual exists in the dataset.

If you wanted to employ differential privacy to release the singling-out query, then you’ll notice that the utility of these kinds of queries works out to be very poor. This poor utility is, perhaps counterintuitively, ideal! In the query discussed above, the range of the query will be no greater in magnitude than the sensitivity of the query. Therefore, any signal provided by the query will be dwarfed by the noise scale. Legitimate queries involve datasets of greater size, where the range of the query dwarfs the noise scale. This is the typical behavior of DP methods, stemming from how noise is scaled calibrated according to the sensitivity of each query.

Suppose the data analyst is reluctant to employ differential privacy, opting instead to refuse any query with a predicate that matches fewer than k records. Intuitively, a higher threshold should admit stronger protections. Unfortunately, any choice of threshold will prove unsatisfactory.

Differencing Attack

The differencing attack circumvents thresholding-based protections against microdata reconstruction by distributing the attack over more than one query. In this attack, an adversary simply computes the difference between two carefully-crafted queries. Just like singling out, since this attack is potent enough to reconstruct microdata, it is also possible to use differencing attacks to infer membership.

Returning to the teacher survey example, the adversary can craft two queries for which the predicates match a reasonably large number of individuals, but still differ by only a single individual. Once they receive the answers to these two queries, they can simply compute the difference.

predicate_a = (df['maritalStatus'] == 'Un-Married')
predicate_b = (df['maritalStatus'] == 'Un-Married') | 
    (df['sex'] == 3) & (df['age'] == 27)
df.loc[predicate_a]["smoker"].sum() - df.loc[predicate_b]["smoker"].sum()

This differencing attack with two queries is relatively easy to see through, but the adversary can just as easily set the trap over a larger set of queries, or take advantage of auxiliary information to make the predicate seem more general than it really is. For instance, if the adversary already knew the number of un-married teachers in the dataset, then they would only need to submit the second query. Auxiliary information makes differencing attacks impossible for a data curator to guard against, because the data curator has no way of recognizing queries that may single out individuals in the dataset.

Differencing attacks are also particularly easy to execute over temporal data, where it is natural to want to compare differences between time-steps.

The differencing attack may also be carried out with just a single query. If the adversary asked for the smoking rate, and all members of the group share the same survey answer, none of the individuals in the group benefit from plausible deniability. Otherwise, the smoking rate within a small group can be used to give a probabilistic estimate of the coworker’s smoking status. Such a probabilistic estimate can be refined by collecting auxiliary information about other individuals in the group.

If you wanted to employ differential privacy to release the pair of differencing queries, then you’ll notice that the utility of each query in isolation remains high, but the utility of their difference is very poor! In this case, the privacy loss parameters are split amongst two queries, resulting in two instances of noise addition that are each twice as large. Since the variance of the difference of two random variables is simply the sum of the variances, then under basic rules of probability, the variance of the difference is now four times greater than it was in the singling-out attack. As discussed before, the variance of the query in the singling-out attack was already untenable.

At this point, the data curator is left without recourse: any incoming query may potentially violate privacy. The only solace being that the adversary must very carefully and manually craft queries that will only gain them a single attribute, and that the adversary must have significant auxiliary information.

Least Squares Solution

The methods shown in this section provide a more practical approach to exploit the information in a set of aggregate queries in order to reconstruct multiple rows. Reconstructing the entire dataset by replicating the differencing attack multiple times is inefficient, because one query may be useful for reconstructing more than one entry in the dataset. Both approaches are also difficult to scale up to many records, because the adversary must be able to craft queries that singles out each individual.

The previous two attacks can be rephrased as simple systems of equations to solve for a single entry in the dataset. The singling-out query is a particularly simple system, consisting of one equation, and the differencing attack involves a system of up to k equations. In this section, you will learn how to randomly construct a system of equations that can be used to solve for/reconstruct an arbitrarily large subset of the dataset.

In this example, the goal of the adversary is to craft a set of predicates that will provide sufficient information to reconstruct the entire “smoker” column from the voter survey dataset. At this point, it is necessary to define the query interface that will be broken:

target = "smoker"


def query_interface(predicates):
    """Count the number of smokers that satisfy each predicate.
    Resembles a public query interface on a sequestered dataset.

    :param predicates: a list of predicates on the public variables
    :returns a 1-d np.ndarray of exact answers to the subset sum queries"""

    # 1. data curator checks predicates
    # 2. data curator executes and returns queries:
    return data[target].values @ np.stack([pred(data) for pred in predicates],
                                          axis=1)

An adversary may submit a set of queries like so:

query_interface([
    lambda data: data['sex'] == 1,            # "is-female" predicate
    lambda data: data['maritalStatus'] == 1,  # "is-married" predicate
])

To expedite the process of selecting suitable predicate functions, the adversary might construct them randomly, using a hashing scheme.

def make_random_predicate():
    """Returns a (pseudo)random predicate function by
       hashing public identifiers."""
    prime = 691
    desc = np.random.randint(prime, size=len(pub))
    # this predicate maps data into a 1-d ndarray of booleans
    #   (where `@` is the dot product and `%` modulus)
    return lambda data: ((data[pub].values @ desc) % prime % 2).astype(bool)

# Example usage
random_predicate = make_random_predicate()
num_smokers_that_matched_random_predicate = query_interface([random_predicate])

# The boolean mask from applying the example predicate to the data:
random_predicate_mask = random_predicate(data)

If we consider the predicate mask A, the US citizen column x, and the subset sum answers b, then what we need to do is find the x that minimizes |Ax-b|2. The target column is equivalent to the least squares solution (assuming the public variables uniquely identify each individual).

def reconstruction_attack(data_pub, predicates, answers):
    """Reconstructs a target column based on the `answers` to queries
       about `data`.

    :param data_pub: data of length n consisting of public identifiers
    :param predicates: a list of k predicate functions
    :param answers: a list of k answers to a query on data filtered by the
                    k predicates
    :return 1-dimensional boolean ndarray"""
    masks = np.stack([pred(data_pub) for pred in predicates])
    return np.linalg.lstsq(masks, answers, rcond=None)[0] > 0.5

We’re now ready to carry out the attack. Generate a large set of random queries, submit them to the query interface, and then find the least squares solution.

predicates = [make_random_predicate() for _ in range(2 * len(data))]
exact_answers = query_interface(predicates)

# generate example predicates and compute example query answers
reconstructed_target = reconstruction_attack(
    data_pub=data[pub], predicates=predicates, answers=exact_answers)

# complete reconstruction of the target column
assert np.array_equal(reconstructed_target, data[target])

As has been shown, this approach circumvents all ad-hoc protections placed on the data thus far, giving the adversary the ability to reconstruct the smoker column.

Multiple columns can be reconstructed at once by augmenting the queries and system of equations. In this example, the number of queries is chosen to be high enough to exactly reconstruct the data. The adversary may be able to reconstruct the microdata with fewer queries, by choosing predicate functions that are maximally entropic, instead of random, and by allowing some uncertainty in the reconstructed microdata. In practice, it is not necessary to perfectly reconstruct the microdata in order to carry out a record linkage attack.

Yet again, differential privacy provides a robust counter to this approach. This attack is most potent when it maximally exploits parallel composition. If the attack were adjusted to do so, it would still be no more effective than the singling out attack. Since the privacy loss parameters grow commensurately with the number of queries made on the data, the singling-out attack for a single data entry remains the most practical attack.

Tip

One limitation of this example is that it assumes an interactive model, where the adversary can choose their queries. In practice, the attack is still just as applicable to static data releases by the data curator. One must reconstruct the predicate mask A for data releases, based on the queries the data curator has provided.

In fact, there are examples of this being done. Many statistical agencies, like the U.S. Census, make data releases of grouped queries at many levels of disaggregation. These releases are just as vulnerable to the same attack! The U.S. Census Bureau conducted a proof-of-concept reconstruction attack, with the aid of detailed geography data that has not been made publicly-available, that admits microdata reconstruction including sex, age, race and ethnicity on the 2010 Census data. The reconstructed microdata was then subjected to a record linkage attack against commercial databases, resulting in confirmed re-identification of 52 million persons.

Another limitation of this example is that the target column must be binary. This is manageable: when the target column is not binary, any numeric column can be binned into a categorical column, and any categorical column can be binarized via a one-hot encoding. For example, if the column to reconstruct consisted of age, then age could be reconstructed to arbitrary precision by choosing the binning granularity and number of queries commensurately. Similarly, categorical predicates can trivially be converted to an ordinal encoding.

One limitation of this approach is the computational requirements. Notice that the number of unknowns in the system of equations grows linearly in the size of the dataset. When the dataset is very large, it becomes more computationally feasible to rephrase the problem as a constraint satisfaction problem, and to use a SAT solver.

Tracing

Tracing is a kind of membership inference attack that leverages a reference sample from the population. This attack can still accurately infer membership when the data curator uses privacy enhancing techniques to distort releases. When the attack was released, it significantly reduced the gap between the worst-case privacy guarantees (in terms of differential privacy loss parameters) and real-world attacks.

This section only discusses an abridged version of the attack that more clearly demonstrates the ideas. For convenience, assume the coworker is named Alice.

Just like in previous attacks, the adversary collects auxiliary information about Alice, storing it in the individual variable. The adversary collects several mean answers on different data subsets and columns by either submitting queries to a query interface or collecting information from a data release. Finally, the adversary collects a reference sample from the population to form a baseline to compare against.

The core idea of the attack is that, if the answers on the private dataset are more similar to the targeted individual than they are to the reference sample from the population, then the targeted individual is likely to be a member of the private dataset.

def membership_attack(individual, answers, reference_samples, alpha=.05):
    """Perform membership attack using dwork et al. test statistic.
    See figure 1 in
    https://privacytools.seas.harvard.edu/files/privacytools/files/robust.pdf

    :param individual: y, a boolean vector of shape (d,)
    :param answers: q, a float vector with elements in [0, 1] of shape (d,)
    :param reference_samples: z, a boolean vector of length (1, d)
    :param alpha: statistical significance
    :return: True if alice is in data with (1-alpha)100% confidence."""
    individual = individual * 2 - 1
    answers = answers * 2 - 1
    reference_samples = reference_samples * 2 - 1

    alice_similarity = np.dot(individual, answers) # cosine similarity
    reference_similarity = np.dot(reference_samples[0], answers)
    statistic = alice_similarity - reference_similarity

    d = len(individual)
    tau = np.sqrt(8 * d * np.log(1 / alpha))
    return statistic > tau

The attack itself is very simple: First, all three variables are normalized. The cosine similarity is then used to measure the distance between individual and answers, and between population and answers. The test statistic is simply the comparison of these distances, being positive if the queries are more similar to Alice, and negative if the queries are more similar to the population.

Just like any standard statistical hypothesis test, the test statistic is compared to the predicated value, tau. If the test statistic exceeds tau, then Alice is a member of the private dataset with (1-α)100% confidence.

As mentioned above, this test has been abridged, but has more complicated variations that can detect membership in the private dataset with greater statistical power.

K-Anonymity Vulnerabilities

Another form of ad-hoc privacy protection is K-Anonymity, a popular approach to data anonymization that is susceptible to record linkage attacks. To employ K-Anonymity, the data curator takes several steps. Any identifying columns are immediately removed. Then the curator identifies all columns that are likely to be used as pseudo-identifiers. The “privatization” guarantees there are always at least K records corresponding to any given combination of the pseudo-identifiers. This may be done by either suppressing pseudo-identifiers (removing them from the dataset as well), or generalizing (re-binning them with less granularity).

A core issue of K-anonymity is that it is impossible to determine just what a pseudo-identifier is. This means that this approach to privacy is susceptible to auxiliary information held by the adversary. In practice, just about any column may be used as a pseudo-identifier, and labeling too many columns as pseudo-identifiers destroys the utility of the analysis.

In the teacher survey dataset, quite literally every column could be used as a pseudo-identifier. If you wanted any utility from K-Anonymity, then you wouldn’t be able to label optimism, life satisfaction, and self-esteem as pseudo-identifiers. Unfortunately, you can still use these attributes to disambiguate the exact record belonging to your coworker, from the set matched by the pseudo-identifiers (by K-Anonymity, a set of size at least K).

Even if you have columns in your data that are not pseudo-identifiers, privacy can still be trivially violated in the case where all K records share the same answer on the attribute of interest. This is the same vulnerability as when a singling-out query, discussed previously, returns a small result set. Individuals in that group either completely lose plausible deniability, or are subject to a probabilistic inference that is vulnerable to refinement from auxiliary information.

Attacks on Machine Learning

This section discusses alternative attacks that are applicable in a machine learning context. In this context it is unusual to release grouped queries or use predicates. In these attacks, the adversary is assumed to have access to the model parameters, and is determined to violate the privacy of individuals in the training data. The previously-discussed attacks may still be applied to attack the dataset used for inference and model scoring.

There is a broad literature that tailors attacks to specific ML problems. For instance, recommender systems based on collaborative filtering are particularly susceptible to differencing attacks because recommendations are continually updated in response to new data.

A privacy violation is more stringent than you might expect. For instance, consider model inversion, which is a technique where an adversary attempts to recover the training dataset from learned parameters. Model inversion does not necessarily constitute a privacy violation! This is because a well-generalized model aims to preserve the statistical properties of the population, not individual training examples. A well-generalized model will more likely only retain the statistical properties within subpopulations, leading to a “reconstructed” dataset whose records don’t relate to individual records in the training data.

Regardless, there are an abundance of neural networks large enough to memorize vital information about individuals in the training data. In these cases, the attacks are very similar in nature to those we’ve discussed already- dataset reconstruction (in whole or in part) and membership inference. As a general rule, privacy attacks are more potent when models are over-fitted, or when the training data lacks diversity.

There are many ad-hoc protections a data curator may use to prevent specific vulnerabilities: only releasing aggregates, refusing queries based on attributes of the query or data and K-anonymity. There is no end to ad-hoc protections, and the attacks that circumvent them.

If the data analyst starts adding noise to query results, then the adversary can simply average repeated queries to reduce the variance of the noise. If the data analyst uses the query itself as a seed for the random number generator, such that the answer is always the same for the same query, then the adversary can simply submit many equivalent formulations of the same query that each result in different seeding. In fact, using seeds for random number generators opens another class of vulnerabilities, in that the seed itself can be reconstructed to predict and strip the noise out of the answer.

Other ad-hoc protections have included swapping attributes of the data, rounding answers, and random sampling, all met with questionable utility and continued vulnerabilities.

A data curator employing ad-hoc protections can never be confident that their protections are effective, nor that the privacy of individuals in their data is preserved. This motivates the more formal, mathematically justified approach of differential privacy. DP is an approach that works equally well regardless of the underlying dataset, or the computational capabilities or auxiliary knowledge of the attacker.

While the attacks shown motivate the use of differential privacy, differential privacy can only mitigate privacy attacks if the privacy loss parameters are chosen appropriately. While many of these attacks are completely foiled by the use of differential privacy, there are some, particularly the membership inference attacks related to tracing, that may still be viable when the privacy loss parameters are set too loose. Keep in mind that the protections afforded are only as strong as the privacy loss parameters.

There are cases where the chance of a privacy attack on sensitive data is simply too great of a risk. In cases like this, you can’t access the data directly, but instead can generate synthetic data, that is, data that is derived from an existing dataset and is statistically similar. In the next chapter, you will learn how to generate DP synthetic data.

Exercises

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset