Chapter 1. Introduction

A few years ago we partnered with O’Reilly to write a book of case studies and methods for anonymizing health data, walking readers through practical methods to produce anonymized data sets in a variety of contexts.1 Since that time, interest in anonymization, sometimes also called de-identication, has increased due to the growth, and use, of data, evolving and stricter privacy laws, and expectations of trust by privacy regulators, by private industry, and by citizens from which data is being collected and processed.

Data is recognized as an important driver of innovation in economic and research activities, used to improve services and derive new insights. Services are delivered more efficiently, at a lower cost and with increased usability, based on an analysis of relevant data regarding how a service is provided and used. Insights improve outcomes in many facets of our lives, reducing the likelihood of fatal accidents (in travel, work, or leasure), getting us better returns from financial investments, or improving health-related outcomes by understanding disease progression and environmental influences, to name a but a few examples. Sharing and using data responsibly is at the core of all these data-driven activities.

The focus of this book is on implementing and deploying solutions to reduce identifiability within a data collection pipeline, and it’s therefore important to establish context around the technologies and data flows that will be used in production. Example applications include everything from structured data collection, to IoT and device data (smart cities, telco, medical). In addition to the advantages and limitations of particular technologies, decision makers need to understand where they apply within a deployed pipeline to manage the “spectrum of identifiability”. By this we mean that identifiability is more than just a black and white concept, as we will see when we explore a range of data transformations and disclosure contexts.

This is, however, a book of strategy, not a book of theory. Consider this book as your advisor on how to plan for and use the full spectrum of anonymization tools and processes to get richer data that is legally and defensibly obtained for purposes other than what was originally envisioned in collecting data or providing a service. We will work through different scenarios based on three distinct classes of identifiability of the data involved, and provide case studies to understand some of the strategic considerations that organizations are struggling with.

Our Motivation in Writing This Book

We have a combined three decades of experience in data privacy, from academic research and authorship, delivering training courses, seminars, and presentations, as well as leading highly-skilled teams of researchers, data scientists, and practitioners. We’ve learned a great deal, and we continue to learn a great deal, about how to put privacy technology into practice. We want to share that knowledge to help drive best practice forward, demonstrating that it is possible to achieve the “win-win” of data privacy that has been championed by the likes of former privacy commissioner Dr Ann Cavoukian in her highly influental concept of Privacy by Design. There are many privacy advocates that believe that we can and should treat privacy as a societal good that is encouraged and enforced, and that there are practical ways we can achieve this while meeting the wants and needs of our modern society.

The central question we are consistently asked is how to utilize data in a way that protects individual privacy, but still ensures the data are of sufficient granularity that analytics will be useful and meaningful. By incorporating anonymization methods to reduce identifiability, these entities can establish and integrate secure, repeatable anonymization processes into their data flows and analytics in a sustainable manner.

Best practice recognizes a spectrum of identifiability,2 and this spectrum can be leveraged to create various pipelines to anonymization. We will explain how to objectively compare data sharing options for various data collection use cases to help the reader better understand how to match their problems to privacy solutions, thereby enabling secure and privacy-preserving analytics. There are a range of permutations in how to reduce identifiability, including where and when, to provide useful data while meaningfully protecting privacy based on benefits and needs.

While technology is an important enabler of anonymization, technology is not the end of the story. Accounting for risk in an anonymization process is critical to achieving the right level of de-identification and resulting data utility, which influences the analytic outcomes. Accordingly, to maximize outcomes, an organization must have efficient methods of measuring, monitoring and assuring the controls associated with each disclosure context. Planning and documenting are also critical for any regulated area, as auditors and investigators need to review implementations to ensure the right balance is met when managing risks.

And, ultimately, anonymization can be a catalyst for responsibly using data, as it is privacy-enhancing. There is a security component that comes from limiting the ability to identify individuals, as well as an ethical component that comes from deriving insights that are broader than single indivuals. Conceptually, we can think of this as using “statistics” (that is, numerical pieces of information) rather than single individuals, and using those statistics to leverage insights into broader populations and application areas to increase reach and impact.

Getting to Terms

Before we describe anonymization in any more detail, there are some terms it would be best to introduce at the outset, for those not familiar with the privacy landscape. We will describe a variety of considerations in this book based on potential data flows, and we will simply describe this as data sharing. Whether the data is released, as in a copy of the data is provided to another party, or access is granted to a repository or system, it’s all sharing to us. Sometimes the term disclosure is also used, and in a very broad sense. In an attempt to keep things simple, we will make no distinction between these terms. Also, to explain this sharing, we will use the terms data custodian to refer to the entity sharing data, and data recipient to refer to the entity receiving data.

We would struggle to describe anonymization, and privacy generally, without explaining that personal data is information about an identifiable individual. You may also come across the terms personal information (e.g., Canada), personally identifying information (e.g., US), or protected health information (identifiable health information defined for specific US health organizations). Personal data is probaly the most broad (and due to EU privacy regulations also of high impact globally), and since our focus is on data for analytics, we will use this term throughout this book.

An identifiable individual, used in describing personal data, is often referred to as a data subject. The data subject is not necesarily the “unit of analysis” (a term commonly used in scientific research to mean the person or thing under study). Any individual represented in the data is considered a data subject. The unit of analysis could be households, where the adult guardians represent the individuals that are of primary interest to the study. Although the number of children a person has (as parent or guardian) is personal, children are also data subjects in their own right.

Warning

Our aim is to help match privacy considerations to technical solutions. This book is generic, however, touching on a variety of topics relevant to anonymization. Legal interpretations are contextual, and we urge you to consult with your legal and privacy team! Materials presented in this book are for informational purposes only, and not for the purpose of providing legal advice. There, now that we’ve given our disclaimer, we can breath easy.

Regulations

Data protection or privacy regulations (which we will simply call regulations or privacy regulations), and subsequent legal precedents, define what is meant by personal data. This isn’t a book about the law, and there are many regulations and laws to consider (including national, regional, sectorial, even cultural or tribal norms depending on the country). However, there are two that are notable for our purposes, as they have influenced the field of anonymization in terms of how it is defined and its reach:

Health Insurance Portability and Accountability Act (HIPAA)

Specific to US health data (and a subset at that), HIPAA includes a Privacy Rule which provides the most descriptive definition of anonymization (called de-identification in the act). Known as Expert Determination, this approach requires someone familiar with generally accapted methods to apply statistical or scientific principles and methods to anonymize data such that the risk of identifying an individual is very small.

General Data Protection Regulation (GDPR)

This very comprehensive regulation of the European Union has had far reaching affects, in part due to its extraterritorial scope (applying to residents of the EU, regardless of where that data is processed, when a service is intentionally targeting the EU) and in part due to the severity of the fines it introduced based on an organization’s global revenue. The regulation is risk based, with many references to risk analysis.

Since we’ve introduced US and EU privacy regulations, we should also clarify some of the terms that are used between the two to refer to similar concepts. We’re focusing on the two above, although in truth there are also state-level privacy laws in the US (such as the California Consumer Protection Act), as well as member-level privacy laws in the EU that add additional layers. For our purposes the terms in Table 1-1 should be sufficient. And, yes, you may notice that we’ve repeated personal data for the sake of completeness. The definitions are only basic interpretations in an attempt to bring the US and EU terms into alignment. This is only meant to provide some guidance on aligning the terms, be sure to discuss your particular situation with your legal and privacy team.

Table 1-1. Basic definitions based on the similarties between US and EU terms, in case you come across them. HIPAA is sectorial, but we will disregard the specific health sector focus and concentrate on the commonalities. There are nuances to the definitions between jurisdictions, but hopefully this helps provide some alignment.
US HIPAA EU GDPR Common Definition

Protect Health Information

Personal Data

Information about an identifiable individual

De-identification

Anonymization

Process that removes the association between the identifying data and the data subject

Covered Entitity

Data Controller

Entity which determines the purposes and the means of the processing of personal data

Business Associate

Data Processor

Entity that processes personal data on behalf of the data controller

Data Recipient

Data Processor (for personal data)

Entity to which data is disclosed by the data custodian

Limited Data Set

Pseudonymized Data

personal data that can no longer be attributed to a specific data subject without the use of additional information (in the case of LDS, only direct identifers are removed, whereas pseudonymized has a broader interpretation)

Back to the subject of what makes personal data identifiable, and how we can interpret identifiable for the purpose of defining anonymization. Guidance from authorities is almost exclusively risk based, attempting to balance the benefits of sharing data against an interpretation of anonymization that will sufficiently reduce identifiability to appropriately manage risks. We won’t go through the various guidance documents available. Our previous work has helped influence guidance, and this book as a whole has been influenced by that guidance as well. We’re all in this together! Let’s consider various conditions on identifiability that have been put forward, as shown in Table 1-2.

Table 1-2. Conditions on identifiability from various authorities.
Authority Definition

Federal Court (Canada)

Serious possibility that an individual could be identified through the use of that information, alone or in combination with other information

Office of the Privacy Commissioner of Canada

“Serious possibility” means something more than a frivolous chance and less than a balance of probabilities

GDPR

Identifiability is defined by the “means reasonably likely to be used” to identify a data subject, taking into consideration objective factors (such as cost and time required to identify)

Illinois Court (US)

Not identifiable if it requires a highly skilled person to perform the re-identification

HIPAA

Reasonable basis to believe identifiability, whereas not identifiable if an expert certifies that the risk of re-identification is “very small”

Yakowitz

Only public information can be used for re-identification

States of Data

We mentioned the identifiability spectrum, which is influenced by the conditions of identifiability from authorities, as well as various sections in regulations and their interpretations, and guidance. The full spectrum includes verifying the identity of the data recipient, contractual controls, privacy and security controls, and transformations of identifying information.This book has in fact been organized around a few points along that spectrum, based on three main states of data:

Identified

We use this term to mean that there is directly identifying information in the data, such as names or address. We make a slight distinction with the term identifiable, meaning it could be reasonable to identify an individual (alone or in combination with other information). Many points along the spectrum will be considered identifiable, and therefore personal. But identified means the identity is known and associated with the data, which is often the case when delivering a service to an exact person.

Pseudonymized

The term pseudonymization was popularized with the introduction of the GDPR. Technically speaking, the directly identifying information doesn’t need to be replaced with a pseudonym, it could just as well be a token or fake data or even suppressed entirely. The legal term pseudonymization simply means that direct identifiers have been removed in some way, as a data protection mechanism. And any additional information required to re-identify is kept seperate and is subject to technical and administrative (or organizational) controls.

Anonymized

When we use the term “anonymization”, we mean anonymization that is legally defensible—that is, anonymization that meet standards of current legal frameworks, and that can be presented as evidence to governing bodies and regulatory authorities (i.e., data protection and privacy commissioners), to mitigate exposure and demonstrate that you have taken your responsibility toward data subjects seriously. The terms anonymization and de-identitication are sometimes used interchangeably, but be careful as de-identification is sometimes used interchangeably with pseudonymization as well!

There will be other terms introduced throughout the book, but these are the ones that you need to start reading. Many of the terms we’ve just introduced have necessarily include some discussion of regulations, so this sections has served to introduce terms and regulations, at least to some degree. We will describe regulations where it’s needed to explain a concept or consideration. The next section will actually delve deeper into regulatory considerations as it relates to the process of anonymization.

Anonymization as Data Protection

There has been, and will continue to be, considerable debate around the term “anonymous”, often focusing on the output of anonymization alone (i.e., can someone be reasonably identified in the data, and what’s considered “reasonable”). We shouldn’t lose sight of the fact that anonymization is a form of data protection, and privacy enhancing. To be effective at enhancing privacy, anonymization needs to be used, and that means it also needs to be practical and produce useful data. Barriers that discourage or limit the use of anonymization technology will simply drive organizations to use identified data, or not innovate at all. There are many benefits that can be extracted from sharing and using data, let’s make sure that it’s done responsibly.

We keep mentioning the need to produce “useful data” from the process of anonymization. There is a reality here that we can’t escape, something we called the Goldilocks Principle in our previous book. It’s the idea that we need to balance risk against benefits, and in this case the benefits relate to the utility of the data. It is possible to produce useful data, depending on the purpose and needs, and achieve a win-win. But as data geeks we have to be upfront in saying that there is no such thing as zero risk. When we cross the road and look both ways, we are taking a measured risk. The risk can be quantified, but it’s statistical in nature and never zero unless very little to no information is shared. We consider probable risks, and aim to achieve very low probabilities.

Consider the rock and hard place we are in. In a data sharing scenario in which we wish to achieve private data analysis, there will always be a sender (the data custodian) and a recipient (the data analyst). But the recipient is also deemed an eavesdropper or adversary (using standard security language, in this case referring to an entity or individual that may re-identify data, whether it be intentional or not). Compare this with encryption, in which the recepient gets to decrypt and gain access to the original data shared by the sender. The recepient in the encryption example is not considered an adversary. Not so with anonymization.

Our goal in anonymizing data is to balance the needs of the recipient (providing them with useful data) while minimizing the ability of an adversary (including the recipient) to extract personal information from the data. The two roles that the recipient plays, as an eventual user of the data and also a potential adversary, is what distinguishes anonymization from encryption (in which the adversary and recipient are mutually exclusive), and what makes producing useful and safe data so challenging.

A more practical and realistic approach is to focus on the process of minimizing risk, considering anonymization as a risk-management process. This is the approach taken, for example, by the HITRUST Alliance.3 This is also the approach taken in data security, which is laregly process based and contextual. We call this risk-based anonymization, which in our work has always included process- and harm-based assessments to provide a hollistic approach to anonymization. This approach informs the statistical estimators and data transformations that are applied directly to data. Guidance on the topic of anonymization is almost always risk based.

Warning

If personal data is pseudonymized, or falls short of full anonymization, subsequent uses of the data must still be compatible with the original purpose for the data collection, and may require an additional legal basis for processing. Either way, reducing identifiability in data helps support the case that is made for using the data in ways that were not originally intended (secondary purposes). We will therefore also consider methods to reduce identifiability that may fall short of full anonymization, because they are both useful in their own right and are likely to build towards full anonymization. We need to understand all the tools at our disposal.

Approval or Consent

As a form of data protection, anonymization itself does not require the approval of data subjects, although transparency is recommended and possibly required in some jurisdictions. As with other data protections, it is being done on their behalf, to remove the association between them and the data. We use the term approval here rather than consent because under the GDPR consent is more restrictive than in other jurisdictions (i.e., it must be “freely given, specific, informed and unambiguous”, with additional details and guidance around the interpretation).

Getting approval of data subjects can be extremely difficult and impractical. Imagine asking someone going to a hospital for treatment whether or not they would allow their data to be anonymized for other purposes. Is it even appropriate to be asking them when they are seeking care? Would some people feel pressured or coerced, or answer in a reactionary way out of frustration or spite? It will be different in other scenarios, where the stakes aren’t as high and the information not as harmful or sensitive. But timing and framing will be important.

At the other extreme, approval to anonymize could be sought days, months, perhaps even years later. This could make for awkward situations when people have moved on and acquaintances are asked for contact information. There may be conflict between acquaintances, or simply a reluctance to share contact information. Or the individuals concerned may even be deceased. Contacting thousands of individuals for their approval is likely to be impractical, and unlikely to be fruitful.

But let’s assume data subjects are reachable. Some privacy scholars have argued that approval can be meaninglss, either because the details are described in impenetrable legaleze, or because people don’t understand the implications or simply don’t want to be bothered. Depending on how the approval is structured, they may accept just to get access to something being offered or elect not to find and select the opt-out. How is this privacy preserving?

On the other hand, imagine an approval process in which acceptance is optional, seperated from all services or offerings. There could be a potentially endless stream of requests to anonymize data, for every use case and every service hoping to improve operations or innovate. Government and private sector would burden individuals with requests, to the point they would simply ignore all requests. The concept of priming also suggests that even when cool heads prevail, people often only think about privacy when it’s brought to their attention. They become sensitive to the topic because they are now thinking of it, and perhaps unnecessarily so. Opt-in would be rare, even when the benefits will be to data subjects or a broader population.

The truth probably lies somewhere in the middle. Specific sectors or use cases may see different rates of approval. Certain socio-economic groups may be more sensitive to privacy concerns, and services and insights would become biased to specic groups. Making opt-in the default for anonymizing data, provided the process meets guidance or standards, would ensure non-personal data is available to improve services and derive new insights. This is why regulations offer alternatives to approval, and focus on much more than the process of reducing identifiability.

Purpose Specification

Debate regarding anonymization usually comes from sharing the data for purposes other than that for which the data was originally collected for. Although the process of anonymization is important, it’s the uses of anonymized data that concern people. There have been too many examples of misuses of data, in which people felt discriminated aginst or harmed in some way, although interestingly most are probably using identified data. Anonymization will not solve the misuses, althogh it can help mitigate concerns.

Personal data may, for example, be collected to perform a banking transaction, but that personal data is anonymized and the insights are used to determine age groups that use a banking app versus an ATM, and at what times and on what weekdays. Such data-driven insights can improve services based on current usage patterns, for different age groups. Some may take issue with this form of targeting, even when the intent is to improve services by age group. All organizations have to make decisions to ensure the return on investments are reasonable, otherwise they will cease to exist, and this will inevitably mean making trade-offs. However, if the targeting touches on senstive demographic groups, it will enter the realm of ethical considerations.

If data are to be used for other purposes, in which approval of data subjects is not explicitly sought, there should be a full range of considerations inluded to ensure uses of the data are appropriate. Specifically, harms should be considered in the broader context of ethical uses of data. Although this may be deemed orthogonal to anonymization, the reality is that it could set the tone for how a risk management approach to anonymization is evaluated. We consider framing anonymization within the broader context of data protection.

Reducing identifiability to a level in which it becomes non-personal is, by its very nature, technical, using a blend of statistics, computer science, and risk assessment. In order to engender trust, we must also look beyond the technical and use best practice in privacy and data protection more broadly. Consider making the case to use anonymized data based on the purposes for which the resulting data will be used. For example, we can take a page from EU privacy regulations and consider “legitimate interests” as a way to frame anonymization as a tool to support the lawful and ethical reuse of data. That is, a data sharing scenario can consider how reusing the data (called “processing” in the regulatory language of GDPR) is legitimate, necessary, and balanced, so that it’s found to be reasonable for the specified purposes.

Legitimate

Reuse of the data should be something that is done now or in the very near future. The interests in reusing the data can be commercial, individual, or societal, but it should avoid causing harm. It should also be possible to explain those interests clearly, and that the reuse would seem reasonable in the hypothetical case it was explained to individuals.

Necessary

Reuse of the data should be somewhat specific and targeted to the use case, and minimized to what is required to meet the objectives that are laid out in advance. Overcollection will be frowned upon by the public, so it’s best to ensure needs are well laid out. Again, imagine the hypothetical case of trying to explain the reuse of all that data to individuals.

Balanced

Reuse of the data should have well articulated benefits that outweigh residual risks, or data protection or privacy obligations. Consider impacts, and how they can be mitigated. A form of risk-benefit analysis can help underscore mitigation strategies. Hint: reduce identifiability!

Anonymization can help two aspects of the above: it can more clearly limit the data to what is necessary, at least in terms of information that may be identifiable; and it can support the benefits from reusing the data as a mitigation strategy to reduce risks, making the reuse more balanced towards benefits. In other words, it is the legitimacy of reuse that needs to be explained. Anonymization will focus attention on the necessary, and balance towards benefits. But how the anonymized data is used needs to be considered to ensure it is appropriate.

Now this isn’t to say that we need to make the case of “legitimate interests” to use anonymized data. What we are suggesting is that the privacy considerations in the above can help “legitimize” that use. We are simply drawing from some best practices to help frame the conversation and, ultimately, the reporting that takes place to explain anonymization.

Re-Identification Attacks

There are small set of well-known re-identification attacks that are repeated at conferences, academic publications, and by the media, often in an attempt to raise awareness around the field of anonymization. Like any scientific discipline, these data points serve as evidence to inform and evolve the field (and where there isn’t evidence, the field relies on scientific plausibility). They are what we call demonstration attacks because they serve to demonstrate a potential vulnerability, although not its likelihood or impact. Demonstration attacks target the most “re-identifiable” individual to prove the possibility of re-identification. They are a risk in public data sharing, since there are no controls and the attacker can gain noteriety for a succesful attempt.

These well-known and publicized re-identification attacks were not on what we consider to be anonymized data, neither would they have been considered anonymized by experts in the field of statistical disclosure control (the field defined by decades of practice by experts at national statistical organizations). Although the methods of statistical disclosure control have existed for decades, they were predominantly applied to national statistics and in government data sharing. Let’s consider a handful of demonstration attacks to learn about them, and the lessons we can extract.

AOL Search Queries

In 2006 a team at AOL thought it would be of value to researchers in natural language processing (a field that develops algorithms in computer science to understand language) to share three months of web searches, around 20 million, of 657,000 users. They made the data publicly available, and it can still be found on the computers of researchers around the world and probably on peer-to-peer networks, even though AOL removed the search data from its site shortly after the release when a New York Times reporter published a story after having identified user 4417749.

User 4417749’s searches included “tea for good health”, “numb fingers”, “hand tremors”, “dry mouth”, “60 single men”, “dog that urinates on everything”, “landscapers in Lilburn, GA”, “homes sold in shadow lake subdivision Gwinnett County Georgia”. Pay close attention to the last two searches. Geographic information narrows the population in a very obvious way, in this case allowing a reporter to visit the neighborhoud and find a potential match. And this is how Thelma was found from her search queries.4

What’s more, others claimed they were able to identify people in the search data. Many search queries contained identifying information in the form of person names based on vanity searches (in which you search for yourself to see what’s publicly available), or searches of friends and neighbors, place names associated with their a home or place of work, and other identifiers that could be used by pretty much anyone since the search data was public. And of course the searches also included sensitive personal information that people expected to kept private. It’s a good example of the risks with sharing pseudonymous data publicly.

Netflix Prize

Again in 2006, Netflix that launched a data analytics competition to predict subscribers’ movie ratings based on their past movie ratings. Better algorithms could, in theory, be used to provide Netflix users with targeted film recommendations so that users are engaged and keep using the service. The competion was pretty much open to anyone, and by joining they would gain access to a training set of 100,480,507 ratings for 17,770 movies by 480,189 subscribers. The ratings included a a pseudonym in place of the subcriber name, the movie, the date of the rating, and the rating itself.5

A group of researchers demonstrated how they could match a few dozen ratings to the Internet Movie Database (IMDb), a limit imposed by the IMDb terms of service, using a robust algorithm that would attempt to optimize the matches. They hypothesized that the ratings of Netflix users that also rated movies on IMDb would strongly agree.The researchers claimed that subscribers in the Netflix dataset were unique based on a handful of ratings outside the top 500 movies and approximate dates (+/-1 week), that that they had found two especially strong candidates for re-identification. Based on the matching between the public IMDb movie ratings and the Netlix movie ratings, the researchers claimed to be able infer political affiliation and religious views based on the non-public movies viewed and rated in the Netflix data.

Whether an adversary could know this level of detail, and confirm their target was in the sample dataset, is debatable. However, given an appropriate database with names and overlapping information, the algorithm developed could be effective at matching datasets. It’s hard to know from a demonstration attack alone. In the case of mobility traces, researches found it to have a precision of about 20% given their overlapping data from the same population (although they had found 75% of trajectories were unique from 5 data points).6

State Inpatient Database

Both these previous examples were of data in which pseudonyms had replaced user names. Let’s consider a different example, in which not only were names removed, but some information was generalized, such as date of birth replaced by age. In this case we turn to the Healthcare Cost and Utilization Project (HCUP), which shares databases for resarch and policy analysis. In 2013 the State Inpatient Database (SID) of Washington State from 2011 was subject to a demonstration attack using publicly available news reports. Privacy experts had warned that these databases required additional protection, and since the demonstration attack multiple improvements were introduced.

Basically, a team searched news archives for stories about hospital encounters in Washington State. One included a 61-year old man from Soap Lake that was thrown from his motorcyle on a Saturday and hospitalized at Lincoln Hospital. Raymon was re-identified in the SID based on this publicly available informaiton, and from this all his other hospital encounters in the state that year were available since the database was longitudinal.

A total of 81 news reports from 2011 were collected from news archives, with the word “hospitalized” in it, and 35 patients were uniquely identified in the SID of 648,384 hospitalizations.7 On the one hand, you could argue that 35 individuals out of 81 news reports is a significant risk, provided there’s public reporting of the hospitalization; on the other hand, you could argue that 35 individuals out 648,384 hospitalizations is a very small number for the benefits provided from sharing the data. But public sharing is challenging given the risks of a demonstration attack, whereas controls can dramatically prevent such incidents. Importantly, however, is what we learn about potentially identifying information used to identify an indivual, which can be used to properly estimate risk.

Lessons Learned

We need to disguish between what is possible and what is probable, otherwise we would spend our lives crossing the street in fear of a plane being dropped on our heads (possible, but not probable). Demonstration attacks are important to understand what is possible, but they don’t always scale or make sense sense outside of a very targeted attack on public data, where there are no controls on who has access and what they can do with the data. Our focus in this book is primarily non-public data sharing, and how we can assess risk based on the context of that sharing.

Let’s draw some lessons from these demonstration attacks.

  • Pseudonymized data, in which names and other directly identifying information have been removed, are vulnerable (which is why they are considered personal data).

  • Data shared publicly is at risk of demonstration attacks, the worst kind since it only takes one re-identification to claim success. Notoriety is an important motivator, leading to publication of the results.

  • Contractual controls can discourage attempts at a demonstration attack (e.g., the IMDb terms of service), but will not be sufficient to eliminate all attacks. Additional controls and data transformations will be required.

Risk-Based Anonymization

Let’s turn our attention to what we mean by risk based, since we’ve used this term a few times already. An evaluation of risk implicitly involves careful risk assessments, to understand more precisely where there is risk and the impact of different mitigation strategies. This drives better decisions about how to prioritize and manage these risks. The process also means that risk is evaluated in an operational context, using repeatable anbd objective assessments to achieve our data sharing goals.

We take a very scientific approach to risk-based anonymization. Besides being evidence-based, so that the approach is reasonable and adaptive to a changing threat landscape, we also determine risk tolerance using a threshold that is independent from how we measure the probability of re-identification. Based on risk assessments that drive an evaluation of the context of the data sharing scenario, we can compare risk measurement to the threshold to determine how much to transform identifying information until the statistical measure of risk is within the pre-defined risk tolerance. We will describe this process is detail in the following chapters, but this aspect at least can be seen in Figure 1-1.

images/anonymization_cycle.png
Figure 1-1. Quantitively evaluating the risk of re-identification involves setting a risk threshold, and comparing statistical risk measurement against that threshold. This drives the transformations of identifying data.

Contrast this with a fixed, list-based approach of the necessary data transformations. HIPAA, mentioned earlier, includes in its Privacy Rule a method known as Safe Harbor that uses a fixed list of 18 identifers that need to be transformed. It includes many directly identifying pieces of information that need to be removed, such as name and security number. Individual-level dates must be limited to year only, and limits on the accuracy of geographic information. Regardless of context, regardless of what data is being shared, the same approach is used.

The only saving grace to the HIPAA Safe Harbor approach is a “no actual knowledge” requirement that has been interpreted to be a catch all to verify that there are no obvious patterns that could be used to identify someone, such as a rare disease. Although the approach is simple, it’s not very robust privacy protection and only really useful for annual reporting. Also note that it’s only suitable under HIPAA, as it was derived using US census information and there are no provisions in the regulations of other jurisdictions to use this specific list.

Another approach would be heuristics, which are rules of thumb derived from past experience. These tend to be more complicated than simple lists and have conditions and exceptions. Buyer beware. The devil is in the details, and it can be hard to justify heuristics without defensible evidence or metrics. They may provide a subjective gut check that things make sense, but this will be insufficient faced by regulatory scrutiny.

The purpose of a risk-based approach is to replace an otherwise subjective gut check with a more guided decision-making approach. This is why we described risk-based anonymization as a risk management approach. And one of the most important ways you can reduce risk in a repeatable ways is through automation, as shown in Figure 1-2. Creating automated risk management processes, in general, ensures you capture all necessary information without missing anything, with auditable proof of what was done in case an issue arises that you need to correct for next time.

images/anonymization_automation.png
Figure 1-2. Automation means replacing a gut check with repeatable processes and auditable proof of what was done.

About This Book

We’ll provide a conceptual basis for understanding anonymization, starting with an understanding of re-identification risk. That is, providing a reasonable measure of risk regarding the ability to identify an individual in data. We will do this in two chapters, starting with the idea of an identifiability spectrum to understand data risk in Chapter 2, and then a governance framework that explains the context of data sharing to understand context risk in Chapter 3. Risk will be assessed in terms of both data and context, since they are intimately linked. Our identifiablity spectrum will therefore evolve from the concept of data risk, into one that encompasses both data and context.

From this conceptual basis of re-identification risk we will then look at data pipelines. We’ll start with identified data, and concepts from privacy engineering in Chapter 4. That is, how do you design a system with privacy in mind, building in protections and, in particilar, reducing identifiability for those novel uses of data that fall outside of the original purposes of data collection. We will also touch on the subject of having both identified and anonymized data.

Once we’ve established the requirements that are of concern with identified data, we will consider another class of data for which direct identification has been removed, which we explained above as being pseudonymized. This is the first step to reducing identifiability, by removing names and addresses of the people in the data. In [Link to Come], we start to explicitly work towards anonymizing data. We first look at how pseudonymization fits as data protection, and introduce a first step towards anonymization. We also consider technologies that can sit on top of pseudonymized data, and what that means in terms of anonymization.

Our final data pipeline is focused entirely on anonymization in [Link to Come] (so entirely about secondary uses of data). We start with the more traditional approach of pushing the anonymization at source to a recipient. But then we turn things around, considering the anonymized data as being pulled by the recipient. This way of thinking provides an interesting opportunity to leverage anonymization from a different set of requirements, and opens up a way to build data lakes. We will do this by building on concepts introduced in other chapters, to come up with novel approaches to building a pipeline.

We finish the book in [Link to Come] with a discussion of the safe use of data, including the topics of accountability and ethics. The practical use of “deep learning” and related methods in artificial intelligence and machine learning (AIML) has introduced new concerns to the world of data privacy. Many frameworks and guiding principles have been suggested to manage these concerns, and we wish to summarize and provide some practical considerations in the context of building anonymization pipelines.

1 El Emam, K., & Arbuckle, L. (2013, updated 2014). Anonymizing Health Data: Case Studies and Methods to Get You Started. O’Reilly Media, Inc.

2 For an excellent summary of the identifiability spectrum applied across a range of controls, see Future of Privacy Forum (2016). A Visual Guide to Practical De-Identification.

3 HITRUST Alliance, (2015). HITRUST De-Identification Framework. Site: hitrustalliance.net/de- identification[hitrustalliance.net/de-identification

4 AOL Search Data Leak

5 Netflix Prize

6 Wang, H., et al. (2018, January). De-Anonymization of Mobility Trajectories: Dissecting the Gaps Between Theory and Practice. In The 25th Annual Network & Distributed System Security Symposium (NDSS’18).

7 Sweeney, L. (2015). Only You, Your Doctor, and Many Others May Know. Technology Science, 2015092903(9), 29.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset