Chapter 3. A Practical Risk-Management Framework

While technology is an important enabler of data de-identification, technology is not the end of the story. Building an effective anonymization pipeline at an enterprise level is as much about governance as it is about technology, as we aim to deliver trust to stakeholders.1 Accounting for risk in a de-identification technology is critical to achieving the right level of de-identification and resulting data utility, which influences the analytic outcomes.

To maximize outcomes, an organization must have efficient methods of measuring, monitoring and assuring the controls associated with each disclosure context. More broadly, organizations should establish a framework to manage re-identification risks holistically while enabling a wide range of data uses.

If you only apply technology to anonymize data, you miss out on a vital area of the overall strategy—the people and decisions behind the solution and the processes and procedures that instill consistency. Without these elements, you miss the tenets of governance—accountability, transparency, and applicability. And you end up with less useful data.

The techniques used to achieve anonymization cannot be separated from the context in which data are shared: the exact data you’re working with, the people you’re sharing it with, and the goals of subsequent analysis. This is called risk-based anonymization. There is a framework that has emerged from statistical data sharing by government agencies that is predominantly a risk-based approach. We’ll demonstrate how it can be operationalized in a broader setting.

Five Safes of Anonymization

Responsible data sharing requires an assessment of many factors, all of which need to be considered objectively to compare data sharing options. Only then can data custodians determine the most appropriate option for their particular circumstances, given the risks and benefits of sharing data in the first place.

One framework that has gained popularity after more than a decade of use is known as the Five Safes,2 which is intended to capture the relevant dimensions to assess the context and results of a data sharing scenario in an effort to make sound decisions. Those dimensions are: Safe Projects, Safe People, Safe Settings, Safe Data, and Safe Outputs. The term “safe” is treated on a spectrum, as in “how safe” is it, so that this balancing can take place.

Note

The entire premise of the Five Safes is based on the idea of risk assessment, which may be seen as subjective but with objective support through risk estimation. Greater emphasis is then placed on empirical evidence to drive decision-making.

Let’s compare with risk-based anonymization, which requires an evaluation of the external information available to an adversary (whether a re-identification is intentional or not), and how they may combine it to re-identify data. Removing personal information from data using a risk-based methodology requires an assessment of the environment and the circumstances in which the data will be shared (to know what external information will be available to an adversary), and an assessment of the data itself (to determine how the external information available to an adversary may be used to re-identify data).

With that in mind, the Five Safes can be desribed using concepts from risk-based anonymization, as shown in Figure 3-1. Our goal with this framework is the safe use of data, while maintaining as much granularity as possible. This is why the framework starts with defining project boundaries, and then focuses on people and settings. That way residual risk is managed by de-identification, i.e., transforming the data to meet project needs and maintain the highest level of data utility we can. But what we do with that data (i.e., the outputs) will still pose some risks, which we consider last.

images/five_safes.png
Figure 3-1. Overall risk exposure using the Five Safes, operationalized through risk-based anonymization.

We can summarize the steps of the Five Safes, operationalized through risk-based anonymization, in greater detail as follows:3

Safe Projects

What are the legal and ethical boundaries of a data sharing scenario, is de-identification needed as a privacy-protective measure?

Safe people

Who are the anticipated data recipients, what are their motivations and capacity to re-identify, and who may they know in the data?

Safe Settings

What are the technical and organizational controls in place to prevent a deliberate attempt to re-identify or to prevent a data breach?

Safe Data

What is the re-identification risk, considering the people and settings of the data environment, and what de-identification can be applied?

Safe Outputs

What are the risks using the de-identified data for the intended and other purposes, and what is a suitable risk threshold?

Imagine a healthcare scientist seeking access to data. In general we may think that the use of health data will provide public benefit, and should therefore be supported. Health data can, however, be among the most sensitive data about individuals, revealing lifestyle habits, personal events that may trigger strong emotions, embarrassing information, or just things they want to keep private because, well, it’s personal. Taking our responsibility towards data subjects seriously, we need to ensure the safe use of this data. Let’s walk through the Five Safes in detail, keeping this example in mind.

Safe Projects

Our healthcare scientist is seeking access to personal data, which could be within the same institution but through a different department, or it could be from an external organization. The intended use of the data may be of benefit to the data subjects, in their healthcare treatment or in some other way, or it may be more general and of potential benefit to the public at large. We need to capture all these considerations through an evaluation of data flow and primary and secondary purposes, before we decide whether to provide access and launch into the effort of ensuring the safe use of data.

Data flow

It’s important to understand the flow of data, to recognize legal and ethical boundaries and intended purposes so that we can identify the parameters needed to assess risk and create safe projects.

  • Where the collected data are coming from, who collected them, and the legal and ethical grounds for doing so.

  • Where the shared data are going, who wants access, and the legal and ethical grounds for doing so.

  • Whether the data are considered personal or not, and how anonymization is applied in accordance to regulations.

Primary and Secondary Purposes

Understanding the legal context for collection, approval mechanism,4 and transparency will be important to determine the appropriate mechanisms for sharing data, especially for secondary purposes.

  • The data custodian may have collected information for a primary purpose, such as providing care to a patient.

  • Or the data custodian may have collected information explicitly for a secondary purpose, such as constructing a database of patients with diabetes for subsequent research.

  • Personal information may also come indirectly through one or more data custodians, where permitted.

  • Alternatively, data may come from another source claiming to be anonymized (which may need to be assessed in its own right before being used or combined with personal information).

When properly anonymized, data are no long personal and therefore not subject to privacy legislation. But first the uses need to be understood to determine primary or secondary purposes, and determine legislative requirements.

  • An agent, acting on behalf of the data custodian, may use personal data for a primary purpose.

  • Depending on the jurisdiction, there may not be a legislative requirement to de-identify information that an agent uses for secondary purposes, or a requirement to obtain additional approval from data subjects for such uses. However, it may be encouraged or desirable.

  • The data custodian may also receive a request to share with an internal or external recipient for some secondary purpose. Sharing of personal data are sometimes mandatory, whereas others may be discretionary to the data custodian. The conditions for discretionary sharing do vary.

  • Other data sharing, that are not explicitly permitted in legislation, require that either approval be obtained from the data subjects or the personal data are anonymized.

When to De-Identify

There are circumstances in which we may not de-identify the data, even when describing secondary uses. But there are also circumstances in which we may want to de-identify for the sake of protecting privacy, and circumstances in which we must do it. We divide these into four scenarios to consider in deciding how the data can be shared, and what if any de-identification is needed.

Mandatory sharing

No approval is required, and the data do not require anonymization because it is likely that individuals need to be identified (e.g., law enforcement). However, there may be considerable underreporting by indivuals due to privacy concerns.

Internal sharing

It is often unnecessary for an agent to have data in identifiable form to perform their functions, even for primary purposes, and de-identification is desired to enhance privacy and avoid potential breaches.

Permitted sharing

Approval may be optional, under the discretion of the data custodian, for the public good (e.g., public health). There is reluctance, however, by data custodians to share personal data due to issues of individual and public trust, which de-identification can help remedy.

Other sharing

When approval is not possible or practical, and there are no exceptions in the legislation, the custodian must anonymize the personal data before sharing with a data recipient.

For our healthcare scientist, we’ll assume their desired use of the data falls within the category of “other data sharing”. Even if it had been a permited sharing scenario we would likely have wanted to anonymize the data, but given the legislative authority our risk tolerance would have been higher, meaning the release of more granular data.

Safe People

Our healthcare scientist is unlikely to be the only one that will have access to the requested data. There may be analysts and technologists, perhaps even students, that will be working with the data. We meed to understand the lab in which the data will be used, who will have access, and under what circumastances.

Data recipients are central to an assessment of context risk because the entity or employees may re-identify data, whether it be intentional or not. It may came as a surprise, but the anticipated recipient is also considered an adversary. This isn’t to say they are malicious, adversary is a general meant to capture those entities that pose risks. Unintended recipients may also need to be considered, and therefore a more complete picture of all the possible recipients is warrented.

We assume that the adversary has access to the shared data, and has some background knowledge that will be used in a re-identification. The nature of that background knowledge will depend on the assumptions one is willing to make. Figure 3-2 provides some examples of the types of adversaries we consider, which will also have different depths of knowledge.

For example, the researcher, media, or marketer will use publicly available information to re-identify, which relates to the Sample to Population in the “Direction of Matching” described previously. Whereas the relative, neighbor, and co-worker will use publicly available information (since it’s public, after all) and information that is known to them as acquaintances, which relates to the Population to Sample in the “Direction of Matching” described previously. We therefore start to see how our concepts of re-identification science relate to scenarios that would be form the basis of threat models.

images/adversaries.png
Figure 3-2. Adversaries can be divided into two categories: those that use public information to re-identify, and those that are acquaintances and have more in-depth knowledge to re-identify.

Recipient Trust

We can begin to evaluate the likelihood of an attempt to re-identify by considering these potential adversaries. Consider the motives and capacity of the anticipated data recipient to re-identify the shared data. We assume that the data custodian is sharing data that have gone through some kind of de-identification.

Motives

The motive to re-identify individuals in the data implies an intentional re-identification, considering issues such as conflicts of interest and the potential for financial gain from a re-identification.

Capacity

The capacity to re-identify individuals in the data considers whether the data recipient has the skills and financial resources to re-identify the data.

Motives can be managed by having enforceable data sharing agreements or contracts with the data recipient. Such an agreement will determine how likely a deliberate re-identification attempt would be. Contractual obligations need to include very specific clauses (otherwise there are some very legitimate ways to re-identify a dataset):

  • A prohibition on re-identification, on attempting to contact any of the patients in the data set, and on linking with other data sets without permission from the data custodian;

  • An audit requirement that allows the data custodian to conduct spot checks to ensure compliance with the agreement, or a requirement for regular third-party audits;

  • A prohibition on sharing the data with other third parties (so that the data custodian can keep track of who has the data), or a requirement to pass on the above restrictions to any other party the data is subsequently shared with.

You can imagine our healthcare researcher being resistant to some of these clauses. Of course these are all optional, and to be determined by the use case and governance that a data custodian wants to have in place, based on their risk tolerance. We often hear of organizations, for example, that prefer not to have restrictions on linking with other data sets. This can be managed somewhat by adding some fine print, such as no linking to identified data, or personal data, or with data that may increase identifiability.

Acquaintances

Recipient trust is about attempts to re-identify, but there is still a risk even when there is no attempt. Data recipients may have prior knowledge of personal information because they’re acquaintances of individuals in the data (remember our list of potential adversies in Figure 3-2). This in turn may lead them to re-identify inadvertently, or spontaneously (yes, that’s actually what it’s sometimes called!), simply by recognizing them. It’s a factor that needs to be considered when evaluating risk, because it relates to how safe it is to have people working with data.

Our healthcare scientist, and those working in their lab, provide a perfect example in the case that they are working with data in the same geography they’re in or are from. In order to know intimate information about an acquaintance, those individuals working with the data would need to be some kind of friend. We can therefore incorporate in our models the probability that the adversary knows someone in the defined population covered by the data.

VIPs are also at elevated risk, because more information about them is known publicly (we are all their acquaintance!). This would include individuals that are in the public realm often, and where there would be a media interest in writing about information that may be contained in the shared data, especially if they were unusual or pertinent to their public role. Typical VIPs would be politicians, actors and artists, and sports personalities.

A re-identification of a VIP may seem like a low likelihood event, although they are potentially more likely to be targets. A successful re-identifiation would, however, have a high impact, perhaps more damaging to public trust due to the increased media interest. The easiest approach to dealing with VIPs would be to remove them from shared data, rather than inflate the re-identification risk measurement for all data subjects. That being said, the transformations that are planned may be sufficient if identifiers known to acquaintances are included, and if the data is a sample.

Safe Settings

We need to assess the data environment of our healthcare scientist, that is, the environment in which the shared data will be used. If anyone in an organization can walk in and use the data, we know the environment is on the low end of safe, and this will leave a significant residual risk to account for through data transformations. On the other end of the spectrum, a safer environment will mean more granular data for our healthcare scientist and their team.

The security and privacy practices of the data recipient will have an impact on the likelihood of a rogue employee at the data recipient’s site being able to re-identify the shared data. A rogue employee may not necessarily be bound by a contract unless there are strong mitigating controls in place. It also determines the likelihood of an outsider gaining access to the shared data.

Note

An evaluation of mitigating controls needs to be detailed and evidence based, preferably mapped to existing professional, international, and government regulations, standards, and policies, including ISO/IEC 27002, where appropriate. Using a standardized approach also ensures consistency, not only for a single organization that is sharing data, but across organizations, e.g., the HITRUST De-Identification Framework.5]

There are several mitigating controls that need to be considered in dealing with personal data, and to ensure the assessment of Safe Settings is defensible. These are considered the most basic forms of controls. Think of them as minimum standards only! We can only give you a taste of what’s expected in the subsections that follow, because it’s pretty detailed (although this summary covers a lot of ground).

Controlling Access, Disclosure, Retention, and Disposition of Personal Data

  • Only authorized staff should have access to data, and only when they need it to do their jobs.

  • There should be data sharing agreements in place with collaborators and subcontractors, and all of the above should have to sign nondisclosure or confidentiality agreements.

  • There should be a data retention policy with limits on long-term use, and regular purging of data to reduce vulnerability to breaches.

  • If any data is going to leave the relevant jurisdiction in which the data sharing is taking place, there should be enforceable data sharing agreements and policies in place to control disclosure to third parties.

Safeguarding Personal Data

  • It’s important to respond to complaints or incidents, and that all staff receive privacy, confidentiality, and security training.

  • Personnel need to be disciplined for violations of these policies and procedures, and there should be a tried and tested protocol for privacy breaches.

  • Authentication measures must be in place with logs that can be used to investigate an incident.

  • Data can be accessed remotely, but that access must be secure and logged.

  • On the technical side, a regularly updated program needs to be in place to prevent malicious or mobile code from being run on servers, workstations and mobile devices, and data should be transmitted securely.

  • It’s also necessary to have physical security in place to protect access to computers and files, with mandatory photo ID.

Ensuring Accountability and Transparency in the Management of Personal Data

  • There should be someone in a position of seniority who is accountable for the privacy, confidentiality, and security of data, and there needs to be a way to contact that person.

  • Internal or external auditing and monitoring mechanisms also need to be in place.

Risk Matrix

A detailed assessment of Safe Settings can be combined with our assessment of Safe People to create a standard risk matrix for an internal adversary, as shown in Figure 3-3. If you’ve ever seen a risk matrix before, they usually contain subjective entries. The entries in our risk matrix are, however, known as expert probabilities, which have been derived from past data releases by reputable organizations and regulatory or industry guidance.6

images/riskmatrix.png
Figure 3-3. A risk matrix provides a visual demonstration of risks to assist decision making, and in this case expert probabilities are used so that more objective support can be provided in estimating re-identification risk.

Having expert probabilities, instead of subjective categories of low, medium, and high, allow us to combine the entries with measures of re-identification risk in the data itself, which we’ll see explicitly when we discuss Safe Data. This means that we can assign a probability of attempting to re-identify to our researcher and lab personnel. As can be seen from the risk matrix, the more we can trust the recipients, and the stronger the privacy and security settings, the lower the assigned probabily they will attempt to re-identify data.

Safe Data

At this point we’ve done everything we can to capture the mitigating controls in place, both technical and organizational, to evaluate our healthcare scientist’s data environment. We are now left with reducing the residual risk of re-identification through data transformations.

An assessment of Safe People and Safe Settings results in an evaluation of context risk. A structured approach can be used to assess context risk and evaluate whether an attack will be realized, known as threat modelling. Consistent with the modelling of threat sources used in information security and risk modelling, there are three plausible attacks that can be made on data:7

Deliberate

A targeted attempt by the data recipient as an entity, or a rogue employee due to a lack of sufficient controls, to re-identify individuals in the shared data. The risk matrix from Figure 3-3 is used to capture this probability.

Accidental (inadvertant)

An inadvertant or unintentional re-identification, for example an individual being recognized while a recipient is working with the shared data. This probability can be estimated as having at least one acquaintance in the defined population.8

Environmental (breach)

The data could also be lost or stolen in the case where all the controls put in place have failed to prevent a data breach. Industry-specific rates provide a means to estimate the probability of a data breach.

To produce Safe Data the overall risk of re-identification needs to be assessed, which is a combination of context risk (the probability of an attack) and data risk (the probability of re-identification when there is an attack).9 As summarized in Figure 3-4, this will drive the de-identification required to reduce re-identification risk so that residual risks are appropriately managed.

images/overall_risk_data_and_threats.png
Figure 3-4. Overall re-identification risk is a combination of the probability of re-identification in the data given an attack, times the probability of an attack in the first placed (determined through threat modelling).

Quantyfying Risk

Because risk measurement invariably requires the use of statistical methods, any risk measurement technique will be based on a model of plausible re-identification attacks, and models make assumptions about the real world. Therefore, risk measurement will always imply a series of assumptions that need to be made explicit. Furthermore, because of the statistical nature of risk measurement, there will also be uncertainty in these measurements and this uncertainty needs to be taken into account.

The risk measurement we’re referring to applies to indirectly identifying data. Three kinds of risks need to be managed, of which detailed metrics can be derived:10

Prosecutor risk

The prosecutor has background information about a specific person that is known to them, and uses this background information to search for a matching record in the shared data.

Journalist risk

The journalist doesn’t know the particular individual in the shared data, which is a subset of a larger public dataset, but does know that all the people in the data exist in a larger public dataset.

Marketer risk

The marketer is less concerned if some of the records are misidentified. Here the risk pertains to everyone in the data. Marketer risk is always less than prosecutor or journalist risk, and is therefore often ignored.

In practice either prosecutor or journalist risk are used, as they represent targeted attacks, whereas marketer risk is an average and will always be a lower probability. Prosecutor risk is used when a target individual is known to be in the shared data, whereas journalist risk is used when the target individual is in a larger defined population in which the shared data is only a sample. In other words, if the shared data represents the entire defined population, use prosecutor risk; if the shared data represents a sample from the defined population, use journalist risk.

Note

Prosecutor and journalist, although representing targeted attacks, are forms of average risk. There is still a risk that uniques remain in the data, even though on average the cluster sizes are much larger. For this reason, we advocated for what we termed strict average in our previous book,11 in which a maximum risk metric is included that ensures there are no population uniques in the data.

If a population registry has information about individuals who are known to be in the shared data, an adversary may target the highest risk data subjects. In this case, the maximum of the risk metric is taken across all data subjects when there are no controls in place to prevent such an attack (e.g., public data sharing). On the other hand, if an adversary will not target the highest risk data subjects, because there are controls in place to prevent such an attack, but is trying to find information about a specific individual, the risk metric is averaged across all data subjects since target is random (e.g., private data sharing).

Safe Outputs

Once we share data with the healthcare scientist, it should go without saying that they will produce models and statistics. The scientist and team want to learn from the data. The question is what do they want to learn, and how will they use this information. We should have captured their purposes under Safe Projects, but it’s possible they will find other uses that we need to keep an eye on. The anonymized data itself is an output, to that healthcare lab, but so are the analytical results and decisions they make from the data. We want to ensure these are not disclosive in a way that would be deemed inappropriate.

Imagine the healthcare scientist stated upfront, in their request for data, that they wanted to study vaccination rates. Through the use of the anonymized data, the scientist finds that there is a population group that is under vaccinated. The scientist may now take this information and launch a targeted education campaign, or publish in local media as well as an academic journal.

Although public education seems laudible, the disclosure may result in that population group being targeted by others in the community in less than ideal ways, through shaming or being treated in a biased way. These decisions require careful consideration. Not to mention if the results are also used in other ways, such as marketing purposes. Although we would like to capture as many of these as possible in defining a Safe Project, we must recognize that circumstances change once the results are in hand and understood.

Ultimately, the degree of de-identification necessary to reduce risk to a suitable tolerance level raises the question of risk thresholds. There are many precedents going back multiple decades for what is a suitable probability for sharing anonymized data, with a range of options shown in Figure 3-5. To decide which threshold to use, we can look at the sensitivity of the data and the approval mechanism that was in place when the data was originally collected.

Invasion of Privacy

Invasion of privacy is a subjective criterion that can be used by the data custodian to influence the selection of a risk threshold. If the invasion of privacy is deemed to be high, that should skew the decision more toward a lower threshold. On the other hand, if the invasion of privacy is deemed to be low, a higher threshold would be selected.

  • Are the data highly detailed, are they highly sensitive and personal in nature?

  • What is the potential injury to individuals from an inappropriate processing of the data?

  • What is the appropriateness of approval by data subjects for disclosing the data?

Although approval is not required of data subjects for sharing properly anonymized data, the sharing of data would not be considered as privacy invasive when approval has been provided by data subjects compared to when no approval is sought. There are in fact multiple levels of notice and approval that can exist for the sharing of anonymized data.

  • There is court order or a provision in the relevant legislation permitting the sharing and use of the data without notice or approval of data subjects.

  • The data were unsolicited or given freely or voluntarily by the data subjects with little expectation of it being maintained in total confidence.

  • The data subjects have provided express approval that their data can be shared and used for this purpose when it was originally collected or at some point since then.

  • The data custodian has consulted well-defined groups or communities regarding the sharing and use of the data and had a positive response.

  • A strategy for informing or notifying the public about potential sharing and use for the data requestor’s purpose was in place when the data were collected or since then.

  • Obtaining approval from data subjects at this point is inappropriate or impractical.

The practical consequence of evaluating invasion of privacy is that the suitable threshold (or the definition of “very small risk”) will be lower under the most invasive scenario. Even under the most invasive scenario, however, it is possible to share the data, but the degree of de-identification would be greater.

images/thresholds.png
Figure 3-5. The risk threshold represents the maximum tolerable risk for sharing the data. This threshold needs to be quantitative and defensible. The expert probabilities in this diagram are based on past precedents.

Five Safes in Practice

Let’s pull together the information presented into a risk-based assessment of re-identification. We’ll use our healthcare scientist, but be more specific about the context of the data sharing.

Safe Project

The personal data collected is from a hospital that wishes to leverage their data assets for scientific research into treatments and effects of cancer. Data will be made available to healthcare scientists for secondary research purposes, but only after an ethics review to ensure the uses are deemed appropriate. The environment in which the data will be used will be external to the hospital, but in the same jurisdiction.

Safe People

The hospital has decided that only identified researchers from approved research institutions will have access to data. They have recognized that some researchers will have analysts and technologists, perhaps even students, that will be involved in working with data, and could have acquaintances in the data. For this reason, contracts will be required with the research institution to ensure there is oversight, and all staff that will access data will be required to take privacy training and sign agreements regarding ethical use of data they are being entrusted with. There should be no obvious reason to want to re-identify data (i.e., low motives and capacity).

Safe Settings

Although the hospital would like to make the data broadly available, they are not in a position to assess the data environment on a case-by-case basis. Rather, they will only share data when the risk level is deemed high, which will be spelled out in standard data sharing contracts. These will result in a fixed risk score, simplifying data sharing from their perspective, and will require the institution to agree to be accountable for the research scientist’s lab environment.

Safe Data

With the previous information, the hospital is in a position to assess risk based on plausible re-identification attacks, which represent the data context.

  • Deliberate: The Safe People in this case have been assumed to have low motives and capacity to re-identify. The Safe Settings are fixed so that the privacy and security controls will be high. Combined, Safe People and Safe Settings are mapped to the risk matrix in Figure 3-3 to provide a risk score of 0.05.

  • Accidental (inadvertent): The most prevelent disease in the data will be breast cancer, and the probability of knowing at least one women with breast cancer is about 0.70. On the other hand, knowing at least one person with oral cancer, which is much less common, is about 0.054. This means the risk of having an aquaintance in the data will vary based on cancer type, although a conservative estimate would be to use breast cancer.

  • Environment (breach): Breach rates vary by the level of privacy and security controls. Previous breach rates in US healthcare were reported to be 0.14 for strong controls, which we can use here for this exercise but this should be verified based on industry and jurisdiction, where breach rates are available.

Safe Outputs

To ensure the data is appropriately transformed, based on data and context risk, a risk tolerance needs to be defined. Although health data is inherently sensitive, cancer data is not itself more sensitive (compared to things like abuse, sexual orientation, etc.). Based on past precedent, we will assume an appropriate risk threshold of 0.10.

To operationalize the above, the hospital will measure re-identification risk of the data, combined with context, to determine how much the data needs to be transformed. If we assume data of oral cancer is being shared, based on the above the primary driver of risk will be from the data environment with the risk of data being lost or stollen. In other words, we have a context risk of 0.14, and an overall threshold of 0.10, so that the data will need to be transformed so that 0.14 * data_risk ≤ 0.10.

Probabilistic risk estimation is a tool to drive decision making, providing guidance on what aspects of data sharing, be it context or data, need to be modified to ensure risk is appropriately managed. The estimation is based on a long history in statistical disclosure control. The modeling is useful, but as we already pointed out there is subjuctivity to all models. Our goal is to ensure the models are defensible, while capturing the broader context of data sharing to ensure our pricture of risk is complete and reasonable. The Five Safes provides a framework to capture this context in a (hopefully) memorable way.

Final Thoughts

In many jurisdictions, demonstrating that data has a very small risk of re-identification is a legal or regulatory requirement. Our methodology provides a basis for meeting these requirements in a defensible, evidence-based way. We have demonstrated how the Five Safes framework can be operationalized using risk-based anonymization: each dimension is evaluated independently of the others, brought together by an overall assessment of risk. This allows for the evaluation of scenarios of responsible data sharing, which will be context driven given the impact different scenarios will have on the usefulness of the data.

Data utility is important for those using anonymized data, because the results of their analyses are critical for informing services provided, policy, and investment decisions. Also, the cost of getting access to data is not trivial, making it important to ensure the quality of the data received. We don’t want to be wasteful, spending time and money collecting high-quality data, only to then watch that quality deteriorate through anonymization practices meant to prepare the data for secondary use.

The impact of de-identification on data utility is important, and very context-driven. All stakeholders need to provide input on what is most important to them, be it data utility or privacy. It’s not easy to balance the needs of everyone involved, but open communication and a commitment to producing useful data with a sufficiently low risk of re-identification is all that is really needed to get started. It’s not an easy negotiation—and it may be iterative—but its importance cannot be underestimated. Ideally, framing that conversation around the Five Safes should help to clarify the most important points.

1 Templar, M. (2017). Get Governed: Building World Class Data Governance Programs. Iron Lady Publishing.

2 Ritchie, F. (2017). The ‘Five Safes’: A Framework for Planning, Designing and Evaluating Data Access Solutions. Data For Policy 2017, London, UK, September 2017.

3 Arbuckle, L., & Ritchie, F. (2019). The Five Safes of Risk-Based Anonymization. IEEE Security & Privacy, 17(5), 84-89.

4 We use the word “approval” rather than “consent” because the latter can have very specific conditions and interpretations associated with it based on the relevant privacy laws.

5 HITRUST Alliance, (2015). HITRUST De-Identification Framework. Site: hitrustalliance.net/de-identification[hitrustalliance.net/de-identification

6 See Chapter 18 of El Emam, K. (2013). Guide to the De-Identification of Personal Health Information. CRC Press.

7 See, for example, ISO 27005 Information Security Risk Management, NIST SP 800-30 Risk Mmanagement Guide for IT Systems, and CSE TRA-1 Harmonized Threat and Risk Assessment Methodology.

8 On average people tend to have 150 friends, called the Dunbar number. Given the prevalence ρ of a knowable characteristic that defines the population of the data, the probability of having an acquaintance in the data can be computed in a straighforward manner using 1-(1-ρ)150. See Measuring Risk Under Plausible Attacks in Chapter 2 of El Emam, K. & Arbuckle, L. (2013, updated 2014). Anonymizing Health Data: Case Studies and Methods to Get You Started. O’Reilly Media.

9 Marsh, C., Skinner, C., Arber, S., Penhale, B., Openshaw, S., Hobcraft, J., Lievesley, D., & Walford, N. (1991). The Case for Samples of Anonymized Records From the 1991 Census. Journal of the Royal Statistical Society: Series A (Statistics in Society), 154(2), 305-340.

10 See Chapter 18 of El Emam, K. (2013). Guide to the De-Identification of Personal Health Information. CRC Press.

11 See Managing Re-Identification Risk in Chapter 2 of El Emam, K. & Arbuckle, L. (2013, updated 2014). Anonymizing Health Data: Case Studies and Methods to Get You Started. O’Reilly Media.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset