Skip to content

Latest commit

 

History

History
81 lines (41 loc) · 11.7 KB

2016-08-16.md

File metadata and controls

81 lines (41 loc) · 11.7 KB

Lunchtime for Data Privacy

This post will be a bit of a philosophical one, about what privacy is and why we might choose to call things by that name. The story comes in the context of a paper No Free Lunch in Data Privacy which argues that one can't have everything when it comes to statistical analysis of private data: if you want some accuracy you have to give up some privacy.

I don't agree, and the crux of the issue is "just what do you get to call 'privacy'?", a question that gets surprisingly little attention.

Historically, privacy research has struggled to formally nail down a good formal definition of "privacy". There have been many, many definitions, though most of these are post-hoc characterizations of whatever the paper achieves as "privacy", rather than aspirational goals about what should be protected. Even the most successful definitions, say k-anonymity and differential privacy, are used primarily to describe properties of computations rather than to formalize our intuitive requirements of something that calls itself "privacy".

This lack of clarity leads to confusion when people speak of things "not providing privacy" or being a "privacy breach", because (i) there are so many definitions of privacy, and (ii) it isn't clear whether the authors' beliefs about privacy have any bearing beyond the publication of their paper.

Let's take as an example the Privacy in Pharmacogenetics paper, which we previously discussed. The paper describes an attack, model inversion, which predicts genetic markers through access to (i) other private data of yours, and (ii) correlations observed in other people between their genetic markers and their corresponding private data. They conclude that the disclosure of (ii) the correlations, again involving only the data of people not you, constitute an attack on your private data.

Nothing written above is wrong in the factual sense, but it is rhetorically misleading. An attacker may indeed be able to learn about your private data, but it is in the same sense that someone who sees you in public may be able to make a guess about your weight. Although this could be annoying, should we call the understanding of the general correlation between size and weight a "privacy attack"? Is it really no different than revealing the same information directly from your most recent medical examination, which we likely would view as a breach of privacy? Should we perhaps distinguish these two, so that we might prevent one and worry less about the other?

There is a valuable difference between "your secrets" and "secrets about you". As a society we place great value on one's right to self-determination, and somewhat less value on one's right to image management. We value one's ability to keep what secrets they have, but find it perhaps less reasonable to demand ignorance of others, even about one's self. If we must trade away some secrets for accurate statistics in the common interest, it is helpful to distinguish these two types of secrets we might be trading away, and it may be harmful to fail to distinguish them. We can (and should) argue about the relative costs and benefits of keeping each type of secret, but we must distinguish between them to have meaningful discussions.

Differential privacy is a formal distinction between "your secrets" and "secrets about you".

Implementations of differential privacy formally demonstrate that you do not need to trade away any of "your secrets" to get some accurate statistical information. On the other hand, it does not guarantee that any "secret about you" remains (or indeed ever was) a secret, for the very important reason that such a guarantee is generally impossible.

We are going to go a bit deeper on each of these points, and circle round to the No Free Lunch paper and its implications for data privacy generally, and differential privacy in particular. The authors claim to address misconceptions about differential privacy, but I worry that they introduce misconceptions of their own instead.

No Free Lunch

The No Free Lunch in Data Privacy paper is organized around a "no free lunch" type of theorem, one which says that for a technique to do one thing well, it must do other things badly. This makes some sense in the context of data privacy, where there is a tension between the two main goals, the privacy of individuals and accuracy across large populations. If you want to do one very well, you may need to do the other badly.

The theorem says that if you have a secret about a dataset---the authors give the example of reading the first record in the dataset---then any non-useless computation can take the secret from "totally secret" to "pretty much public". This would suck; records in the dataset feel very much like a "your secret" type of secret.

To understand, let's look at the definition of non-useless the authors use, which they call discrimination.

Discriminant

Informally, discrimination says that there should be at least a few input datasets that are likely to produce different sorts of results. This would mean that if you see one of the results, it probably wasn't one of the other inputs (with parameters set appropriately).

The theorem uses this intuition to reverse out the precise value of a secret, from any discriminating computation:

NoLunch

The proof picks datasets with different values of the secret as inputs, and asks the computation to discriminate between them, at which point it knows the input dataset and the secret. Unfortunately, the definition of discrimination doesn't require a computation to distinguish between any datasets, just some datasets. Maybe not these datasets. The proof doesn't hold, and indeed there are counter-examples to the theorem: consider a query which wants to read out my record, and a computation which says "buzz off" if my record is present in the input (but is otherwise discriminating).

The theorem and proof can be fixed by changing the definitions and theorem somewhat; if nothing else, it is meant to be a simplification of prior work by Cynthia Dwork and Moni Naor. The theorem's final form notwithstanding, the authors follow it with a clear example in which a discriminating computation fails to meet their standard for privacy. I'm going to paraphrase it in the language of our example of size and weight, but I think it is otherwise faithful to their example.

Consider a dataset where each entry is a pair of height and weight. Your secret is your weight. Imagine k datasets in which you have a distinct weight in each, and the density of the participants (ratio of weight to height) is distinct in each. Now any computation which reveals the average density across the dataset also reveals your weight.

The authors' conclusion is that what we require are more assumptions about how data are produced, as the flexibility to tie your specific weight to a broad statistical trend stresses the credibility of the example. But, let's change the example somewhat, and see that you get the same argument without implausible datasets.

Consider a dataset where each entry is a pair of height and weight. Your secret is your density, because your height is public information. Imagine k datasets in each of which all participants have the same density, but with different densities for different datasets. Now any computation which reveals the average density across the dataset also reveals your density.

Again we have what is counted a privacy violation, where we have only introduced a common statistical property (density) to all participants. Is it implausible that all participants share some common statistical property? I don't think so. Is our understanding of privacy in any way improved by introducing mathematical assumptions on the data generation process that preclude this possibility? Again, I don't think so.

Any computation which reveals something about a dataset may reveal something about an individual. It's a fact.

Privacy as agency

The definition of privacy as protecting arbitrary "secrets about you" has some fundamental issues. The most important one, in my mind, is agency: if the rest of the world can learn your secret without your help, it is not your secret. It may be a secret, currently, but this is only true for as long as the rest of the world isn't bothered enough to think too hard about it. The requirement that it remain a secret is a requirement made on the rest of the world, and how it is allowed to think.

At the same time, protecting arbitrary "secrets about you" is a much stronger standard than is actually needed to provide meaningful and actionable privacy guarantees. When presented with the option to contribute data, I have two options (contribute, or don't) and only need to be re-assured that I will not regret participation. I might regret that someone is asking the questions they are, but if these conclusions would be reached with or without my participation I have no practical reason not to participate.

As mentioned before, differential privacy provides a guarantee that your secrets remain your secrets. Participation in a differentially private computation will reveal nothing that wouldn't be revealed without your participation. The results can say nothing that couldn't be learned without your data. Any secret you could keep before can still be kept. Secrets you could not keep might now have been revealed, but not because of your participation.

These guarantees hold with no assumption on how data are generated.

Misconceptions

The main body of the No Free Lunch paper addresses what they feel are "misconceptions" about differential privacy. I can't speak to how well understood it is (not, is my best guess), but the authors' clarifications are not entirely helpful.

NoLunch

To be clear, these statements are true as written. Differential privacy makes no requirements of data generation, or what an attacker does or does not know. It provides the guarantee that your secrets are kept with no further assumptions.

The authors argue that differential privacy requires further assumptions to protect "privacy", but only when privacy means "secrets about you"; their arguments do not make the distinction between "your secrets" and "secrets about you". To the extent that there are misconceptions about differential privacy's guarantee, I think they are largely that masking all "secrets about you" is the right privacy standard to expect.

While we can debate about whether one is entitled to force secrets about themselves onto others (e.g. the debate around the right to be forgotten), we should be able to do this without muddying the understanding of other more actionable (and less impossible) guarantees, as provided by differential privacy.

As a concrete proposal, based on the right to be forgotten, consider using the following terms:

  • Privacy: protecting secrets that could only be revealed with your participation.

  • Forgettability: protecting secrets that could be revealed without your participation.

I'm guessing that folks currently working with what is now "forgettability" are all "LOL! no.", but seriously, make the case that what you are about should be called privacy.