Skip to content

Latest commit

 

History

History
104 lines (53 loc) · 15.1 KB

2019-04-12.md

File metadata and controls

104 lines (53 loc) · 15.1 KB

Differential privacy and Demographics

Demography is the study of populations, most commonly human populations.

The subjects of demographic study (people) have some quirks that aren't as clearly raised in other areas of statistics. People generally participate (or not) in demographic studies by their own free will. The responses they submit or behavior they present can reflect their beliefs about the nature or implications of the study. Finally, in many contexts we generally agree that people have rights, to privacy or agency, about their personal data. These differences distinguish the study of human populations from, say, that of migratory birds or the failure rates of industrial components.

What with the US Census working through differential privacy for the 2020 census, I've had the chance to see more and more demographers in my twitter timeline, and the results were somewhat surprising.

In one sense, many of the responses were entirely predictable. Enforced differential privacy is an existential threat to a generation of demographers who have limited training in modern data privacy. Where previously demographers may have received raw data dumps with names redacted, a higher privacy standard means they can no longer do the work they trained to do the way they were trained to do it. This work is, as you might imagine, quite useful and important in a functioning society, and preventing demographers from performing it to the best of their ability is understandably frustrating.

What surprised me then? Two things, as best as I can tell. First, that both the concerns and mechanics of privacy are fundamentally properties of populations; these issues should be right in the demographers wheelhouse, but they seem to have had nothing specific to say about them (most specific data on the subject comes from the US Census itself). Second, as advanced by Kinsey, privacy guarantees are crucial if you hope to have full and honest participation by a population that can choose what data to contribute; I had thought that other demographers might hold similar values, but it never came up.

A demographic study of privacy

It seems to me, admittedly naive on these matters, that almost all of the social and mechanical questions about privacy are questions about populations. If one scientific area should home the study of privacy, you might think demography would be that area.

A recent, and very excellent, Tweetorial from John Abowd, Chief Scientist at the US Census, reports a great deal of interesting quantitative data about the privacy properties of the set of responses to the 2010 US Census. For example, take the question:

How many individuals are uniquely identified by their combination of age, race, gender, ethnicity within their census tract?

This quantity is rather important if you want to understand whether releasing this information in the clear might have privacy risks. It also happen to be a quantity that can be trivially reported, with very high accuracy, using differential privacy. Aside: I'm not aware of another formal privacy criterion that would permit this release.

The number is 50%, by the way. Half of the respondents in the 2010 US Census are uniquely identified by their age, race, gender, ethnicity within their census tract. That's a pretty large number, and has some serious implications for whether we should publish this information.

Even though there are no names or addresses above, demographic data that uniquely identify an individual can still cause harm. These five values form what is called a quasi-identifier: enough information to uniquely identify a person. Imagine that ICE or the IRS or the ATF (or whichever central authority you like least) knows that among the 100-ish people in your census tract one matches your demographic description; they don't need to know your name to go door to door asking after your description. If the US adds the question "are you a US citizen" to the 2020 census this becomes a much less hypothetical concern.

If 50% is such a large number, why are we discovering this just now in 2019 rather than in 2010 as that Census landed? Which community is meant to be studying privacy risks to populations? I think demographers would have been a great fit!

What about the social views of privacy violations? Do people care about privacy violations? Do they change their behavior because of perceived privacy risks? Would they not do so with better privacy mechanisms? These all seem like great questions for demographic research. Perhaps this is already underway.

Unfortunately, it seems like there is a bit of a gap here. John's Tweetorial for example indicates that for 46% of 2010 US Census respondents, their quasi-identifiers can be exactly reconstructed from statistical information the US Census could publish. This is a serious number, almost half the population, with potentially serious consequences (quasi-identifier release), and something to be concerned about.

Instead, we have prominent demographers LOL'ing at how exact reconstruction of quasi-identifiers for 46% of the 2010 respondents "isn't high accuracy".

Shame

This is just embarassing. If I had the ability to crack 46% of the passwords of US Citizens, or held naked photos of 46% of these demographers' family members, you could be damned sure they would pay attention. Either these people don't understand what they are talking about, or they just don't take it very seriously.

It would be great to have demographers study the privacy characteristics and concerns of human populations, but it may require some new generation of demographers. This isn't how you get full and honest participation from subjects with privacy concerns.

Differential privacy benefits demography

There is a great deal of angst about the harm differential privacy may cause to demographic research, and this is an entirely reasonable angst. If nothing else, differential privacy will be very disruptive to established methodology, and there could very easily be very significant "else"s.

At the same time, there are useful things that differential privacy does for demographic research, and I wanted to call them out. This isn't meant to absolve differential privacy of its challenges, or to ask the demographers to be thankful, but just to point out that there is a virtuous cycle of feedback here, if you look for it. Perhaps it can help.

Differential privacy allows you to ask new questions

Differential privacy provides a formal basis for asking and answering questions whose privacy properties are otherwise challenging to justify.

We saw this above, with the question:

How many individuals are uniquely identified by their combination of age, race, gender, ethnicity within their census tract?

The exact response to this question, with the addition of Laplace noise with parameter 2/epsilon, provides epsilon-differential privacy.

This is neat because while other statistical disclosure limitation mechanisms might randomize and then release the source data, the introduced entropy will increase the apparent number of uniquely identifiable individuals, unfairly over-reporting the threat. Differential privacy's ability to ask direct questions, and get direct answers with unbiased noise, has the potential to dramatically simplify some stastitical analyses.

Differential privacy improves inferences

The data that many demographers currently use, public use microdata released by the US Census, are not accurate. They are randomized and adjusted in various ways in order to improve privacy (at least since 1990). However, the mechanisms the US Census uses to do this are not public. Demographers who want to use these measurements and draw conclusions about the true underlying population must make informed guesses about the process (or, not uncommonly, just ignore the process).

This was a very real problem in the 2000 Census, the 2003-2006 American Community Survey, and the 2004-2009 Current Population Survey, as reported by Alexander, Davern, and Stevenson in Inaccurate age and sex data in the Census PUMS files: Evidence and Implications. Specifically,

For women and men ages 65 and older, age- and sex-specific population estimates generated from the PUMS files differ by as much as 15% from counts in published data tables. [...] These problems were an unintentional by-product of the misapplication of a newer generation of disclosure avoidance procedures carried out on the data.

Because the US Census could not report the specific processes they applied to the data, consumers of the data had no idea about the level of inaccuracy and could not easily evaluate it for themselves. Some number of conclusions resulted that did not reflect the true underlying data, and likely still today other conclusions have been drawn that are not representative; to the best of my understanding because they don't know the underlying processes most demographers just don't know.

One very appealing property of differential privacy is that one can be public about the mechanism used, and the levels of noise introduced in the publications. The specific values of noise cannot be released, but the exact probabilistic relationship between each publication and the source data can be published in the clear.

A detailed understanding of a statistical process is table stakes for drawing conclusions based observations of that process. It boggles my mind (and the minds of several colleagues) that this is not more of an issue with demographers. I would have imagined demographers demanding a clear explanation of the statistical processes in the name of proper science; instead they are currently demanding the opposite.

Differential privacy can increase accuracy

At the moment, public use microdata releases are (to the best of my understanding) based on a roughly 1% sample of the US population. This is because the data are sensitive, and further subsampling provides an additional layer of protection over the other protocols used by the US Census.

Differential privacy provides a justification for increasing the volume of data fed in to these releases by a factor of 100x.

The limits imposed by differential privacy also improve the accuracy of the conclusions of demographers as a group, by formally preventing overfitting of results to the source data and informally preventing data dredging. These limits have the potential to be frustrating to the publication-hungry demographer, but they are actually good news for policy makers who want to act with confidence based on their findings.

Differential privacy can increase participation

Perhaps most important, the strong guarantees of differential privacy have the potential to mitigate the issue of self-censorship associated with statistical surveys.

Years ago, Alfred Kinsey surveyed various populations about human sexuality, and controversially instituted strong anonymity protocols in an attempt to get the most candid and honest answers. Shy voters report to pollsters behavior they believe will be better received, despite ultimately voting differently. Currently, the US government hopes to introduce a US Census question that will likely suppress participation by minorities and immigrants. Surveys of human populations must deal with the issue that their subjects, humans, are able to modulate their participation in their own self-interest.

In each of these cases, data are contaminated by the human subjects' view of the process, and the concerns that their specific answers might feed back in undesirable ways.

This might be fine, if the behavioral changes were uniform across the population, but invariably they are not. The potential harms of privacy are felt more strongly by people with something to lose. People reasonably argue that the US citizenship question will be meaningless, as the self-censorship will so obviously color the participation. John Abowd argues that one justification for differential privacy is the Equal Protection Clause: equally strong privacy guarantees should apply to the most interesting, unique, and vulnerable members of our society, as apply to fairly generic Irish-German folk like me who benefit from being demographically unremarkable.

Much as weak privacy protection can lead to poor data, strong privacy guarantees have the potential to improve the quality of collected data.

Demography and Privacy, going forward

While there are probably going to be growing pains for a while, the field of demography will need to come to terms with the data privacy requirements of their subjects. That resolution shouldn't be subjegation of one to the other, but rather a framework where the non-trivial trade-off between the benefits of more and less information can be managed. Differential privacy provides one framework, where more aggregate and less specific information can be released, but any other similarly serious proposal is worth considering.

Academics LOL'ing about the privacy non-risks is not a helpful contribution.

Aaron Roth likes to point out that one of the more important contributions of differential privacy is that it gives a framework for trading off accuracy and privacy: witholding data provides 0-differential privacy, and releasing data in the clear provides infinite-differential privacy, but there is a spectrum between. As soon as you concede that finite-differential privacy has value, you can have a sane discussion about where the value should be to maximize social good. We can ask serious quantitative questions about the efficacy of enforcing the Civil Rights Act from accurate data against the risks of disclosing individual detail in immigrant populations.

A lot of the public discourse about differential privacy and demography research is framed as "demographers vs differential privacy", but to my experience very few differential privacy researchers have religion that one must use differential privacy. Rather, like Aaron above they have religion that with access to responsible privacy technology, you should have a serious discussion about whether and why you should or should not use it. In fairness, our jobs don't depend on people using differential privacy nearly as much as the demographers' jobs appear to depend on them not using it; it's a bit easier to be dispassionate with less skin in the game.

That current demographers largely don't understand modern data privacy is a reason to be wary of immediate large-scale deployment, but this is also something demographers will need to fix. It also appears they might need some encouragement to do so, and perhaps the US Census and their statutory obligations are a fine way to kick-start that. But if they do, anyone and everyone expert with differential privacy should work hard to ease the transition.