Search Mailing List Archives
[liberationtech] Opinion on a paper?
Paul Bernal (LAW)
Paul.Bernal at uea.ac.uk
Mon Sep 10 00:35:13 PDT 2012
Thanks Jodi, Alec, Joss, Robert and Nick
That's really helpful - and very quick! A lot to digest - but it looks pretty devastating for the piece, confirming my suspicions.
Many thanks again
Dr Paul Bernal
UEA Law School
University of East Anglia
Norwich Research Park
Norwich NR4 7TJ
email: paul.bernal at uea.ac.uk<mailto:paul.bernal at uea.ac.uk>
On 10 Sep 2012, at 03:20, Robert Munro <robert.munro at gmail.com<mailto:robert.munro at gmail.com>>
I second the criticism about the assumptions of a 'perfect population
register'. This is a much broader problem, as shown by the Netflix
case. For a good synopsis, see Pete Warden's take on the problem, some
examples of how external data can be used to help reverse anonymized
data, and some suggestions for ways to operate with imperfect
You certainly don't need to be high-profile, either, like the article
suggests. Last year I was working on disease outbreak tracking. There
was an actual case where a girl in East Africa had been reported as
testing positive to Ebola. Her village was named in reports and this
was a region where victims of diseases are often vilified and
sometimes killed. She would have likely been the only person from her
village who was rushed to a hospital at that time (and more likely the
only girl of her age-bracket). It would have been simple for everyone
from her village to immediately make the connection. We decided we
would not want to publish this information, even though many other
health organizations did. Her diagnosis was ultimately incorrect,
which doesn't really affect the anonymization issue, but it makes any
identification/vilification even more disturbing.
We were information managers and health professionals, not lawyers,
and the international aspect no doubt complicates things. I assume
that the health organizations who did publicize this acted within the
law. For us, this wasn't enough. If it was reported in a health
journal 5 years later? That might be ok. But as real-time report it
was clearly unethical. I doubt the other organizations published this
in malice - it was one piece of information among many - but it
highlights the problem.
On 9 September 2012 15:30, Joss Wright
<joss-liberationtech at pseudonymity.net<mailto:joss-liberationtech at pseudonymity.net>> wrote:
On Sun, Sep 09, 2012 at 07:19:22PM +0000, Paul Bernal (LAW) wrote:
I wondered if anyone had an opinion on it - I don't have the technical
knowledge to be able to evaluate it properly. The basic conclusion
seems to be that re-identification of 'anonymised' data is not nearly
as easy as we had previously thought (from the work of Latanya
Sweeney, Paul Ohm etc). Are these conclusions valid?
My concern is that I can see this paper being used to justify all
kinds of potentially risky information being released - particularly
health data, which could get into the hands of insurance companies and
others who could use it to the detriment of individuals. On the other
hand, if the conclusions are really valid, then perhaps people like me
shouldn't be as concerned as we are.
I've gone over this paper quite quickly, partially because it's late
here and I should be asleep; apologies for any bizarre turns of phrase,
repetition (hesitation or deviation...), or bad-tempered
I'll also certainly defer to the hardcore reidentification experts if
they turn up.
(This email has become slightly longer than I intended. To sum up:
"Lots of problems. False assumptions. Cherry-picked examples. Ignores or
wholly misunderstands subsequent decade of research. Somewhat
misrepresents statistics. Wishful-thinking recommendations. Correct in
stating that we don't need to delete all data everywhere in order to
avoid reidentification, but that's about it.")
My initial response is that the paper is partially correct, in that the
Sweeney example was a dramatic, anecdotal demonstration of
reidentification and shouldn't be taken as representative of data in
general. On the other hand, the paper goes wildly off in the other
direction, and claims that the specifics of the Sweeney example somehow
demonstrate that reidentification in general is barely feasible and can
easily be handled with a few simple rules of thumb.
Overall, I would say that there are a number of serious flaws in the
arguments of the author.
Firstly, the paper is predicated almost entirely on what the author
refers to as `the myth of the perfect population register' -- that
almost no realistic database covers an entire population, and so any
apparently unique record could in fact also match someone outside of the
database. This is certainly true, but is used by the author to justify
an assumption that does not hold, in my opinion.
This assumption, the largest conceptual flaw in the paper, is that a
reidentification has to be unique and perfect to be of any value. The
author claims, based on the `perfect population register', that because
some reidentified record, relating to, say, health information of an
individual, could potentially match that of someone that wasn't in the
database, that there is no guarantee that the record is accurate, and
thus the reidentification is useless. This is not true -- even such
partial or probabilistic reidentifications reduce the set of
possibilities, and reveal information regarding an individual. This can
be used and combined with further data sources to achieve either
reidentification, if that is the goal, or simply the revelation of
sensitive personal information.
As an example: Sweeney used William Weld's unique characteristics in
the voter database to reidentify his anonymous health data. As some
hypothetical `Person X' who was not in the voter database could have
matched those apparently unique characteristics, the anonymous health
data could have belonged to Person X rather than William Weld. As the
author notes, this is overcome in the Sweeney case by making use of
public information to confirm that the data was that of William Weld --
the author seems to believe that any such auxiliary information for
other individuals could not reasonably exist, despite the existence of
Google and Facebook.
The author takes from this that any partial or probabilistic
reidentification is therefore worthless, and claims that it was only the
widely publicized `auxiliary information' about William Weld's health
status that made such reidentification possible.
What the author fails to address is that the availability of such
auxiliary information is exactly what is being made available with
greater and greater frequency by the release of poorly-anonymised
databases. As such, whilst the initial reidentification cannot be made
with perfect accuracy, subsequent pieces of auxiliary information can be
used to verify, research and identify an individual. (Of course, an
attacker may simply be seeking to gain a given piece of sensitive
information, so a true `reidentification' may not be a useful goal in
considering the risks of such databases.)
The author states in the abstract that `... most re-identification
attempts face a strong challenge in being able to create a complete and
accurate population register', and claims that this strong assumption
underlies most other reidentification work. (Using the entirely
objective phrase `somewhat furtive "insider" trade secret'.) In fact,
this strong assumption is entirely too strong, and is given as an
assumption only by the author themselves. I would point to the seminal
Shmatikov and Narayanan work on the Netflix Prize for a deeper analysis
that shatters exactly this kind of assumption. This claim by the author
is somewhat of a strawman argument, and one on which the entire paper is
A second flaw comes in switching several times, according to the
argument needed, as to whether the attacker is interested in identifying
a targeted individual (`We need William Weld's data'), or whether any
individual will do (`We need someone's data, but don't care who it
is.'). These raise very different problems, and different sets of
statistics, and need to be clearly separated in analysis.
A third flaw, related to the first and epitomised by the section
starting with the final paragraph of page 6, is that an attacker would
need to somehow build their perfect database before reidentifying an
individual. The author states that the attacker would have to check all
other individuals outside of the original database to complete the
reidentification. In fact, they could simply seek alternative forms of
auxiliary information to make their reidentification more and more
certain. I do find it bizarre that the author makes this claim, as the
more intelligent approach of using auxiliary information is precisely
that employed by Sweeney in the case of William Weld.
The author does address the problem of probabilistic reidentification at
the latter stages of the paper (top of page 9) but dismisses it
entirely, and unreasonably, out of hand. I could write a whole essay on
this particular argument, but I'll simply note that with a 35% chance
for error, you simply have a very good starting point to find extra
auxiliary information to reduce your error to whatever you decide is
acceptable. (This should not be ignored, however, as the author's
insistence that reidentification must be 100% certain is probably the
deepest flaw here.)
A more worrying problem comes in the surprising lack of coverage of any
of the subsequent, and equally highly publicized, reidentification
attacks, or any of the developments in anonymisation since k-anonymity.
Even if we brush aside the vast amount of work on differential privacy,
which is extremely popular in anonymity research today, the author has
not addressed concepts such as l-diversity or t-closeness, which would
seem necessary for a reasonable study.
(As a quick example, consider this application of an l-diversity
problem: We cannot identify William Weld uniquely in the health
database, but we can isolate him as one of four people. All of those
four have been prescribed antidepressants in the last six months, and
three are being treated for an STD. No perfect reidentification, but
certainly a sensitive data leak for the poor governor.)
The total lack of coverage of, for example, Shmatikov and Narayanan's
reidentification of the Netflix Prize dataset, and the (wonderful)
analysis and methodology used there show a worrying lack of familiarity
with the state of the art, and certainly call into question the
conclusions drawn from the author's analysis.
I do find the total focus on the Sweeney example, and the picking apart
of the details, a very concerning example of the kind of thinking that
often surrounds anonymisation: that by fixing the specific problem that
you identify with a specific example, you can fix the wider problem.
This is a `patching up the holes' approach, rather than an attempt to
systemically fix a problem; this has rarely been shown an effective
strategy, particularly in computer security. ("This was caused by a
combination of gender, birthdate and zip code? Quick, make those
sensitive pieces of data!")
The recommendations at the end of the paper are simply unrealistic.
Point by point:
1) Make it illegal to reidentify data -- this approach has been
criticised at length in the literature, as the author acknowledges and
dismisses, but I would focus particularly here on how difficult it is to
detect reidentification attempts. This will stop only the most ethical
2) Require anyone linking in new data to maintain anonymity --
recognizes the problem of auxiliary information, but somehow ignores it
at the same time.
3) Give data `anonymous' status, but allow that status to be withdrawn
-- I assume that all the copies of the dataset will automatically
self-destruct once this status is withdrawn.
4) Specify that recipients must comply with restrictions -- if you can
state this then you have already solved most of the world's problems.
More seriously, this (and other recommendations here) seem to conflate
anonymisation that is shared with trusted researchers, which /is/ less
of a problem, with anonymisation that is released to the public. If you
are restricting access, there are a lot of extra approaches that you can
This is extremely important to understand, as the public release of data
continually combines to provide more and more auxiliary data. This is
why it is critically important that data for public release is
anonymised, as there is no realistic way to pull that data back once it
is in the public domain. All information is auxiliary information for
the next attack.
5) Require that data holders are secure -- again, this is a fine wish,
but gives nothing practical.
6) Data use agreements that pass on to further recipients -- trust is
not commutative, and this holds most of the same wishful thinking
problems as the other recommendations here.
All of these recommendations are based on an assumption of trust, good
faith and playing by the rules. In short, entirely the opposite of
conventional security-based thinking. While we shouldn't throw away
everything to meet some puritanical ideal of security, we shouldn't
ignore an entire field of study because we don't like their conclusions.
I don't entirely dismiss the need for a regulatory approach to this. In
fact, several of these recommendations are reasonable if combined with
other, stronger, guarantees. There should be penalties for misuse of
data, or poor anonymisation, but they should be backed up at the
technical level by effective techniques that can safeguard information.
More importantly, none of these recommendations provide any kind of
practical or constructive approach to best practice for anonymising
data, or how to weigh up the risk or effects of data release. This seems
to follow the overall tone of the paper that these risks are not a
The final conclusions of the paper are that the Sweeney example was not
representative, and I agree; I also wholly disagree with almost all of
the analysis and conclusions of the paper. From the choice use of
language regarding, particularly, the `somewhat furtive "insider" trade
secret', the author clearly believes that researchers into
reidentification are massively and knowingly overplaying the chances of
reidentification. I resent that.
The one point on which I do agree is that there needs to be a balance
between the benefits of access to large-scale databases, and the risks
of reidentification. Where that point of balance should be is, I think,
something on which I would strongly disagree with the author; although
perhaps not as much as one might think.
I do fully appreciate that the author comes from the perspective of
wanting to use data for the greater good, and that some claims of the
risks of database release are overly cautious. This paper, though,
massively overstates the difficulties, and massively understates the
We should have a better understanding of the actual risks of
reidentification, and weigh this against the benefits from access to
aggregate personal data. The way to do this, however, is in a
broad-based study of the real-world risks, research into the means for
reidentification and anonymisation, and a systemic approach to the
protection of personal data; not by hand-waving away the risks by
picking apart one unrepresentative example and ignoring the subsequent
decade of active research into the area.
Happy to answer any other questions, on- or off-list.
Unsubscribe, change to digest, or change password at: https://mailman.stanford.edu/mailman/listinfo/liberationtech
Unsubscribe, change to digest, or change password at: https://mailman.stanford.edu/mailman/listinfo/liberationtech
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the liberationtech