Search Mailing List Archives
[liberationtech] need advice on using hashes for preserving PII's utility for disambiguation while protecting sensitive info
tom at ritter.vg
Fri Feb 7 02:42:18 PST 2014
In addition to what the others have said, I'll give a name to some of
The process of assigning an opaque random identifier to an easily
reversed string is 'Tokenization'. I don't work in payment processing
- but it's big there. Don't want to have a ton of PCI requirements?
Pay a tokenization service - send the credit card to them, they give
you an identifier (say a guid like
1C4E0B18-ABE6-4657-8B1B-79474EC80A94) and you store that. A horrible
way of doing this is choosing a 'secret' salt and making every token
MD5(salt || identifier). A safe way is a database of GUID-Identifier
mappings. (But secure the database!) It's a pretty safe situation and
used all over the payment industry BUT like others have said, requires
some meta-authority between government organizations to do this, which
probably makes it a non-starter.
Similar, but different, is Format Preserving Encryption (FPE). FPE is
used in the situation where you have a database full of credit card
numbers of the form WWWW-XXXX-YYYY-ZZZZ (or SSNs) and you realize
"Holy crap, storing these in plaintext is a horrible idea!" - BUT you
have so much software that expects credit card numbers to be in a
specific form and you can't rewrite it all to handle
V1dXVy1YWFhYLVlZWVktWlpaWgo= or something like that. FPE converts the
credit card to NDKG-NDSH-LKAU-QNCB so your application sees it in a
correct format, but it's 'encrypted' (it can ever have a correct Luhn
number.) The encryption is strong, as long as you keep the key
secret. But this _also_ requires some sort of meta-authority
government organization and thus is even less likely to work, because
as soon as you say 'crypto' everyone gets huffy about control of the
The idea of using a CRC, a colliding but unlikely to be colliding in
the situation you care about is interesting, and I have to imagine
there's a term for it and studies about it.
The unfortunate situation is data anonymization is wildly
difficult. I realize that you're in the trenches and saying
"Actually, the problem we have is that the anonymization is too good".
The long term approach I have to problems like this is find a
professor researching it, and hook up with a grad student where you
get them to work on your problem. They come up with an idea, write a
paper, someone else breaks their paper and your work - but hey, so
long as the grad student did their homework, it's been broken in a new
and novel way. ;)
 Look up the Federal Bridge PKI for a great example
On 6 February 2014 15:49, Tom Lee <tlee at sunlightfoundation.com> wrote:
> We've been kicking around an idea at Sunlight that aims to use cryptographic
> ideas to resolve some of the concerns around the publication of publicly
> identifiable information in government disclosures. I could use some smart
> people to tell me what's dumb about it.
> We often face challenges related to disambiguating entities: is the John
> Smith who gave political donation A the same John Smith that gave political
> donation B? One obvious solution to this problem is to push to expand the
> information that's collected and disclosed -- if we had John's driver's
> license number (DLN), for instance, it'd be easy to disambiguate these
> records. But that could introduce privacy concerns for John. One approach to
> this problem (which I don't think government has tried) is employing a
> one-way hash.
> Obviously the input key space for DLNs and most other personal ID numbers is
> so small that reversing this with a dictionary attack would be trivial. You
> can add a salt, but only on a per-entity basis (not a per-record basis) if
> you want to preserve the capacity to disambiguate. That in turns calls for a
> lookup table in which the input keys are stored, which kind of defeats the
> point of using a hash (you might as well just assign random output IDs for
> each input ID). I would worry about government's ability to keep this lookup
> table secure, and I worry about the brittleness of such a system.
> Alternately, you can use a single system-wide secret (or set of secrets) to
> transform inputs into reliable outputs. I think this is less brittle and
> maybe easier to preserve as a secret, but this system might be too easily
> reversible given the ability to observe its outputs and know the universe of
> possible inputs. I'm unsure of the cryptographic options that might be
> appropriate here.
> For all I know, the lack of implementations using this kind of one-way
> transformation isn't about government sluggishness but rather about its
> feasibility. I'd be very curious to hear folks ideas on this score, though.
> My general hunch is that something must be possible -- even a few bits'
> worth of disambiguating information would be hugely useful to us, and
> presumably you're not leaking important amounts of information by, say,
> sharing the last digit of a DLN. So there must be a spectrum of options. But
> as is probably apparent, I don't think I've got a handle on how to think
> about this problem rigorously.
> Liberationtech is public & archives are searchable on Google. Violations of
> list guidelines will get you moderated:
> https://mailman.stanford.edu/mailman/listinfo/liberationtech. Unsubscribe,
> change to digest, or change password by emailing moderator at
> companys at stanford.edu.
More information about the liberationtech