Anonymizing is hard to do

Topic: 

One of the few positive things over the recent giant AOL data spill (which we have asked the FTC to look into) is it has hopefully taught a few lessons about just how hard it is to truly anonymize data. With luck, the lesson will be "don't be fooled into thinking you can do it" and not "Just avoid what AOL did."

There is some Irony that in general, AOL is one of the better performers. They don't keep a permanent log of searches tied to userid, though it is tied, reports say, to a virtual ID. (I have seen other reports to suggest even this is erased after a while.) AOL also lets you turn off short term logging of the association with your real ID. Google, MSN, Yahoo and others keep the data effectively forever.

Everybody has pointed out that for many people, just the search queries themselves can be enough to identify a person, because people search for things that relate to them. But many people's searches will not be trackable back to them.

However, the AOL records maintain the exact time of the search, to the second or perhaps more accurately. They also maintain the site the user clicked on after doing the search. AOL may have wiped logs, but most sites don't. Let's say you go through the AOL logs and discover an AOL user searched and clicked on your site. You can go into your own logs and find that search, both from the timestamp, and the fact the "referer" field will identify that the user came via an AOL search for those specific terms.

Now you can learn the IP address of the user, and their cookies or even account with your site, if your site has accounts.

If you're a lawyer, however, doing a case where you can subpoena information, you could use that tool to identify almost any user in the AOL database who did a modest volume of searches. And the big sites with accounts could probably identify all their users who are in the database, getting their account id (and thus often name and email and the works.)

So even if AOL can't uncover who many of these users are due to an erasure policy, the truth is that's not enough. Even removing the site does not stop the big sites from tracking their own users, because their own logs have the timestamped searches. And an investigator could look for a query, do the query, see what sites you would likely click on, and search the logs of those sites. They would still find you. Even without the timestamp this is possible for an uncommon query. And uncommon queries are surprisingly common. :-)

I have a static IP address, so my IP address links directly to me. Broadband users who have dynamic IP addresses may be fooled -- if you have a network gateway box or leave your sole computer on, your address may stay stable for months at a time -- it's almost as close a tie as a static IP.

The point here is that once the data are collected, making them anonymous is very, very hard. Harder than you think, even when you take into account this rule about how hard it is.

Comments

In Europe, most dynmic IP addresses change every 24 hours or so.
However, the ISP knows who had what IP when. There is a debate
about how long they should/must remember this information. Also,
many dynamic-DNS services keep track of who has what IP when. (There
is a legitimate reason for this. My dynamic-DNS provider,
http://www.dynaccess.com/, uses the IP addresses to determine whether
email can be sent out through his SMTP server (rather than using,
say, SMTP authentication). If someone complains about spam, in the
interest of himself and his non-spamming customers he has to know
whom to take action against.)

The question is how much anonymity is required to provide an appropriate degree of anonymity.

There was a time when anonymizing the IP address of a packet was sufficient. Today, though, the content of many packets (e.g. Google searches) provides traceable information. If I do a search on "Bill Gates", does that have to be anonymized? How 'bout if I do a search on "Brad Templeton"? On my own name?

Then the question becomes: who gets to do the anonymization of the raw data? And how much anonymized data do they need to publish in order to validate their research results?

Only after we begin to define what anonymization requires can we begin to guess how easy or difficult it is to achieve. In the interim, all we can say for certain is that the raw data is not anonymous. Nor has it ever been.

Anonymizing is hard to do... Today i saw this:
http://www.ip-adress.com

How does this work? This is heavy.

Add new comment