One of the few positive things over the recent giant AOL data spill (which we have asked the FTC to look into) is it has hopefully taught a few lessons about just how hard it is to truly anonymize data. With luck, the lesson will be “don’t be fooled into thinking you can do it” and not “Just avoid what AOL did.”
There is some Irony that in general, AOL is one of the better performers. They don’t keep a permanent log of searches tied to userid, though it is tied, reports say, to a virtual ID. (I have seen other reports to suggest even this is erased after a while.) AOL also lets you turn off short term logging of the association with your real ID. Google, MSN, Yahoo and others keep the data effectively forever.
Everybody has pointed out that for many people, just the search queries themselves can be enough to identify a person, because people search for things that relate to them. But many people’s searches will not be trackable back to them.
However, the AOL records maintain the exact time of the search, to the second or perhaps more accurately. They also maintain the site the user clicked on after doing the search. AOL may have wiped logs, but most sites don’t. Let’s say you go through the AOL logs and discover an AOL user searched and clicked on your site. You can go into your own logs and find that search, both from the timestamp, and the fact the “referer” field will identify that the user came via an AOL search for those specific terms.
Now you can learn the IP address of the user, and their cookies or even account with your site, if your site has accounts.
If you’re a lawyer, however, doing a case where you can subpoena information, you could use that tool to identify almost any user in the AOL database who did a modest volume of searches. And the big sites with accounts could probably identify all their users who are in the database, getting their account id (and thus often name and email and the works.)
So even if AOL can’t uncover who many of these users are due to an erasure policy, the truth is that’s not enough. Even removing the site does not stop the big sites from tracking their own users, because their own logs have the timestamped searches. And an investigator could look for a query, do the query, see what sites you would likely click on, and search the logs of those sites. They would still find you. Even without the timestamp this is possible for an uncommon query. And uncommon queries are surprisingly common. :-)
I have a static IP address, so my IP address links directly to me. Broadband users who have dynamic IP addresses may be fooled — if you have a network gateway box or leave your sole computer on, your address may stay stable for months at a time — it’s almost as close a tie as a static IP.
The point here is that once the data are collected, making them anonymous is very, very hard. Harder than you think, even when you take into account this rule about how hard it is.