Google is currently fighting a subpoena from the DoJ for their search logs. The DoJ experts in the COPA online porn case want to mine Google’s logs, not for anybody’s data in particular, but because they are such a great repository of statistics on internet activity. Google is fighting hard as they should. Apparently several Google competitors caved in.
These logs are a treasure trove of information, just as the DoJ experts say they are. No wonder they want them. They are particularly valuable to Google, of course, so much so that they have resisted all calls to wipe them or anonymize them. In fact, Google has built a fancy system with its own custom computer language to do massively parallel computing to let it gather statistics from this giant pool of data.
The DoJ and the companies that didn’t fight the order insist there is no personally identifiable information in these logs, but that’s certainly not true of the source logs. Even if you remove the Google account cookie that is now sent with most people’s queries, the IP address is recorded. I have a static IP address myself on my DSL. It’s always the same, and so it would be easy to extract all my searches, which include some pretty confidential stuff, things like me entering the names of medicines I have been prescribed. (It even includes me searching for “Kiddie Porn” because I wanted to see if any adwords would be presented on such a search. There were not, in case you are wondering.) Yahoo and MSN state the IP address and other information was stripped from what they handed over.
Static IPs are the norm for corporations and more savvy internet users, but while most DSL and cable users have a dynamic IP, it isn’t really very dynamic. If you have a home gateway box or computer that is on all the time, it changes very infrequently, in some cases, never. All your activity can be linked back to you through that address. Only dial-up users can expect any anonymity from their dynamic IP, and even then ISPs keep logs for some period of time which connect dynamic IPs and accounts.
But there is something far more frightening about this collection of data. I hope Google wins its fight over this data, because the DoJ really has no business forcing a private company to help them with their statistics problems.
But what about when a subpoena comes about an individual? Imagine you are under investigation for something, or just in a frivolous lawsuit or even a messy divorce. You can bet lawyers are going to want to say, for those with mostly-static IPs, “I want the search records for this IP, or this cookie.” And it’s going to be a lot harder for search engines to turn down those requests, because they will be specific and will relate to the data the search companies are holding on all of us.
One way to hold the lawyers back will be to make it expensive. But how long will it remain expensive? After a few requests, the software to pull the records will exist, and it will not be possible to claim it’s more expensive than the data mining Google already does for itself, to improve its own business.
Now, before it seems like I am ragging on Google here, let’s not forget that Google’s competition — AOL, Yahoo and MSN — hasn’t been even so good as to fight this first salvo. Yahoo has a whole department to comply with legal requests for their records, and famously handed over the ID of a journalist who sent an E-mail that has landed him in a Chinese jail. When it comes to intent, Google has indeed been the “do the least evil” company here.
But with court orders, intent matters not. This pool of data is an “attractive nuisance.” In the end, I think Google will realize it has to start anonymizing this data to the point that it can respond to requests with “we don’t have that information.” Doing so will erase information that can be valuable to Google’s business. It will come at a cost to them. Worse, the cost can’t be predicted because they will lose the ability to learn new things they haven’t even realized they want to learn about how people use their tools. But in the end, it’s the only choice, both to keep their subpoena costs down, and to make users comfortable with searching.
Perhaps these logs were handed over without IPs or user names. But what if somebody browses them and sees queries on things like kiddie porn or white house security or how to build a nuclear bomb? Could that be sufficient cause for a further order to get the identifying information associated with that query?
In the meantime, if you feel motivated to foolishly search for things that could be misinterpreted, as I did, may I recommend you do so through Tor, the anonymizing proxy. (The EFF provided significant financial support to the development of Tor.) Tor bounces your web requests through a series of randomly chosen servers, all encrypted, so nobody can trace back your requests to you. Be sure not to login when using it, though!