Data Deposit Box instead of data portability

Submitted by brad on Mon, 2008-05-05 21:08

Topic:

Tags:

I've been ranting of late about the dangers inherent in "Data Portability" which I would like to rename as BEPSI to avoid the motherhood word "portability" for something that really has a strong dark side as well as its light side.

But it's also important to come up with an alternative. I think the best alternative may lie in what I would call a "data deposit box" (formerly "data hosting.") It's a layered system, with a data layer and an application layer on top. Instead of copying the data to the applications, bring the applications to the data.

A data deposit box approach has your personal data stored on a server chosen by you. That server's duty is not to exploit your data, but rather to protect it. That's what you're paying for. Legally, you "own" it, either directly, or in the same sense as you have legal rights when renting an apartment -- or a safety deposit box.

Your data box's job is to perform actions on your data. Rather than giving copies of your data out to a thousand companies (the Facebook and Data Portability approach) you host the data and perform actions on it, programmed by those companies who are developing useful social applications.

As such, you don't join a site like Facebook or LinkedIn. Rather, companies like those build applications and application containers which can run on your data. They don't get the data, rather they write code that works with the data and runs in a protected sandbox on your data host -- and then displays the results directly to you.

To take a simple example, imagine a social application wishes to send a message to all your friends who live within 100 miles of you. Using permission tokens provided by you, it is able to connect to your data host and ask it to create that subset of your friend network, and then e-mail a message to that subset. It never sees the friend network at all. Let's say it wants to display that subset of friends to you. It puts in the query, and then directs your web browser to embed in the page a frame fetched from the data host keyed to that operation. The data host talks directly to your browser, the application company again never sees the data. (Your data host knows you and your browser via a cookie or similar technology.) This would also apply to custom applications running on your home PC -- they could also ask the data host to perform actions.

The problem with a first level (API) approach to this is that one could never design a remote procedure call style API able to do everything that application developers can think of. The API would always lag behind the innovation, and different data hosting companies would support different generations of the API.

So I believe we may want to consider a less pure approach. In this approach, the data host runs a protected virtual machine environment, similar in nature to the Java virtual machine. This environment has complete access to the data, and can do anything with it that you want to authorize. The developers provide little applets which run on your data host and provide the functionality. Inside the virtual machine is a Capability-based security environment which precisely controls what the applets can see and do with it. In addition, data on your host is stored encrypted so that even the data host can't access it. Rather, when you enable an application to run on your data, you give it access tokens (capability handles,) which are decryption keys. When it wants to work on your data, it sends those decryption keys along with its applet, and only then is it physically able to access the data. The system should be devised so that access can be readily revoked for any party.

Now, even with this level of security, I think it would still be possible for a malicious applet developer to break the rules, and find ways to grab your data and copy it somewhere beyond your control. However, we've changed the rules of the game greatly. Currently, everybody is copying your data (and your friends' data,) just as a matter of course. That's the default. They would have to work very hard not to keep a copy. In the data hosting model, they would have to work extra hard, and maliciously, and in violation of contract, to make a copy of your data. Changing it from implicit to overt act can make all the difference.

You could have more than one data host, and even use different hosts for different personas. You could also personally run the data host if you have always-on facilities or don't need to roam to other computers. Or you might distribute your data hosting, so a computer in your own house is the data host when you are online, but a mirrored computer you rent space on elsewhere is the data host when you are roaming.

Social Graph Apps

Your database would store your own personal data, and the data your connections have decided to reveal to you. In addition, you would subscribe to a feed of changes from all friends on their data. This allows applications that just run on your immediate social network to run entirely in the data hosting server.

This approach still presents a problem, however, for social graph applications that go beyond dealing with friends to friends-of-friends and FoFoF. Fortunately, it has turned out that even FoF has spawned very few useful apps and FoFoF has spawned perhaps one or two. It may be the case that the added security of Data hosting outweighs the cost of making social graph apps harder.

Applications can traverse social graphs in one of two ways. Each user can generate an identity for an application or group of applications. and pass this to their connections. Each connection can get an ID made by combining these two identity codes. In that way, a connection can only be understood by applications that are trusted by both ends of the connection. Connections can be made public or semi-public, but applications can only understand the links between parties that have approved the application. The second way is for those who wish to use such applications to accept they must make their graph more available to those apps. The people providing social graph search would get access to large groups of data, but because this is the exception, and not the rule, they could be held to higher standards of data protection when they do this. This provides functionality like we have today but with better oversight and more isolation of the apps that truly need the data.

Trusted 3rd party "external data" apps

This architecture does make it more expensive to produce apps that act as a trusted 3rd party, allowed to know things you would not actually disclose to your contact. A typical example would be a "crush" app, where you can declare crushes which are only revealed when mutual. (There are ways to do this particular app without a 3rd party, but I am not sure that applies to the general problem.)

So it probably will come to pass that some apps will have a legitimate need to combine data over a large network of users. The goal is to make such applications the exception, rather than the rule. It is not to make it impossible to build apps that users want.

However, sites that want to keep the data can get extra scrutiny, and promise to remove it on demand.

Much simpler layers

For those who think this layered approach would be too difficult, consider that in a way, it already is used. Today, most personal data apps keep all the data in an SQL database, and the applications involve doing queries on the database, processing the results, and displaying them. The results are rarely remembered.

The layer approach simply involves putting the database under different ownership and rules from the application, and getting stricter about the not remembering. While there are applications that need to remember things, this would again become the exception, rather than the norm.

Building this infrastructure

Something must pay for this data hosting, and it generally needs to be on quality servers with good bandwidth, security and redundancy.

It would be ideal if users paid for it themselves, perhaps as part of the services they buy from their ISP, the same way they usually get web hosting. However, user-pay business models are rare on the internet today outside the ISP itself.

As such, it is expected that application providers will want a way to pay for data hosting for their users. They will want to offer it (seamlessly) to users who do not have data hosting that will serve this particular application. In the extreme case, we could end up with applications offering free hosting only for their own use, which is effectively how things are today, but even the concept of a firewall between data and application could have value.

While micropayments are not usually very useful, because of the human cost of thinking about small payments, they can make sense for corporate settlements. Data hosts not being paid by the user might accept requests that come with small micropayment tokens if this could be standardized; a basic token good enough for the CPU and bandwidth of a typical request. If this is done to keep the prices at market levels, there is no reason companies should not feel happy to pay for outsourced hosting.

There is of course a danger that data hosts would appear which offer hosting for free in exchange for the right to exploit the data. Clearly this is what we're trying to avoid, but it still could be better to have just one such company holding your data, so you can keep watch on what it does and actually try to understand its contract, than to have 100 companies do it, with no time to consider the contracts.

It actually makes sense for users to join together on shared hosts for a variety of reasons, one of which may be shared negotiation. When a new cool external data application appears, and it wants users to authorize it for everything, large data hosts can negotiate how much data the application actually needs.

Updates

Local hosting on your own PC

One interesting possible architecture is local hosting on your own PC, with sync to an external data host for roaming and exchanging updates with contacts.

In this case, You go to the social networking site, which embeds an application. The application is an iframe sourced from yourname.datahostdns.com:port or similar. When at home, this resolves to localhost (127.0.0.1) meaning your own machine, where a data hosting server is running. Your own machine's data host looks at the request, which may trigger it to connect to the application's server for new code or data, but in many cases the code will be cached locally. (The URL would include the version number of the app it specifies to save even that lookup.)

Then your own machine performs the operation and feeds back the resulting HTML to the embedded frame in your browser -- all very fast and all local.

Cloud data hosting is now much simpler and cheaper. It is used as a cloud cache. It handles feed updates with your contacts, syncs data with your personal workstations, or acts as your server when you are roaming to untrusted machines. It performs any other operations that require an always-on server, since your PC is not that. (Google's Browser Sync is a proto-example of this model.) The cloud-based host can also provide data hosting for users who can't or won't install a data hosting app on their PC, which may include most corporate users at the start. This is vastly cheaper to operate, and thus it's easier to finance this infrastructure.

This approach requires that you be able to really sandbox the social app code. Though it's interesting to consider that one reason we don't trust random code on our machines is to protect our personal data, and so we shouldn't drive all our personal data into the hands of 3rd parties in the name of protecting it. We're not talking about running random code on your sandbox though, but only code from application developers you have decided to trust. In that sense it's not much similar than running Active X controls or Applets in your browser.

Another interesting approach would be to integrate data hosting into a home router (with USB so it can connect to a large flash drive.) As long as apps aren't super CPU intensive, this could provide an always-on server that's in your house and isolated from your PCs.

Open Social

Kevin Marks writes that plans for OpenSocial involve implementing many elements of this architecture. I hope that's true, and I have more to learn about OpenSocial, since my first blush evaluation mostly saw a repurposing of the Google Gadget framework, but plans for the future do more.