Archives

Date

Old think on data storage for movies

A story from the New York Times suggests it costs over $12,000/year to store a movie in digital form.

This number is entirely bogus, and based on old thinking, namely the assumptions of offline storage on DVDs and tapes. Offline media do degrade, and you must copy them before they have a chance to degrade, which takes people, though frankly it’s still should not be as expensive as this. To do my calculations, I am going to assume a movie needs 100gb of storage with low-loss lossy compression. You can scale the numbers up if you like if you want to assume more, even at 1 TB it doesn’t change that much.

A film occupying 100gb of storage can go on about 20 dvds (or 11 dual layer,) costing about $8. It can go on 4 independent sets of 20 DVDs for $32 in media. Ideally you could rack these in a DVD jukebox, but if they are just sleeved, then once a year a person could pull out the DVDs, put them in a reader which would test them. Any that tested fine would be re-sleeved, those that did not would flag for the others to be pulled, and then copied to new media. (Probably better media, like blu-ray.) There are algorithms to distribute the data so that a large number of the disks must fail in that year to actually lose something. Of course, you use different vaults around the world. When approaching the point where failure rates go up for the media, you re-burn new copies even if the old ones still test fine.

This takes human time, though not all that much. Perhaps half an hour of actual human time swapping disks though much more real time to burn them, but you don’t do just one at a time.

However, even better is the new style of archival — online storage. Hard disks are 20 cents/gigabyte and continuing to fall. NAS boxes are more expensive now but there is no reason they won’t drop to very reasonable prices, so that a NAS case adds perhaps 5 cents/gigabyte (ie. $100 for a 4x500gb drive box which lasts for 10-15 years.) (NAS boxes are small boxes that hold a collection of drives and allow access to them over ethernet. No computer is needed.) They also cost about 2 cents/gb/year for power if on all the time, and some small amount for space, though they would tend to sit in computer centers that already exist.

Those are today’s prices, which will just get cheaper, except for the power. Much cheaper. If a drive lasts an average of 4 years before failing and a NAS lasts 10 years, this works out to 7.5 cents/gigabyte/year. Of course you will store your files redundantly, in 4 different places (which is actually overkill) and so it’s 30 cents/gigabyte/year.

Which is still just $30 for a 100gb file, or $300 for a TB.

Online storage is live. You can regularly check the integrity, all the time. You can either leave it off and spin it up every few days (to not use power) or just leave it on all the time. If one, two or three of the 4 disks fails, computers can copy the data to fresh disks in the network, and you are alive. Your disks should last 3 to 4 years but many will last much longer. You need a computer system to control all this, but you only need one for the entire cloud of NAS boxes, or at most a few. Its cost is low.

The real cost is people. But companies like Google have solved the problem of running large server farms. They tolerate single drive failures. The computers copy the data to new drives right away, and technicans go by every few days to pull old ones and slot in fresh ones for the next need — not for the same file. This takes just a few minutes of the tech’s time. And there is no rush to their work. Fore each 100gb file, you should expect to have a replacement about once every 4 years (ie. the lifetime of an average drive.)

Now all this is at today’s price of $100 for a 500gb drive. But that’s dropping fast, faster than Moore’s law. The replacements will be 1TB and 2TB drives before long, and the cost will continue to fall. And this is with 4 copies of every file. You can actually get by with less using modern data distribution algorithms which can scatter a file of 100gb into 200 1gb pieces, for which almost half must be lost before the whole file is lost. Several data centers could burn down without losing any files if things are done right. I have not accounted for bandwidth here for replacements, which usually would be done in the same data center except in unusual circumstances.

The biggest cost is the people to set all this up. However, presuming big demand, the cost per gigabyte for those people should become modest.