Old think on data storage for movies

Topic: 

A story from the New York Times suggests it costs over $12,000/year to store a movie in digital form.

This number is entirely bogus, and based on old thinking, namely the assumptions of offline storage on DVDs and tapes. Offline media do degrade, and you must copy them before they have a chance to degrade, which takes people, though frankly it's still should not be as expensive as this. To do my calculations, I am going to assume a movie needs 100gb of storage with low-loss lossy compression. You can scale the numbers up if you like if you want to assume more, even at 1 TB it doesn't change that much.

A film occupying 100gb of storage can go on about 20 dvds (or 11 dual layer,) costing about $8. It can go on 4 independent sets of 20 DVDs for $32 in media. Ideally you could rack these in a DVD jukebox, but if they are just sleeved, then once a year a person could pull out the DVDs, put them in a reader which would test them. Any that tested fine would be re-sleeved, those that did not would flag for the others to be pulled, and then copied to new media. (Probably better media, like blu-ray.) There are algorithms to distribute the data so that a large number of the disks must fail in that year to actually lose something. Of course, you use different vaults around the world. When approaching the point where failure rates go up for the media, you re-burn new copies even if the old ones still test fine.

This takes human time, though not all that much. Perhaps half an hour of actual human time swapping disks though much more real time to burn them, but you don't do just one at a time.

However, even better is the new style of archival -- online storage. Hard disks are 20 cents/gigabyte and continuing to fall. NAS boxes are more expensive now but there is no reason they won't drop to very reasonable prices, so that a NAS case adds perhaps 5 cents/gigabyte (ie. $100 for a 4x500gb drive box which lasts for 10-15 years.) (NAS boxes are small boxes that hold a collection of drives and allow access to them over ethernet. No computer is needed.) They also cost about 2 cents/gb/year for power if on all the time, and some small amount for space, though they would tend to sit in computer centers that already exist.

Those are today's prices, which will just get cheaper, except for the power. Much cheaper. If a drive lasts an average of 4 years before failing and a NAS lasts 10 years, this works out to 7.5 cents/gigabyte/year. Of course you will store your files redundantly, in 4 different places (which is actually overkill) and so it's 30 cents/gigabyte/year.

Which is still just $30 for a 100gb file, or $300 for a TB.

Online storage is live. You can regularly check the integrity, all the time. You can either leave it off and spin it up every few days (to not use power) or just leave it on all the time. If one, two or three of the 4 disks fails, computers can copy the data to fresh disks in the network, and you are alive. Your disks should last 3 to 4 years but many will last much longer. You need a computer system to control all this, but you only need one for the entire cloud of NAS boxes, or at most a few. Its cost is low.

The real cost is people. But companies like Google have solved the problem of running large server farms. They tolerate single drive failures. The computers copy the data to new drives right away, and technicans go by every few days to pull old ones and slot in fresh ones for the next need -- not for the same file. This takes just a few minutes of the tech's time. And there is no rush to their work. Fore each 100gb file, you should expect to have a replacement about once every 4 years (ie. the lifetime of an average drive.)

Now all this is at today's price of $100 for a 500gb drive. But that's dropping fast, faster than Moore's law. The replacements will be 1TB and 2TB drives before long, and the cost will continue to fall. And this is with 4 copies of every file. You can actually get by with less using modern data distribution algorithms which can scatter a file of 100gb into 200 1gb pieces, for which almost half must be lost before the whole file is lost. Several data centers could burn down without losing any files if things are done right. I have not accounted for bandwidth here for replacements, which usually would be done in the same data center except in unusual circumstances.

The biggest cost is the people to set all this up. However, presuming big demand, the cost per gigabyte for those people should become modest.

Comments

I just re-read the story, and noticed that the cost calculations in the article are even worse than you write, from the article:

To store a digital master record of a movie costs about $12,514 a year, versus the $1,059 it costs to keep a conventional film master.

I believe that these are the costs the studios are being charged by third party companies. I also believe that the cost includes digitization of the master (or one would hope!) along with QA, redundant storage, and of course, profit margin. Still nuts, and somebody somewhere is making gobs of money.

--
John <><

Your 100GB (or ever 1TB) assumption is likely to be way way off. They are storing these movies as raw high resolution files which are much much larger than even 25Gb HD discs. When you add up all the extra footage that is saved that is not part of the movie, you could be looking at 100s of TBs per movie or even more

Uncompressed 2000 x 1000 x 24 bits x 24fps x 90 x 60 seconds is 777gb. However, there are lossless compressions that can take this down a bit, and there are "invisibly lossy" compressions that can take it a great deal further.

In particular, VBR invisibly lossy compressions are possible that are near-lossless in periods of complexity and movement, and very high compression in simpler scenes. So no, 100gb is really not too far off for this sort of film.

And more to the point, if it's costing you $12,000/year to store uncompressed, you start thinking about more lossy compressions pretty fast!

I agree compression is probably being used but your calculation of size is off. I imagine they prefer to keep everything shot (which is where the magic deleted scenes come from).

According to my fading memory from film class many years ago regular film would have a shooting ratio of 6:1. Documentaries were lower.

Even lossy compressions are going to have a hard time with that much footage.

I think the way i'd store one movie is to put it on hard drives, plus one extra hard drive that has a parity calculation of the other drive (RAID 5). then you could lose a storage disk and still recover.

Except the RAID concept can be extended much further. If you have a file that takes up 8 drives worth of space, you can store it on 16 different drives such that you can lose any 8 of them and still recover the file (or perhaps it is 7). So only twice the space to be able to withstand a major loss, an extremely unlikely loss if the 8 drives are at 8 different facilities.

However, as soon as you lose 9 drives you have lost *everything* which is the downside. (Same for Raid 5. You can lose 1 drive OK, but 2 drives and all your data is gone, which is why people go to RAID 6 for large arrays, where you can lose 2 drives, and 3 takes you out.)

Yes, I understand you want to store all the extra footage. So increase the size tenfold. It's still no $12,000 per year.

doh, I meant the differentiator between your idea and mine would be that I would take mine offline (so of course I didn't put that part in...). Mostly for power considerations. With hot swappable drives and robot loaders I don't see a need to waste power keeping them live all the time. Automated refresh routines to ensure data integrity once a week/month should be sufficient.

and yeah, still no where near $12,000 a year per movie for storage.

Well, there's offline (ie. a person has to come and insert a disk or tape) and there's online (access at any time) and there's on-demand, where it's "off" but the machine can command reading it, either by turning it on, or having a jukebox that can grab tapes or DVDs.

The key to the on-demand or online forms is you detect quickly if something has failed, and you then act to create another redundant unit right away. If you don't check for failure often, you run the risk that you could get multiple failures between checks, adding to the risk of data loss.

With hard drives, there is an interesting question. Spin-down/spin-up is stressful. So is running all the time. You would want to calculate (some may have already researched this) just what frequency of spin-ups gives the longest lifetime and also an appropriately quickly failure detection. And yes, has lower power costs, but that's only one factor.

Fully online notifies you of failure right away in many cases, or you can have it constantly rereading the disk looking for trouble. But it costs power, and may cost lifetime.

However, all these numbers can be worked out, if they haven't yet, and it may well be the best strategy is to keep them off, and spin them up once a day for a full error check, or some other period like that. (Power will be minimal cost in anything like once every few hours of checking.) Drives that are off can also fail for being off, it turns out. Anyway, the hardware to do NAS boxes is now cheap enough to make this easy to do.

Add new comment